Goblin
News
AI news by
promptgoblins.ai
|
News
About
News
About
Filtered by:
benchmarks
Clear
Titles
Summaries
Today
7
Stanford Study: AI Outperforms Law Professors as Tutors in Blind Test
Research
1
1h ago
7
Stanford Study: AI Outperforms Law Professors as Tutors in Blind Test
Research
· 1 src · 1h ago
Discuss
Yesterday
6
Microsoft ASSERT: Open-Source Framework Turns Plain-Language Rules into AI Test Cases
Open Source
1
9h ago
6
Microsoft ASSERT: Open-Source Framework Turns Plain-Language Rules into AI Test Cases
Open Source
· 1 src · 9h ago
Discuss
Monday
7
WindBorne Claims WeatherMesh-6 Outperforms ECMWF Gold Standard
Models
1
1d ago
7
WindBorne Claims WeatherMesh-6 Outperforms ECMWF Gold Standard
Models
· 1 src · 1d ago
Discuss
Last Week
7
Tencent's Hy3 LLM Tops OpenRouter Rankings Despite Mediocre Benchmarks
Markets
1
4d ago
7
Tencent's Hy3 LLM Tops OpenRouter Rankings Despite Mediocre Benchmarks
Markets
· 1 src · 4d ago
Discuss
6
CogCAPTCHA30: Process-Based AI Detection Shows Frontier Models Are Least Human-Like
Research
1
4d ago
6
CogCAPTCHA30: Process-Based AI Detection Shows Frontier Models Are Least Human-Like
Research
· 1 src · 4d ago
Discuss
7
Open Models Trail Closed Frontier by 8–10 Months on Private Benchmarks
Research
1
4d ago
7
Open Models Trail Closed Frontier by 8–10 Months on Private Benchmarks
Research
· 1 src · 4d ago
Discuss
6
Human-Made Ads Outperform AI-Generated Counterparts Despite Near-Identical Appearance
Research
1
4d ago
6
Human-Made Ads Outperform AI-Generated Counterparts Despite Near-Identical Appearance
Research
· 1 src · 4d ago
Discuss
7
LLMs Absorb False Beliefs Even When Explicitly Warned They Are False
Research
1
5d ago
7
LLMs Absorb False Beliefs Even When Explicitly Warned They Are False
Research
· 1 src · 5d ago
Discuss
7
Neuromorphic Ising Machine Tackles Hard Optimization Problems AI Cannot
Research
1
5d ago
7
Neuromorphic Ising Machine Tackles Hard Optimization Problems AI Cannot
Research
· 1 src · 5d ago
Discuss
7
DeepSWE: Contamination-Free Benchmark for Long-Horizon Coding Agents
Research
1
6d ago
7
DeepSWE: Contamination-Free Benchmark for Long-Horizon Coding Agents
Research
· 1 src · 6d ago
Discuss
7
ITBench-AA: Frontier AI Models Score Below 50% on Enterprise IT Agentic Tasks
Research
1
6d ago
7
ITBench-AA: Frontier AI Models Score Below 50% on Enterprise IT Agentic Tasks
Research
· 1 src · 6d ago
Discuss
7
MAI-Image-2.5 Launches at No. 3 on Arena Text-to-Image Leaderboard
Models
1
6d ago
7
MAI-Image-2.5 Launches at No. 3 on Arena Text-to-Image Leaderboard
Models
· 1 src · 6d ago
Discuss
6
WIRED Fact-Checker: AI Search Tools Inaccurate 45–60% of the Time
Safety
1
May 26
6
WIRED Fact-Checker: AI Search Tools Inaccurate 45–60% of the Time
Safety
· 1 src · May 26
Discuss
6
Research: LLM Coding Agents Degrade Sharply Under Structural Constraints
Research
1
May 25
6
Research: LLM Coding Agents Degrade Sharply Under Structural Constraints
Research
· 1 src · May 25
Discuss
2 Weeks Ago
7
LiteFrame Cuts Video LLM Inference Latency 35% with Compact Encoder
Research
1
May 21
7
LiteFrame Cuts Video LLM Inference Latency 35% with Compact Encoder
Research
· 1 src · May 21
Discuss
7
2,000-Run Study Identifies Optimal Mixture-of-Experts Config Rules
Research
1
May 21
7
2,000-Run Study Identifies Optimal Mixture-of-Experts Config Rules
Research
· 1 src · May 21
Discuss
6
The 'Good Enough' AI Era: Cheaper Models Close Gap on Frontier Labs
Markets
1
May 21
6
The 'Good Enough' AI Era: Cheaper Models Close Gap on Frontier Labs
Markets
· 1 src · May 21
Discuss
6
WavFlow Generates Audio Directly in Raw Waveform Space
Research
1
May 21
6
WavFlow Generates Audio Directly in Raw Waveform Space
Research
· 1 src · May 21
Discuss
7
Empirical Study: Grep Outperforms Vector Search in Agentic Retrieval Across Agent Harnesses
Research
1
May 20
7
Empirical Study: Grep Outperforms Vector Search in Agentic Retrieval Across Agent Harnesses
Research
· 1 src · May 20
Discuss
6
AI Agents Gain Physical Form: Code-as-Policy Robotics Goes Consumer
Research
1
May 20
6
AI Agents Gain Physical Form: Code-as-Policy Robotics Goes Consumer
Research
· 1 src · May 20
Discuss
7
Multiscreen Architecture Matches Transformers with 30% Fewer Parameters
Research
1
May 19
7
Multiscreen Architecture Matches Transformers with 30% Fewer Parameters
Research
· 1 src · May 19
Discuss
7
Cerebras Runs Kimi K2.6 at 981 Tokens/sec — 29x Faster Than Official Endpoint
Infra
1
May 19
7
Cerebras Runs Kimi K2.6 at 981 Tokens/sec — 29x Faster Than Official Endpoint
Infra
· 1 src · May 19
Discuss
7
HRM-Text: Full Foundation Model Pretraining Framework for Under $1,500
Open Source
1
May 19
7
HRM-Text: Full Foundation Model Pretraining Framework for Under $1,500
Open Source
· 1 src · May 19
Discuss
8
NVIDIA Vera CPU: First Agentic AI Processor Delivered to Anthropic, OpenAI, SpaceXAI, and Oracle — Benchmarks Confirm Claims
Updated
Infra
3
6d ago
8
NVIDIA Vera CPU: First Agentic AI Processor Delivered to Anthropic, OpenAI, SpaceXAI, and Oracle — Benchmarks Confirm Claims
Top
Infra
· 3 srcs · 6d ago
Discuss
8
LLMs Autonomously Optimize LLM Training, Beat Human Records on nanoGPT Speedrun
Research
1
May 18
8
LLMs Autonomously Optimize LLM Training, Beat Human Records on nanoGPT Speedrun
Research
· 1 src · May 18
Discuss
7
Lighthouse Attention: 17× Faster Long-Context Training via Hierarchical Selection
Research
1
May 18
7
Lighthouse Attention: 17× Faster Long-Context Training via Hierarchical Selection
Research
· 1 src · May 18
Discuss
7
Open Agent Leaderboard: Benchmarking Full AI Systems, Not Just Models
Research
1
May 18
7
Open Agent Leaderboard: Benchmarking Full AI Systems, Not Just Models
Research
· 1 src · May 18
Discuss
7
Aurora Optimizer Fixes Muon Neuron Death Bug, Sets New Speedrun SoTA
Research
1
May 18
7
Aurora Optimizer Fixes Muon Neuron Death Bug, Sets New Speedrun SoTA
Research
· 1 src · May 18
Discuss
6
LangChain Deep Agents Launches Model-Specific Harness Profiles, Yielding 10–20 Point Benchmark Gains
Products
1
May 18
6
LangChain Deep Agents Launches Model-Specific Harness Profiles, Yielding 10–20 Point Benchmark Gains
Products
· 1 src · May 18
Discuss
3 Weeks Ago
6
Elite CTF Competitor Argues Frontier AI Has Broken Competitive Hacking Format
Security
1
May 16
6
Elite CTF Competitor Argues Frontier AI Has Broken Competitive Hacking Format
Security
· 1 src · May 16
Discuss
7
Microsoft Research: LLMs Show 19-34% Artifact Fidelity Loss in Delegated Multi-Step Tasks
Research
1
May 15
7
Microsoft Research: LLMs Show 19-34% Artifact Fidelity Loss in Delegated Multi-Step Tasks
Research
· 1 src · May 15
Discuss
7
Datadog Releases Toto 2.0: Scalable Open-Weights Time Series Models
Open Source
1
May 15
7
Datadog Releases Toto 2.0: Scalable Open-Weights Time Series Models
Open Source
· 1 src · May 15
Discuss
8
Microsoft MDASH Multi-Agent System Tops CyberGym Cybersecurity Benchmark
Security
1
May 14
8
Microsoft MDASH Multi-Agent System Tops CyberGym Cybersecurity Benchmark
Top
Security
· 1 src · May 14
Discuss
7
Token Superposition Training Cuts LLM Pretraining Time 2.5x Without Architecture Changes
Research
1
May 14
7
Token Superposition Training Cuts LLM Pretraining Time 2.5x Without Architecture Changes
Research
· 1 src · May 14
Discuss
7
IBM Granite Embedding R2: Open Multilingual Models with 32K Context Top Sub-100M Benchmarks
Open Source
1
May 14
7
IBM Granite Embedding R2: Open Multilingual Models with 32K Context Top Sub-100M Benchmarks
Open Source
· 1 src · May 14
Discuss
6
Forum AI's Campbell Brown Benchmarks AI on High-Stakes Topics with Expert-Built Evaluations
Safety
1
May 14
6
Forum AI's Campbell Brown Benchmarks AI on High-Stakes Topics with Expert-Built Evaluations
Safety
· 1 src · May 14
Discuss
7
Perceptron Mk1: Video Analysis AI Model Priced 80-90% Below Frontier Rivals
Models
1
May 13
7
Perceptron Mk1: Video Analysis AI Model Priced 80-90% Below Frontier Rivals
Models
· 1 src · May 13
Discuss
7
RL Fine-Tuning Enables Small 4B Models to Match Large LLMs as Recursive Agents
Research
1
May 13
7
RL Fine-Tuning Enables Small 4B Models to Match Large LLMs as Recursive Agents
Research
· 1 src · May 13
Discuss
7
Research Reframes Tokenization as a Compute Scaling Variable
Research
1
May 13
7
Research Reframes Tokenization as a Compute Scaling Variable
Research
· 1 src · May 13
Discuss
6
Parameter Golf ML Challenge: Lessons from 2,000 Submissions
Research
1
May 13
6
Parameter Golf ML Challenge: Lessons from 2,000 Submissions
Research
· 1 src · May 13
Discuss
7
AutoTTS: Agentic Framework Auto-Discovers LLM Test-Time Scaling Strategies
Research
1
May 12
7
AutoTTS: Agentic Framework Auto-Discovers LLM Test-Time Scaling Strategies
Research
· 1 src · May 12
Discuss
7
A²RD: Agentic Diffusion Architecture Achieves 30% Consistency Gains in Long-Form Video Generation
Research
1
May 12
7
A²RD: Agentic Diffusion Architecture Achieves 30% Consistency Gains in Long-Form Video Generation
Research
· 1 src · May 12
Discuss
7
NVIDIA-Backed Sparsity Technique Reported to Deliver 20% LLM Speedup on H100 GPUs
Research
1
May 12
7
NVIDIA-Backed Sparsity Technique Reported to Deliver 20% LLM Speedup on H100 GPUs
Research
· 1 src · May 12
Discuss
7
Cactus Open-Sources Needle: 26M Parameter Tool-Calling Model for Consumer Devices
Open Source
1
May 12
7
Cactus Open-Sources Needle: 26M Parameter Tool-Calling Model for Consumer Devices
Open Source
· 1 src · May 12
Discuss
7
Tsinghua Study: Visual Generation Boosts AI Spatial Reasoning
Research
1
May 12
7
Tsinghua Study: Visual Generation Boosts AI Spatial Reasoning
Research
· 1 src · May 12
Discuss
6
AI Coding Proficiency Is Shifting Language Choice Away From Python Toward Systems Languages
Products
1
May 12
6
AI Coding Proficiency Is Shifting Language Choice Away From Python Toward Systems Languages
Products
· 1 src · May 12
Discuss
6
Normalizing Trajectory Models Enable High-Quality Few-Step Diffusion with Exact Likelihood
Research
1
May 12
6
Normalizing Trajectory Models Enable High-Quality Few-Step Diffusion with Exact Likelihood
Research
· 1 src · May 12
Discuss
6
Claude Reinvented 3,000 Lines of Existing Python Libraries Instead of Importing Them
Research
1
May 12
6
Claude Reinvented 3,000 Lines of Existing Python Libraries Instead of Importing Them
Research
· 1 src · May 12
Discuss
7
Mathematician Reports ChatGPT 5.5 Pro Produced PhD-Level Math Research in About an Hour
Models
1
May 11
7
Mathematician Reports ChatGPT 5.5 Pro Produced PhD-Level Math Research in About an Hour
Models
· 1 src · May 11
Discuss
6
Forced Memory Consolidation Degrades LLM Episodic Memory Quality
Research
1
May 11
6
Forced Memory Consolidation Degrades LLM Episodic Memory Quality
Research
· 1 src · May 11
Discuss
Filters
Signal
Title
Category
Sources
Posted
Discuss