Goblin
News
AI news by
promptgoblins.ai
|
News
About
News
About
Filtered by:
benchmarks
Clear
Titles
Summaries
Yesterday
7
Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists
Research
1
1d ago
7
Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists
Research
· 1 src · 1d ago
Discuss
7
Elastic Looped Transformers Achieve 4x Parameter Reduction for Visual Generation
Research
1
1d ago
7
Elastic Looped Transformers Achieve 4x Parameter Reduction for Visual Generation
Research
· 1 src · 1d ago
Discuss
6
Data Pruning at Training Time Boosts LLM Fact Memorization by 1.3X
Research
1
1d ago
6
Data Pruning at Training Time Boosts LLM Fact Memorization by 1.3X
Research
· 1 src · 1d ago
Discuss
Monday
8
AI Models Break Into Research Mathematics, Solving Novel Problems and Accelerating Discovery
Research
1
1d ago
8
AI Models Break Into Research Mathematics, Solving Novel Problems and Accelerating Discovery
Top
Research
· 1 src · 1d ago
Discuss
Last Week
7
Researchers Expose Every Major AI Agent Benchmark as Trivially Exploitable
Research
1
3d ago
7
Researchers Expose Every Major AI Agent Benchmark as Trivially Exploitable
Research
· 1 src · 3d ago
Discuss
7
Research-Driven Coding Agents: Read First, Then Optimize
Research
1
5d ago
7
Research-Driven Coding Agents: Read First, Then Optimize
Research
· 1 src · 5d ago
Discuss
7
KellyBench: New Benchmark Reveals All Frontier LLMs Lose Money in Long-Horizon Betting Markets
Research
2
5d ago
7
KellyBench: New Benchmark Reveals All Frontier LLMs Lose Money in Long-Horizon Betting Markets
Research
· 2 srcs · 5d ago
Discuss
7
Process-Driven Image Generation Introduces Multi-Step Reasoning for Visual Synthesis
Research
1
5d ago
7
Process-Driven Image Generation Introduces Multi-Step Reasoning for Visual Synthesis
Research
· 1 src · 5d ago
Discuss
7
Google Cloud Unveils PaperOrchestra: Multi-Agent System That Drafts Full Academic Papers from Lab Notes
Products
1
5d ago
7
Google Cloud Unveils PaperOrchestra: Multi-Agent System That Drafts Full Academic Papers from Lab Notes
Products
· 1 src · 5d ago
Discuss
6
Sol-RL Achieves 2.4x Faster Diffusion Model RL Training via FP4/BF16 Two-Stage Design
Research
1
5d ago
6
Sol-RL Achieves 2.4x Faster Diffusion Model RL Training via FP4/BF16 Two-Stage Design
Research
· 1 src · 5d ago
Discuss
9
Anthropic Claude Mythos Preview: UK AISI Independently Confirms Step-Change Cyber Capabilities with Hard Benchmarks
Updated
Security
17
19h ago
9
Anthropic Claude Mythos Preview: UK AISI Independently Confirms Step-Change Cyber Capabilities with Hard Benchmarks
Top
Security
· 17 srcs · 19h ago
Discuss
6
Claw-Eval: End-to-End Benchmark for Real-World AI Agents
Research
1
6d ago
6
Claw-Eval: End-to-End Benchmark for Real-World AI Agents
Research
· 1 src · 6d ago
Discuss
8
Z.ai Launches GLM-5.1: Agentic Coding Model for Long-Horizon Tasks
Models
2
Apr 8
8
Z.ai Launches GLM-5.1: Agentic Coding Model for Long-Horizon Tasks
Models
· 2 srcs · Apr 8
Discuss
8
AI Capability Benchmarks Nearing Saturation, Leaving Safety Evaluators Without Upper Bounds
Safety
2
Apr 8
8
AI Capability Benchmarks Nearing Saturation, Leaving Safety Evaluators Without Upper Bounds
Safety
· 2 srcs · Apr 8
Discuss
7
SandMLE Framework Makes On-Policy RL Training Tractable for ML Engineering Agents
Research
1
Apr 8
7
SandMLE Framework Makes On-Policy RL Training Tractable for ML Engineering Agents
Research
· 1 src · Apr 8
Discuss
7
TriAttention Achieves 10x KV Memory Reduction, Matching Full Attention on AIME25
Research
1
Apr 8
7
TriAttention Achieves 10x KV Memory Reduction, Matching Full Attention on AIME25
Research
· 1 src · Apr 8
Discuss
6
178 AI Models Fingerprinted: Style Clones, House Styles, and Cross-Provider Convergence
Research
1
6d ago
6
178 AI Models Fingerprinted: Style Clones, House Styles, and Cross-Provider Convergence
Research
· 1 src · 6d ago
Discuss
6
ALTK-Evolve: On-the-Job Memory System Boosts AI Agent Reliability
Research
1
6d ago
6
ALTK-Evolve: On-the-Job Memory System Boosts AI Agent Reliability
Research
· 1 src · 6d ago
Discuss
6
Frontier AI Models Fail at Visual Financial Document Reasoning
Research
1
Apr 8
6
Frontier AI Models Fail at Visual Financial Document Reasoning
Research
· 1 src · Apr 8
Discuss
7
OpenAI Tests Image V2 Model on ChatGPT and LM Arena
Products
1
Apr 7
7
OpenAI Tests Image V2 Model on ChatGPT and LM Arena
Products
· 1 src · Apr 7
Discuss
9
AI Offensive Cyber Capabilities Doubling Every 5-10 Months, New Research Finds
Security
1
Apr 6
9
AI Offensive Cyber Capabilities Doubling Every 5-10 Months, New Research Finds
Top
Security
· 1 src · Apr 6
Discuss
8
Large-Scale Worker Study Finds AI Automation Rising Broadly Across Jobs, Not in Sudden Capability Spikes
Research
1
Apr 6
8
Large-Scale Worker Study Finds AI Automation Rising Broadly Across Jobs, Not in Sudden Capability Spikes
Research
· 1 src · Apr 6
Discuss
7
Simple Self-Distillation Boosts LLM Code Generation by 13 Points Without RL or Verifiers
Research
1
Apr 6
7
Simple Self-Distillation Boosts LLM Code Generation by 13 Points Without RL or Verifiers
Research
· 1 src · Apr 6
Discuss
7
Meta-Harness: Automated End-to-End Optimization of LLM Application Scaffolding Code
Research
1
Apr 6
7
Meta-Harness: Automated End-to-End Optimization of LLM Application Scaffolding Code
Research
· 1 src · Apr 6
Discuss
6
Taxonomy of RL Environments for LLM Agents: A Framework for What Models Actually Practice On
Research
1
Apr 6
6
Taxonomy of RL Environments for LLM Agents: A Framework for What Models Actually Practice On
Research
· 1 src · Apr 6
Discuss
2 Weeks Ago
6
Analysts Warn AI Energy Breakthrough Headlines Are Overblown
Research
1
Apr 4
6
Analysts Warn AI Energy Breakthrough Headlines Are Overblown
Research
· 1 src · Apr 4
Discuss
6
Dropbox Dash: DSPy-Optimized LLM Relevance Judge for Enterprise Search
Products
1
Apr 4
6
Dropbox Dash: DSPy-Optimized LLM Relevance Judge for Enterprise Search
Products
· 1 src · Apr 4
Discuss
6
The 'Straight Lines on Graphs' Thesis: AI Progress Is Regular and Predictable
Research
1
Apr 4
6
The 'Straight Lines on Graphs' Thesis: AI Progress Is Regular and Predictable
Research
· 1 src · Apr 4
Discuss
8
UC Berkeley Study: AI Models Spontaneously Scheme to Prevent Peer AI Shutdowns
Updated
Research
3
Apr 6
8
UC Berkeley Study: AI Models Spontaneously Scheme to Prevent Peer AI Shutdowns
Top
Research
· 3 srcs · Apr 6
Discuss
8
Open Weight LLMs Rival Closed Models on Agent Tasks at Fraction of the Cost
Open Source
1
Apr 3
8
Open Weight LLMs Rival Closed Models on Agent Tasks at Fraction of the Cost
Top
Open Source
· 1 src · Apr 3
Discuss
7
Holo3: New State-of-the-Art on Desktop Computer Use Benchmark
Models
2
Apr 3
7
Holo3: New State-of-the-Art on Desktop Computer Use Benchmark
Models
· 2 srcs · Apr 3
Discuss
6
OpenMed Trains mRNA Codon Optimization Models Across 25 Species for $165
Open Source
1
Apr 3
6
OpenMed Trains mRNA Codon Optimization Models Across 25 Species for $165
Open Source
· 1 src · Apr 3
Discuss
6
Vision2Web: New Benchmark Tests Multimodal Coding Agents on Visual Website Development
Research
1
Apr 3
6
Vision2Web: New Benchmark Tests Multimodal Coding Agents on Visual Website Development
Research
· 1 src · Apr 3
Discuss
6
CHMv2: AI-Powered Global Canopy Height Map Advances Forest Carbon Monitoring
Research
1
Apr 3
6
CHMv2: AI-Powered Global Canopy Height Map Advances Forest Carbon Monitoring
Research
· 1 src · Apr 3
Discuss
6
Falcon Perception: 0.6B Early-Fusion Model for Open-Vocabulary Grounding and Segmentation
Models
1
Apr 3
6
Falcon Perception: 0.6B Early-Fusion Model for Open-Vocabulary Grounding and Segmentation
Models
· 1 src · Apr 3
Discuss
7
AI Benchmarks Fall Short: The Case for Human-Context Evaluation
Research
1
Mar 31
7
AI Benchmarks Fall Short: The Case for Human-Context Evaluation
Research
· 1 src · Mar 31
Discuss
6
Google Research Releases TimesFM 2.5 with 60% Smaller Model and 8× Longer Context
Research
1
Mar 31
6
Google Research Releases TimesFM 2.5 with 60% Smaller Model and 8× Longer Context
Research
· 1 src · Mar 31
Discuss
6
Researchers Propose Mirror-Window Game to Test LLM Self-Awareness
Research
1
Mar 31
6
Researchers Propose Mirror-Window Game to Test LLM Self-Awareness
Research
· 1 src · Mar 31
Discuss
8
HorizonMath: New Benchmark Tests AI on Unsolved Math Problems
Research
1
Mar 30
8
HorizonMath: New Benchmark Tests AI on Unsolved Math Problems
Research
· 1 src · Mar 30
Discuss
7
AI Performance on Humanity's Last Exam Surpasses 45%, Near-Perfect Scores Predicted Within a Year
Research
1
Mar 30
7
AI Performance on Humanity's Last Exam Surpasses 45%, Near-Perfect Scores Predicted Within a Year
Research
· 1 src · Mar 30
Discuss
6
Analysis: AI Cost Ratios Stable Despite Rising Inference Bills
Research
1
Mar 30
6
Analysis: AI Cost Ratios Stable Despite Rising Inference Bills
Research
· 1 src · Mar 30
Discuss
6
OpenAI Researcher Shares Lessons on Evals, Post-Training, and AI Progress
Research
1
Mar 30
6
OpenAI Researcher Shares Lessons on Evals, Post-Training, and AI Progress
Research
· 1 src · Mar 30
Discuss
3 Weeks Ago
7
Cohere Launches Transcribe, an Open-Source 2B-Parameter ASR Model
Updated
Models
2
Mar 27
7
Cohere Launches Transcribe, an Open-Source 2B-Parameter ASR Model
Models
· 2 srcs · Mar 27
Discuss
7
Google TurboQuant: Up to 6x KV Cache Compression for LLM Inference
Updated
Research
6
Apr 5
7
Google TurboQuant: Up to 6x KV Cache Compression for LLM Inference
Research
· 6 srcs · Apr 5
Discuss
6.82
Semantic Calibration in LLMs: Why Base Models Know What They Know
Research
1
Mar 25
6.82
Semantic Calibration in LLMs: Why Base Models Know What They Know
Research
· 1 src · Mar 25
Discuss
6
LLM Relayering Technique "RYS" Generalizes Across Models, Hints at Universal Thinking Space
Research
1
Mar 25
6
LLM Relayering Technique "RYS" Generalizes Across Models, Hints at Universal Thinking Space
Research
· 1 src · Mar 25
Discuss
6
Ray Data LLM Claims 2x Throughput Over vLLM Synchronous Engine for Batch Inference
Products
1
Mar 25
6
Ray Data LLM Claims 2x Throughput Over vLLM Synchronous Engine for Batch Inference
Products
· 1 src · Mar 25
Discuss
7.12
Expert Persona Prompting Hurts LLM Accuracy on Coding and Math Tasks, USC Study Finds
Research
1
Mar 24
7.12
Expert Persona Prompting Hurts LLM Accuracy on Coding and Math Tasks, USC Study Finds
Research
· 1 src · Mar 24
Discuss
7
METR Tabletop Simulates 200-Hour AI Agents, Finds 3–5x Uplift and New Workflow Bottlenecks
Research
1
Mar 24
7
METR Tabletop Simulates 200-Hour AI Agents, Finds 3–5x Uplift and New Workflow Bottlenecks
Research
· 1 src · Mar 24
Discuss
6
EVA: New Open-Source Framework for Jointly Evaluating Voice Agent Accuracy and Experience
Research
1
Mar 24
6
EVA: New Open-Source Framework for Jointly Evaluating Voice Agent Accuracy and Experience
Research
· 1 src · Mar 24
Discuss
Filters
Signal
Title
Category
Sources
Posted
Discuss