AI Capability Benchmarks Nearing Saturation, Leaving Safety Evaluators Without Upper Bounds

Safety2 sources·Apr 8

benchmarks ai-progress alignment anthropic

Summary

• Frontier AI benchmarks are being saturated faster than new ones can be created
• Claude Opus 4.6 completes nearly all METR Time Horizon tasks; 50% horizon is 12 hours, 95% UCB is 60 hours
• Anthropic's ASL-4 determination for Opus 4.6 now relies on a 16-person researcher survey, not quantitative benchmarks
• Analyst projects that by mid-2027, no existing benchmark can rule out dangerous AI capabilities (author's opinion, not METR's official position)

Adjust signal

Details

#	Type	Key Point	Context
1	Insight	Benchmark saturation has accelerated sharply from 2024 to 2026, undermining capability upper-bounding	GPQA was considered extremely challenging in early 2024 but was largely saturated within a year. By early 2026, newer agentic and long-horizon benchmarks face the same trajectory. The author frames this as a structural, worsening trend rather than an isolated measurement challenge.
2	Research	METR Time Horizon suite nearly saturated; Claude Opus 4.6 achieves 50% horizon of 12 hours, 95% UCB of 60 hours	The Time Horizon suite was designed to measure long-horizon autonomous task completion. Frontier models like Opus 4.6 and GPT-5.3 can now reliably complete all but roughly a dozen tasks in the suite, making it difficult to establish a meaningful upper bound.
3	Policy	Anthropic's ASL-4 evaluations were maxed out by Opus 4.6; no-ASL-4 determination now rests on a 16-person internal researcher survey	Prior Anthropic models were ruled out of ASL-4 status via quantitative benchmarks. Opus 4.6 saturated those evaluations, so the determination relies on whether 16 staff believe the model meets the 'fully automate an entry-level Anthropic researcher' definition.
4	Financials	Human baselines for 50 new 32-hour tasks would cost over $1 million and require 3,200+ specialist hours	GPQA cost ~$100,000 to develop in 2024. The author estimates new long-horizon benchmarks cost an order of magnitude more, before any task development work. Serial timelines also risk saturation before completion.
5	Insight	Author projects that by mid-2027, no 2026-era benchmark can rule out dangerous AI capabilities	This is the author's personal forward-looking projection — explicitly not METR's official position — based on current saturation trends. It implies safety evaluation must move beyond fixed benchmarks to alternative methodologies.
6	Strategy	Alternative evaluation methodologies — surveys, observational data — must supplement or replace benchmarks	METR is already exploring alternatives after finding its uplift methodology less reliable. The piece does not detail what alternative frameworks would look like at scale, but frames this shift as unavoidable.
7	Industry Update	New 2025 benchmarks launched to address saturation: τ2-Bench, MCP-Atlas, terminal-bench, Finance Agent	These represent a wave of more challenging agentic benchmarks by academic and industry teams. The author's thesis is that even these are now approaching saturation, illustrating how quickly the evaluation landscape is outpaced by model capability improvements.

1.Insight

Benchmark saturation has accelerated sharply from 2024 to 2026, undermining capability upper-bounding

GPQA was considered extremely challenging in early 2024 but was largely saturated within a year. By early 2026, newer agentic and long-horizon benchmarks face the same trajectory. The author frames this as a structural, worsening trend rather than an isolated measurement challenge.

2.Research

METR Time Horizon suite nearly saturated; Claude Opus 4.6 achieves 50% horizon of 12 hours, 95% UCB of 60 hours

The Time Horizon suite was designed to measure long-horizon autonomous task completion. Frontier models like Opus 4.6 and GPT-5.3 can now reliably complete all but roughly a dozen tasks in the suite, making it difficult to establish a meaningful upper bound.

3.Policy

Anthropic's ASL-4 evaluations were maxed out by Opus 4.6; no-ASL-4 determination now rests on a 16-person internal researcher survey

Prior Anthropic models were ruled out of ASL-4 status via quantitative benchmarks. Opus 4.6 saturated those evaluations, so the determination relies on whether 16 staff believe the model meets the 'fully automate an entry-level Anthropic researcher' definition.

4.Financials

Human baselines for 50 new 32-hour tasks would cost over $1 million and require 3,200+ specialist hours

GPQA cost ~$100,000 to develop in 2024. The author estimates new long-horizon benchmarks cost an order of magnitude more, before any task development work. Serial timelines also risk saturation before completion.

5.Insight

Author projects that by mid-2027, no 2026-era benchmark can rule out dangerous AI capabilities

This is the author's personal forward-looking projection — explicitly not METR's official position — based on current saturation trends. It implies safety evaluation must move beyond fixed benchmarks to alternative methodologies.

6.Strategy

Alternative evaluation methodologies — surveys, observational data — must supplement or replace benchmarks

METR is already exploring alternatives after finding its uplift methodology less reliable. The piece does not detail what alternative frameworks would look like at scale, but frames this shift as unavoidable.

7.Industry Update

New 2025 benchmarks launched to address saturation: τ2-Bench, MCP-Atlas, terminal-bench, Finance Agent

These represent a wave of more challenging agentic benchmarks by academic and industry teams. The author's thesis is that even these are now approaching saturation, illustrating how quickly the evaluation landscape is outpaced by model capability improvements.

Insight = author analysis (personal opinion), Research = empirical findings, Policy = safety frameworks, Financials = cost data, Strategy = proposed direction, Industry Update = new releases

What This Means

A METR-affiliated analyst, writing in a personal capacity, argues the field is approaching a structural crisis in AI safety evaluation: the benchmarks used to certify that frontier models do not pose dangerous risks are being saturated faster than replacements can be built or validated. If this analysis holds, safety evaluation may increasingly rely on qualitative researcher surveys rather than rigorous quantitative measurement — a significant weakening of the evidentiary basis for safety claims. This matters not just for researchers but for policymakers and AI developers whose safety commitments are operationalized through these exact evaluation frameworks.

Sentiment

Mostly alarmed at benchmark saturation and measurement crisis, with calls for better evals and coordination

@ChrisPainterYupChris Painter · Head of policy @METR_EvalsView post

Alarmed

“Over the last year, dangerous capability evaluations have moved into a state where it's difficult to find any Q&A benchmark that models don't saturate. Work has had to shift toward... quick surveys of researchers... or... uplift studies.”

@wfithianWill Fithian · UC Berkeley Statistics prof, AI risk researcherView post

Alarmed

“Thank you Chris. It’s critical that people with your expertise speak out publicly. But why ignore the idea of a pause treaty, while making the full case for it?”

@tobyordoxfordToby Ord · Senior Researcher at Oxford University, author The PrecipiceView post

Concerned

“METR found that half of the code written by AIs on a prominent benchmark which had been graded as correct would have actually been rejected by humans for inadequate quality. What does this mean for their famous time-horizon metric?”

@yoemsriYoussef El Manssouri · Co-Founder & CEO @SesterceGroupView post

Concerned

“The thermometer metaphor is the whole problem. We built capability before we built measurement. Now we're trying to calibrate instruments while the temperature keeps rising. That's not a policy failure, it's a sequencing failure baked in from the start.”

@clwdbotVaclav Milizé · Co-founder @mangoai, AI devView post

Alarmed

“the buried story isn't 14.5 hours. it's 'our task suite is nearly saturated.' METR is essentially saying: we broke our own yardstick. the 6-to-98-hour confidence interval isn't uncertainty about the model, it's uncertainty about the measurement. we're flying blind on capability growth right now.”

Split

~70/30 alarmed at evals crisis and need for coordination vs. seeing saturation as signal of rapid progress

Sources

Why it's getting harder to measure AI performanceUnderstandingai
We're actually running out of benchmarks to upper bound AI capabilitiesLesswrong

Similar Events

AI Benchmarks Fall Short: The Case for Human-Context Evaluation

Mar 31

Opinion: AI Infrastructure Buildout Is Largely Fictional, Grid Data Shows

Mar 25