Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists

Research1 source·1d ago

ai2 benchmarks autonomous-agents agent-evaluation ai-progress science

Summary

• Ai2 benchmarks expose a wide gap between science agent hype and real performance.
• Top frontier models complete only ~20% of harder DiscoveryWorld scientific tasks.
• Human scientists with advanced degrees complete the same tasks ~70% of the time.
• Benchmarks test end-to-end scientific reasoning, not just knowledge recall.

Adjust signal

Details

#	Type	Key Point	Context
1	Context	Science agent announcements proliferate without rigorous supporting evidence	Teams publicly announce agents capable of designing experiments, writing code, and producing research papers. Ai2 argues the evidence behind such claims is typically weak, motivating the development of formal benchmarks.
2	Research	ScienceWorld 2022: top models scored below 10% on elementary science experiments	Despite high marks on multiple-choice grade-school science exams, the same models scored below 10% when performing experiments in a virtual environment — exposing the gap between declarative knowledge ('book smarts') and applied reasoning ('street smarts').
3	Stat	By early 2025, frontier models reached low 80s on ScienceWorld	Real progress over three years, but top systems still haven't fully solved a 4th-grade science curriculum. ScienceWorld remains an unsolved benchmark despite significant model advancement.
4	New Tech	DiscoveryWorld (2024): first benchmark for end-to-end scientific investigation by AI agents	Agents must form hypotheses, design experiments, run them, and analyze results — often over hundreds of in-game actions. Scores both task completion and adherence to scientific process, distinguishing genuine insight from lucky guessing.
5	Stat	Top AI completes ~20% of harder DiscoveryWorld tasks vs ~70% for human scientists	The ~50 percentage point gap directly challenges claims about AI matching human scientific capability. This gap persists despite frontier models' progress on the easier ScienceWorld benchmark.
6	Tech Info	DiscoveryWorld: 120 tasks, 8 domains, 3 difficulty levels, set on fictional Planet X	Domains include proteomics, rocket science, radioisotope dating, and epidemiology. Tasks are placed in a fictional space colony to prevent agents from drawing on prior training knowledge. Parametric variations change data, solution, and environment layout each run.
7	Insight	Ai2's Peter Jansen questions whether newly announced science agents show genuine progress	Jansen notes that if best-in-class systems a year ago couldn't solve most easy DiscoveryWorld problems, claims of dramatically better agents today warrant scrutiny. He led development of both ScienceWorld and DiscoveryWorld.
8	Industry Update	DiscoveryWorld paper cited ~80 times, covered by New Scientist since 2024	Growing adoption in the research community signals demand for rigorous evaluation frameworks as science agent development and marketing claims accelerate.

1.Context

Science agent announcements proliferate without rigorous supporting evidence

Teams publicly announce agents capable of designing experiments, writing code, and producing research papers. Ai2 argues the evidence behind such claims is typically weak, motivating the development of formal benchmarks.

2.Research

ScienceWorld 2022: top models scored below 10% on elementary science experiments

Despite high marks on multiple-choice grade-school science exams, the same models scored below 10% when performing experiments in a virtual environment — exposing the gap between declarative knowledge ('book smarts') and applied reasoning ('street smarts').

3.Stat

By early 2025, frontier models reached low 80s on ScienceWorld

Real progress over three years, but top systems still haven't fully solved a 4th-grade science curriculum. ScienceWorld remains an unsolved benchmark despite significant model advancement.

4.New Tech

DiscoveryWorld (2024): first benchmark for end-to-end scientific investigation by AI agents

Agents must form hypotheses, design experiments, run them, and analyze results — often over hundreds of in-game actions. Scores both task completion and adherence to scientific process, distinguishing genuine insight from lucky guessing.

5.Stat

Top AI completes ~20% of harder DiscoveryWorld tasks vs ~70% for human scientists

The ~50 percentage point gap directly challenges claims about AI matching human scientific capability. This gap persists despite frontier models' progress on the easier ScienceWorld benchmark.

6.Tech Info

DiscoveryWorld: 120 tasks, 8 domains, 3 difficulty levels, set on fictional Planet X

Domains include proteomics, rocket science, radioisotope dating, and epidemiology. Tasks are placed in a fictional space colony to prevent agents from drawing on prior training knowledge. Parametric variations change data, solution, and environment layout each run.

7.Insight

Ai2's Peter Jansen questions whether newly announced science agents show genuine progress

Jansen notes that if best-in-class systems a year ago couldn't solve most easy DiscoveryWorld problems, claims of dramatically better agents today warrant scrutiny. He led development of both ScienceWorld and DiscoveryWorld.

8.Industry Update

DiscoveryWorld paper cited ~80 times, covered by New Scientist since 2024

Growing adoption in the research community signals demand for rigorous evaluation frameworks as science agent development and marketing claims accelerate.

Context = background framing; Research = benchmark methodology and findings; Stat = quantitative result; New Tech = new tool or capability; Tech Info = technical specifications; Insight = attributed analysis or argument; Industry Update = adoption and reception signals

What This Means

The gap between AI science agent marketing and measurable performance is significant — frontier models in early 2025 still struggle to complete tasks that credentialed human scientists handle with relative ease. For AI developers, procurement teams, and researchers evaluating science agent tools, benchmarks like ScienceWorld and DiscoveryWorld provide a more honest signal than self-reported demos or cherry-picked results. As pressure mounts to deploy AI in scientific workflows, rigorous manipulation-resistant evaluations are becoming critical infrastructure for the field.

Sources

Measuring Scientific Discovery AgentsAllenai

Similar Events

Researchers Expose Every Major AI Agent Benchmark as Trivially Exploitable

4d ago

AI Benchmarks Fall Short: The Case for Human-Context Evaluation

Mar 31