Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists
Summary
- • Ai2 benchmarks expose a wide gap between science agent hype and real performance.
- • Top frontier models complete only ~20% of harder DiscoveryWorld scientific tasks.
- • Human scientists with advanced degrees complete the same tasks ~70% of the time.
- • Benchmarks test end-to-end scientific reasoning, not just knowledge recall.
Details
Science agent announcements proliferate without rigorous supporting evidence
Teams publicly announce agents capable of designing experiments, writing code, and producing research papers. Ai2 argues the evidence behind such claims is typically weak, motivating the development of formal benchmarks.
ScienceWorld 2022: top models scored below 10% on elementary science experiments
Despite high marks on multiple-choice grade-school science exams, the same models scored below 10% when performing experiments in a virtual environment — exposing the gap between declarative knowledge ('book smarts') and applied reasoning ('street smarts').
By early 2025, frontier models reached low 80s on ScienceWorld
Real progress over three years, but top systems still haven't fully solved a 4th-grade science curriculum. ScienceWorld remains an unsolved benchmark despite significant model advancement.
DiscoveryWorld (2024): first benchmark for end-to-end scientific investigation by AI agents
Agents must form hypotheses, design experiments, run them, and analyze results — often over hundreds of in-game actions. Scores both task completion and adherence to scientific process, distinguishing genuine insight from lucky guessing.
Top AI completes ~20% of harder DiscoveryWorld tasks vs ~70% for human scientists
The ~50 percentage point gap directly challenges claims about AI matching human scientific capability. This gap persists despite frontier models' progress on the easier ScienceWorld benchmark.
DiscoveryWorld: 120 tasks, 8 domains, 3 difficulty levels, set on fictional Planet X
Domains include proteomics, rocket science, radioisotope dating, and epidemiology. Tasks are placed in a fictional space colony to prevent agents from drawing on prior training knowledge. Parametric variations change data, solution, and environment layout each run.
Ai2's Peter Jansen questions whether newly announced science agents show genuine progress
Jansen notes that if best-in-class systems a year ago couldn't solve most easy DiscoveryWorld problems, claims of dramatically better agents today warrant scrutiny. He led development of both ScienceWorld and DiscoveryWorld.
DiscoveryWorld paper cited ~80 times, covered by New Scientist since 2024
Growing adoption in the research community signals demand for rigorous evaluation frameworks as science agent development and marketing claims accelerate.
Context = background framing; Research = benchmark methodology and findings; Stat = quantitative result; New Tech = new tool or capability; Tech Info = technical specifications; Insight = attributed analysis or argument; Industry Update = adoption and reception signals
What This Means
The gap between AI science agent marketing and measurable performance is significant — frontier models in early 2025 still struggle to complete tasks that credentialed human scientists handle with relative ease. For AI developers, procurement teams, and researchers evaluating science agent tools, benchmarks like ScienceWorld and DiscoveryWorld provide a more honest signal than self-reported demos or cherry-picked results. As pressure mounts to deploy AI in scientific workflows, rigorous manipulation-resistant evaluations are becoming critical infrastructure for the field.
