Harvey Open-Sources Legal Agent Benchmark (LAB) for Real-World Law Firm Tasks

harvey legal benchmarks agent-evaluation ai-agents open-source-release

Summary

• Harvey releases LAB: open-source benchmark with 1,200+ tasks across 24 legal practice areas
• 75,000+ expert-written rubric criteria evaluate long-horizon agent performance on real law firm workflows
• First benchmark designed to signal AI agent readiness for production legal deployment
• Leaderboard intentionally withheld at launch; research partner baselines coming in weeks

Adjust signal

Details

#	Type	Key Point	Context
1	Product Launch	LAB: 1,200+ tasks, 24 practice areas, fully open-source	Each task mirrors actual law firm assignment: instruction + client matter with materials + requirement to produce a reviewable work product. Open-sourced for model providers, agent builders, researchers, and law firms to share a common evaluation framework.
2	Research	75,000+ expert-written rubric criteria enable rigorous, contextual evaluation	Legal quality is highly contextual and resists simple automated scoring. Expert-written rubrics per task allow nuanced assessment. Substantially more comprehensive than any prior legal AI benchmark in scope and evaluation depth.
3	Strategy	Leaderboard withheld to prevent premature benchmark gaming	Harvey expects the dataset to evolve and wants community input before locking results. In coming weeks, research partners will publish baseline results and standards for normalizing submissions — reflecting lessons learned from benchmark overfitting in other domains.
4	Insight	Explicit SWE-Bench analogy: benchmark as leading indicator of production readiness	Harvey cites Karpathy's observation that coding agents 'basically didn't work before December and basically work since.' Coding benchmark scores preceded and predicted real-world productivity shifts. LAB aims to provide the same legible signal for legal work — telling firms where agents are ready to deploy and where human-in-the-loop remains essential.
5	Industry Update	Domain-specific long-horizon benchmarks proliferating across professional services	GDPval, OSWorld-Verified, BrowseComp, MCP Atlas, FinanceAgent, Humanity's Last Exam, and APEX-Agents represent a broader shift: the field is moving past general reasoning evals toward measuring agents against specific professional workflows and ROI thresholds.

1.Product Launch

LAB: 1,200+ tasks, 24 practice areas, fully open-source

Each task mirrors actual law firm assignment: instruction + client matter with materials + requirement to produce a reviewable work product. Open-sourced for model providers, agent builders, researchers, and law firms to share a common evaluation framework.

2.Research

75,000+ expert-written rubric criteria enable rigorous, contextual evaluation

Legal quality is highly contextual and resists simple automated scoring. Expert-written rubrics per task allow nuanced assessment. Substantially more comprehensive than any prior legal AI benchmark in scope and evaluation depth.

3.Strategy

Leaderboard withheld to prevent premature benchmark gaming

Harvey expects the dataset to evolve and wants community input before locking results. In coming weeks, research partners will publish baseline results and standards for normalizing submissions — reflecting lessons learned from benchmark overfitting in other domains.

4.Insight

Explicit SWE-Bench analogy: benchmark as leading indicator of production readiness

Harvey cites Karpathy's observation that coding agents 'basically didn't work before December and basically work since.' Coding benchmark scores preceded and predicted real-world productivity shifts. LAB aims to provide the same legible signal for legal work — telling firms where agents are ready to deploy and where human-in-the-loop remains essential.

5.Industry Update

Domain-specific long-horizon benchmarks proliferating across professional services

GDPval, OSWorld-Verified, BrowseComp, MCP Atlas, FinanceAgent, Humanity's Last Exam, and APEX-Agents represent a broader shift: the field is moving past general reasoning evals toward measuring agents against specific professional workflows and ROI thresholds.

Product Launch = new tool released; Research = evaluation methodology detail; Strategy = deliberate positioning; Insight = analytical framing; Industry Update = broader sector trend

What This Means

LAB provides the first credible, expert-validated benchmark for measuring whether legal agents are ready for production deployment — and which of 24 practice areas are closest to that threshold. Law firms evaluating AI vendors now have a shared, independent yardstick; model providers have a public target to optimize against.

Sources

Introducing Harvey's Legal Agent Benchmark (12 minute read)Harvey

Similar Events

Harvey AI: Autonomous Agents Ready to Handle Entire Legal Matters, Not Just Assist Lawyers

Apr 5

Claw-Eval: End-to-End Benchmark for Real-World AI Agents

Apr 9