Harvey Open-Sources Legal Agent Benchmark (LAB) for Real-World Law Firm Tasks
Summary
- • Harvey releases LAB: open-source benchmark with 1,200+ tasks across 24 legal practice areas
- • 75,000+ expert-written rubric criteria evaluate long-horizon agent performance on real law firm workflows
- • First benchmark designed to signal AI agent readiness for production legal deployment
- • Leaderboard intentionally withheld at launch; research partner baselines coming in weeks
Details
LAB: 1,200+ tasks, 24 practice areas, fully open-source
Each task mirrors actual law firm assignment: instruction + client matter with materials + requirement to produce a reviewable work product. Open-sourced for model providers, agent builders, researchers, and law firms to share a common evaluation framework.
75,000+ expert-written rubric criteria enable rigorous, contextual evaluation
Legal quality is highly contextual and resists simple automated scoring. Expert-written rubrics per task allow nuanced assessment. Substantially more comprehensive than any prior legal AI benchmark in scope and evaluation depth.
Leaderboard withheld to prevent premature benchmark gaming
Harvey expects the dataset to evolve and wants community input before locking results. In coming weeks, research partners will publish baseline results and standards for normalizing submissions — reflecting lessons learned from benchmark overfitting in other domains.
Explicit SWE-Bench analogy: benchmark as leading indicator of production readiness
Harvey cites Karpathy's observation that coding agents 'basically didn't work before December and basically work since.' Coding benchmark scores preceded and predicted real-world productivity shifts. LAB aims to provide the same legible signal for legal work — telling firms where agents are ready to deploy and where human-in-the-loop remains essential.
Domain-specific long-horizon benchmarks proliferating across professional services
GDPval, OSWorld-Verified, BrowseComp, MCP Atlas, FinanceAgent, Humanity's Last Exam, and APEX-Agents represent a broader shift: the field is moving past general reasoning evals toward measuring agents against specific professional workflows and ROI thresholds.
Product Launch = new tool released; Research = evaluation methodology detail; Strategy = deliberate positioning; Insight = analytical framing; Industry Update = broader sector trend
What This Means
LAB provides the first credible, expert-validated benchmark for measuring whether legal agents are ready for production deployment — and which of 24 practice areas are closest to that threshold. Law firms evaluating AI vendors now have a shared, independent yardstick; model providers have a public target to optimize against.
