Claw-Eval: End-to-End Benchmark for Real-World AI Agents
Summary
- • Claw-Eval benchmark evaluates AI agents on 300 real-world tasks
- • Pass^3 metric requires 3 independent successful runs to count as a pass
- • Three splits: general, multimodal, and multi-turn conversational tasks
- • Open-source under MIT License with public leaderboard on Hugging Face
Details
Pass^3 metric requires 3 successful runs to eliminate lucky results
Model must pass a task in all 3 independent trials; surfaces genuinely reliable agents versus occasionally capable ones, directly addressing single-run benchmark inconsistency
300 tasks split: general (161), multimodal (101), multi-turn (38)
Full-trajectory auditing grades Completion, Safety, and Robustness; covers webpage generation, video QA, document extraction, and conversational tasks with simulated user personas
Built on OpenClaw, PinchBench, OfficeQA, OneMillion-Bench and others
Consolidates and extends prior real-world agent evaluation work from PKU and HKU; codebase is being audited for end-to-end community reproducibility verification
CLI supports sandboxed parallel evaluation: --trials 3 --parallel 16
Available on Hugging Face (claw-eval/Claw-Eval) under MIT License; leaderboard and individual task cases at claw-eval.github.io
Technical details of Claw-Eval's design, methodology, and evaluation infrastructure
What This Means
AI practitioners evaluating agents for production deployment now have a standardized, reproducible benchmark that requires consistent performance — the Pass^3 requirement makes it significantly harder to claim capability from single evaluation runs or cherry-picked demos.
