Claw-Eval: End-to-End Benchmark for Real-World AI Agents

Research1 source·Apr 9

Summary

• Claw-Eval benchmark evaluates AI agents on 300 real-world tasks
• Pass^3 metric requires 3 independent successful runs to count as a pass
• Three splits: general, multimodal, and multi-turn conversational tasks
• Open-source under MIT License with public leaderboard on Hugging Face

Adjust signal

Details

#	Type	Key Point	Context
1	Research	Pass^3 metric requires 3 successful runs to eliminate lucky results	Model must pass a task in all 3 independent trials; surfaces genuinely reliable agents versus occasionally capable ones, directly addressing single-run benchmark inconsistency
2	Research	300 tasks split: general (161), multimodal (101), multi-turn (38)	Full-trajectory auditing grades Completion, Safety, and Robustness; covers webpage generation, video QA, document extraction, and conversational tasks with simulated user personas
3	Context	Built on OpenClaw, PinchBench, OfficeQA, OneMillion-Bench and others	Consolidates and extends prior real-world agent evaluation work from PKU and HKU; codebase is being audited for end-to-end community reproducibility verification
4	Tech Info	CLI supports sandboxed parallel evaluation: --trials 3 --parallel 16	Available on Hugging Face (claw-eval/Claw-Eval) under MIT License; leaderboard and individual task cases at claw-eval.github.io

1.Research

Pass^3 metric requires 3 successful runs to eliminate lucky results

Model must pass a task in all 3 independent trials; surfaces genuinely reliable agents versus occasionally capable ones, directly addressing single-run benchmark inconsistency

2.Research

300 tasks split: general (161), multimodal (101), multi-turn (38)

Full-trajectory auditing grades Completion, Safety, and Robustness; covers webpage generation, video QA, document extraction, and conversational tasks with simulated user personas

3.Context

Built on OpenClaw, PinchBench, OfficeQA, OneMillion-Bench and others

Consolidates and extends prior real-world agent evaluation work from PKU and HKU; codebase is being audited for end-to-end community reproducibility verification

4.Tech Info

CLI supports sandboxed parallel evaluation: --trials 3 --parallel 16

Available on Hugging Face (claw-eval/Claw-Eval) under MIT License; leaderboard and individual task cases at claw-eval.github.io

Technical details of Claw-Eval's design, methodology, and evaluation infrastructure

What This Means

AI practitioners evaluating agents for production deployment now have a standardized, reproducible benchmark that requires consistent performance — the Pass^3 requirement makes it significantly harder to claim capability from single evaluation runs or cherry-picked demos.

Sources

Claw-Eval Benchmark for AI Agents (GitHub Repo)Github

Similar Events

Researchers Expose Every Major AI Agent Benchmark as Trivially Exploitable

Apr 11

AWS Open-Sources Agent-EvalKit for Systematic AI Agent Evaluation

Jun 11