← Back to feed
6

Claw-Eval: End-to-End Benchmark for Real-World AI Agents

Research1 source·Apr 9

Summary

  • • Claw-Eval benchmark evaluates AI agents on 300 real-world tasks
  • • Pass^3 metric requires 3 independent successful runs to count as a pass
  • • Three splits: general, multimodal, and multi-turn conversational tasks
  • • Open-source under MIT License with public leaderboard on Hugging Face
Adjust signal

Details

1.Research

Pass^3 metric requires 3 successful runs to eliminate lucky results

Model must pass a task in all 3 independent trials; surfaces genuinely reliable agents versus occasionally capable ones, directly addressing single-run benchmark inconsistency

2.Research

300 tasks split: general (161), multimodal (101), multi-turn (38)

Full-trajectory auditing grades Completion, Safety, and Robustness; covers webpage generation, video QA, document extraction, and conversational tasks with simulated user personas

3.Context

Built on OpenClaw, PinchBench, OfficeQA, OneMillion-Bench and others

Consolidates and extends prior real-world agent evaluation work from PKU and HKU; codebase is being audited for end-to-end community reproducibility verification

4.Tech Info

CLI supports sandboxed parallel evaluation: --trials 3 --parallel 16

Available on Hugging Face (claw-eval/Claw-Eval) under MIT License; leaderboard and individual task cases at claw-eval.github.io

Technical details of Claw-Eval's design, methodology, and evaluation infrastructure

What This Means

AI practitioners evaluating agents for production deployment now have a standardized, reproducible benchmark that requires consistent performance — the Pass^3 requirement makes it significantly harder to claim capability from single evaluation runs or cherry-picked demos.

Sources

Similar Events