ITBench-AA: Frontier AI Models Score Below 50% on Enterprise IT Agentic Tasks
Summary
- • Artificial Analysis and IBM launch ITBench-AA, a new agentic enterprise IT benchmark
- • All frontier models score below 50% on Kubernetes incident response tasks
- • Claude Opus 4.7 leads at 47%, with GPT-5.5 and Qwen3.7 Max close behind
- • Longer agent trajectories don't improve accuracy — over-investigation increases false positives
Details
Artificial Analysis and IBM jointly launch ITBench-AA for agentic enterprise IT evaluation
Developed over six months, ITBench-AA is the first benchmark in a planned series targeting enterprise IT operations. IBM's Software Innovation Lab built the underlying dataset; Artificial Analysis contributed the frontier model evaluation framework. SRE is the launch domain, with FinOps and CISO tasks planned next.
All frontier models score below 50% on SRE Kubernetes incident diagnosis
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. This makes ITBench-AA SRE one of the least saturated agentic benchmarks currently available, indicating significant headroom before models approach production reliability.
Turn counts vary nearly 3x across models; longer trajectories reduce accuracy
GPT-5.5 (xhigh) averages 31 turns per task at 46% accuracy; Gemini 3.1 Pro Preview averages 83 turns at only 30%. Over-investigation causes models to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives — a critical failure mode for autonomous SRE agents.
Open-weights leader GLM-5.1 reaches 40%, effectively tied with Gemini 3.5 Flash
GLM-5.1 (Reasoning) leads open-weights models at 40%, matching Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, Gemma 4 31B (Reasoning) at 37%, and Gemini 3.1 Pro Preview trails at 30%.
59 SRE tasks across public and held-out splits using Kubernetes incident snapshots
Tasks include 40 public and 19 brand-new held-out scenarios. Each provides a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and application topology. Faults span resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions.
Strict scoring: models score 0.0 if any root cause is missed
Models must submit a minimal set of root-cause Kubernetes entities matched against IBM-provided ground truth. If any ground-truth root cause is missed, the model scores 0.0 for that repeat — mirroring real operational stakes where incomplete diagnosis equals failure.
Evaluation runs via Stirrup open-source harness with sandboxed shell access
Each model gets shell access to a sandboxed file system with relevant logs and snapshots. Tasks capped at 100 turns with 3 repeats per task to reduce variance. The open-source reference harness enables reproducibility and community extension.
ITBench-AA roadmap expands to FinOps and CISO task domains beyond SRE
The SRE launch is the first in a series. Financial Operations (FinOps) and CISO task sets are next, broadening coverage across three high-stakes enterprise IT functions where autonomous AI agents are being actively deployed or evaluated.
Research = study findings, Stat = quantitative result, Product Launch = new tool/benchmark, Tech Info = technical specification, Infrastructure = evaluation tooling, Strategy = roadmap/direction
What This Means
ITBench-AA exposes a significant capability gap: even the best frontier AI models cannot reliably diagnose production Kubernetes incidents, with top performers barely cracking 47% accuracy. For AI practitioners and enterprise IT teams, this is a grounding data point — autonomous SRE agents are not yet ready for unsupervised production deployment. The finding that more investigative turns correlates with worse performance is particularly significant, suggesting that current agentic reasoning architectures struggle to know when to stop — a critical requirement for operational trustworthiness. As AI deployment in enterprise infrastructure accelerates, rigorous domain-specific benchmarks like this will be essential for setting realistic expectations and guiding model development.
