← Back to feed
7

ITBench-AA: Frontier AI Models Score Below 50% on Enterprise IT Agentic Tasks

Research1 source·6d ago

Summary

  • • Artificial Analysis and IBM launch ITBench-AA, a new agentic enterprise IT benchmark
  • • All frontier models score below 50% on Kubernetes incident response tasks
  • • Claude Opus 4.7 leads at 47%, with GPT-5.5 and Qwen3.7 Max close behind
  • • Longer agent trajectories don't improve accuracy — over-investigation increases false positives
Adjust signal

Details

1.Product Launch

Artificial Analysis and IBM jointly launch ITBench-AA for agentic enterprise IT evaluation

Developed over six months, ITBench-AA is the first benchmark in a planned series targeting enterprise IT operations. IBM's Software Innovation Lab built the underlying dataset; Artificial Analysis contributed the frontier model evaluation framework. SRE is the launch domain, with FinOps and CISO tasks planned next.

2.Research

All frontier models score below 50% on SRE Kubernetes incident diagnosis

Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. This makes ITBench-AA SRE one of the least saturated agentic benchmarks currently available, indicating significant headroom before models approach production reliability.

3.Stat

Turn counts vary nearly 3x across models; longer trajectories reduce accuracy

GPT-5.5 (xhigh) averages 31 turns per task at 46% accuracy; Gemini 3.1 Pro Preview averages 83 turns at only 30%. Over-investigation causes models to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives — a critical failure mode for autonomous SRE agents.

4.Stat

Open-weights leader GLM-5.1 reaches 40%, effectively tied with Gemini 3.5 Flash

GLM-5.1 (Reasoning) leads open-weights models at 40%, matching Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, Gemma 4 31B (Reasoning) at 37%, and Gemini 3.1 Pro Preview trails at 30%.

5.Tech Info

59 SRE tasks across public and held-out splits using Kubernetes incident snapshots

Tasks include 40 public and 19 brand-new held-out scenarios. Each provides a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and application topology. Faults span resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions.

6.Tech Info

Strict scoring: models score 0.0 if any root cause is missed

Models must submit a minimal set of root-cause Kubernetes entities matched against IBM-provided ground truth. If any ground-truth root cause is missed, the model scores 0.0 for that repeat — mirroring real operational stakes where incomplete diagnosis equals failure.

7.Infrastructure

Evaluation runs via Stirrup open-source harness with sandboxed shell access

Each model gets shell access to a sandboxed file system with relevant logs and snapshots. Tasks capped at 100 turns with 3 repeats per task to reduce variance. The open-source reference harness enables reproducibility and community extension.

8.Strategy

ITBench-AA roadmap expands to FinOps and CISO task domains beyond SRE

The SRE launch is the first in a series. Financial Operations (FinOps) and CISO task sets are next, broadening coverage across three high-stakes enterprise IT functions where autonomous AI agents are being actively deployed or evaluated.

Research = study findings, Stat = quantitative result, Product Launch = new tool/benchmark, Tech Info = technical specification, Infrastructure = evaluation tooling, Strategy = roadmap/direction

What This Means

ITBench-AA exposes a significant capability gap: even the best frontier AI models cannot reliably diagnose production Kubernetes incidents, with top performers barely cracking 47% accuracy. For AI practitioners and enterprise IT teams, this is a grounding data point — autonomous SRE agents are not yet ready for unsupervised production deployment. The finding that more investigative turns correlates with worse performance is particularly significant, suggesting that current agentic reasoning architectures struggle to know when to stop — a critical requirement for operational trustworthiness. As AI deployment in enterprise infrastructure accelerates, rigorous domain-specific benchmarks like this will be essential for setting realistic expectations and guiding model development.

Sources

Similar Events