ITBench-AA: Frontier AI Models Score Below 50% on Enterprise IT Agentic Tasks

Research1 source·6d ago

benchmarks agent-evaluation ai-agents kubernetes anthropic claude enterprise-ai ibm

Summary

• Artificial Analysis and IBM launch ITBench-AA, a new agentic enterprise IT benchmark
• All frontier models score below 50% on Kubernetes incident response tasks
• Claude Opus 4.7 leads at 47%, with GPT-5.5 and Qwen3.7 Max close behind
• Longer agent trajectories don't improve accuracy — over-investigation increases false positives

Adjust signal

Details

#	Type	Key Point	Context
1	Product Launch	Artificial Analysis and IBM jointly launch ITBench-AA for agentic enterprise IT evaluation	Developed over six months, ITBench-AA is the first benchmark in a planned series targeting enterprise IT operations. IBM's Software Innovation Lab built the underlying dataset; Artificial Analysis contributed the frontier model evaluation framework. SRE is the launch domain, with FinOps and CISO tasks planned next.
2	Research	All frontier models score below 50% on SRE Kubernetes incident diagnosis	Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. This makes ITBench-AA SRE one of the least saturated agentic benchmarks currently available, indicating significant headroom before models approach production reliability.
3	Stat	Turn counts vary nearly 3x across models; longer trajectories reduce accuracy	GPT-5.5 (xhigh) averages 31 turns per task at 46% accuracy; Gemini 3.1 Pro Preview averages 83 turns at only 30%. Over-investigation causes models to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives — a critical failure mode for autonomous SRE agents.
4	Stat	Open-weights leader GLM-5.1 reaches 40%, effectively tied with Gemini 3.5 Flash	GLM-5.1 (Reasoning) leads open-weights models at 40%, matching Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, Gemma 4 31B (Reasoning) at 37%, and Gemini 3.1 Pro Preview trails at 30%.
5	Tech Info	59 SRE tasks across public and held-out splits using Kubernetes incident snapshots	Tasks include 40 public and 19 brand-new held-out scenarios. Each provides a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and application topology. Faults span resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions.
6	Tech Info	Strict scoring: models score 0.0 if any root cause is missed	Models must submit a minimal set of root-cause Kubernetes entities matched against IBM-provided ground truth. If any ground-truth root cause is missed, the model scores 0.0 for that repeat — mirroring real operational stakes where incomplete diagnosis equals failure.
7	Infrastructure	Evaluation runs via Stirrup open-source harness with sandboxed shell access	Each model gets shell access to a sandboxed file system with relevant logs and snapshots. Tasks capped at 100 turns with 3 repeats per task to reduce variance. The open-source reference harness enables reproducibility and community extension.
8	Strategy	ITBench-AA roadmap expands to FinOps and CISO task domains beyond SRE	The SRE launch is the first in a series. Financial Operations (FinOps) and CISO task sets are next, broadening coverage across three high-stakes enterprise IT functions where autonomous AI agents are being actively deployed or evaluated.

1.Product Launch

Artificial Analysis and IBM jointly launch ITBench-AA for agentic enterprise IT evaluation

Developed over six months, ITBench-AA is the first benchmark in a planned series targeting enterprise IT operations. IBM's Software Innovation Lab built the underlying dataset; Artificial Analysis contributed the frontier model evaluation framework. SRE is the launch domain, with FinOps and CISO tasks planned next.

2.Research

All frontier models score below 50% on SRE Kubernetes incident diagnosis

Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. This makes ITBench-AA SRE one of the least saturated agentic benchmarks currently available, indicating significant headroom before models approach production reliability.

3.Stat

Turn counts vary nearly 3x across models; longer trajectories reduce accuracy

GPT-5.5 (xhigh) averages 31 turns per task at 46% accuracy; Gemini 3.1 Pro Preview averages 83 turns at only 30%. Over-investigation causes models to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives — a critical failure mode for autonomous SRE agents.

4.Stat

Open-weights leader GLM-5.1 reaches 40%, effectively tied with Gemini 3.5 Flash

GLM-5.1 (Reasoning) leads open-weights models at 40%, matching Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, Gemma 4 31B (Reasoning) at 37%, and Gemini 3.1 Pro Preview trails at 30%.

5.Tech Info

59 SRE tasks across public and held-out splits using Kubernetes incident snapshots

Tasks include 40 public and 19 brand-new held-out scenarios. Each provides a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and application topology. Faults span resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions.

6.Tech Info

Strict scoring: models score 0.0 if any root cause is missed

Models must submit a minimal set of root-cause Kubernetes entities matched against IBM-provided ground truth. If any ground-truth root cause is missed, the model scores 0.0 for that repeat — mirroring real operational stakes where incomplete diagnosis equals failure.

7.Infrastructure

Evaluation runs via Stirrup open-source harness with sandboxed shell access

Each model gets shell access to a sandboxed file system with relevant logs and snapshots. Tasks capped at 100 turns with 3 repeats per task to reduce variance. The open-source reference harness enables reproducibility and community extension.

8.Strategy

ITBench-AA roadmap expands to FinOps and CISO task domains beyond SRE

The SRE launch is the first in a series. Financial Operations (FinOps) and CISO task sets are next, broadening coverage across three high-stakes enterprise IT functions where autonomous AI agents are being actively deployed or evaluated.

Research = study findings, Stat = quantitative result, Product Launch = new tool/benchmark, Tech Info = technical specification, Infrastructure = evaluation tooling, Strategy = roadmap/direction

What This Means

ITBench-AA exposes a significant capability gap: even the best frontier AI models cannot reliably diagnose production Kubernetes incidents, with top performers barely cracking 47% accuracy. For AI practitioners and enterprise IT teams, this is a grounding data point — autonomous SRE agents are not yet ready for unsupervised production deployment. The finding that more investigative turns correlates with worse performance is particularly significant, suggesting that current agentic reasoning architectures struggle to know when to stop — a critical requirement for operational trustworthiness. As AI deployment in enterprise infrastructure accelerates, rigorous domain-specific benchmarks like this will be essential for setting realistic expectations and guiding model development.

Sources

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBMHugging Face

Similar Events

AI Benchmarks Fall Short: The Case for Human-Context Evaluation

Mar 31

Ai2 Benchmarks Reveal AI Science Agents Far Behind Human Scientists

Apr 14