Introducing ITBench-AA: Revolutionizing Site Reliability Engineering Benchmarking
Artificial Analysis, in collaboration with IBM Software Innovation Lab, has launched ITBench-AA—a pioneering benchmark suite aimed at evaluating AI models on critical enterprise IT tasks. Initially focusing on Site Reliability Engineering (SRE), this benchmark reveals that even frontier models struggle, scoring below 50% in performance.
Understanding Site Reliability Engineering Tasks
ITBench-AA specializes in benchmarking AI performance on Kubernetes incident responses, a challenging domain where models must analyze logs, trace dependencies, and identify root-cause entities across complex infrastructures. The benchmark has been powered by IBM’s extensive experience in enterprise IT operations, utilizing a dataset specifically designed for this evaluation.
Key Findings from ITBench-AA
The initial results from the ITBench-AA SRE tasks are revealing:
-
Model Performance: The leading model, Claude Opus 4.7, achieved a score of 47%, closely followed by GPT-5.5 at 46%, and Qwen3.7 Max at 42%. Notably, all frontier models scored below 50%, highlighting a significant gap in performance.
-
Investigation Efficiency: Models exhibited varied turn counts, with longer interaction trajectories not necessarily correlating with improved accuracy. For example, GPT-5.5 required an average of 31 turns for a 46% score, whereas Gemini 3.1 averaged 83 turns, only yielding a 30% score. This suggests that excessive investigation could lead to inaccuracies.
-
Performance Comparison: Open weights models such as GLM-5.1 and Gemma 4 31B scored 40% and 37% respectively. However, models that adopted exhaustive investigation techniques often faced penalties.
Overview of ITBench-AA SRE Tasks
ITBench-AA encompasses a total of 59 SRE tasks, which include:
- 40 public tasks and 19 new, held-out tasks.
- Each task presents a Kubernetes incident snapshot, including logs, traces, alerts, and metrics, challenging models to accurately identify independent root-cause entities.
The fault scenarios cover a wide array of typical SRE failure modes, such as infrastructure failures and resource quota exhaustion, testing models across various critical situations.
Methodology Details
The methodology of ITBench-AA is designed for clear and fair evaluation:
- Each task is tackled using the Stirrup reference harness, allowing models shell access to a sandboxed environment for relevant logs and snapshots.
- Models are required to submit a structured JSON diagnosis, identifying root causes like Kubernetes Deployments, Services, and Pods.
- Scoring is based on average precision at full recall, rewarding accuracy while eliminating false positives from scoring biases.
Highlights of the Benchmark
-
Structured Investigations: Tasks task agents to analyze snapshots, reviewing alerts and logs to diagnose issues accurately. For example, an agent encountering user-facing failures efficiently traced the issue to a network policy blocking a critical service.
-
Impact of Turn Count: While some models engaged in lengthy explorations, their accuracy didn’t improve proportionally. Models submitting excess irrelevant entities were penalized, signifying the importance of precise root-cause identification without digressions.
-
Cost-Effective Performance: Open weights models like Gemma 4 31B demonstrated competitive performance at a lower cost per task, emphasizing the value of economical AI solutions without sacrificing accuracy.
Collaboration with IBM
ITBench-AA is an innovative partnership with IBM, drawing upon their robust IT benchmarking expertise. This collaboration sets the stage for the framework to expand beyond SRE tasks to include areas like Financial Operations (FinOps) and even responsibilities typically associated with a Chief Information Security Officer (CISO) over time.
By focusing on agentic enterprise IT tasks, ITBench-AA aims to redefine performance standards for AI models in complex operational environments, ultimately refining the capabilities of future AI applications.
Inspired by: Source

