ITBench-AA Report: Agentic Enterprise IT Models From IBM Fall Short With Scores Below 50% On Initial Benchmark

Introducing ITBench-AA: Revolutionizing Site Reliability Engineering Benchmarking

Artificial Analysis, in collaboration with IBM Software Innovation Lab, has launched ITBench-AA—a pioneering benchmark suite aimed at evaluating AI models on critical enterprise IT tasks. Initially focusing on Site Reliability Engineering (SRE), this benchmark reveals that even frontier models struggle, scoring below 50% in performance.

Contents

Introducing ITBench-AA: Revolutionizing Site Reliability Engineering Benchmarking
Understanding Site Reliability Engineering Tasks
Key Findings from ITBench-AA
Overview of ITBench-AA SRE Tasks
Methodology Details
Highlights of the Benchmark
Collaboration with IBM

Understanding Site Reliability Engineering Tasks

ITBench-AA specializes in benchmarking AI performance on Kubernetes incident responses, a challenging domain where models must analyze logs, trace dependencies, and identify root-cause entities across complex infrastructures. The benchmark has been powered by IBM’s extensive experience in enterprise IT operations, utilizing a dataset specifically designed for this evaluation.

Key Findings from ITBench-AA

The initial results from the ITBench-AA SRE tasks are revealing:

Model Performance: The leading model, Claude Opus 4.7, achieved a score of 47%, closely followed by GPT-5.5 at 46%, and Qwen3.7 Max at 42%. Notably, all frontier models scored below 50%, highlighting a significant gap in performance.
Investigation Efficiency: Models exhibited varied turn counts, with longer interaction trajectories not necessarily correlating with improved accuracy. For example, GPT-5.5 required an average of 31 turns for a 46% score, whereas Gemini 3.1 averaged 83 turns, only yielding a 30% score. This suggests that excessive investigation could lead to inaccuracies.
Performance Comparison: Open weights models such as GLM-5.1 and Gemma 4 31B scored 40% and 37% respectively. However, models that adopted exhaustive investigation techniques often faced penalties.

Overview of ITBench-AA SRE Tasks

ITBench-AA encompasses a total of 59 SRE tasks, which include:

40 public tasks and 19 new, held-out tasks.
Each task presents a Kubernetes incident snapshot, including logs, traces, alerts, and metrics, challenging models to accurately identify independent root-cause entities.

The fault scenarios cover a wide array of typical SRE failure modes, such as infrastructure failures and resource quota exhaustion, testing models across various critical situations.

Methodology Details

The methodology of ITBench-AA is designed for clear and fair evaluation:

Each task is tackled using the Stirrup reference harness, allowing models shell access to a sandboxed environment for relevant logs and snapshots.
Models are required to submit a structured JSON diagnosis, identifying root causes like Kubernetes Deployments, Services, and Pods.
Scoring is based on average precision at full recall, rewarding accuracy while eliminating false positives from scoring biases.

Highlights of the Benchmark

Structured Investigations: Tasks task agents to analyze snapshots, reviewing alerts and logs to diagnose issues accurately. For example, an agent encountering user-facing failures efficiently traced the issue to a network policy blocking a critical service.
Impact of Turn Count: While some models engaged in lengthy explorations, their accuracy didn’t improve proportionally. Models submitting excess irrelevant entities were penalized, signifying the importance of precise root-cause identification without digressions.
Cost-Effective Performance: Open weights models like Gemma 4 31B demonstrated competitive performance at a lower cost per task, emphasizing the value of economical AI solutions without sacrificing accuracy.

Collaboration with IBM

ITBench-AA is an innovative partnership with IBM, drawing upon their robust IT benchmarking expertise. This collaboration sets the stage for the framework to expand beyond SRE tasks to include areas like Financial Operations (FinOps) and even responsibilities typically associated with a Chief Information Security Officer (CISO) over time.

By focusing on agentic enterprise IT tasks, ITBench-AA aims to redefine performance standards for AI models in complex operational environments, ultimately refining the capabilities of future AI applications.

Inspired by: Source

ITBench-AA Report: Agentic Enterprise IT Models from IBM Fall Short with Scores Below 50% on Initial Benchmark — Insights from Artificial Analysis

Introducing ITBench-AA: Revolutionizing Site Reliability Engineering Benchmarking

Understanding Site Reliability Engineering Tasks

Key Findings from ITBench-AA

Overview of ITBench-AA SRE Tasks

Methodology Details

Highlights of the Benchmark

Collaboration with IBM

Stay Connected

Explore Top AI Tools Instantly

Latest News

Google Introduces Feature to Indicate AI-Generated Ads

Meet the Palmyra-Mini Family: Lightweight, Powerful, and Intelligent Solutions Await!

Is the ChatGPT Browser Already Dead? Exploring Recent Changes and Implications

Enhanced Retrieval-Augmented Reasoning: Truncated Step-Level Sampling with Process Rewards (2602.23440)

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Introducing ITBench-AA: Revolutionizing Site Reliability Engineering Benchmarking

Understanding Site Reliability Engineering Tasks

Key Findings from ITBench-AA

Overview of ITBench-AA SRE Tasks

More Read

Methodology Details

Highlights of the Benchmark

Collaboration with IBM

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Google Introduces Feature to Indicate AI-Generated Ads

Meet the Palmyra-Mini Family: Lightweight, Powerful, and Intelligent Solutions Await!

Is the ChatGPT Browser Already Dead? Exploring Recent Changes and Implications

Enhanced Retrieval-Augmented Reasoning: Truncated Step-Level Sampling with Process Rewards (2602.23440)