By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
AIModelKitAIModelKitAIModelKit
  • Home
  • News
    NewsShow More
    Stay Ahead: The Future of IVF and the Latest in AI Innovations
    Stay Ahead: The Future of IVF and the Latest in AI Innovations
    6 Min Read
    Key Highlights from Day Two at TechEx North America: Strengthening Your Case for Innovation
    Key Highlights from Day Two at TechEx North America: Strengthening Your Case for Innovation
    7 Min Read
    Pope Leo Issues Caution on AI Risks in Landmark Papal Document
    Pope Leo Issues Caution on AI Risks in Landmark Papal Document
    5 Min Read
    OpenAI Solves 80-Year-Old Mathematics Problem: A Breakthrough Achievement
    OpenAI Solves 80-Year-Old Mathematics Problem: A Breakthrough Achievement
    5 Min Read
    Google I/O 2023: Unveiling the New Directions in AI-Driven Scientific Research
    Google I/O 2023: Unveiling the New Directions in AI-Driven Scientific Research
    5 Min Read
  • Open-Source Models
    Open-Source ModelsShow More
    ITBench-AA Report: Agentic Enterprise IT Models from IBM Fall Short with Scores Below 50% on Initial Benchmark — Insights from Artificial Analysis
    ITBench-AA Report: Agentic Enterprise IT Models from IBM Fall Short with Scores Below 50% on Initial Benchmark — Insights from Artificial Analysis
    4 Min Read
    OlmoEarth v1.1: Discover the Enhanced Efficiency of Our New Model Family
    OlmoEarth v1.1: Discover the Enhanced Efficiency of Our New Model Family
    5 Min Read
    Enhancing Scientific Impact with Global Partnerships and Open Resources
    Enhancing Scientific Impact with Global Partnerships and Open Resources
    5 Min Read
    Top 4 Ways Google Research Scientists Utilize Empirical Research Assistance
    Top 4 Ways Google Research Scientists Utilize Empirical Research Assistance
    5 Min Read
    Unlocking DeepInfra on Hugging Face: Explore Powerful Inference Providers 🔥
    Unlocking DeepInfra on Hugging Face: Explore Powerful Inference Providers 🔥
    5 Min Read
  • Guides
    GuidesShow More
    Master Sending Emails with Python: Take Our Quiz – Real Python
    Master Sending Emails with Python: Take Our Quiz – Real Python
    3 Min Read
    Integrating LLMs with Your Data Using Python MCP Servers – A Comprehensive Guide from Real Python
    Integrating LLMs with Your Data Using Python MCP Servers – A Comprehensive Guide from Real Python
    5 Min Read
    Ultimate Quiz to Optimize Your Python Development Environment – Real Python
    Ultimate Quiz to Optimize Your Python Development Environment – Real Python
    3 Min Read
    Mastering Scatter Plots in Python: A Comprehensive Quiz on Using plt.scatter() – Real Python Guide
    Mastering Scatter Plots in Python: A Comprehensive Quiz on Using plt.scatter() – Real Python Guide
    3 Min Read
    5 Essential Python Concepts You Need to Master
    5 Essential Python Concepts You Need to Master
    8 Min Read
  • Tools
    ToolsShow More
    Optimizing Use-Case Based Deployments with SageMaker JumpStart
    Optimizing Use-Case Based Deployments with SageMaker JumpStart
    5 Min Read
    Safetensors Partners with PyTorch Foundation: Strengthening AI Development
    Safetensors Partners with PyTorch Foundation: Strengthening AI Development
    5 Min Read
    High Throughput Computer Use Agent: Understanding 12B for Optimal Performance
    High Throughput Computer Use Agent: Understanding 12B for Optimal Performance
    5 Min Read
    Introducing the First Comprehensive Healthcare Robotics Dataset and Essential Physical AI Models for Advancing Healthcare Robotics
    Introducing the First Comprehensive Healthcare Robotics Dataset and Essential Physical AI Models for Advancing Healthcare Robotics
    6 Min Read
    Creating Native Multimodal Agents with Qwen 3.5 VLM on NVIDIA GPU-Accelerated Endpoints
    Creating Native Multimodal Agents with Qwen 3.5 VLM on NVIDIA GPU-Accelerated Endpoints
    5 Min Read
  • Events
    EventsShow More
    AI-Driven Shift Transforming Cybersecurity Skills and Talent Strategy: Insights from the Hack The Box Report
    AI-Driven Shift Transforming Cybersecurity Skills and Talent Strategy: Insights from the Hack The Box Report
    6 Min Read
    NVIDIA and Ineffable Intelligence Join Forces to Revolutionize Reinforcement Learning Infrastructure
    NVIDIA and Ineffable Intelligence Join Forces to Revolutionize Reinforcement Learning Infrastructure
    5 Min Read
    UK Financial Services Security Hackathon: Lloyds Banking Group, Hack The Box, and Google Cloud Join Forces
    UK Financial Services Security Hackathon: Lloyds Banking Group, Hack The Box, and Google Cloud Join Forces
    6 Min Read
    NVIDIA and SAP Enhance Trust in Specialized Agents Through Collaboration
    NVIDIA and SAP Enhance Trust in Specialized Agents Through Collaboration
    7 Min Read
    Introducing NVIDIA Spectrum-X: The Open, AI-Native Ethernet Fabric for Gigascale AI with Enhanced MRC Capabilities
    Introducing NVIDIA Spectrum-X: The Open, AI-Native Ethernet Fabric for Gigascale AI with Enhanced MRC Capabilities
    5 Min Read
  • Ethics
    EthicsShow More
    Transforming Organizational Design for the Era of Agentic AI
    Transforming Organizational Design for the Era of Agentic AI
    5 Min Read
    How the AI Era is Sparking an Intense Bug Hunting Arms Race
    How the AI Era is Sparking an Intense Bug Hunting Arms Race
    6 Min Read
    Ensuring Kids’ Pajamas Are Safe: Why Shouldn’t Their AI Be Just as Secure?
    Ensuring Kids’ Pajamas Are Safe: Why Shouldn’t Their AI Be Just as Secure?
    6 Min Read
    Palantir Responds to Sadiq Khan After £50 Million Metropolitan Police Contract Blocked
    Palantir Responds to Sadiq Khan After £50 Million Metropolitan Police Contract Blocked
    6 Min Read
    Can AI Help You Find True Love? How Dating Apps Are Betting on Artificial Intelligence
    Can AI Help You Find True Love? How Dating Apps Are Betting on Artificial Intelligence
    6 Min Read
  • Comparisons
    ComparisonsShow More
    Exploring OCR-Reasoning Benchmark: Assessing MLLMs’ Performance in Complex Text-Rich Image Reasoning
    Exploring OCR-Reasoning Benchmark: Assessing MLLMs’ Performance in Complex Text-Rich Image Reasoning
    5 Min Read
    Enhancing Azure Logic Apps: Introducing Sandboxed Code Interpreters for Agent Workflows
    Enhancing Azure Logic Apps: Introducing Sandboxed Code Interpreters for Agent Workflows
    0 Min Read
    Exploring AI Content Moderation for Safe and Effective Therapy Conversations
    Exploring AI Content Moderation for Safe and Effective Therapy Conversations
    6 Min Read
    Join the InfoQ Online Certification Program: New Cohorts for AI Engineering and Organizational Architecture
    Join the InfoQ Online Certification Program: New Cohorts for AI Engineering and Organizational Architecture
    5 Min Read
    Enhancing Inclusive Toxic Content Moderation: Mitigating Adversarial Attack Vulnerabilities in Toxicity Classifiers for LLM-Generated Content
    Enhancing Inclusive Toxic Content Moderation: Mitigating Adversarial Attack Vulnerabilities in Toxicity Classifiers for LLM-Generated Content
    5 Min Read
Search
  • Privacy Policy
  • Terms of Service
  • Contact Us
  • FAQ / Help Center
  • Advertise With Us
  • Latest News
  • Model Comparisons
  • Tutorials & Guides
  • Open-Source Tools
  • Community Events
© 2025 AI Model Kit. All Rights Reserved.
Reading: ITBench-AA Report: Agentic Enterprise IT Models from IBM Fall Short with Scores Below 50% on Initial Benchmark — Insights from Artificial Analysis
Share
Notification Show More
Font ResizerAa
AIModelKitAIModelKit
Font ResizerAa
  • 🏠
  • 🚀
  • 📰
  • 💡
  • 📚
  • ⭐
Search
  • Home
  • News
  • Models
  • Guides
  • Tools
  • Ethics
  • Events
  • Comparisons
Follow US
  • Latest News
  • Model Comparisons
  • Tutorials & Guides
  • Open-Source Tools
  • Community Events
© 2025 AI Model Kit. All Rights Reserved.
AIModelKit > Open-Source Models > ITBench-AA Report: Agentic Enterprise IT Models from IBM Fall Short with Scores Below 50% on Initial Benchmark — Insights from Artificial Analysis
Open-Source Models

ITBench-AA Report: Agentic Enterprise IT Models from IBM Fall Short with Scores Below 50% on Initial Benchmark — Insights from Artificial Analysis

aimodelkit
Last updated: May 27, 2026 7:00 pm
aimodelkit
Share
ITBench-AA Report: Agentic Enterprise IT Models from IBM Fall Short with Scores Below 50% on Initial Benchmark — Insights from Artificial Analysis
SHARE

Introducing ITBench-AA: Revolutionizing Site Reliability Engineering Benchmarking

Artificial Analysis, in collaboration with IBM Software Innovation Lab, has launched ITBench-AA—a pioneering benchmark suite aimed at evaluating AI models on critical enterprise IT tasks. Initially focusing on Site Reliability Engineering (SRE), this benchmark reveals that even frontier models struggle, scoring below 50% in performance.

Contents
  • Introducing ITBench-AA: Revolutionizing Site Reliability Engineering Benchmarking
  • Understanding Site Reliability Engineering Tasks
  • Key Findings from ITBench-AA
  • Overview of ITBench-AA SRE Tasks
  • Methodology Details
  • Highlights of the Benchmark
  • Collaboration with IBM

Understanding Site Reliability Engineering Tasks

ITBench-AA specializes in benchmarking AI performance on Kubernetes incident responses, a challenging domain where models must analyze logs, trace dependencies, and identify root-cause entities across complex infrastructures. The benchmark has been powered by IBM’s extensive experience in enterprise IT operations, utilizing a dataset specifically designed for this evaluation.

Key Findings from ITBench-AA

The initial results from the ITBench-AA SRE tasks are revealing:

  1. Model Performance: The leading model, Claude Opus 4.7, achieved a score of 47%, closely followed by GPT-5.5 at 46%, and Qwen3.7 Max at 42%. Notably, all frontier models scored below 50%, highlighting a significant gap in performance.

  2. Investigation Efficiency: Models exhibited varied turn counts, with longer interaction trajectories not necessarily correlating with improved accuracy. For example, GPT-5.5 required an average of 31 turns for a 46% score, whereas Gemini 3.1 averaged 83 turns, only yielding a 30% score. This suggests that excessive investigation could lead to inaccuracies.

  3. Performance Comparison: Open weights models such as GLM-5.1 and Gemma 4 31B scored 40% and 37% respectively. However, models that adopted exhaustive investigation techniques often faced penalties.

Overview of ITBench-AA SRE Tasks

ITBench-AA encompasses a total of 59 SRE tasks, which include:

  • 40 public tasks and 19 new, held-out tasks.
  • Each task presents a Kubernetes incident snapshot, including logs, traces, alerts, and metrics, challenging models to accurately identify independent root-cause entities.

The fault scenarios cover a wide array of typical SRE failure modes, such as infrastructure failures and resource quota exhaustion, testing models across various critical situations.

More Read

Commenting on the U.S. NTIA’s Call for Input on AI Accountability: Our Response
Commenting on the U.S. NTIA’s Call for Input on AI Accountability: Our Response
Enhancing Machine Learning and Wildfire Research with High-Performance Computing
Create a Custom Visual Interactive User Experience for Any Prompt: Elevate Engagement and Creativity
Unlocking DeepInfra on Hugging Face: Explore Powerful Inference Providers 🔥
Introducing a New DataCamp Learning Track: Enhance Your Skills Today!

Methodology Details

The methodology of ITBench-AA is designed for clear and fair evaluation:

  • Each task is tackled using the Stirrup reference harness, allowing models shell access to a sandboxed environment for relevant logs and snapshots.
  • Models are required to submit a structured JSON diagnosis, identifying root causes like Kubernetes Deployments, Services, and Pods.
  • Scoring is based on average precision at full recall, rewarding accuracy while eliminating false positives from scoring biases.

Highlights of the Benchmark

  1. Structured Investigations: Tasks task agents to analyze snapshots, reviewing alerts and logs to diagnose issues accurately. For example, an agent encountering user-facing failures efficiently traced the issue to a network policy blocking a critical service.

  2. Impact of Turn Count: While some models engaged in lengthy explorations, their accuracy didn’t improve proportionally. Models submitting excess irrelevant entities were penalized, signifying the importance of precise root-cause identification without digressions.

  3. Cost-Effective Performance: Open weights models like Gemma 4 31B demonstrated competitive performance at a lower cost per task, emphasizing the value of economical AI solutions without sacrificing accuracy.

Collaboration with IBM

ITBench-AA is an innovative partnership with IBM, drawing upon their robust IT benchmarking expertise. This collaboration sets the stage for the framework to expand beyond SRE tasks to include areas like Financial Operations (FinOps) and even responsibilities typically associated with a Chief Information Security Officer (CISO) over time.

By focusing on agentic enterprise IT tasks, ITBench-AA aims to redefine performance standards for AI models in complex operational environments, ultimately refining the capabilities of future AI applications.

Inspired by: Source

Introducing PaddlePaddle: Now Available on the Hugging Face Hub
CyberSecEval 2: A Complete Framework for Assessing Cybersecurity Risks and Capabilities of Large Language Models
Enhance Your HDR Photo Editing Skills with Machine Learning Techniques
Exploring Google AI Edge’s MediaPipe: A Comprehensive Guide
Discover HoloTab by HCompany: Your Ultimate AI Browser Companion

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Copy Link Print
Previous Article Stay Ahead: The Future of IVF and the Latest in AI Innovations Stay Ahead: The Future of IVF and the Latest in AI Innovations

Stay Connected

XFollow
PinterestPin
TelegramFollow
LinkedInFollow

							banner							
							banner
Explore Top AI Tools Instantly
Discover, compare, and choose the best AI tools in one place. Easy search, real-time updates, and expert-picked solutions.
Browse AI Tools

Latest News

Stay Ahead: The Future of IVF and the Latest in AI Innovations
Stay Ahead: The Future of IVF and the Latest in AI Innovations
News
Exploring OCR-Reasoning Benchmark: Assessing MLLMs’ Performance in Complex Text-Rich Image Reasoning
Exploring OCR-Reasoning Benchmark: Assessing MLLMs’ Performance in Complex Text-Rich Image Reasoning
Comparisons
Master Sending Emails with Python: Take Our Quiz – Real Python
Master Sending Emails with Python: Take Our Quiz – Real Python
Guides
Enhancing Azure Logic Apps: Introducing Sandboxed Code Interpreters for Agent Workflows
Enhancing Azure Logic Apps: Introducing Sandboxed Code Interpreters for Agent Workflows
Comparisons
//

Leading global tech insights for 20M+ innovators

Quick Link

  • Latest News
  • Model Comparisons
  • Tutorials & Guides
  • Open-Source Tools
  • Community Events

Support

  • Privacy Policy
  • Terms of Service
  • Contact Us
  • FAQ / Help Center
  • Advertise With Us

Sign Up for Our Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

AIModelKitAIModelKit
Follow US
© 2025 AI Model Kit. All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?