Introducing Terminal-Bench 2.0 and Harbor: Transforming AI Agent Evaluation
The world of autonomous AI agents just got a major upgrade. The developers behind Terminal-Bench, a pioneering benchmark suite designed for evaluating AI agent performance in terminal-based tasks, have rolled out version 2.0. Accompanying this release is Harbor, a newly developed framework tailored for testing, enhancing, and optimizing AI agents within containerized environments. This dual launch aims to tackle persistent challenges in AI evaluation, especially for agents operating autonomously in realistic developer settings.
Why Terminal-Bench 2.0 Matters
Since its initial release in May 2025, Terminal-Bench 1.0 swiftly became a default benchmark within the AI agent community. The suite provided developers with a means to measure agent performance in terminal environments. However, its wide-ranging scope led to some inconsistencies and issues. Community feedback highlighted many tasks as poorly defined or disrupted by changes in external services, which undermined the benchmark’s reliability.
With Terminal-Bench 2.0, these concerns have been systematically addressed. This latest version includes 89 rigorously validated tasks, each subjected to hours of manual and Large Language Model (LLM)-assisted scrutiny. The focus here is on realism, clarity, and solvability, raising the performance bar while ensuring the tasks are stable and reproducible. For example, the challenging download-youtube task has either been removed or restructured due to its instability related to third-party APIs.
Co-creator Alex Shaw notes that while the benchmark is ostensibly harder, many users might find that state-of-the-art (SOTA) performance remains comparable to 1.0. This observation indicates a robust enhancement in task quality rather than a simple increase in difficulty.
Harbor: A Framework for Scalable Evaluations
The launch of Harbor is a significant addition to the AI agent evaluation toolkit. It offers a platform that enables developers to scale tests across thousands of cloud containers seamlessly. Compatible with major cloud providers such as Daytona and Modal, Harbor was internally tested during the development of Terminal-Bench 2.0, running tens of thousands of rollouts.
Key Features of Harbor
Harbor stands out as a versatile framework that supports numerous features, including:
- Evaluation of Any Container-Installable Agent: This opens avenues for testing various agent architectures.
- Scalable Supervised Fine-Tuning and Reinforcement Learning Pipelines: It efficiently integrates fine-tuning methods suited to diverse models.
- Custom Benchmark Creation and Deployment: Developers can tailor benchmarks to fit specific needs.
- Full Integration with Terminal-Bench 2.0: This provides a cohesive system for evaluation and improvement.
Developers can access Harbor easily via its website, harborframework.com, where they can find complete documentation for testing and submitting agents to a public leaderboard.
Initial Leaderboard Results: Who’s Leading the Pack?
Early results from the Terminal-Bench 2.0 leaderboard have revealed exciting competition. The standout performer so far is OpenAI’s Codex CLI, a GPT-5 variant, achieving an impressive 49.6% success rate. It sits ahead of other notable entries, including:
- Codex CLI (GPT-5) — 49.6%
- Codex CLI (GPT-5-Codex) — 44.3%
- OpenHands (GPT-5) — 43.8%
- Terminus 2 (GPT-5-Codex) — 43.4%
- Terminus 2 (Claude Sonnet 4.5) — 42.8%
This clustering of results demonstrates a high level of competition, with no single agent managing to solve more than half of the tasks presented.
Submission Process: Join the Evaluation Wave
Developers eager to test or submit their agents can easily engage with the Terminal-Bench 2.0 framework through simple command-line interface (CLI) commands. To join the leaderboard, researchers must perform five benchmark runs, with submission details sent to the development team for verification.
bash
harbor run -d terminal-bench@2.0 -m "
The integration of Terminal-Bench 2.0 into research workflows is already taking shape, focused on advancing fields such as agentic reasoning, tool use, and code generation. According to co-creator Mike Merrill, ongoing research efforts will soon present a detailed preprint discussing the verification processes and methodologies behind the benchmark’s design.
Towards Standardized Evaluation Across AI
The simultaneous launch of Terminal-Bench 2.0 and Harbor represents a pivotal step toward a more standardized and scalable framework for evaluating AI agents. As LLM agents proliferate in development and operational environments, reliable and reproducible testing methods are essential.
These comprehensive tools not only offer improvements in benchmarking and evaluation but also lay the groundwork for a unified stack that can support ongoing enhancements across the diverse AI ecosystem. With these advancements, the quest for robust, efficient AI is set to reach new heights.
Inspired by: Source

