Terminal-Bench 2.0 is released with Harbor, a new framework for testing agents inside containers



The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents in real-world terminal-based tasks, have released version 2.0 with Harbor, a new framework for testing, improving, and optimizing AI agents in containerized environments.

This dual release aims to address long-standing pain points in testing and optimizing AI agents, especially those built to operate autonomously in realistic development environments.

With a more difficult and rigorously validated set of tasks, Terminal-Bench 2.0 replaces version 1.0 as the standard for evaluating frontier model functionality.

The included runtime framework, Harbor, allows developers and researchers to scale evaluation across thousands of cloud containers and integrate with both open source and proprietary agents and training pipelines.

“Harbor is the package we wish we had when we were making Terminal-Bench." Co-creator Alex Shaw writes of X: "It is intended for agent, model, and benchmark developers and researchers who want to evaluate and improve their agents and models."

Higher standards, cleaner data

Terminal-Bench 1.0 has rapidly gained popularity since its release in May 2025 and has become the default benchmark for evaluating agent performance across the field of AI-powered agents operating in developer-style terminal environments. These agents interact with the system through the command line, mimicking how developers work behind the scenes in a graphical user interface.

However, its broad scope was fraught with contradictions. Several tasks have been identified by the community as underspecified or unstable due to changes in external services.

Version 2.0 directly addresses these issues. The updated suite includes 89 tasks, each of which undergoes several hours of manual and LLM-assisted validation. The focus is on raising the upper limit of difficulty while improving reliability and reproducibility by making tasks solvable, realistic, and clearly specified.

A notable example is download-youtube This task was removed or refactored in 2.0 because it relied on an unstable third-party API.

“Astute terminal bench fans may find that SOTA’s performance is comparable to TB1.0, despite our claims that TB2.0 is more difficult,” Shaw said at X. “We believe this is because the quality of the tasks is significantly higher in the new benchmark.”

Harbor: Large scale integration rollout

In parallel with benchmark updates, the team started the following: porta new framework for running and evaluating agents in containers deployed in the cloud.

Harbor supports large-scale rollout infrastructure and is compatible with leading providers including: daytona and modal.

Harbor is designed to be generalizable across agent architectures and supports:

  • Evaluating agents that can be installed in containers

  • Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines

  • Creating and deploying custom benchmarks

  • Full integration with Terminal-Bench 2.

Harbor was used internally to perform tens of thousands of rollouts during the creation of new benchmarks. It is now publicly available via harborframework.com and includes documentation for testing your agent and submitting it to public leaderboards.

Early results: GPT-5 leads in task success

Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI’s Codex CLI (Command Line Interface), a variant powered by GPT-5, taking the lead with a success rate of 49.6%, the highest of all agents tested to date.

Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents.

Top 5 agent results (Terminal Bench 2.0):

  1. Codex CLI (GPT-5) — 49.6%

  2. Codex CLI (GPT-5-Codex) — 44.3%

  3. Open Hands (GPT-5) — 43.8%

  4. Terminal 2 (GPT-5-Codex) — 43.4%

  5. Terminal 2 (Claude Sonnet 4.5) — 42.8%

The tight clustering among the top models indicates active competition between platforms, with no single agent solving more than half of the tasks.

Submission and use

To test or submit an agent, users install Harbor and run benchmarks using simple CLI commands. Submission to leaderboards requires five benchmark runs, and results can be emailed to developers along with the job directory for validation.

Harbor Run -d Terminal Bench@2.0 -m "<モデル>" -a "<エージェント>" –n-attempts 5 –jobs-dir <パス/出力先>

Terminal-Bench 2.0 is already integrated into research workflows focused on agent inference, code generation, and tool usage. A detailed preprint covering the validation process and design methodology behind the benchmark is in the works, said co-creator Mike Merrill, a postdoctoral fellow at Stanford University.

Aiming for standardization

The combined release of Terminal-Bench 2.0 and Harbor represents a step toward a more consistent and scalable agent evaluation infrastructure. As LLM agents proliferate in development and production environments, the need for controlled and reproducible testing grows.

These tools provide a potential foundation for a unified evaluation stack, supporting standardization of model improvement, environmental simulation, and benchmarking across the AI ​​ecosystem.



Source link