New Benchmark & Framework Aim to Standardize AI Agent Evaluation
SAN FRANCISCO,CA – A new iteration of the Terminal-Bench benchmark,version 2.0, has launched alongside Harbor, a framework designed for large-scale agent testing and evaluation in cloud-deployed containers. The release aims to address the growing need for consistent and reproducible testing as Large Language Model (LLM) agents become increasingly prevalent in developer and operational environments.
Terminal-Bench 2.0 presents a more challenging set of tasks than its predecessor, version 1.0. Though, initial results show comparable performance, which the benchmark’s creators attribute to a substantial increase in task quality.”Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” noted Alex Shaw on X. “We believe this is as task quality is substantially higher in the new benchmark.”
Harbor supports evaluation of any container-installable agent and is compatible with major cloud providers including Daytona and Modal. Key features include scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines,custom benchmark creation and deployment,and full integration with Terminal-Bench 2.0. The framework was used internally to run tens of thousands of rollouts during the development of the new benchmark and is now publicly available at harborframework.com.
early results from the Terminal-Bench 2.0 leaderboard place OpenAI’s Codex CLI (command line interface), powered by GPT-5, in the lead with a 49.6% success rate. Other top performers include additional GPT-5 variants and agents based on Claude Sonnet 4.5.
Top 5 Agent Results (Terminal-Bench 2.0):
- Codex CLI (GPT-5) – 49.6%
- Codex CLI (GPT-5-codex) – 44.3%
- openhands (GPT-5) – 43.8%
- Terminus 2 (GPT-5-Codex) – 43.4%
- Terminus 2 (Claude Sonnet 4.5) – 42.8%
Users can test or submit agents by installing Harbor and utilizing simple CLI commands. Submissions to the public leaderboard require five benchmark runs, with results and job directories submitted for validation.An example command is: harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" --n-attempts 5 --jobs-dir <path/to/output>.
According to Mike Merrill, a postdoctoral researcher at Stanford and co-creator of terminal-Bench, a detailed preprint outlining the benchmark’s verification process and design methodology is currently in progress. The combined release of Terminal-Bench 2.0 and Harbor represents a move towards a unified evaluation stack for the AI ecosystem,supporting model enhancement,habitat simulation,and benchmark standardization.