Home » Technology » Terminal-Bench 2.0 & Harbor: AI Agent Evaluation Framework

Terminal-Bench 2.0 & Harbor: AI Agent Evaluation Framework

by Rachel Kim – Technology Editor

New Benchmark ​& Framework Aim to Standardize AI Agent​ Evaluation

SAN FRANCISCO,CA – A‍ new ‍iteration of the Terminal-Bench benchmark,version 2.0, has launched alongside Harbor, a framework designed for large-scale agent ⁢testing and evaluation in cloud-deployed containers. The release aims ​to address ‌the growing need for consistent and reproducible testing as Large ‌Language Model (LLM) agents become increasingly prevalent in developer and operational environments.

Terminal-Bench 2.0 presents a more challenging set of tasks than its predecessor, version 1.0. Though,‍ initial results show comparable performance, which the benchmark’s creators attribute to a substantial increase in task quality.”Astute Terminal-Bench fans may notice that SOTA​ performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” noted Alex Shaw⁢ on X. “We believe this is as task quality is‍ substantially higher in the​ new benchmark.”

Harbor supports evaluation of​ any container-installable agent​ and is compatible with⁣ major⁤ cloud providers including Daytona ⁤and Modal. ​Key features include scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines,custom benchmark creation and deployment,and ⁤full integration with ⁣Terminal-Bench 2.0. The​ framework​ was used internally to run tens of thousands of rollouts during the development of the new benchmark and is now publicly available⁤ at harborframework.com.

early results from the Terminal-Bench 2.0 leaderboard place OpenAI’s⁤ Codex CLI (command line interface), powered by ‍GPT-5, in the lead with a 49.6% ‌success‌ rate. Other top performers include additional GPT-5 variants and agents based on ‍Claude Sonnet 4.5.

Top‌ 5 Agent‌ Results (Terminal-Bench 2.0):

  1. Codex CLI (GPT-5) – 49.6%
  2. Codex CLI‍ (GPT-5-codex) – 44.3%
  3. openhands (GPT-5) – 43.8%
  4. Terminus 2 (GPT-5-Codex) – 43.4%
  5. Terminus 2 (Claude Sonnet 4.5) – 42.8%

Users can test or submit⁢ agents by⁤ installing Harbor and utilizing simple CLI commands. Submissions to​ the public leaderboard require five⁣ benchmark runs, with results⁣ and‍ job directories submitted ⁤for validation.An example command ‌is: harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" --n-attempts 5 --jobs-dir <path/to/output>.

According to Mike Merrill, a postdoctoral researcher at Stanford and co-creator of terminal-Bench, a detailed preprint ⁣outlining the benchmark’s verification process ⁣and design methodology is currently in ​progress. The combined release of Terminal-Bench 2.0 and Harbor ⁤represents a move towards a unified evaluation stack for the AI ecosystem,supporting model enhancement,habitat simulation,and benchmark standardization.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.