Terminal-Bench 2.0 & Harbor: AI Agent Evaluation Framework

by Rachel Kim – Technology Editor November 9, 2025

written by Rachel Kim – Technology Editor November 9, 2025

New Benchmark & Framework Aim to Standardize AI Agent Evaluation

SAN FRANCISCO,CA – A‍ new ‍iteration of the Terminal-Bench benchmark,version 2.0, has launched alongside Harbor, a framework designed for large-scale agent ⁢testing and evaluation in cloud-deployed containers. The release aims to address ‌the growing need for consistent and reproducible testing as Large ‌Language Model (LLM) agents become increasingly prevalent in developer and operational environments.

Terminal-Bench 2.0 presents a more challenging set of tasks than its predecessor, version 1.0. Though,‍ initial results show comparable performance, which the benchmark’s creators attribute to a substantial increase in task quality.”Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” noted Alex Shaw⁢ on X. “We believe this is as task quality is‍ substantially higher in the new benchmark.”

Harbor supports evaluation of any container-installable agent and is compatible with⁣ major⁤ cloud providers including Daytona ⁤and Modal. Key features include scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines,custom benchmark creation and deployment,and ⁤full integration with ⁣Terminal-Bench 2.0. The framework was used internally to run tens of thousands of rollouts during the development of the new benchmark and is now publicly available⁤ at harborframework.com.

early results from the Terminal-Bench 2.0 leaderboard place OpenAI’s⁤ Codex CLI (command line interface), powered by ‍GPT-5, in the lead with a 49.6% ‌success‌ rate. Other top performers include additional GPT-5 variants and agents based on ‍Claude Sonnet 4.5.

Top‌ 5 Agent‌ Results (Terminal-Bench 2.0):

Codex CLI (GPT-5) – 49.6%
Codex CLI‍ (GPT-5-codex) – 44.3%
openhands (GPT-5) – 43.8%
Terminus 2 (GPT-5-Codex) – 43.4%
Terminus 2 (Claude Sonnet 4.5) – 42.8%

Users can test or submit⁢ agents by⁤ installing Harbor and utilizing simple CLI commands. Submissions to the public leaderboard require five⁣ benchmark runs, with results⁣ and‍ job directories submitted ⁤for validation.An example command ‌is: harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" --n-attempts 5 --jobs-dir <path/to/output>.

According to Mike Merrill, a postdoctoral researcher at Stanford and co-creator of terminal-Bench, a detailed preprint ⁣outlining the benchmark’s verification process ⁣and design methodology is currently in progress. The combined release of Terminal-Bench 2.0 and Harbor ⁤represents a move towards a unified evaluation stack for the AI ecosystem,supporting model enhancement,habitat simulation,and benchmark standardization.

Rachel Kim – Technology Editor

Rachel Kim – Technology Editor Rachel Kim is Technology Editor at World Today News, specializing in digital trends, artificial intelligence, and innovation. Her reporting helps readers understand the impact of new technologies on everyday life and the world economy.

Terminal-Bench 2.0 & Harbor: AI Agent Evaluation Framework

New Benchmark ​& Framework Aim to Standardize AI Agent​ Evaluation

Share this:

Related

Video: Huge Dees blow as leading goal-kicker ruled out

John Laws Death: Australian Radio Legend Dies at 90

You may also like

Leave a Comment Cancel Reply

New Benchmark & Framework Aim to Standardize AI Agent Evaluation