Codesota · Benchmark · Terminal-Bench 2Browse/Agentic/Lineages
Agentic-coding lineage · Docker-sandboxed terminal tasks · Oct 2025

Terminal-Bench.

152 hand-built terminal tasks — devops, data, SWE, scientific computing — each scored by container-internal unit tests. Where agentic-coding evaluation moved past SWE-bench: the harness, the shell environment, and the model are all measured together. You cannot win this leaderboard with raw model intelligence alone.

Lineage status · Active· TB2 live since Oct 2025 · TB3 + TB-Science in development

Terminal-Bench 1 (80 tasks) was retired Oct 2025 in favour of TB2 (152 tasks). Submissions cannot modify timeouts or resources, and results are verified by the Terminal-Bench team. Top entry: Codex on GPT-5.5 at 82.0%.

Live leaderboard Read the paperCoding lineage
§ 01 · Lineage

From code patches to terminal sessions.

SWE-bench measured the model. SWE-bench Pro fixed contamination but still scored single patches. Terminal-Bench treats the agent + harness + shell as one system — closer to how engineers actually deploy these tools.

SWE-bench Verified
Aug 2024
Saturating

500 human-filtered GitHub issues, model-only scoring. OpenAI publicly stopped reporting Sep 2025 — contamination.

SWE-bench Pro
Sep 2025
Active

Scale AI's contamination-controlled successor. 1,865 problems, held-out splits. Frontier drops to ~23%.

Terminal-Bench 2
Oct 2025
Active

Real terminal/devops/data/SWE tasks in a Docker sandbox. Agent-coupled — the harness counts.

◆ this page
§ 02 · SOTA

64.9% → 82.0%, two tracks.

Each dot is a record-setting agent + model pair on TB2. Closed/API line drives the frontier; open-weight harnesses (OpenHands, Mini-SWE-Agent, CAMEL-AI) trail by ~30 points.


API · latest
Apr 2026 · Codex + GPT-5.5 · 82.0%
Open · latest
Jan 2026 · OpenHands + Claude Opus 4.5 · 51.9%
Frontier gap
30.1pp
Closed / APIOpen weight
0%25%50%75%100%Oct 2025Jan 2026Apr 202664.969.477.381.882.043.846.551.9CodexOpenHands
Fig 2 · TB2 SOTA progression by record-setting agent (not model). Open-weight harnesses pair with closed models too — what's tracked here is whether the agent code is open.
§ 03 · Leaderboard

Top agents,
by harness.

Top-30 closed and key open entries from the live tbench.ai TB2 leaderboard. The same model appears multiple times paired with different agents — that's the point: harness changes the outcome.


Source
tbench.ai/leaderboard/terminal-bench/2.0
Snapshot
2026-04-27
Rows
39 of 124
#AgentModelOrgTypeSubmittedScoreCI
01CodexGPT-5.5OpenAIAPIApr 202682.0±2.2
02ForgeCodeGPT-5.4OpenAIAPIMar 202681.8±2.0
03TongAgentsGemini 3.1 ProGoogleAPIMar 202680.2±2.6
04ForgeCodeClaude Opus 4.6AnthropicAPIMar 202679.8±1.6
05SageAgentGPT-5.3-CodexOpenAIAPIMar 202678.4±2.2
06ForgeCodeGemini 3.1 ProGoogleAPIMar 202678.4±1.8
07DroidGPT-5.3-CodexOpenAIAPIFeb 202677.3±2.2
08CapyClaude Opus 4.6AnthropicAPIMar 202675.3±2.4
09Simple CodexGPT-5.3-CodexOpenAIAPIFeb 202675.1±2.4
10Terminus-KIRAGemini 3.1 ProGoogleAPIFeb 202674.8±2.6
11Terminus-KIRAClaude Opus 4.6AnthropicAPIFeb 202674.7±2.6
12MuxGPT-5.3-CodexOpenAIAPIMar 202674.6±2.5
13MAYA-V2Claude Opus 4.6AnthropicAPIMar 202672.1±2.2
14TongAgentsClaude Opus 4.6AnthropicAPIFeb 202671.9±2.7
15Junie CLIMultipleJetBrainsAPIMar 202671.0±2.9
16CodeBrain-1GPT-5.3-CodexOpenAIAPIFeb 202670.3±2.6
17DroidClaude Opus 4.6AnthropicAPIFeb 202669.9±2.5
18AnteGemini 3 ProGoogleAPIJan 202669.4±2.1
19IndusAGIGPT-5.3-CodexOpenAIAPIMar 202669.1±2.3
20CruxClaude Opus 4.6AnthropicAPIFeb 202666.9
21Deep AgentsGPT-5.2-CodexOpenAIAPIFeb 202666.5±3.1
22MuxClaude Opus 4.6AnthropicAPIFeb 202666.5±2.5
23SageAgentGemini 3 ProGoogleAPIFeb 202665.2±2.1
24DroidGPT-5.2OpenAIAPIDec 202564.9±2.8
25Terminus 2GPT-5.3-CodexOpenAIAPIFeb 202664.7±2.7
26Junie CLIGemini 3 FlashGoogleAPIDec 202564.3±2.8
27DroidClaude Opus 4.5AnthropicAPIDec 202563.1±2.7
28Terminus 2Claude Opus 4.6AnthropicAPIFeb 202662.9±2.7
29Codex CLIGPT-5.2OpenAIAPIDec 202562.9±3.0
30WarpMultipleWarpAPIDec 202561.2±3.0
51OpenHandsClaude Opus 4.5AnthropicOSSJan 202651.9±2.9
58CAMEL-AIClaude Sonnet 4.5AnthropicOSSDec 202546.5±2.4
60OpenHandsGPT-5OpenAIOSSNov 202543.8±3.0
68OpenHandsClaude Sonnet 4.5AnthropicOSSNov 202542.6±2.8
69Mini-SWE-AgentClaude Sonnet 4.5AnthropicOSSNov 202542.5±2.8
71Mini-SWE-AgentGPT-5-CodexOpenAIOSSNov 202541.3±2.8
75OpenHandsClaude Opus 4.1AnthropicOSSNov 202536.9±2.7
81Mini-SWE-AgentClaude Opus 4.1AnthropicOSSNov 202535.1±2.5
84Mini-SWE-AgentGPT-5OpenAIOSSNov 202533.9±2.9
Fig 3 · CI = 95% confidence interval of the accuracy estimate. Open-weight rows show the agent code is open — the model paired with it can be either side. Frontier agent + model gap is 30.1 points.
§ 04 · Open vs closed

Harness gap is 30.1 points.

Open-weight harnesses (OpenHands, Mini-SWE-Agent, CAMEL-AI) trail proprietary harnesses (Codex, ForgeCode, TongAgents). Even when the underlying model is identical, harness engineering moves the score by 20+ points.

Open harness avg
41.6%
9 rows · top: OpenHands · 51.9%
Closed harness avg
71.2%
30 rows · top: Codex · 82.0%
Frontier gap
30.1pp
Codex − OpenHands
§ 05 · Methodology

The agent counts too.

Containerised tasks

Every task ships as a Docker image with seeded files, optional services, and a hidden test harness. The agent sees a shell; it cannot peek at the tests.

Fixed budgets

Submissions cannot modify per-task timeouts or resources. This forces ceiling comparisons — no winning by buying compute.

Agent-coupled

The score reflects the harness, the prompt scaffold, and the underlying model together. A great model with a poor harness loses to a worse model with a better harness.

Verified runs

Top submissions are independently re-run by Terminal-Bench team members. CI bars represent the run-to-run variance.

§ 06 · Resources

Papers and code.

Key papers
Repositories

How benchmarks evolve SWE-bench (predecessor)Full coding lineageAgent comparison