Terminal-Bench.

152 hand-built terminal tasks — devops, data, SWE, scientific computing — each scored by container-internal unit tests. Where agentic-coding evaluation moved past SWE-bench: the harness, the shell environment, and the model are all measured together. You cannot win this leaderboard with raw model intelligence alone.

Lineage status · Active· TB2 live since Oct 2025 · TB3 + TB-Science in development

Terminal-Bench 1 (80 tasks) was retired Oct 2025 in favour of TB2 (152 tasks). Submissions cannot modify timeouts or resources, and results are verified by the Terminal-Bench team. Top entry: Codex on GPT-5.5 at 82.0%.

Live leaderboard ↗Read the paper Coding lineage →

§ 01 · Lineage

From code patches to terminal sessions.

SWE-bench measured the model. SWE-bench Pro fixed contamination but still scored single patches. Terminal-Bench treats the agent + harness + shell as one system — closer to how engineers actually deploy these tools.

SWE-bench Verified

Aug 2024

Saturating

500 human-filtered GitHub issues, model-only scoring. OpenAI publicly stopped reporting Sep 2025 — contamination.

SWE-bench Pro

Sep 2025

Active

Scale AI's contamination-controlled successor. 1,865 problems, held-out splits. Frontier drops to ~23%.

Terminal-Bench 2

Oct 2025

Active

Real terminal/devops/data/SWE tasks in a Docker sandbox. Agent-coupled — the harness counts.

◆ this page

§ 02 · SOTA

64.9% → 82.0%, two tracks.

Each dot is a record-setting agent + model pair on TB2. Closed/API line drives the frontier; open-weight harnesses (OpenHands, Mini-SWE-Agent, CAMEL-AI) trail by ~30 points.

API · latest: Apr 2026 · Codex + GPT-5.5 · 82.0%
Open · latest: Jan 2026 · OpenHands + Claude Opus 4.5 · 51.9%
Frontier gap: 30.1pp

Closed / APIOpen weight

Fig 2 · TB2 SOTA progression by record-setting agent (not model). Open-weight harnesses pair with closed models too — what's tracked here is whether the agent code is open.

§ 03 · Leaderboard

Top agents,
by harness.

Top-30 closed and key open entries from the live tbench.ai TB2 leaderboard. The same model appears multiple times paired with different agents — that's the point: harness changes the outcome.

Source: tbench.ai/leaderboard/terminal-bench/2.0
Snapshot: 2026-04-27
Rows: 39 of 124

#	Agent	Model	Org	Type	Submitted	Score	CI
01	Codex	GPT-5.5	OpenAI	API	Apr 2026	82.0	±2.2
02	ForgeCode	GPT-5.4	OpenAI	API	Mar 2026	81.8	±2.0
03	TongAgents	Gemini 3.1 Pro	Google	API	Mar 2026	80.2	±2.6
04	ForgeCode	Claude Opus 4.6	Anthropic	API	Mar 2026	79.8	±1.6
05	SageAgent	GPT-5.3-Codex	OpenAI	API	Mar 2026	78.4	±2.2
06	ForgeCode	Gemini 3.1 Pro	Google	API	Mar 2026	78.4	±1.8
07	Droid	GPT-5.3-Codex	OpenAI	API	Feb 2026	77.3	±2.2
08	Capy	Claude Opus 4.6	Anthropic	API	Mar 2026	75.3	±2.4
09	Simple Codex	GPT-5.3-Codex	OpenAI	API	Feb 2026	75.1	±2.4
10	Terminus-KIRA	Gemini 3.1 Pro	Google	API	Feb 2026	74.8	±2.6
11	Terminus-KIRA	Claude Opus 4.6	Anthropic	API	Feb 2026	74.7	±2.6
12	Mux	GPT-5.3-Codex	OpenAI	API	Mar 2026	74.6	±2.5
13	MAYA-V2	Claude Opus 4.6	Anthropic	API	Mar 2026	72.1	±2.2
14	TongAgents	Claude Opus 4.6	Anthropic	API	Feb 2026	71.9	±2.7
15	Junie CLI	Multiple	JetBrains	API	Mar 2026	71.0	±2.9
16	CodeBrain-1	GPT-5.3-Codex	OpenAI	API	Feb 2026	70.3	±2.6
17	Droid	Claude Opus 4.6	Anthropic	API	Feb 2026	69.9	±2.5
18	Ante	Gemini 3 Pro	Google	API	Jan 2026	69.4	±2.1
19	IndusAGI	GPT-5.3-Codex	OpenAI	API	Mar 2026	69.1	±2.3
20	Crux	Claude Opus 4.6	Anthropic	API	Feb 2026	66.9	—
21	Deep Agents	GPT-5.2-Codex	OpenAI	API	Feb 2026	66.5	±3.1
22	Mux	Claude Opus 4.6	Anthropic	API	Feb 2026	66.5	±2.5
23	SageAgent	Gemini 3 Pro	Google	API	Feb 2026	65.2	±2.1
24	Droid	GPT-5.2	OpenAI	API	Dec 2025	64.9	±2.8
25	Terminus 2	GPT-5.3-Codex	OpenAI	API	Feb 2026	64.7	±2.7
26	Junie CLI	Gemini 3 Flash	Google	API	Dec 2025	64.3	±2.8
27	Droid	Claude Opus 4.5	Anthropic	API	Dec 2025	63.1	±2.7
28	Terminus 2	Claude Opus 4.6	Anthropic	API	Feb 2026	62.9	±2.7
29	Codex CLI	GPT-5.2	OpenAI	API	Dec 2025	62.9	±3.0
30	Warp	Multiple	Warp	API	Dec 2025	61.2	±3.0
51	OpenHands	Claude Opus 4.5	Anthropic	OSS	Jan 2026	51.9	±2.9
58	CAMEL-AI	Claude Sonnet 4.5	Anthropic	OSS	Dec 2025	46.5	±2.4
60	OpenHands	GPT-5	OpenAI	OSS	Nov 2025	43.8	±3.0
68	OpenHands	Claude Sonnet 4.5	Anthropic	OSS	Nov 2025	42.6	±2.8
69	Mini-SWE-Agent	Claude Sonnet 4.5	Anthropic	OSS	Nov 2025	42.5	±2.8
71	Mini-SWE-Agent	GPT-5-Codex	OpenAI	OSS	Nov 2025	41.3	±2.8
75	OpenHands	Claude Opus 4.1	Anthropic	OSS	Nov 2025	36.9	±2.7
81	Mini-SWE-Agent	Claude Opus 4.1	Anthropic	OSS	Nov 2025	35.1	±2.5
84	Mini-SWE-Agent	GPT-5	OpenAI	OSS	Nov 2025	33.9	±2.9

Fig 3 · CI = 95% confidence interval of the accuracy estimate. Open-weight rows show the agent code is open — the model paired with it can be either side. Frontier agent + model gap is 30.1 points.

§ 04 · Open vs closed

Harness gap is 30.1 points.

Open-weight harnesses (OpenHands, Mini-SWE-Agent, CAMEL-AI) trail proprietary harnesses (Codex, ForgeCode, TongAgents). Even when the underlying model is identical, harness engineering moves the score by 20+ points.

Open harness avg

41.6%

9 rows · top: OpenHands · 51.9%

Closed harness avg

71.2%

30 rows · top: Codex · 82.0%

Frontier gap

30.1pp

Codex − OpenHands

§ 05 · Methodology

The agent counts too.

Containerised tasks

Every task ships as a Docker image with seeded files, optional services, and a hidden test harness. The agent sees a shell; it cannot peek at the tests.

Fixed budgets

Submissions cannot modify per-task timeouts or resources. This forces ceiling comparisons — no winning by buying compute.

Agent-coupled

The score reflects the harness, the prompt scaffold, and the underlying model together. A great model with a poor harness loses to a worse model with a better harness.

Verified runs

Top submissions are independently re-run by Terminal-Bench team members. CI bars represent the run-to-run variance.

§ 06 · Resources