Terminal-Bench 2.0.

Terminal-agent benchmark for software engineering, machine learning, security, data science, system administration, file operations, and related terminal workflows. Scores measure the agent harness and underlying model as one system.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

20 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: accuracy · higher is better

accuracy· primary

20 rows

#	Model	Org	Submitted	Paper / code	accuracy
01	Codex / GPT-5.5API	OpenAI	Apr 2026	terminal-bench-official	82
02	ForgeCode / GPT-5.4API	ForgeCode	Apr 2026	terminal-bench-official	81.80
03	TongAgents / Gemini 3.1 ProAPI	TongAgents	Apr 2026	terminal-bench-official	80.20
04	ForgeCode / Claude Opus 4.6API	ForgeCode	Apr 2026	terminal-bench-official	79.80
05	SageAgent / GPT-5.3-CodexAPI	SageAgent	Apr 2026	terminal-bench-official	78.40
06	ForgeCode / Gemini 3.1 ProAPI	ForgeCode	Apr 2026	terminal-bench-official	78.40
07	Droid / GPT-5.3-CodexAPI	Droid	Apr 2026	terminal-bench-official	77.30
08	Capy / Claude Opus 4.6API	Capy	Apr 2026	terminal-bench-official	75.30
09	Simple Codex / GPT-5.3-CodexAPI	OpenAI	Apr 2026	terminal-bench-official	75.10
10	Terminus-KIRA / Gemini 3.1 ProAPI	Terminus-KIRA	Apr 2026	terminal-bench-official	74.80
11	Terminus-KIRA / Claude Opus 4.6API	Terminus-KIRA	Apr 2026	terminal-bench-official	74.70
12	Mux / GPT-5.3-CodexAPI	Mux	Apr 2026	terminal-bench-official	74.60
13	MAYA-V2 / Claude 4.6 OpusAPI	MAYA	Apr 2026	terminal-bench-official	72.10
14	TongAgents / Claude Opus 4.6API	TongAgents	Apr 2026	terminal-bench-official	71.90
15	Junie CLI / MultipleAPI	JetBrains	Apr 2026	terminal-bench-official	71
16	CodeBrain-1 / GPT-5.3-CodexAPI	CodeBrain	Apr 2026	terminal-bench-official	70.30
17	Droid / Claude Opus 4.6API	Droid	Apr 2026	terminal-bench-official	69.90
18	Ante / Gemini 3 ProAPI	Ante	Apr 2026	terminal-bench-official	69.40
19	IndusAGI Coding Agent / GPT-5.3-CodexAPI	IndusAGI	Apr 2026	terminal-bench-official	69.10
20	Crux / Claude Opus 4.6API	Crux	Apr 2026	terminal-bench-official	66.90

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy

Apr 27, 2026Codex / GPT-5.5OpenAI82

Fig 3 · SOTA-setting models only. 1 entries span Apr 2026 → Apr 2026.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

Terminal-Bench 2.0.

Best published scores.

1 stepsof state of the art.

Neighbouring benchmarks.

Have a score that beatsthis table?

1 steps
of state of the art.

Have a score that beats
this table?