Who leads the Terminal-Bench 2.0 benchmark?

Codex / GPT-5.5 currently leads Terminal-Bench 2.0 with a score of 82 on accuracy.

What is the state-of-the-art score on Terminal-Bench 2.0?

The state-of-the-art result on Terminal-Bench 2.0 is 82 (accuracy), achieved by Codex / GPT-5.5 as of 2026.

How many models are tracked on Terminal-Bench 2.0?

Codesota tracks 20 models on Terminal-Bench 2.0.

When was the Terminal-Bench 2.0 leaderboard last updated?

The Terminal-Bench 2.0 leaderboard on Codesota includes results through 2026.

Codesota · Benchmark · Terminal-Bench 2.0Home/Leaderboards/Terminal-Bench 2.0

Unknown

Terminal-Bench 2.0.

Name: Terminal-Bench 2.0 Benchmark Results
Creator: Unknown
Published: 2026-01-01
License: https://creativecommons.org/licenses/by/4.0/

Stanford x Laude benchmark for AI agents operating in terminal environments. Terminal-Bench 2.0 evaluates terminal mastery across software engineering, machine learning, security, data science, system administration, file operations, and related operational workflows. Official site lists 89 high-quality tasks and a 124-entry live leaderboard.

Paper ↗Leaderboard ↓Lineage

§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

accuracy

Accuracy is the reported evaluation metric for Terminal-Bench 2.0. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Codex / GPT-5.5 Official Terminal-Bench 2.0 leaderboard rank 1. System couples agent scaffold and underlying model: Codex / GPT-5.5.	verified	82	2026	Source ↗	Looks wrong?
02	ForgeCode / GPT-5.4 Official Terminal-Bench 2.0 leaderboard rank 2. System couples agent scaffold and underlying model: ForgeCode / GPT-5.4.	verified	81.8	2026	Source ↗	Looks wrong?
03	TongAgents / Gemini 3.1 Pro Official Terminal-Bench 2.0 leaderboard rank 3. System couples agent scaffold and underlying model: TongAgents / Gemini 3.1 Pro.	verified	80.2	2026	Source ↗	Looks wrong?
04	ForgeCode / Claude Opus 4.6 Official Terminal-Bench 2.0 leaderboard rank 4. System couples agent scaffold and underlying model: ForgeCode / Claude Opus 4.6.	verified	79.8	2026	Source ↗	Looks wrong?
05	ForgeCode / Gemini 3.1 Pro Official Terminal-Bench 2.0 leaderboard rank 6. System couples agent scaffold and underlying model: ForgeCode / Gemini 3.1 Pro.	verified	78.4	2026	Source ↗	Looks wrong?
06	SageAgent / GPT-5.3-Codex Official Terminal-Bench 2.0 leaderboard rank 5. System couples agent scaffold and underlying model: SageAgent / GPT-5.3-Codex.	verified	78.4	2026	Source ↗	Looks wrong?
07	Droid / GPT-5.3-Codex Official Terminal-Bench 2.0 leaderboard rank 7. System couples agent scaffold and underlying model: Droid / GPT-5.3-Codex.	verified	77.3	2026	Source ↗	Looks wrong?
08	Capy / Claude Opus 4.6 Official Terminal-Bench 2.0 leaderboard rank 8. System couples agent scaffold and underlying model: Capy / Claude Opus 4.6.	verified	75.3	2026	Source ↗	Looks wrong?
09	Simple Codex / GPT-5.3-Codex Official Terminal-Bench 2.0 leaderboard rank 9. System couples agent scaffold and underlying model: Simple Codex / GPT-5.3-Codex.	verified	75.1	2026	Source ↗	Looks wrong?
10	Terminus-KIRA / Gemini 3.1 Pro Official Terminal-Bench 2.0 leaderboard rank 10. System couples agent scaffold and underlying model: Terminus-KIRA / Gemini 3.1 Pro.	verified	74.8	2026	Source ↗	Looks wrong?
11	Terminus-KIRA / Claude Opus 4.6 Official Terminal-Bench 2.0 leaderboard rank 11. System couples agent scaffold and underlying model: Terminus-KIRA / Claude Opus 4.6.	verified	74.7	2026	Source ↗	Looks wrong?
12	Mux / GPT-5.3-Codex Official Terminal-Bench 2.0 leaderboard rank 12. System couples agent scaffold and underlying model: Mux / GPT-5.3-Codex.	verified	74.6	2026	Source ↗	Looks wrong?
13	MAYA-V2 / Claude 4.6 Opus Official Terminal-Bench 2.0 leaderboard rank 13. System couples agent scaffold and underlying model: MAYA-V2 / Claude 4.6 Opus.	verified	72.1	2026	Source ↗	Looks wrong?
14	TongAgents / Claude Opus 4.6 Official Terminal-Bench 2.0 leaderboard rank 14. System couples agent scaffold and underlying model: TongAgents / Claude Opus 4.6.	verified	71.9	2026	Source ↗	Looks wrong?
15	Junie CLI / Multiple Official Terminal-Bench 2.0 leaderboard rank 15. System couples agent scaffold and underlying model: Junie CLI / Multiple.	verified	71	2026	Source ↗	Looks wrong?
16	CodeBrain-1 / GPT-5.3-Codex Official Terminal-Bench 2.0 leaderboard rank 16. System couples agent scaffold and underlying model: CodeBrain-1 / GPT-5.3-Codex.	verified	70.3	2026	Source ↗	Looks wrong?
17	Droid / Claude Opus 4.6 Official Terminal-Bench 2.0 leaderboard rank 17. System couples agent scaffold and underlying model: Droid / Claude Opus 4.6.	verified	69.9	2026	Source ↗	Looks wrong?
18	Ante / Gemini 3 Pro Official Terminal-Bench 2.0 leaderboard rank 18. System couples agent scaffold and underlying model: Ante / Gemini 3 Pro.	verified	69.4	2026	Source ↗	Looks wrong?
19	IndusAGI Coding Agent / GPT-5.3-Codex Official Terminal-Bench 2.0 leaderboard rank 19. System couples agent scaffold and underlying model: IndusAGI Coding Agent / GPT-5.3-Codex.	verified	69.1	2026	Source ↗	Looks wrong?
20	Crux / Claude Opus 4.6 Official Terminal-Bench 2.0 leaderboard rank 20. System couples agent scaffold and underlying model: Crux / Claude Opus 4.6.	verified	66.9	2026	Source ↗	Looks wrong?

Lineage

Terminal-Bench 2.0 in context.

See full agentic ai benchmarks lineage →

Predecessors (1)

saturating2024-08

SWE-bench Verified

Terminal-Bench broadens from GitHub issue repair into terminal-native operational workflows: build systems, security, data processing, system administration, and file operations. It scores the full agent harness, not only the base model.

This benchmark (1)

active2026-04

Terminal-Bench 2.0

None yet — this is the current frontier.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards