Codesota · Benchmark · Terminal-Bench 2.0Home/Leaderboards/Terminal-Bench 2.0
Unknown

Terminal-Bench 2.0.

Stanford x Laude benchmark for AI agents operating in terminal environments. Terminal-Bench 2.0 evaluates terminal mastery across software engineering, machine learning, security, data science, system administration, file operations, and related operational workflows. Official site lists 89 high-quality tasks and a 124-entry live leaderboard.

Paper Leaderboard Lineage
§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for Terminal-Bench 2.0. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Codex / GPT-5.5
Official Terminal-Bench 2.0 leaderboard rank 1. System couples agent scaffold and underlying model: Codex / GPT-5.5.
verified822026Source ↗Looks wrong?
02ForgeCode / GPT-5.4
Official Terminal-Bench 2.0 leaderboard rank 2. System couples agent scaffold and underlying model: ForgeCode / GPT-5.4.
verified81.82026Source ↗Looks wrong?
03TongAgents / Gemini 3.1 Pro
Official Terminal-Bench 2.0 leaderboard rank 3. System couples agent scaffold and underlying model: TongAgents / Gemini 3.1 Pro.
verified80.22026Source ↗Looks wrong?
04ForgeCode / Claude Opus 4.6
Official Terminal-Bench 2.0 leaderboard rank 4. System couples agent scaffold and underlying model: ForgeCode / Claude Opus 4.6.
verified79.82026Source ↗Looks wrong?
05ForgeCode / Gemini 3.1 Pro
Official Terminal-Bench 2.0 leaderboard rank 6. System couples agent scaffold and underlying model: ForgeCode / Gemini 3.1 Pro.
verified78.42026Source ↗Looks wrong?
06SageAgent / GPT-5.3-Codex
Official Terminal-Bench 2.0 leaderboard rank 5. System couples agent scaffold and underlying model: SageAgent / GPT-5.3-Codex.
verified78.42026Source ↗Looks wrong?
07Droid / GPT-5.3-Codex
Official Terminal-Bench 2.0 leaderboard rank 7. System couples agent scaffold and underlying model: Droid / GPT-5.3-Codex.
verified77.32026Source ↗Looks wrong?
08Capy / Claude Opus 4.6
Official Terminal-Bench 2.0 leaderboard rank 8. System couples agent scaffold and underlying model: Capy / Claude Opus 4.6.
verified75.32026Source ↗Looks wrong?
09Simple Codex / GPT-5.3-Codex
Official Terminal-Bench 2.0 leaderboard rank 9. System couples agent scaffold and underlying model: Simple Codex / GPT-5.3-Codex.
verified75.12026Source ↗Looks wrong?
10Terminus-KIRA / Gemini 3.1 Pro
Official Terminal-Bench 2.0 leaderboard rank 10. System couples agent scaffold and underlying model: Terminus-KIRA / Gemini 3.1 Pro.
verified74.82026Source ↗Looks wrong?
11Terminus-KIRA / Claude Opus 4.6
Official Terminal-Bench 2.0 leaderboard rank 11. System couples agent scaffold and underlying model: Terminus-KIRA / Claude Opus 4.6.
verified74.72026Source ↗Looks wrong?
12Mux / GPT-5.3-Codex
Official Terminal-Bench 2.0 leaderboard rank 12. System couples agent scaffold and underlying model: Mux / GPT-5.3-Codex.
verified74.62026Source ↗Looks wrong?
13MAYA-V2 / Claude 4.6 Opus
Official Terminal-Bench 2.0 leaderboard rank 13. System couples agent scaffold and underlying model: MAYA-V2 / Claude 4.6 Opus.
verified72.12026Source ↗Looks wrong?
14TongAgents / Claude Opus 4.6
Official Terminal-Bench 2.0 leaderboard rank 14. System couples agent scaffold and underlying model: TongAgents / Claude Opus 4.6.
verified71.92026Source ↗Looks wrong?
15Junie CLI / Multiple
Official Terminal-Bench 2.0 leaderboard rank 15. System couples agent scaffold and underlying model: Junie CLI / Multiple.
verified712026Source ↗Looks wrong?
16CodeBrain-1 / GPT-5.3-Codex
Official Terminal-Bench 2.0 leaderboard rank 16. System couples agent scaffold and underlying model: CodeBrain-1 / GPT-5.3-Codex.
verified70.32026Source ↗Looks wrong?
17Droid / Claude Opus 4.6
Official Terminal-Bench 2.0 leaderboard rank 17. System couples agent scaffold and underlying model: Droid / Claude Opus 4.6.
verified69.92026Source ↗Looks wrong?
18Ante / Gemini 3 Pro
Official Terminal-Bench 2.0 leaderboard rank 18. System couples agent scaffold and underlying model: Ante / Gemini 3 Pro.
verified69.42026Source ↗Looks wrong?
19IndusAGI Coding Agent / GPT-5.3-Codex
Official Terminal-Bench 2.0 leaderboard rank 19. System couples agent scaffold and underlying model: IndusAGI Coding Agent / GPT-5.3-Codex.
verified69.12026Source ↗Looks wrong?
20Crux / Claude Opus 4.6
Official Terminal-Bench 2.0 leaderboard rank 20. System couples agent scaffold and underlying model: Crux / Claude Opus 4.6.
verified66.92026Source ↗Looks wrong?
Lineage

Terminal-Bench 2.0 in context.

See full agentic ai benchmarks lineage →
This benchmark (1)
active2026-04
Terminal-Bench 2.0
None yet — this is the current frontier.
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards
Terminal-Bench 2.0 Leaderboard | CodeSOTA | CodeSOTA