Codesota · Agentic AI · Autonomous Coding · Terminal-Bench 2.0Tasks/Agentic AI/Autonomous Coding
Autonomous Coding · benchmark dataset · 2026 · TERMINAL

Terminal-Bench 2.0.

Terminal-agent benchmark for software engineering, machine learning, security, data science, system administration, file operations, and related terminal workflows. Scores measure the agent harness and underlying model as one system.

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

20 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
20 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Codex / GPT-5.5APIOpenAIApr 2026terminal-bench-official82
02ForgeCode / GPT-5.4APIForgeCodeApr 2026terminal-bench-official81.80
03TongAgents / Gemini 3.1 ProAPITongAgentsApr 2026terminal-bench-official80.20
04ForgeCode / Claude Opus 4.6APIForgeCodeApr 2026terminal-bench-official79.80
05SageAgent / GPT-5.3-CodexAPISageAgentApr 2026terminal-bench-official78.40
06ForgeCode / Gemini 3.1 ProAPIForgeCodeApr 2026terminal-bench-official78.40
07Droid / GPT-5.3-CodexAPIDroidApr 2026terminal-bench-official77.30
08Capy / Claude Opus 4.6APICapyApr 2026terminal-bench-official75.30
09Simple Codex / GPT-5.3-CodexAPIOpenAIApr 2026terminal-bench-official75.10
10Terminus-KIRA / Gemini 3.1 ProAPITerminus-KIRAApr 2026terminal-bench-official74.80
11Terminus-KIRA / Claude Opus 4.6APITerminus-KIRAApr 2026terminal-bench-official74.70
12Mux / GPT-5.3-CodexAPIMuxApr 2026terminal-bench-official74.60
13MAYA-V2 / Claude 4.6 OpusAPIMAYAApr 2026terminal-bench-official72.10
14TongAgents / Claude Opus 4.6APITongAgentsApr 2026terminal-bench-official71.90
15Junie CLI / MultipleAPIJetBrainsApr 2026terminal-bench-official71
16CodeBrain-1 / GPT-5.3-CodexAPICodeBrainApr 2026terminal-bench-official70.30
17Droid / Claude Opus 4.6APIDroidApr 2026terminal-bench-official69.90
18Ante / Gemini 3 ProAPIAnteApr 2026terminal-bench-official69.40
19IndusAGI Coding Agent / GPT-5.3-CodexAPIIndusAGIApr 2026terminal-bench-official69.10
20Crux / Claude Opus 4.6APICruxApr 2026terminal-bench-official66.90
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Apr 27, 2026Codex / GPT-5.5OpenAI82
Fig 3 · SOTA-setting models only. 1 entries span Apr 2026 Apr 2026.
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies