152 hand-built terminal tasks — devops, data, SWE, scientific computing — each scored by container-internal unit tests. Where agentic-coding evaluation moved past SWE-bench: the harness, the shell environment, and the model are all measured together. You cannot win this leaderboard with raw model intelligence alone.
Terminal-Bench 1 (80 tasks) was retired Oct 2025 in favour of TB2 (152 tasks). Submissions cannot modify timeouts or resources, and results are verified by the Terminal-Bench team. Top entry: Codex on GPT-5.5 at 82.0%.
SWE-bench measured the model. SWE-bench Pro fixed contamination but still scored single patches. Terminal-Bench treats the agent + harness + shell as one system — closer to how engineers actually deploy these tools.
500 human-filtered GitHub issues, model-only scoring. OpenAI publicly stopped reporting Sep 2025 — contamination.
Scale AI's contamination-controlled successor. 1,865 problems, held-out splits. Frontier drops to ~23%.
Real terminal/devops/data/SWE tasks in a Docker sandbox. Agent-coupled — the harness counts.
Each dot is a record-setting agent + model pair on TB2. Closed/API line drives the frontier; open-weight harnesses (OpenHands, Mini-SWE-Agent, CAMEL-AI) trail by ~30 points.
Top-30 closed and key open entries from the live tbench.ai TB2 leaderboard. The same model appears multiple times paired with different agents — that's the point: harness changes the outcome.
| # | Agent | Model | Org | Type | Submitted | Score | CI |
|---|---|---|---|---|---|---|---|
| 01 | Codex | GPT-5.5 | OpenAI | API | Apr 2026 | 82.0 | ±2.2 |
| 02 | ForgeCode | GPT-5.4 | OpenAI | API | Mar 2026 | 81.8 | ±2.0 |
| 03 | TongAgents | Gemini 3.1 Pro | API | Mar 2026 | 80.2 | ±2.6 | |
| 04 | ForgeCode | Claude Opus 4.6 | Anthropic | API | Mar 2026 | 79.8 | ±1.6 |
| 05 | SageAgent | GPT-5.3-Codex | OpenAI | API | Mar 2026 | 78.4 | ±2.2 |
| 06 | ForgeCode | Gemini 3.1 Pro | API | Mar 2026 | 78.4 | ±1.8 | |
| 07 | Droid | GPT-5.3-Codex | OpenAI | API | Feb 2026 | 77.3 | ±2.2 |
| 08 | Capy | Claude Opus 4.6 | Anthropic | API | Mar 2026 | 75.3 | ±2.4 |
| 09 | Simple Codex | GPT-5.3-Codex | OpenAI | API | Feb 2026 | 75.1 | ±2.4 |
| 10 | Terminus-KIRA | Gemini 3.1 Pro | API | Feb 2026 | 74.8 | ±2.6 | |
| 11 | Terminus-KIRA | Claude Opus 4.6 | Anthropic | API | Feb 2026 | 74.7 | ±2.6 |
| 12 | Mux | GPT-5.3-Codex | OpenAI | API | Mar 2026 | 74.6 | ±2.5 |
| 13 | MAYA-V2 | Claude Opus 4.6 | Anthropic | API | Mar 2026 | 72.1 | ±2.2 |
| 14 | TongAgents | Claude Opus 4.6 | Anthropic | API | Feb 2026 | 71.9 | ±2.7 |
| 15 | Junie CLI | Multiple | JetBrains | API | Mar 2026 | 71.0 | ±2.9 |
| 16 | CodeBrain-1 | GPT-5.3-Codex | OpenAI | API | Feb 2026 | 70.3 | ±2.6 |
| 17 | Droid | Claude Opus 4.6 | Anthropic | API | Feb 2026 | 69.9 | ±2.5 |
| 18 | Ante | Gemini 3 Pro | API | Jan 2026 | 69.4 | ±2.1 | |
| 19 | IndusAGI | GPT-5.3-Codex | OpenAI | API | Mar 2026 | 69.1 | ±2.3 |
| 20 | Crux | Claude Opus 4.6 | Anthropic | API | Feb 2026 | 66.9 | — |
| 21 | Deep Agents | GPT-5.2-Codex | OpenAI | API | Feb 2026 | 66.5 | ±3.1 |
| 22 | Mux | Claude Opus 4.6 | Anthropic | API | Feb 2026 | 66.5 | ±2.5 |
| 23 | SageAgent | Gemini 3 Pro | API | Feb 2026 | 65.2 | ±2.1 | |
| 24 | Droid | GPT-5.2 | OpenAI | API | Dec 2025 | 64.9 | ±2.8 |
| 25 | Terminus 2 | GPT-5.3-Codex | OpenAI | API | Feb 2026 | 64.7 | ±2.7 |
| 26 | Junie CLI | Gemini 3 Flash | API | Dec 2025 | 64.3 | ±2.8 | |
| 27 | Droid | Claude Opus 4.5 | Anthropic | API | Dec 2025 | 63.1 | ±2.7 |
| 28 | Terminus 2 | Claude Opus 4.6 | Anthropic | API | Feb 2026 | 62.9 | ±2.7 |
| 29 | Codex CLI | GPT-5.2 | OpenAI | API | Dec 2025 | 62.9 | ±3.0 |
| 30 | Warp | Multiple | Warp | API | Dec 2025 | 61.2 | ±3.0 |
| 51 | OpenHands | Claude Opus 4.5 | Anthropic | OSS | Jan 2026 | 51.9 | ±2.9 |
| 58 | CAMEL-AI | Claude Sonnet 4.5 | Anthropic | OSS | Dec 2025 | 46.5 | ±2.4 |
| 60 | OpenHands | GPT-5 | OpenAI | OSS | Nov 2025 | 43.8 | ±3.0 |
| 68 | OpenHands | Claude Sonnet 4.5 | Anthropic | OSS | Nov 2025 | 42.6 | ±2.8 |
| 69 | Mini-SWE-Agent | Claude Sonnet 4.5 | Anthropic | OSS | Nov 2025 | 42.5 | ±2.8 |
| 71 | Mini-SWE-Agent | GPT-5-Codex | OpenAI | OSS | Nov 2025 | 41.3 | ±2.8 |
| 75 | OpenHands | Claude Opus 4.1 | Anthropic | OSS | Nov 2025 | 36.9 | ±2.7 |
| 81 | Mini-SWE-Agent | Claude Opus 4.1 | Anthropic | OSS | Nov 2025 | 35.1 | ±2.5 |
| 84 | Mini-SWE-Agent | GPT-5 | OpenAI | OSS | Nov 2025 | 33.9 | ±2.9 |
Open-weight harnesses (OpenHands, Mini-SWE-Agent, CAMEL-AI) trail proprietary harnesses (Codex, ForgeCode, TongAgents). Even when the underlying model is identical, harness engineering moves the score by 20+ points.
Every task ships as a Docker image with seeded files, optional services, and a hidden test harness. The agent sees a shell; it cannot peek at the tests.
Submissions cannot modify per-task timeouts or resources. This forces ceiling comparisons — no winning by buying compute.
The score reflects the harness, the prompt scaffold, and the underlying model together. A great model with a poor harness loses to a worse model with a better harness.
Top submissions are independently re-run by Terminal-Bench team members. CI bars represent the run-to-run variance.