SWE-bench Verified — Agentic Leaderboard
500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).
Claude Opus 4.5
Anthropic
80.9
resolve-rate
resolve-rate Progress Over Time
Showing 9 breakthroughs from Feb 2024 to Mar 2026
Key Milestones
SWE-agent with GPT-4o backbone. Table 2, arxiv:2402.07927. Resolves 23.7% of SWE-bench Verified.
Agentless v1.5 with GPT-4o. arxiv:2405.15793. Localize-then-repair without agent loop.
Updated from 25.6% (May 2024) to 38.8% in Sept 2024 reinvented agent release
Claude 3.7 Sonnet with agentic scaffold. Reported in system card arxiv:2502.18449.
codex-1 single-attempt score; 83.8% with 8 tries. Announced May 2025.
Claude Opus 4.5 via Claude Code scaffold. Top of SWE-bench Verified leaderboard as of March 2026.
Top Models Performance Comparison
Top 10 models ranked by resolve-rate
resolve-ratePrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Claude Opus 4.5API Anthropic | 80.9 | Feb 2025 | |
| 2 | Claude Sonnet 4API Anthropic | 72.7 | Feb 2025 | |
| 3 | Codex (codex-1) OpenAI | 72.1 | May 2025 | |
| 4 | o3API OpenAI | 69.1 | Apr 2025 | |
| 5 | Claude 3.7 SonnetAPI Anthropic | 63.7 | Feb 2025 | |
| 6 | Moatless Tools (Claude 3.7)Open Source Moatless / community | 57.6 | Feb 2025 | |
| 7 | Devin 2.0 Cognition AI | 53.6 | May 2024 | |
| 8 | OpenHands CodeAct (Claude 3.7)Open Source OpenHands / All-Hands AI | 53 | Jul 2024 | |
| 9 | o3-miniAPI OpenAI | 49.3 | Jan 2025 | |
| 10 | Claude Code (Sonnet 3.5) Anthropic | 49 | Feb 2025 | |
| 11 | Amazon Q Developer Amazon Web Services | 38.8 | Sep 2024 | |
| 12 | GPT-4oAPI OpenAI | 33.2 | Aug 2024 | |
| 13 | Agentless (GPT-4o)Open Source UIUC / Microsoft | 30.2 | May 2024 | |
| 14 | SWE-agent (GPT-4o)Open Source Princeton NLP | 23.7 | Feb 2024 | |
| 15 | Devin 1.0 Cognition AI | 13.8 | May 2024 |