SWE-bench2024python

SWE-bench Verified — Agentic Leaderboard

500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).

Samples:500
Metrics:resolve-rate
Paper / Website
Current State of the Art

Claude Opus 4.5

Anthropic

80.9

resolve-rate

resolve-rate Progress Over Time

Showing 9 breakthroughs from Feb 2024 to Mar 2026

18.035.152.369.586.6Feb 2024Jul 2024Dec 2024May 2025Oct 2025Mar 2026resolve-rateDate

Key Milestones

Feb 2024
SWE-agent (GPT-4o)

SWE-agent with GPT-4o backbone. Table 2, arxiv:2402.07927. Resolves 23.7% of SWE-bench Verified.

23.7
May 2024
Agentless (GPT-4o)

Agentless v1.5 with GPT-4o. arxiv:2405.15793. Localize-then-repair without agent loop.

30.2
+27.4%
Aug 2024
GPT-4o

GPT-4o (2024-08-06) via llm-stats leaderboard

33.2
+9.9%
Sep 2024
Amazon Q Developer

Updated from 25.6% (May 2024) to 38.8% in Sept 2024 reinvented agent release

38.8
+16.9%
Jan 2025
o3-mini

o3-mini (2025-01-30) via llm-stats leaderboard

49.3
+27.1%
Feb 2025
Claude 3.7 Sonnet

Claude 3.7 Sonnet with agentic scaffold. Reported in system card arxiv:2502.18449.

63.7
+29.2%
Apr 2025
o3

o3 (2025-04-16) via llm-stats leaderboard

69.1
+8.5%
May 2025
Codex (codex-1)

codex-1 single-attempt score; 83.8% with 8 tries. Announced May 2025.

72.1
+4.3%
Mar 2026
Claude Opus 4.5Current SOTA

Claude Opus 4.5 via Claude Code scaffold. Top of SWE-bench Verified leaderboard as of March 2026.

80.9
+12.2%
Total Improvement
241.4%
Time Span
2y 2m
Breakthroughs
9
Current SOTA
80.9

Top Models Performance Comparison

Top 10 models ranked by resolve-rate

resolve-rate1Claude Opus 4.580.9100.0%2Claude Sonnet 472.789.9%3Codex (codex-1)72.189.1%4o369.185.4%5Claude 3.7 Sonnet63.778.7%6Moatless Tools (Claude 3.7)57.671.2%7Devin 2.053.666.3%8OpenHands CodeAct (Claude...53.065.5%9o3-mini49.360.9%10Claude Code (Sonnet 3.5)49.060.6%0%25%50%75%100%% of best
Best Score
80.9
Top Model
Claude Opus 4.5
Models Compared
10
Score Range
31.9

resolve-ratePrimary

#ModelScorePaper / CodeDate
1
Claude Opus 4.5API
Anthropic
80.9Feb 2025
2
Claude Sonnet 4API
Anthropic
72.7Feb 2025
3
Codex (codex-1)
OpenAI
72.1May 2025
4
o3API
OpenAI
69.1Apr 2025
5
Claude 3.7 SonnetAPI
Anthropic
63.7Feb 2025
6
Moatless Tools (Claude 3.7)Open Source
Moatless / community
57.6Feb 2025
7
Devin 2.0
Cognition AI
53.6May 2024
8
OpenHands CodeAct (Claude 3.7)Open Source
OpenHands / All-Hands AI
53Jul 2024
9
o3-miniAPI
OpenAI
49.3Jan 2025
10
Claude Code (Sonnet 3.5)
Anthropic
49Feb 2025
11
Amazon Q Developer
Amazon Web Services
38.8Sep 2024
12
GPT-4oAPI
OpenAI
33.2Aug 2024
13
Agentless (GPT-4o)Open Source
UIUC / Microsoft
30.2May 2024
14
SWE-agent (GPT-4o)Open Source
Princeton NLP
23.7Feb 2024
15
Devin 1.0
Cognition AI
13.8May 2024

Related Papers5

Claude 3.7 Sonnet System Card
Feb 2025Models: Claude Opus 4.5, Claude Sonnet 4, Claude 3.7 Sonnet +2 more
Devin: The First AI Software Engineer
May 2024Models: Devin 2.0, Devin 1.0