Code Generation2024python

SWE-bench Verified Subset

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Metrics:resolve-rate
Paper / Website
Current State of the Art

Claude Opus 4.5

Anthropic

80.9

resolve-rate

Top Models Performance Comparison

Top 10 models ranked by resolve-rate

resolve-rate1Claude Opus 4.580.9100.0%2Claude Opus 4.680.899.9%3Gemini 3.1 Pro80.699.6%4GPT-5.2 Thinking80.098.9%5Claude Sonnet 4.679.698.4%6Gemini 3 Flash78.096.4%7Claude Sonnet 4.577.295.4%8GPT-5.176.394.3%9Gemini 3 Pro76.294.2%10GPT-574.992.6%0%25%50%75%100%% of best
Best Score
80.9
Top Model
Claude Opus 4.5
Models Compared
10
Score Range
6.0

resolve-ratePrimary

#ModelScorePaper / CodeDate
1
Claude Opus 4.5API
Anthropic
80.9Mar 2026
2
Claude Opus 4.6API
Anthropic
80.8Mar 2026
3
Gemini 3.1 ProAPI
Google
80.6Mar 2026
4
GPT-5.2 ThinkingAPI
OpenAI
80Mar 2026
5
Claude Sonnet 4.6API
Anthropic
79.6Mar 2026
6
Gemini 3 FlashAPI
Google
78Mar 2026
7
Claude Sonnet 4.5API
Anthropic
77.2Mar 2026
8
GPT-5.1API
OpenAI
76.3Mar 2026
9
Gemini 3 ProAPI
Google
76.2Mar 2026
10
GPT-5API
OpenAI
74.9Mar 2026
11
Claude Haiku 4.5API
Anthropic
73.3Mar 2026
12
Devstral 2Open Source
Mistral
72.2Mar 2026
13
Claude Sonnet 4API
Anthropic
72.2Mar 2026
14
o3API
OpenAI
69.1Mar 2026
15
o4-miniAPI
OpenAI
68.1Mar 2026
16
Gemini 2.5 ProAPI
Google
63.8Mar 2026
17
Grok 3API
xAI
63.8Mar 2026
18
Claude 3.7 SonnetAPI
Anthropic
62.3Mar 2026
19
DeepSeek R1-0528Open Source
DeepSeek
57.6Mar 2026
20
o3-miniAPI
OpenAI
55.8Mar 2026
21
GPT-4.1API
OpenAI
54.6Mar 2026
22
Gemini 2.5 FlashAPI
Google
54Mar 2026
23
DeepSeek-R1Open Source
DeepSeek
49.2Mar 2026
24
Claude 3.5 SonnetAPI
Anthropic
49Mar 2026
25
o1API
OpenAI
48.9Mar 2026
26
DeepSeek V3Open Source
DeepSeek
42Mar 2026
27
GPT-4oAPI
OpenAI
41.2Mar 2026
28
Claude 3.5 HaikuAPI
Anthropic
40.6Mar 2026
29
DeepSeek V2.5Open Source
DeepSeek
37Mar 2026

Other Code Generation Datasets

SWE-Bench Verified Benchmark - Code Generation | CodeSOTA