SWE-Bench Verified

Unknown

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Benchmark Stats

Models29
Papers29
Metrics1

SOTA History

Not enough data to show trend.

resolve-rate

resolve-rate

Higher is better

RankModelSourceScoreYearPaper
1Claude Opus 4.5

First model to break 80%. With agent scaffolding.

Editorial80.92026Source
2Claude Opus 4.6

Simple bash + file edit scaffold.

Editorial80.82026Source
3Gemini 3.1 Pro

With custom agent scaffold.

Editorial80.62026Source
4GPT-5.2 Thinking

GPT-5.2 with thinking mode and agent scaffold.

Editorial802026Source
5Claude Sonnet 4.6

Simple bash + file edit scaffold.

Editorial79.62026Source
6Gemini 3 Flash

With custom agent scaffold.

Editorial782026Source
7claude-sonnet-45Editorial77.22026Source
8GPT-5.1

GPT-5.1 with agent scaffold.

Editorial76.32026Source
9Gemini 3 Pro

With custom agent scaffold.

Editorial76.22026Source
10GPT-5

GPT-5 with agent scaffold.

Editorial74.92026Source
11Claude Haiku 4.5

128K thinking budget. Simple scaffold.

Editorial73.32026Source
12claude-sonnet-4Editorial72.22026Source
13Devstral 2

Devstral 2 with SWE-agent.

Editorial72.22026Source
14o3Editorial69.12026Source
15o4-mini

With agent scaffolding.

Editorial68.12026Source
16Grok 3

With custom agent scaffold.

Editorial63.82026Source
17gemini-25-proEditorial63.82026Source
18claude-37-sonnetEditorial62.32026Source
19DeepSeek R1-0528

Self-reported score with Agentless framework.

Editorial57.62026Source
20o3-miniEditorial55.82026Source
21gpt-41Editorial54.62026Source
22Gemini 2.5 Flash

With custom agent scaffold.

Editorial542026Source
23deepseek-r1Editorial49.22026Source
24claude-35-sonnetEditorial492026Source
25o1Editorial48.92026Source
26deepseek-v3Editorial422026Source
27gpt-4oEditorial41.22026Source
28Claude 3.5 Haiku

Standard scaffold.

Editorial40.62026Source
29deepseek-v25Editorial372026Source

Submit a Result