Code Generation2024python

SWE-bench Verified Subset

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Metrics:resolve-rate
Paper / Website
Current State of the Art

Claude 3.5 Sonnet

Anthropic

49

resolve-rate

Top Models Performance Comparison

Top 3 models ranked by resolve-rate

resolve-rate1Claude 3.5 Sonnet49.0100.0%2GPT-4o41.284.1%3DeepSeek V2.537.075.5%0%25%50%75%100%% of best
Best Score
49.0
Top Model
Claude 3.5 Sonnet
Models Compared
3
Score Range
12.0

resolve-ratePrimary

#ModelScorePaper / CodeDate
1
Claude 3.5 SonnetAPI
Anthropic
49Dec 2025
2
GPT-4oAPI
OpenAI
41.2Dec 2025
3
DeepSeek V2.5Open Source
DeepSeek
37Dec 2025

Other Code Generation Datasets