Code Generation2024python
SWE-bench Verified Subset
500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.
Metrics:resolve-rate
Paper / WebsiteCurrent State of the Art
Claude 3.5 Sonnet
Anthropic
49
resolve-rate
Top Models Performance Comparison
Top 3 models ranked by resolve-rate
Best Score
49.0
Top Model
Claude 3.5 Sonnet
Models Compared
3
Score Range
12.0
resolve-ratePrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Claude 3.5 SonnetAPI Anthropic | 49 | Dec 2025 | |
| 2 | GPT-4oAPI OpenAI | 41.2 | Dec 2025 | |
| 3 | DeepSeek V2.5Open Source DeepSeek | 37 | Dec 2025 |