500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.
39 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | resolve-rate |
|---|---|---|---|---|---|
| 01 | Claude Opus 4.7 | — | Apr 2026 | vendor | 87.60 |
| 02 | Claude Opus 4.5API | Anthropic | Nov 2025 | anthropic-blog | 80.90 |
| 03 | Claude Opus 4.6API | Anthropic | Feb 2026 | anthropic-blog | 80.80 |
| 04 | Gemini 3.1 ProAPI | Feb 2026 | google-blog | 80.60 | |
| 05 | MiniMax M2.5OSS | MiniMax | Feb 2026 | minimax-blog | 80.20 |
| 06 | GPT-5.2 ThinkingAPI | OpenAI | Dec 2025 | openai-blog | 80 |
| 07 | Claude Sonnet 4.6API | Anthropic | Feb 2026 | anthropic-blog | 79.60 |
| 08 | Gemini 3 FlashAPI | Dec 2025 | google-blog | 78 | |
| 09 | Claude Sonnet 4.5API | Anthropic | Mar 2026 | anthropic-blog | 77.20 |
| 10 | Kimi K2.5API | Moonshot AI | Mar 2026 | moonshot-blog | 76.80 |
| 11 | GPT-5.1API | OpenAI | Mar 2026 | openai-blog | 76.30 |
| 12 | Gemini 3 ProAPI | Mar 2026 | google-blog | 76.20 | |
| 13 | GPT-5API | OpenAI | Mar 2026 | openai-blog | 74.90 |
| 14 | MiniMax M2.1API | MiniMax | Mar 2026 | minimax-blog | 74 |
| 15 | Claude Haiku 4.5API | Anthropic | Mar 2026 | anthropic-blog | 73.30 |
| 16 | Claude Sonnet 4API | Anthropic | Mar 2026 | anthropic-blog | 72.70 |
| 17 | Claude Opus 4API | Anthropic | Mar 2026 | anthropic-blog | 72.50 |
| 18 | Devstral 2OSS | Mistral | Mar 2026 | mistral-blog | 72.20 |
| 19 | Qwen3-Coder 480B A35BOSS | Alibaba Cloud | Mar 2026 | qwen-blog | 69.60 |
| 20 | MiniMax M2API | MiniMax | Mar 2026 | minimax-blog | 69.40 |
| 21 | o3API | OpenAI | Mar 2026 | openai-blog | 69.10 |
| 22 | o4-miniAPI | OpenAI | Mar 2026 | swebench-leaderboard | 68.10 |
| 23 | DeepSeek-V3.1OSS | DeepSeek | Mar 2026 | deepseek-blog | 66 |
| 24 | Kimi-K2OSS | Moonshot.AI | Mar 2026 | kimi-techreport | 65.80 |
| 25 | Grok 3API | xAI | Mar 2026 | xai-blog | 63.80 |
| 26 | Gemini 2.5 ProAPI | Mar 2026 | google-blog | 63.80 | |
| 27 | Claude 3.7 SonnetAPI | Anthropic | Mar 2026 | anthropic-blog | 63.70 |
| 28 | Gemini 2.5 FlashAPI | Mar 2026 | google-blog | 60.40 | |
| 29 | DeepSeek-R1-0528OSS | DeepSeek | Mar 2026 | deepseek-blog | 57.60 |
| 30 | o3-miniAPI | OpenAI | Mar 2026 | swebench-leaderboard | 55.80 |
| 31 | GPT-4.1API | OpenAI | Mar 2026 | swebench-leaderboard | 54.60 |
| 32 | Claude 3.5 SonnetAPI | Anthropic | Mar 2026 | anthropic-blog | 50.80 |
| 33 | DeepSeek R1OSS | DeepSeek | Mar 2026 | swebench-leaderboard | 49.20 |
| 34 | o1API | OpenAI | Mar 2026 | swebench-leaderboard | 48.90 |
| 35 | Devstral Small 2505OSS | Mistral | Mar 2026 | mistral-blog | 46.80 |
| 36 | DeepSeek-V3OSS | DeepSeek | Mar 2026 | swebench-leaderboard | 42 |
| 37 | GPT-4oAPI | OpenAI | Mar 2026 | swebench-leaderboard | 41.20 |
| 38 | Claude 3.5 HaikuAPI | Anthropic | Mar 2026 | anthropic-blog | 40.60 |
| 39 | DeepSeek-V2.5OSS | DeepSeek | Mar 2026 | deepseek-blog | 37 |
Each row below marks a model that broke the previous record on resolve-rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.
Higher scores win. Each subsequent entry improved upon the previous best.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.