500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).
81 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | resolve-rate |
|---|---|---|---|---|---|
| 01 | Claude Mythos Preview | Anthropic | Apr 2026 | editorial | 93.90 |
| 02 | Claude Opus 4.5API | Anthropic | Apr 2026 | editorial | 80.90 |
| 03 | Claude Opus 4.6API | Anthropic | Apr 2026 | editorial | 80.80 |
| 04 | Gemini 3.1 ProAPI | Apr 2026 | editorial | 80.60 | |
| 05 | MiniMax M2.5OSS | MiniMax | Apr 2026 | editorial | 80.20 |
| 06 | GPT-5.2API | OpenAI | Apr 2026 | editorial | 80 |
| 07 | Claude Sonnet 4.6API | Anthropic | Apr 2026 | editorial | 79.60 |
| 08 | Qwen3.6 Plus | Alibaba Cloud | Apr 2026 | editorial | 78.80 |
| 09 | MiMo-V2-ProOSS | Xiaomi | Apr 2026 | editorial | 78 |
| 10 | Gemini 3 FlashAPI | Apr 2026 | editorial | 78 | |
| 11 | GLM-5OSS | Zhipu AI | Apr 2026 | editorial | 77.80 |
| 12 | Muse Spark | Meta | Apr 2026 | editorial | 77.40 |
| 13 | Kimi K2.5API | Moonshot AI | Apr 2026 | editorial | 76.80 |
| 14 | Seed 2.0 Pro | ByteDance | Apr 2026 | editorial | 76.50 |
| 15 | Qwen3.5-397B-A17B | Alibaba Cloud | Apr 2026 | editorial | 76.40 |
| 16 | GPT-5.1 Instant | OpenAI | Apr 2026 | editorial | 76.30 |
| 17 | GPT-5.1 Thinking | OpenAI | Apr 2026 | editorial | 76.30 |
| 18 | GPT-5.1API | OpenAI | Apr 2026 | editorial | 76.30 |
| 19 | Gemini 3 ProAPI | Apr 2026 | editorial | 76.20 | |
| 20 | GPT-5API | OpenAI | Apr 2026 | editorial | 74.90 |
| 21 | MiMo-V2-Omni | Xiaomi | Apr 2026 | editorial | 74.80 |
| 22 | GPT-5 Codex | OpenAI | Apr 2026 | editorial | 74.50 |
| 23 | Claude Opus 4.1 | Anthropic | Apr 2026 | editorial | 74.50 |
| 24 | Step-3.5-FlashOSS | StepFun | Apr 2026 | editorial | 74.40 |
| 25 | GLM-4.7 | Zhipu AI | Apr 2026 | editorial | 73.80 |
| 26 | GPT-5.1 Codex | OpenAI | Apr 2026 | editorial | 73.70 |
| 27 | Seed 2.0 Lite | ByteDance | Apr 2026 | editorial | 73.50 |
| 28 | MiMo-V2-Flash | Xiaomi | Apr 2026 | editorial | 73.40 |
| 29 | Claude Haiku 4.5API | Anthropic | Apr 2026 | editorial | 73.30 |
| 30 | DeepSeek-V3.2-Speciale | DeepSeek | Apr 2026 | editorial | 73.10 |
| 31 | DeepSeek-V3.2 (Thinking) | DeepSeek | Apr 2026 | editorial | 73.10 |
| 32 | Claude Sonnet 4API | Anthropic | Apr 2026 | editorial | 72.70 |
| 33 | Claude Opus 4API | Anthropic | Apr 2026 | editorial | 72.50 |
| 34 | Qwen3.5-27B | Alibaba Cloud | Apr 2026 | editorial | 72.40 |
| 35 | Qwen3.5-122B-A10B | Alibaba Cloud | Apr 2026 | editorial | 72 |
| 36 | Kimi K2-Thinking-0905OSS | Moonshot AI | Apr 2026 | editorial | 71.30 |
| 37 | Grok Code Fast 1 | xAI | Apr 2026 | editorial | 70.80 |
| 38 | Claude 3.7 SonnetAPI | Anthropic | Apr 2026 | editorial | 70.30 |
| 39 | LongCat-Flash-Thinking-2601 | Meituan | Apr 2026 | editorial | 70 |
| 40 | Qwen3-Coder 480B A35BOSS | Alibaba Cloud | Apr 2026 | editorial | 69.60 |
| 41 | Qwen3 MaxOSS | Alibaba Cloud | Apr 2026 | editorial | 69.60 |
| 42 | MiniMax M2API | MiniMax | Apr 2026 | editorial | 69.40 |
| 43 | Qwen3.5-35B-A3B | Alibaba Cloud | Apr 2026 | editorial | 69.20 |
| 44 | o3API | OpenAI | Apr 2026 | editorial | 69.10 |
| 45 | o4-miniAPI | OpenAI | Apr 2026 | editorial | 68.10 |
| 46 | GLM-4.6 | Zhipu AI | Apr 2026 | editorial | 68 |
| 47 | DeepSeek-V3.2-Exp | DeepSeek | Apr 2026 | editorial | 67.80 |
| 48 | Gemini 2.5 Pro Preview | Apr 2026 | editorial | 67.20 | |
| 49 | MiniMax M2.1API | MiniMax | Apr 2026 | editorial | 67 |
| 50 | DeepSeek-V3.1OSS | DeepSeek | Apr 2026 | editorial | 66 |
| 51 | Kimi K2-Instruct-0905 | Moonshot AI | Apr 2026 | editorial | 65.80 |
| 52 | GLM-4.5 | Zhipu AI | Apr 2026 | editorial | 64.20 |
| 53 | Gemini 2.5 ProAPI | Apr 2026 | editorial | 63.20 | |
| 54 | Devstral Medium | Mistral AI | Apr 2026 | editorial | 61.60 |
| 55 | LongCat-Flash-Chat | Meituan | Apr 2026 | editorial | 60.40 |
| 56 | Gemini 2.5 FlashAPI | Apr 2026 | editorial | 60.40 | |
| 57 | LongCat-Flash-Thinking | Meituan | Apr 2026 | editorial | 59.40 |
| 58 | GLM-4.7-Flash | Zhipu AI | Apr 2026 | editorial | 59.20 |
| 59 | GLM-4.5-Air | Zhipu AI | Apr 2026 | editorial | 57.60 |
| 60 | MiniMax M1 80K | MiniMax | Apr 2026 | editorial | 56 |
| 61 | MiniMax M1 40K | MiniMax | Apr 2026 | editorial | 55.60 |
| 62 | GPT-4.1API | OpenAI | Apr 2026 | editorial | 54.60 |
| 63 | LongCat-Flash-Lite | Meituan | Apr 2026 | editorial | 54.40 |
| 64 | Nemotron 3 Super (120B) | NVIDIA | Apr 2026 | editorial | 53.70 |
| 65 | Devstral Small 1.1 | Mistral AI | Apr 2026 | editorial | 53.60 |
| 66 | o3-miniAPI | OpenAI | Apr 2026 | editorial | 49.30 |
| 67 | Claude 3.5 SonnetAPI | Anthropic | Apr 2026 | editorial | 49 |
| 68 | Sarvam-105B | Sarvam AI | Apr 2026 | editorial | 45 |
| 69 | DeepSeek-R1-0528OSS | DeepSeek | Apr 2026 | editorial | 44.60 |
| 70 | DeepSeek-V3OSS | DeepSeek | Apr 2026 | editorial | 42 |
| 71 | o1-previewAPI | OpenAI | Apr 2026 | editorial | 41.30 |
| 72 | o1API | OpenAI | Apr 2026 | editorial | 41 |
| 73 | Claude 3.5 HaikuAPI | Anthropic | Apr 2026 | editorial | 40.60 |
| 74 | Nemotron 3 Nano (30B) | NVIDIA | Apr 2026 | editorial | 38.80 |
| 75 | GPT-4.5API | OpenAI | Apr 2026 | editorial | 38 |
| 76 | Sarvam-30B | Sarvam AI | Apr 2026 | editorial | 34 |
| 77 | GPT-4oAPI | OpenAI | Apr 2026 | editorial | 33.20 |
| 78 | Gemini 2.5 Flash-Lite | Apr 2026 | editorial | 31.60 | |
| 79 | GPT-4.1 miniAPI | OpenAI | Apr 2026 | editorial | 23.60 |
| 80 | Gemini Diffusion | Apr 2026 | editorial | 22.90 | |
| 81 | DeepSeek-V2.5OSS | DeepSeek | Apr 2026 | editorial | 16.80 |
Each row below marks a model that broke the previous record on resolve-rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.
Higher scores win. Each subsequent entry improved upon the previous best.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.