Leaderboards for LiveCodeBench (contest problems), SWE-bench Verified (real GitHub issues), and HumanEval+ (enhanced unit test coverage). Three benchmarks covering different coding abilities.
Continuously updated contest problems from LeetCode, Codeforces, and AtCoder scraped after model training cutoffs. Tests code generation, self-repair, and test-output prediction on truly unseen problems.
| # | Model | Provider | Pass@1 | Date |
|---|---|---|---|---|
| ★ | Gemini 3 Pro Preview | 91.7% | Apr 2026 | |
| 2 | Gemini 3 Flash | 90.8% | Apr 2026 | |
| 3 | GPT-5 | OpenAI | 85% | Apr 2026 |
| 4 | Grok 4 | xAI | 79% | Apr 2026 |
| 5 | Gemini 2.5 Pro | 75.6% | Apr 2026 | |
| 6 | DeepSeek-R1-0528 | DeepSeek | 73.3% | May 2025 |
| 7 | o4-mini | OpenAI | 72.8% | Mar 2026 |
| 8 | Qwen3-235B-A22B | Alibaba | 70.7% | May 2025 |
| 9 | o3-mini | OpenAI | 66.9% | Mar 2026 |
| 10 | DeepSeek R1 | DeepSeek | 65.9% | Jan 2025 |
| 11 | o3 | OpenAI | 65.3% | Mar 2026 |
| 12 | DeepSeek-R1-Distill-Llama-70B | DeepSeek | 65.2% | Jan 2025 |
| 13 | Gemini 2.5 Flash | 63.9% | Apr 2026 | |
| 14 | Kimi k1.5 | Moonshot AI | 62.5% | Jan 2025 |
| 15 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek | 62.1% | Jan 2025 |
| 16 | Claude Opus 4 | Anthropic | 57.8% | Mar 2026 |
| 17 | GPT-4.1 | OpenAI | 54.4% | Mar 2026 |
| 18 | Claude Sonnet 4 | Anthropic | 52.8% | Mar 2026 |
| 19 | DeepSeek-v3-0324 | DeepSeek | 49.2% | Mar 2025 |
| 20 | DeepSeek-V3 | DeepSeek | 49.2% | Mar 2026 |
| 21 | GPT-4.1 mini | OpenAI | 48.3% | Apr 2026 |
| 22 | Qwen2.5-Coder 32B | Alibaba | 47.8% | Mar 2026 |
| 23 | Llama-4-Maverick | Meta | 43.4% | Apr 2025 |
| 24 | DeepSeek-Coder-V2-Instruct | DeepSeek | 43.4% | Mar 2026 |
| 25 | GPT-4o | OpenAI | 40.8% | Mar 2026 |
| 26 | Gemma-3-27b | 39% | Mar 2025 | |
| 27 | Llama-4-Scout | Meta | 32.8% | Apr 2025 |
| 28 | Gemma 3 12B IT | Google DeepMind | 32% | Mar 2025 |
| 29 | Codestral 22B | Mistral | 29.5% | Mar 2026 |
| 30 | Gemma 3 4B IT | Google DeepMind | 23% | Mar 2025 |
Source: livecodebench.github.io · Problems released after training cutoffs.
500 real GitHub issues from popular Python repos. Human-verified to ensure the issue description is clear and the fix is testable. Measures real-world software engineering — not toy problems.
| # | Model | Provider | % Resolved | Date |
|---|---|---|---|---|
| ★ | Claude Opus 4.7 | 87.6% | Apr 2026 | |
| 2 | Claude Opus 4.5 | Anthropic | 80.9% | Mar 2026 |
| 3 | Claude Opus 4.6 | Anthropic | 80.8% | Mar 2026 |
| 4 | Gemini 3.1 Pro | 80.6% | Mar 2026 | |
| 5 | MiniMax M2.5 | MiniMax | 80.2% | Mar 2026 |
| 6 | GPT-5.2 Thinking | OpenAI | 80% | Mar 2026 |
| 7 | Claude Sonnet 4.6 | Anthropic | 79.6% | Mar 2026 |
| 8 | Gemini 3 Flash | 78% | Mar 2026 | |
| 9 | Claude Sonnet 4.5 | Anthropic | 77.2% | Mar 2026 |
| 10 | Kimi K2.5 | Moonshot AI | 76.8% | Mar 2026 |
| 11 | GPT-5.1 | OpenAI | 76.3% | Mar 2026 |
| 12 | Gemini 3 Pro | 76.2% | Mar 2026 | |
| 13 | GPT-5 | OpenAI | 74.9% | Mar 2026 |
| 14 | MiniMax M2.1 | MiniMax | 74% | Mar 2026 |
| 15 | Claude Haiku 4.5 | Anthropic | 73.3% | Mar 2026 |
| 16 | Claude Sonnet 4 | Anthropic | 72.7% | Mar 2026 |
| 17 | Claude Opus 4 | Anthropic | 72.5% | Mar 2026 |
| 18 | Devstral 2 | Mistral | 72.2% | Mar 2026 |
| 19 | Qwen3-Coder 480B A35B | Alibaba Cloud | 69.6% | Mar 2026 |
| 20 | MiniMax M2 | MiniMax | 69.4% | Mar 2026 |
| 21 | o3 | OpenAI | 69.1% | Mar 2026 |
| 22 | o4-mini | OpenAI | 68.1% | Mar 2026 |
| 23 | DeepSeek-V3.1 | DeepSeek | 66% | Mar 2026 |
| 24 | Kimi-K2 | Moonshot.AI | 65.8% | Mar 2026 |
| 25 | Grok 3 | xAI | 63.8% | Mar 2026 |
| 26 | Gemini 2.5 Pro | 63.8% | Mar 2026 | |
| 27 | Claude 3.7 Sonnet | Anthropic | 63.7% | Mar 2026 |
| 28 | Gemini 2.5 Flash | 60.4% | Mar 2026 | |
| 29 | DeepSeek-R1-0528 | DeepSeek | 57.6% | Mar 2026 |
| 30 | o3-mini | OpenAI | 55.8% | Mar 2026 |
| 31 | GPT-4.1 | OpenAI | 54.6% | Mar 2026 |
| 32 | Claude 3.5 Sonnet | Anthropic | 50.8% | Mar 2026 |
| 33 | DeepSeek R1 | DeepSeek | 49.2% | Mar 2026 |
| 34 | o1 | OpenAI | 48.9% | Mar 2026 |
| 35 | Devstral Small 2505 | Mistral | 46.8% | Mar 2026 |
| 36 | DeepSeek-V3 | DeepSeek | 42% | Mar 2026 |
| 37 | GPT-4o | OpenAI | 41.2% | Mar 2026 |
| 38 | Claude 3.5 Haiku | Anthropic | 40.6% | Mar 2026 |
| 39 | DeepSeek-V2.5 | DeepSeek | 37% | Mar 2026 |
Source: swebench.com · Verified subset, agent scaffolding allowed.
EvalPlus extends HumanEval with 80x more test inputs per problem, catching solutions that pass original tests but fail on edge cases. More rigorous than the base HumanEval benchmark.
| # | Model | Provider | Pass@1 | Date |
|---|---|---|---|---|
| ★ | o3-mini (high) | OpenAI | 95.1% | Feb 2025 |
| 2 | Claude 3.7 Sonnet | Anthropic | 94.3% | Feb 2025 |
| 3 | Gemini 2.5 Pro | 93.7% | Mar 2025 | |
| 4 | DeepSeek-R1 | DeepSeek | 91.2% | Jan 2025 |
| 5 | Claude 3.5 Sonnet | Anthropic | 88.1% | Jun 2024 |
| 6 | GPT-4o | OpenAI | 87.4% | May 2024 |
| 7 | Gemini 1.5 Pro | 80.3% | May 2024 | |
| 8 | Llama 3 70B Instruct | Meta | 75.9% | Apr 2024 |
Source: evalplus/evalplus · 80x augmented test cases vs. original HumanEval.
LiveCodeBench is the best single benchmark for general coding ability — it avoids contamination and continuously updates. For software engineering tasks (debugging, refactoring real codebases), use SWE-bench Verified. HumanEval is saturated and no longer differentiates frontier models.
Problems are scraped from competitive programming platforms only after the problem appears post-training cutoff for each model. This ensures models cannot have seen the exact problems during training.
LiveCodeBench rewards algorithmic reasoning under tight constraints — reasoning models excel here. SWE-bench rewards understanding codebases, writing clean patches, and following project conventions — instruction-following models with longer context tend to have an edge.