Pass@1 scores across the two most-cited Python coding benchmarks. HumanEval tests algorithmic problem-solving; MBPP tests practical scripting ability. Most frontier models have saturated both.
Both benchmarks are largely saturated at the frontier. For differentiating today's best models, see LiveCodeBench or SWE-bench.
164 hand-written Python functions with unit tests, released by OpenAI in 2021. Pass@1 = probability a single greedy sample passes all tests.
| # | Model | Provider | Pass@1 | Date |
|---|---|---|---|---|
| ★ | o4-mini | OpenAI | 97.3% | Mar 2026 |
| 2 | Claude Opus 4.6 | Anthropic | 96.3% | Apr 2026 |
| 3 | o3-mini | OpenAI | 96.3% | Mar 2026 |
| 4 | GPT-5 | OpenAI | 95.1% | Apr 2026 |
| 5 | o3 | OpenAI | 94.8% | Apr 2026 |
| 6 | GPT-4.1 | OpenAI | 94.5% | Mar 2026 |
| 7 | Claude Sonnet 4.6 | Anthropic | 94.1% | Apr 2026 |
| 8 | GPT-4.1 mini | OpenAI | 93.8% | Apr 2025 |
| 9 | Qwen2.5-Coder 32B | Alibaba | 92.7% | Mar 2026 |
| 10 | Qwen2.5-Coder 32B | Alibaba | 92.7% | Apr 2026 |
| 11 | o1-preview | OpenAI | 92.4% | Mar 2026 |
| 12 | o1-mini | OpenAI | 92.4% | Mar 2026 |
| 13 | Claude Opus 4 | Anthropic | 92.2% | Mar 2026 |
| 14 | Claude 3.5 Sonnet | Anthropic | 92% | Mar 2026 |
| 15 | GPT-4o | OpenAI | 91% | Mar 2026 |
| 16 | Claude Sonnet 4 | Anthropic | 90.6% | Mar 2026 |
| 17 | GPT-4o | OpenAI | 90.2% | Apr 2026 |
| 18 | DeepSeek-Coder-V2-Instruct | DeepSeek | 90.2% | Apr 2026 |
| 19 | DeepSeek-Coder-V2-Instruct | DeepSeek | 90.2% | Mar 2026 |
| 20 | Llama 3.1 405B | Meta | 89% | Mar 2026 |
| 21 | GPT-4.5 Preview | OpenAI | 88.6% | Mar 2026 |
| 22 | Llama-3.3-70B-Instruct | meta-llama | 88.4% | Apr 2026 |
| 23 | Grok 2 | xAI | 88.4% | Mar 2026 |
| 24 | GPT-4 Turbo | OpenAI | 88.2% | Mar 2026 |
| 25 | Gemma-3-27b | 87.8% | Mar 2025 | |
| 26 | o3 | OpenAI | 87.4% | Mar 2026 |
| 27 | GPT-4o mini | OpenAI | 87.2% | Mar 2026 |
| 28 | GPT-4 Turbo | OpenAI | 86.6% | Apr 2026 |
| 29 | Gemma 3 12B IT | Google DeepMind | 85.4% | Mar 2025 |
| 30 | Codestral 25.01 | Mistral AI | 85.3% | Apr 2026 |
| 31 | Claude 3 Opus | Anthropic | 84.9% | Mar 2026 |
| 32 | Phi-4 | Microsoft | 82.6% | Dec 2024 |
| 33 | DeepSeek-V3 | DeepSeek | 82.6% | Mar 2026 |
| 34 | Llama 3 70B | Meta | 81.7% | Mar 2026 |
| 35 | Codestral 22B | Mistral | 81.1% | Mar 2026 |
| 36 | Llama 3.1 70B | Meta | 80.5% | Mar 2026 |
| 37 | DeepSeek-Coder-33B-Instruct | DeepSeek | 79.3% | Apr 2026 |
| 38 | Gemini 1.5 Pro | 71.9% | Mar 2026 | |
| 39 | Gemma 3 4B IT | Google DeepMind | 71.3% | Mar 2025 |
| 40 | Code Llama 34B | Meta | 62.4% | Mar 2026 |
| 41 | StarCoder2 15B | BigCode | 46.9% | Mar 2026 |
| 42 | Codex (davinci-002) | OpenAI | 46.9% | Apr 2026 |
Source: openai/human-eval · Greedy decode (temperature 0), Pass@1.
~500 crowd-sourced Python problems from Google, covering basic data structures, string manipulation, and simple algorithms. Tests practical scripting fluency more than algorithmic reasoning.
| # | Model | Provider | Pass@1 | Date |
|---|---|---|---|---|
| ★ | o4-mini | OpenAI | 94.9% | Mar 2026 |
| 2 | o3-mini | OpenAI | 93.3% | Mar 2026 |
| 3 | Claude Opus 4 | Anthropic | 92% | Mar 2026 |
| 4 | GPT-4.1 | OpenAI | 90.9% | Mar 2026 |
| 5 | Qwen2.5-Coder 32B | Alibaba | 90.2% | Mar 2026 |
| 6 | Claude Sonnet 4 | Anthropic | 89.6% | Mar 2026 |
| 7 | DeepSeek-Coder-V2-Instruct | DeepSeek | 89.4% | Sep 2024 |
| 8 | DeepSeek-Coder-V2-Instruct | DeepSeek | 89.4% | Mar 2026 |
| 9 | DeepSeek-V3 | DeepSeek | 89.3% | Mar 2026 |
| 10 | Claude 3.5 Sonnet | Anthropic | 89.2% | Dec 2025 |
| 11 | GPT-4o | OpenAI | 87.8% | Dec 2025 |
| 12 | Llama-4-Maverick | Meta | 77.6% | Apr 2025 |
| 13 | Codestral 22B | Mistral | 75.4% | Mar 2026 |
| 14 | Gemma-3-27b | 74.4% | Mar 2025 | |
| 15 | Gemma 3 12B IT | Google DeepMind | 73% | Mar 2025 |
| 16 | Llama-4-Scout | Meta | 67.8% | Apr 2025 |
| 17 | Gemma 3 4B IT | Google DeepMind | 63.2% | Mar 2025 |
| 18 | Code Llama 34B | Meta | 62.6% | Mar 2026 |
| 19 | StarCoder2 15B | BigCode | 54.4% | Mar 2026 |
Source: google-research/mbpp · 3-shot evaluation, sanitized split (374 problems).
164 hand-written Python programming problems released by OpenAI in 2021. Each problem includes a function signature, docstring, and test cases. Models generate the function body, which is executed against the tests.
~500 Python tasks collected from crowd workers by Google Research. Problems are simpler than HumanEval — string manipulation, list operations, basic math. Useful for evaluating small and mid-sized models that struggle with harder benchmarks.
With scores above 95%, the benchmark no longer separates frontier models. All GPT-4-class models are essentially tied. LiveCodeBench uses live contest problems to avoid contamination and provides meaningful signal for current models.