HumanEval & MBPP.

Pass@1 scores across the two most-cited Python coding benchmarks. HumanEval tests algorithmic problem-solving; MBPP tests practical scripting ability. Most frontier models have saturated both.

Lineage status · Saturated

Both benchmarks are largely saturated at the frontier. For differentiating today's best models, see LiveCodeBench or SWE-bench.

HumanEval ↓MBPP

§ 01 · HumanEval

164 functions, greedy decode.

164 hand-written Python functions with unit tests, released by OpenAI in 2021. Pass@1 = probability a single greedy sample passes all tests.

#	Model	Provider	Pass@1	Date
★	o4-mini	OpenAI	97.3%	Mar 2026
2	Claude Opus 4.6	Anthropic	96.3%	Apr 2026
3	o3-mini	OpenAI	96.3%	Mar 2026
4	GPT-5	OpenAI	95.1%	Apr 2026
5	o3	OpenAI	94.8%	Apr 2026
6	GPT-4.1	OpenAI	94.5%	Mar 2026
7	Claude Sonnet 4.6	Anthropic	94.1%	Apr 2026
8	GPT-4.1 mini	OpenAI	93.8%	Apr 2025
9	Qwen2.5-Coder 32B	Alibaba	92.7%	Mar 2026
10	Qwen2.5-Coder 32B	Alibaba	92.7%	Apr 2026
11	o1-preview	OpenAI	92.4%	Mar 2026
12	o1-mini	OpenAI	92.4%	Mar 2026
13	Claude Opus 4	Anthropic	92.2%	Mar 2026
14	Claude 3.5 Sonnet	Anthropic	92%	Mar 2026
15	GPT-4o	OpenAI	91%	Mar 2026
16	Claude Sonnet 4	Anthropic	90.6%	Mar 2026
17	GPT-4o	OpenAI	90.2%	Apr 2026
18	DeepSeek-Coder-V2-Instruct	DeepSeek	90.2%	Apr 2026
19	DeepSeek-Coder-V2-Instruct	DeepSeek	90.2%	Mar 2026
20	Llama 3.1 405B	Meta	89%	Mar 2026
21	GPT-4.5 Preview	OpenAI	88.6%	Mar 2026
22	Llama-3.3-70B-Instruct	meta-llama	88.4%	Apr 2026
23	Grok 2	xAI	88.4%	Mar 2026
24	GPT-4 Turbo	OpenAI	88.2%	Mar 2026
25	Gemma-3-27b	Google	87.8%	Mar 2025
26	o3	OpenAI	87.4%	Mar 2026
27	GPT-4o mini	OpenAI	87.2%	Mar 2026
28	GPT-4 Turbo	OpenAI	86.6%	Apr 2026
29	Gemma 3 12B IT	Google DeepMind	85.4%	Mar 2025
30	Codestral 25.01	Mistral AI	85.3%	Apr 2026
31	Claude 3 Opus	Anthropic	84.9%	Mar 2026
32	Phi-4	Microsoft	82.6%	Dec 2024
33	DeepSeek-V3	DeepSeek	82.6%	Mar 2026
34	Llama 3 70B	Meta	81.7%	Mar 2026
35	Codestral 22B	Mistral	81.1%	Mar 2026
36	Llama 3.1 70B	Meta	80.5%	Mar 2026
37	DeepSeek-Coder-33B-Instruct	DeepSeek	79.3%	Apr 2026
38	Gemini 1.5 Pro	Google	71.9%	Mar 2026
39	Gemma 3 4B IT	Google DeepMind	71.3%	Mar 2025
40	Code Llama 34B	Meta	62.4%	Mar 2026
41	StarCoder2 15B	BigCode	46.9%	Mar 2026
42	Codex (davinci-002)	OpenAI	46.9%	Apr 2026

Source: openai/human-eval · Greedy decode (temperature 0), Pass@1.

§ 02 · MBPP

~500 Python tasks, practical scripting.

~500 crowd-sourced Python problems from Google, covering basic data structures, string manipulation, and simple algorithms. Tests practical scripting fluency more than algorithmic reasoning.

#	Model	Provider	Pass@1	Date
★	o4-mini	OpenAI	94.9%	Mar 2026
2	o3-mini	OpenAI	93.3%	Mar 2026
3	Claude Opus 4	Anthropic	92%	Mar 2026
4	GPT-4.1	OpenAI	90.9%	Mar 2026
5	Qwen2.5-Coder 32B	Alibaba	90.2%	Mar 2026
6	Claude Sonnet 4	Anthropic	89.6%	Mar 2026
7	DeepSeek-Coder-V2-Instruct	DeepSeek	89.4%	Sep 2024
8	DeepSeek-Coder-V2-Instruct	DeepSeek	89.4%	Mar 2026
9	DeepSeek-V3	DeepSeek	89.3%	Mar 2026
10	Claude 3.5 Sonnet	Anthropic	89.2%	Dec 2025
11	GPT-4o	OpenAI	87.8%	Dec 2025
12	Llama-4-Maverick	Meta	77.6%	Apr 2025
13	Codestral 22B	Mistral	75.4%	Mar 2026
14	Gemma-3-27b	Google	74.4%	Mar 2025
15	Gemma 3 12B IT	Google DeepMind	73%	Mar 2025
16	Llama-4-Scout	Meta	67.8%	Apr 2025
17	Gemma 3 4B IT	Google DeepMind	63.2%	Mar 2025
18	Code Llama 34B	Meta	62.6%	Mar 2026
19	StarCoder2 15B	BigCode	54.4%	Mar 2026

Source: google-research/mbpp · 3-shot evaluation, sanitized split (374 problems).

§ 03 · Methodology

Frequently asked.

What is HumanEval?+

164 hand-written Python programming problems released by OpenAI in 2021. Each problem includes a function signature, docstring, and test cases. Models generate the function body, which is executed against the tests.

What is MBPP?+

~500 Python tasks collected from crowd workers by Google Research. Problems are simpler than HumanEval — string manipulation, list operations, basic math. Useful for evaluating small and mid-sized models that struggle with harder benchmarks.

Why are frontier models not compared on HumanEval anymore?+

With scores above 95%, the benchmark no longer separates frontier models. All GPT-4-class models are essentially tied. LiveCodeBench uses live contest problems to avoid contamination and provides meaningful signal for current models.

§ 04 · Related

Continue reading.

Coding · frontier

Coding Benchmarks

LiveCodeBench, SWE-bench, HumanEval+

Math

Math Benchmarks

GSM8K, MATH, AIME, AMC

Index

All LLM Benchmarks

MMLU-Pro, GPQA Diamond, HLE, LiveCodeBench