Codesota · LLM · HumanEval & MBPPLLM/HumanEval / MBPP
Coding · classic · updated April 2026

HumanEval & MBPP.

Pass@1 scores across the two most-cited Python coding benchmarks. HumanEval tests algorithmic problem-solving; MBPP tests practical scripting ability. Most frontier models have saturated both.

Lineage status · Saturated

Both benchmarks are largely saturated at the frontier. For differentiating today's best models, see LiveCodeBench or SWE-bench.

HumanEval MBPP
§ 01 · HumanEval

164 functions, greedy decode.

164 hand-written Python functions with unit tests, released by OpenAI in 2021. Pass@1 = probability a single greedy sample passes all tests.

#ModelProviderPass@1Date
o4-miniOpenAI97.3%Mar 2026
2Claude Opus 4.6Anthropic96.3%Apr 2026
3o3-miniOpenAI96.3%Mar 2026
4GPT-5OpenAI95.1%Apr 2026
5o3OpenAI94.8%Apr 2026
6GPT-4.1OpenAI94.5%Mar 2026
7Claude Sonnet 4.6Anthropic94.1%Apr 2026
8GPT-4.1 miniOpenAI93.8%Apr 2025
9Qwen2.5-Coder 32BAlibaba92.7%Mar 2026
10Qwen2.5-Coder 32BAlibaba92.7%Apr 2026
11o1-previewOpenAI92.4%Mar 2026
12o1-miniOpenAI92.4%Mar 2026
13Claude Opus 4Anthropic92.2%Mar 2026
14Claude 3.5 SonnetAnthropic92%Mar 2026
15GPT-4oOpenAI91%Mar 2026
16Claude Sonnet 4Anthropic90.6%Mar 2026
17GPT-4oOpenAI90.2%Apr 2026
18DeepSeek-Coder-V2-InstructDeepSeek90.2%Apr 2026
19DeepSeek-Coder-V2-InstructDeepSeek90.2%Mar 2026
20Llama 3.1 405BMeta89%Mar 2026
21GPT-4.5 PreviewOpenAI88.6%Mar 2026
22Llama-3.3-70B-Instructmeta-llama88.4%Apr 2026
23Grok 2xAI88.4%Mar 2026
24GPT-4 TurboOpenAI88.2%Mar 2026
25Gemma-3-27bGoogle87.8%Mar 2025
26o3OpenAI87.4%Mar 2026
27GPT-4o miniOpenAI87.2%Mar 2026
28GPT-4 TurboOpenAI86.6%Apr 2026
29Gemma 3 12B ITGoogle DeepMind85.4%Mar 2025
30Codestral 25.01Mistral AI85.3%Apr 2026
31Claude 3 OpusAnthropic84.9%Mar 2026
32Phi-4Microsoft82.6%Dec 2024
33DeepSeek-V3DeepSeek82.6%Mar 2026
34Llama 3 70BMeta81.7%Mar 2026
35Codestral 22BMistral81.1%Mar 2026
36Llama 3.1 70BMeta80.5%Mar 2026
37DeepSeek-Coder-33B-InstructDeepSeek79.3%Apr 2026
38Gemini 1.5 ProGoogle71.9%Mar 2026
39Gemma 3 4B ITGoogle DeepMind71.3%Mar 2025
40Code Llama 34BMeta62.4%Mar 2026
41StarCoder2 15BBigCode46.9%Mar 2026
42Codex (davinci-002)OpenAI46.9%Apr 2026

Source: openai/human-eval · Greedy decode (temperature 0), Pass@1.

§ 02 · MBPP

~500 Python tasks, practical scripting.

~500 crowd-sourced Python problems from Google, covering basic data structures, string manipulation, and simple algorithms. Tests practical scripting fluency more than algorithmic reasoning.

#ModelProviderPass@1Date
o4-miniOpenAI94.9%Mar 2026
2o3-miniOpenAI93.3%Mar 2026
3Claude Opus 4Anthropic92%Mar 2026
4GPT-4.1OpenAI90.9%Mar 2026
5Qwen2.5-Coder 32BAlibaba90.2%Mar 2026
6Claude Sonnet 4Anthropic89.6%Mar 2026
7DeepSeek-Coder-V2-InstructDeepSeek89.4%Sep 2024
8DeepSeek-Coder-V2-InstructDeepSeek89.4%Mar 2026
9DeepSeek-V3DeepSeek89.3%Mar 2026
10Claude 3.5 SonnetAnthropic89.2%Dec 2025
11GPT-4oOpenAI87.8%Dec 2025
12Llama-4-MaverickMeta77.6%Apr 2025
13Codestral 22BMistral75.4%Mar 2026
14Gemma-3-27bGoogle74.4%Mar 2025
15Gemma 3 12B ITGoogle DeepMind73%Mar 2025
16Llama-4-ScoutMeta67.8%Apr 2025
17Gemma 3 4B ITGoogle DeepMind63.2%Mar 2025
18Code Llama 34BMeta62.6%Mar 2026
19StarCoder2 15BBigCode54.4%Mar 2026

Source: google-research/mbpp · 3-shot evaluation, sanitized split (374 problems).

§ 03 · Methodology

Frequently asked.

What is HumanEval?+

164 hand-written Python programming problems released by OpenAI in 2021. Each problem includes a function signature, docstring, and test cases. Models generate the function body, which is executed against the tests.

What is MBPP?+

~500 Python tasks collected from crowd workers by Google Research. Problems are simpler than HumanEval — string manipulation, list operations, basic math. Useful for evaluating small and mid-sized models that struggle with harder benchmarks.

Why are frontier models not compared on HumanEval anymore?+

With scores above 95%, the benchmark no longer separates frontier models. All GPT-4-class models are essentially tied. LiveCodeBench uses live contest problems to avoid contamination and provides meaningful signal for current models.

§ 04 · Related

Continue reading.

Coding · frontier
Coding Benchmarks
LiveCodeBench, SWE-bench, HumanEval+
Math
Math Benchmarks
GSM8K, MATH, AIME, AMC
Index
All LLM Benchmarks
MMLU-Pro, GPQA Diamond, HLE, LiveCodeBench