Code Generation2021python

HumanEval: Hand-Written Evaluation Set

164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.

Metrics:pass@1, pass@10, pass@100
Paper / WebsiteDownload
Current State of the Art

o4-mini

OpenAI

97.3

pass@1

HumanEval — pass@1

30 results · 6 SOTA advances · higher is better

All results
SOTA frontier
50607080901002024202520262027pass@1StarCoder2-15BQwen2.5-Coder-32B-Instructo4-mini

pass@1 Progress Over Time

Showing 5 breakthroughs from Aug 2023 to Mar 2026

58.969.479.890.3100.8Aug 2023Mar 2024Nov 2024Jul 2025Mar 2026pass@1Date

Key Milestones

Aug 2023
Code Llama 34B

Code Llama paper Table 3, arxiv:2308.12950. 0-shot pass@1.

62.4
Jun 2024
DeepSeek-Coder-V2-Instruct

Table 1, arxiv:2406.11931. Zero-shot pass@1.

90.2
+44.6%
Sep 2024
Qwen2.5-Coder-32B-Instruct

Table 2, arxiv:2409.12186. Zero-shot pass@1.

92.7
+2.8%
Apr 2025
GPT-4.1 mini

gpt-4.1-mini-2025-04-14

93.8
+1.2%
Mar 2026
o4-miniCurrent SOTA

Zero-shot, pass@1. Default reasoning effort.

97.3
+3.7%
Total Improvement
55.9%
Time Span
2y 8m
Breakthroughs
5
Current SOTA
97.3

Top Models Performance Comparison

Top 10 models ranked by pass@1

pass@11o4-mini97.3100.0%2o3-mini96.399.0%3GPT-4.194.597.1%4GPT-4.1 mini93.896.4%5Qwen2.5-Coder-32B-Instruct92.795.3%6o1-preview92.495.0%7o1-mini92.495.0%8Claude Opus 492.294.8%9Claude 3.5 Sonnet92.094.6%10GPT-4o91.093.5%0%25%50%75%100%% of best
Best Score
97.3
Top Model
o4-mini
Models Compared
10
Score Range
6.3

pass@1Primary

#ModelScorePaper / CodeDate
1
o4-miniAPI
OpenAI
97.3Mar 2026
2
o3-miniAPI
OpenAI
96.3Mar 2026
3
GPT-4.1API
OpenAI
94.5Mar 2026
4
GPT-4.1 miniAPI
OpenAI
93.8Apr 2025
5
Qwen2.5-Coder-32B-InstructOpen Source
Alibaba
92.7Sep 2024
6
o1-preview
OpenAI
92.4Mar 2026
7
o1-miniAPI
OpenAI
92.4Mar 2026
8
Claude Opus 4API
Anthropic
92.2Mar 2026
9
Claude 3.5 SonnetAPI
Anthropic
92Mar 2026
10
GPT-4oAPI
OpenAI
91Mar 2026
11
Claude Sonnet 4API
Anthropic
90.6Mar 2026
12
DeepSeek-Coder-V2-InstructOpen Source
DeepSeek
90.2Jun 2024
13
Llama 3.1 405BOpen Source
Meta
89Mar 2026
14
GPT-4.5 PreviewAPI
OpenAI
88.6Mar 2026
15
Grok 2API
xAI
88.4Mar 2026
16
GPT-4 TurboAPI
OpenAI
88.2Mar 2026
17
Gemma-3-27b
Google
87.8Mar 2025
18
o3API
OpenAI
87.4Mar 2026
19
GPT-4o mini
OpenAI
87.2Mar 2026
20
Gemma 3 12B IT
Google DeepMind
85.4Mar 2025
21
Claude 3 OpusAPI
Anthropic
84.9Mar 2026
22
DeepSeek-V3Open Source
DeepSeek
82.6Mar 2026
23
Phi-4
Microsoft
82.6Dec 2024
24
Llama 3 70BOpen Source
Meta
81.7Mar 2026
25
Codestral 22B
Mistral
81.1
Codestral: Hello, World!
May 2024
26
Llama 3.1 70BOpen Source
Meta
80.5Mar 2026
27
Gemini 1.5 ProAPI
Google
71.9Mar 2026
28
Gemma 3 4B IT
Google DeepMind
71.3Mar 2025
29
Code Llama 34BOpen Source
Meta
62.4Mar 2026
30
StarCoder2-15BOpen Source
BigCode
46.9Feb 2024

Related Papers3

Other Code Generation Datasets