Code Generation2021python
HumanEval: Hand-Written Evaluation Set
164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.
Current State of the Art
o4-mini
OpenAI
97.3
pass@1
HumanEval — pass@1
30 results · 6 SOTA advances · higher is better
All results
SOTA frontier
pass@1 Progress Over Time
Showing 5 breakthroughs from Aug 2023 to Mar 2026
Key Milestones
Total Improvement
55.9%
Time Span
2y 8m
Breakthroughs
5
Current SOTA
97.3
Top Models Performance Comparison
Top 10 models ranked by pass@1
Best Score
97.3
Top Model
o4-mini
Models Compared
10
Score Range
6.3
pass@1Primary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | o4-miniAPI OpenAI | 97.3 | Mar 2026 | |
| 2 | o3-miniAPI OpenAI | 96.3 | Mar 2026 | |
| 3 | GPT-4.1API OpenAI | 94.5 | Mar 2026 | |
| 4 | GPT-4.1 miniAPI OpenAI | 93.8 | Apr 2025 | |
| 5 | Qwen2.5-Coder-32B-InstructOpen Source Alibaba | 92.7 | Sep 2024 | |
| 6 | o1-preview OpenAI | 92.4 | Mar 2026 | |
| 7 | o1-miniAPI OpenAI | 92.4 | Mar 2026 | |
| 8 | Claude Opus 4API Anthropic | 92.2 | Mar 2026 | |
| 9 | Claude 3.5 SonnetAPI Anthropic | 92 | Mar 2026 | |
| 10 | GPT-4oAPI OpenAI | 91 | Mar 2026 | |
| 11 | Claude Sonnet 4API Anthropic | 90.6 | Mar 2026 | |
| 12 | DeepSeek-Coder-V2-InstructOpen Source DeepSeek | 90.2 | Jun 2024 | |
| 13 | Llama 3.1 405BOpen Source Meta | 89 | Mar 2026 | |
| 14 | GPT-4.5 PreviewAPI OpenAI | 88.6 | Mar 2026 | |
| 15 | Grok 2API xAI | 88.4 | Mar 2026 | |
| 16 | GPT-4 TurboAPI OpenAI | 88.2 | Mar 2026 | |
| 17 | Gemma-3-27b Google | 87.8 | Mar 2025 | |
| 18 | o3API OpenAI | 87.4 | Mar 2026 | |
| 19 | GPT-4o mini OpenAI | 87.2 | Mar 2026 | |
| 20 | Gemma 3 12B IT Google DeepMind | 85.4 | Mar 2025 | |
| 21 | Claude 3 OpusAPI Anthropic | 84.9 | Mar 2026 | |
| 22 | DeepSeek-V3Open Source DeepSeek | 82.6 | Mar 2026 | |
| 23 | Phi-4 Microsoft | 82.6 | Dec 2024 | |
| 24 | Llama 3 70BOpen Source Meta | 81.7 | Mar 2026 | |
| 25 | Codestral 22B Mistral | 81.1 | Codestral: Hello, World! | May 2024 |
| 26 | Llama 3.1 70BOpen Source Meta | 80.5 | Mar 2026 | |
| 27 | Gemini 1.5 ProAPI Google | 71.9 | Mar 2026 | |
| 28 | Gemma 3 4B IT Google DeepMind | 71.3 | Mar 2025 | |
| 29 | Code Llama 34BOpen Source Meta | 62.4 | Mar 2026 | |
| 30 | StarCoder2-15BOpen Source BigCode | 46.9 | Feb 2024 |
Related Papers3
Qwen2.5-Coder Technical Report
Sep 2024Models: Qwen2.5-Coder-32B-Instruct
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Jun 2024Models: DeepSeek-Coder-V2-Instruct
StarCoder2 and The Stack v2: The Next Generation
Feb 2024Models: StarCoder2-15B