Code Generation2021python
HumanEval: Hand-Written Evaluation Set
164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.
Current State of the Art
o1-preview
OpenAI
92.4
pass@1
Top Models Performance Comparison
Top 5 models ranked by pass@1
Best Score
92.4
Top Model
o1-preview
Models Compared
5
Score Range
10.7
pass@1Primary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | o1-preview OpenAI | 92.4 | Dec 2025 | |
| 2 | Claude 3.5 SonnetAPI Anthropic | 92 | Dec 2025 | |
| 3 | GPT-4oAPI OpenAI | 90.2 | Dec 2025 | |
| 4 | DeepSeek V3Open Source DeepSeek | 82.6 | Dec 2025 | |
| 5 | Llama 3 70BOpen Source Meta | 81.7 | Dec 2025 |