Code Generation2021python

HumanEval: Hand-Written Evaluation Set

164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.

Metrics:pass@1, pass@10, pass@100
Paper / WebsiteDownload
Current State of the Art

o1-preview

OpenAI

92.4

pass@1

Top Models Performance Comparison

Top 5 models ranked by pass@1

pass@11o1-preview92.4100.0%2Claude 3.5 Sonnet92.099.6%3GPT-4o90.297.6%4DeepSeek V382.689.4%5Llama 3 70B81.788.4%0%25%50%75%100%% of best
Best Score
92.4
Top Model
o1-preview
Models Compared
5
Score Range
10.7

pass@1Primary

#ModelScorePaper / CodeDate
1
o1-preview
OpenAI
92.4Dec 2025
2
Claude 3.5 SonnetAPI
Anthropic
92Dec 2025
3
GPT-4oAPI
OpenAI
90.2Dec 2025
4
DeepSeek V3Open Source
DeepSeek
82.6Dec 2025
5
Llama 3 70BOpen Source
Meta
81.7Dec 2025

Other Code Generation Datasets