Code Generation2021python

HumanEval: Hand-Written Evaluation Set

164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.

Metrics:pass@1, pass@10, pass@100

Current State of the Art

o1-preview

OpenAI

92.4

pass@1

Top 5 models ranked by pass@1

Best Score

92.4

Top Model

o1-preview

Models Compared

Score Range

10.7

pass@1Primary

#	Model	Score	Paper / Code	Date
1	o1-preview OpenAI	92.4	openai-blog	Dec 2025
2	Claude 3.5 SonnetAPI Anthropic	92	anthropic-blog	Dec 2025
3	GPT-4oAPI OpenAI	90.2	openai-blog	Dec 2025
4	DeepSeek V3Open Source DeepSeek	82.6	deepseek-blog	Dec 2025
5	Llama 3 70BOpen Source Meta	81.7	meta-blog	Dec 2025