164 hand-written Python problems, scored by unit tests. The 2021 OpenAI Codex benchmark that every frontier lab still cites, even though the top of the table now sits above 95%.
HumanEval is still useful as a historical anchor, but the leaderboard is ceiling-bound and likely contaminated for modern models. For current capability comparisons, use HumanEval+, LiveCodeBench, SWE-bench Pro, or Terminal-Bench depending on the task.
Ranked by reported pass@1. The SOTA row is highlighted; everything below it preserves the historical record.
| # | Model | Vendor | Pass@1 | Access | Reported |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | Anthropic | 96.3% | API | 2026-01 |
| 2 | GPT-5 | OpenAI | 95.1% | API | 2025-12 |
| 3 | o3 | OpenAI | 94.8% | API | 2025-04 |
| 4 | Claude Sonnet 4.6 | Anthropic | 94.1% | API | 2026-01 |
| 5 | Qwen 2.5-Coder-32B-Inst | Alibaba | 92.7% | Open Weights | 2025-03 |
| 6 | DeepSeek-Coder-V2-Instruct | DeepSeek | 90.2% | Open Weights | 2024-06 |
| 7 | GPT-4o | OpenAI | 90.2% | API | 2024-05 |
| 8 | Llama-3.3-70B-Instruct | Meta | 88.4% | Open Weights | 2024-12 |
| 9 | GPT-4 Turbo | OpenAI | 86.6% | API | 2023-11 |
| 10 | Codestral 25.01 | Mistral AI | 85.3% | Weights | 2025-01 |
| 11 | DeepSeek-Coder-33B-Inst | DeepSeek | 79.3% | Open Weights | 2023-11 |
| 12 | Codex (davinci-002) | OpenAI | 46.9% | API | 2021-07 |
Scores as reported by vendors or referenced in model cards. Codesota does not re-run HumanEval. See § 04 for how we verify.
Released by OpenAI in January 2021 alongside the original Codex paper (Chen et al., arXiv:2107.03374), HumanEval was built to sidestep BLEU-style text-similarity metrics that say nothing about whether code runs.
The dataset is small on purpose: 164 problems, each hand-written to avoid overlap with the public training corpora of the day. Every problem ships with a function signature, a docstring, a reference body, and a set of hidden unit tests. A generation counts as solved only if all tests pass.
The headline metric is pass@1: the probability that the first sample a model produces passes all tests. It is the “production” number: what you actually get when you ask a model for code once.
Codex, the first model evaluated on HumanEval, scored 46.9% pass@1 in 2021. By late 2024 every frontier model cleared 90%. The current top entry sits at 96.3%. At that altitude a one-point gap is usually noise from sampling temperature, not capability.
The honest caveat: HumanEval has been in public training mixes for years. Problems have been translated, paraphrased, and mirrored across GitHub. A score of 95+ on a model trained after 2023 tells you the model can reproduce HumanEval, not that it can write production code. For a harder signal on real software engineering, see SWE-bench or the guide SWE-bench, explained.
Why keep tracking it? Because vendors keep reporting it. Every Codex, GPT, Claude, Gemini, Qwen, DeepSeek, and Llama model card lists a HumanEval number. It is the longest continuous series in code-generation benchmarking, and that historical continuity is worth preserving even as its ceiling becomes uninformative.
Pass@1 numbers from official model cards, vendor technical reports, or the primary arXiv paper. Each row in § 01 links to the source.
Twitter/X screenshots, unverified forks, pass@k converted to pass@1 without the conversion formula, and numbers that drift from what the vendor originally reported.
“Reported” is the month the number first appeared in a primary source, not the date we ingested it.
Codesota does not independently re-execute HumanEval. Where a vendor reports both greedy and sampled scores, we use the greedy pass@1.
Dataset: github.com/openai/human-eval · huggingface.co/datasets/openai_humaneval · Paper: arXiv:2107.03374