Codesota · Registry · Code generation · HumanEval

HumanEval, measured honestly.

164 hand-written Python problems, scored by unit tests. The 2021 OpenAI Codex benchmark that every frontier lab still cites, even though the top of the table now sits above 95%.

OpenAI · 2021Python · pass@1164 problemsSaturated

Lineage status · saturated and superseded

Do not use HumanEval as a frontier coding signal.

HumanEval is still useful as a historical anchor, but the leaderboard is ceiling-bound and likely contaminated for modern models. For current capability comparisons, use HumanEval+, LiveCodeBench, SWE-bench Pro, or Terminal-Bench depending on the task.

See HumanEval in the coding lineage Successor: HumanEval+Current agentic signal: Terminal-Bench 2
§ 01

Leaderboard: pass@1, zero-shot

Ranked by reported pass@1. The SOTA row is highlighted; everything below it preserves the historical record.

#ModelVendorPass@1AccessReported
1Claude Opus 4.6Anthropic96.3%API2026-01
2GPT-5OpenAI95.1%API2025-12
3o3OpenAI94.8%API2025-04
4Claude Sonnet 4.6Anthropic94.1%API2026-01
5Qwen 2.5-Coder-32B-InstAlibaba92.7%Open Weights2025-03
6DeepSeek-Coder-V2-InstructDeepSeek90.2%Open Weights2024-06
7GPT-4oOpenAI90.2%API2024-05
8Llama-3.3-70B-InstructMeta88.4%Open Weights2024-12
9GPT-4 TurboOpenAI86.6%API2023-11
10Codestral 25.01Mistral AI85.3%Weights2025-01
11DeepSeek-Coder-33B-InstDeepSeek79.3%Open Weights2023-11
12Codex (davinci-002)OpenAI46.9%API2021-07

Scores as reported by vendors or referenced in model cards. Codesota does not re-run HumanEval. See § 04 for how we verify.

§ 02

What HumanEval actually is

Released by OpenAI in January 2021 alongside the original Codex paper (Chen et al., arXiv:2107.03374), HumanEval was built to sidestep BLEU-style text-similarity metrics that say nothing about whether code runs.

The dataset is small on purpose: 164 problems, each hand-written to avoid overlap with the public training corpora of the day. Every problem ships with a function signature, a docstring, a reference body, and a set of hidden unit tests. A generation counts as solved only if all tests pass.

The headline metric is pass@1: the probability that the first sample a model produces passes all tests. It is the “production” number: what you actually get when you ask a model for code once.

§ 03

Why it is saturated, and why it still ships

Codex, the first model evaluated on HumanEval, scored 46.9% pass@1 in 2021. By late 2024 every frontier model cleared 90%. The current top entry sits at 96.3%. At that altitude a one-point gap is usually noise from sampling temperature, not capability.

The honest caveat: HumanEval has been in public training mixes for years. Problems have been translated, paraphrased, and mirrored across GitHub. A score of 95+ on a model trained after 2023 tells you the model can reproduce HumanEval, not that it can write production code. For a harder signal on real software engineering, see SWE-bench or the guide SWE-bench, explained.

Why keep tracking it? Because vendors keep reporting it. Every Codex, GPT, Claude, Gemini, Qwen, DeepSeek, and Llama model card lists a HumanEval number. It is the longest continuous series in code-generation benchmarking, and that historical continuity is worth preserving even as its ceiling becomes uninformative.

§ 04

Methodology side notes

What we accept

Pass@1 numbers from official model cards, vendor technical reports, or the primary arXiv paper. Each row in § 01 links to the source.

What we reject

Twitter/X screenshots, unverified forks, pass@k converted to pass@1 without the conversion formula, and numbers that drift from what the vendor originally reported.

Dates

“Reported” is the month the number first appeared in a primary source, not the date we ingested it.

No re-runs

Codesota does not independently re-execute HumanEval. Where a vendor reports both greedy and sampled scores, we use the greedy pass@1.

§ 05

Related benchmarks

Dataset: github.com/openai/human-eval · huggingface.co/datasets/openai_humaneval · Paper: arXiv:2107.03374