Codesota · Registry · Code generation · HumanEval

HumanEval, measured honestly.

164 hand-written Python problems, scored by unit tests. The 2021 OpenAI Codex benchmark that every frontier lab still cites, even though the top of the table now sits above 95%.

OpenAI · 2021Python · pass@1164 problemsSaturated

Lineage status · saturated and superseded

Do not use HumanEval as a frontier coding signal.

HumanEval is still useful as a historical anchor, but the leaderboard is ceiling-bound and likely contaminated for modern models. For current capability comparisons, use HumanEval+, LiveCodeBench, SWE-bench Pro, or Terminal-Bench depending on the task.

See HumanEval in the coding lineage →Successor: HumanEval+Current agentic signal: Terminal-Bench 2

§ 01

Leaderboard: pass@1, zero-shot

Ranked by reported pass@1. The SOTA row is highlighted; everything below it preserves the historical record.

#	Model	Vendor	Pass@1	Access	Reported
1	Claude Opus 4.6	Anthropic	96.3%	API	2026-01
2	GPT-5	OpenAI	95.1%	API	2025-12
3	o3	OpenAI	94.8%	API	2025-04
4	Claude Sonnet 4.6	Anthropic	94.1%	API	2026-01
5	Qwen 2.5-Coder-32B-Inst	Alibaba	92.7%	Open Weights	2025-03
6	DeepSeek-Coder-V2-Instruct	DeepSeek	90.2%	Open Weights	2024-06
7	GPT-4o	OpenAI	90.2%	API	2024-05
8	Llama-3.3-70B-Instruct	Meta	88.4%	Open Weights	2024-12
9	GPT-4 Turbo	OpenAI	86.6%	API	2023-11
10	Codestral 25.01	Mistral AI	85.3%	Weights	2025-01
11	DeepSeek-Coder-33B-Inst	DeepSeek	79.3%	Open Weights	2023-11
12	Codex (davinci-002)	OpenAI	46.9%	API	2021-07

Scores as reported by vendors or referenced in model cards. Codesota does not re-run HumanEval. See § 04 for how we verify.

§ 02

What HumanEval actually is

Released by OpenAI in January 2021 alongside the original Codex paper (Chen et al., arXiv:2107.03374), HumanEval was built to sidestep BLEU-style text-similarity metrics that say nothing about whether code runs.

The dataset is small on purpose: 164 problems, each hand-written to avoid overlap with the public training corpora of the day. Every problem ships with a function signature, a docstring, a reference body, and a set of hidden unit tests. A generation counts as solved only if all tests pass.

The headline metric is pass@1: the probability that the first sample a model produces passes all tests. It is the “production” number: what you actually get when you ask a model for code once.

Example problem · id 7

def filter_by_substring(
    strings: List[str],
    substring: str,
) -> List[str]:
    """ Filter a list of strings
    for ones that contain
    the given substring """
    return [x for x in strings
            if substring in x]

# hidden tests
assert filter_by_substring([], 'a') == []
assert filter_by_substring(
    ['abc', 'bac', 'd'], 'a'
) == ['abc', 'bac']

§ 03

Why it is saturated, and why it still ships

Codex, the first model evaluated on HumanEval, scored 46.9% pass@1 in 2021. By late 2024 every frontier model cleared 90%. The current top entry sits at 96.3%. At that altitude a one-point gap is usually noise from sampling temperature, not capability.

The honest caveat: HumanEval has been in public training mixes for years. Problems have been translated, paraphrased, and mirrored across GitHub. A score of 95+ on a model trained after 2023 tells you the model can reproduce HumanEval, not that it can write production code. For a harder signal on real software engineering, see SWE-bench or the guide SWE-bench, explained.

Why keep tracking it? Because vendors keep reporting it. Every Codex, GPT, Claude, Gemini, Qwen, DeepSeek, and Llama model card lists a HumanEval number. It is the longest continuous series in code-generation benchmarking, and that historical continuity is worth preserving even as its ceiling becomes uninformative.

§ 04

Methodology side notes

What we accept

Pass@1 numbers from official model cards, vendor technical reports, or the primary arXiv paper. Each row in § 01 links to the source.

What we reject

Twitter/X screenshots, unverified forks, pass@k converted to pass@1 without the conversion formula, and numbers that drift from what the vendor originally reported.

Dates

“Reported” is the month the number first appeared in a primary source, not the date we ingested it.

No re-runs

Codesota does not independently re-execute HumanEval. Where a vendor reports both greedy and sampled scores, we use the greedy pass@1.

§ 05

Related benchmarks

SWE-bench VerifiedReal GitHub issues, multi-file patches, project test suites. The harder signal.
HumanEval · MBPP overviewThe Codesota LLM-side view of both function-synthesis benchmarks side by side.
Guide: SWE-bench, explainedWhy the code-generation frontier moved from HumanEval to SWE-bench.
Guide: code generation modelsHow today’s code-specialized models (DeepSeek-Coder, Qwen-Coder, Codestral) stack up.

Dataset: github.com/openai/human-eval · huggingface.co/datasets/openai_humaneval · Paper: arXiv:2107.03374