SATURATED — Benchmark no longer differentiates top models

HumanEval: From 28.8% to 99%
in five years.

The complete history of the benchmark that defined AI code generation.43 models tracked from July 2021 through saturation in 2026.

164

Problems

Jul 2021

Published

~99%

Current SOTA

Saturated

Status

What is HumanEval?

HumanEval is a benchmark of 164 hand-written Python programming problems created by OpenAI in July 2021. Each problem includes a function signature, a docstring describing what the function should do, and a set of unit tests to verify correctness.

The metric is pass@1: the percentage of problems where the model's first attempt passes all unit tests. No retries, no cherry-picking.

It was introduced alongside Codex in the paper “Evaluating Large Language Models Trained on Code” (Chen et al., 2021) and quickly became the standard yardstick for code generation ability.

Quick Facts

LanguagePython
Problems164
Tests/problem~7.7 avg
Metricpass@1
CreatorOpenAI
Paperarxiv:2107.03374
StatusSaturated

The Progression

SOTA Progression (pass@1)

Best score at each point in time, July 2021 – March 2026

Best Score by Organization

Each org's highest verified HumanEval pass@1

Every Model, Plotted

Each dot is a model. Color = organization. Size = parameter count.

OpenAI
Anthropic
Google
Meta
DeepSeek
Mistral AI
Alibaba
Moonshot AI
Microsoft
IBM
HuggingFace
Salesforce
CMU
EleutherAI
Phind

Complete Timeline

2021

The Starting Line

OpenAI publishes "Evaluating Large Language Models Trained on Code" with 164 hand-written Python problems. Codex, fine-tuned on GitHub, sets the first bar.

Jul 2021
GPT-3OpenAI

Cannot generate code

0%
Jul 2021
GPT-J 6BEleutherAI6B
11.4%
Jul 2021
Codex 300MOpenAI300M
13.2%
Jul 2021
Codex 2.5BOpenAI2.5B
21.4%
Jul 2021
Codex 12BOpenAI12B

First SOTA on HumanEval

28.8%
2022

Specialized Code Models

The field realizes code needs its own models. Salesforce, Google, and OpenAI push toward 50%.

2022
PolyCoder 2.7BCMU2.7B
5.6%
2022
InCoder 6.7BMeta6.7B
15.2%
2022
CodeGen-Mono 6.1BSalesforce6.1B
26.1%
2022
text-davinci-002OpenAI
30.5%
2022
PaLM-Coder 540BGoogle540B
36%
2022
code-davinci-002OpenAI

Nearly doubles Codex original

47%
2023

The ChatGPT Explosion

Chat-tuned models smash through 70%. GPT-4 reaches 85%. Open-source catches up fast with WizardCoder and Code Llama.

Mar 2023
StarCoder 15BHuggingFace15B
33.6%
Mar 2023
GPT-3.5 TurboOpenAI

ChatGPT breaks 70%

72.2%
Jun 2023
WizardCoder 15BMicrosoft15B
57.3%
Jun 2023
GPT-4OpenAI

0-shot; 82.7% with optimized prompting

67%
Aug 2023
Code Llama 34BMeta34B
53.7%
Aug 2023
Code Llama Instruct 70BMeta70B
67.8%
Oct 2023
Phind-CodeLlama v2Phind34B
73.8%
Nov 2023
WizardCoder-Python 34BMicrosoft34B
73.2%
Nov 2023
GPT-4-1106-PreviewOpenAI

First model near 90%

85.7%
2024

Breaking 90%

GPT-4o breaks 90% in May. Claude 3.5 Sonnet matches it. Qwen2.5-Coder hits 92.7% from Alibaba. The ceiling is in sight.

Feb 2024
Claude 3 OpusAnthropic
84.9%
Apr 2024
GPT-4 TurboOpenAI
87.1%
May 2024
Codestral 22BMistral AI22B
81.1%
May 2024
GPT-4oOpenAI

First to break 90%

90.2%
Jun 2024
Claude 3.5 SonnetAnthropic
92%
Jun 2024
DeepSeek-Coder-V2DeepSeek
90.2%
Jul 2024
Mistral Large 2Mistral AI
92%
Jul 2024
Llama 3.1 405BMeta405B
89%
Sep 2024
o1-miniOpenAI
92.4%
Sep 2024
Qwen2.5-Coder 32BAlibaba32B
92.7%
Oct 2024
Claude 3.5 Sonnet v2Anthropic

New SOTA: 93.7%

93.7%
Nov 2024
Amazon Nova ProAmazon
89%
Dec 2024
Llama 3.3 70BMeta70B
88.4%
2025

Saturation

Scores converge above 90% across all major vendors. The benchmark can no longer differentiate top models. The community pivots to harder tests.

Jan 2025
Mistral Small 3Mistral AI24B
84.8%
Mar 2025
Gemma 3 27BGoogle27B
87.8%
Mar 2025
Mistral Small 3.1Mistral AI24B
88.4%
Apr 2025
Granite 3.3 8BIBM8B

8B model near 90%

89.7%
Jul 2025
Kimi K2 InstructMoonshot AI
93.3%
Aug 2025
GPT-5OpenAI
93.4%
Sep 2025
Kimi K2 0905Moonshot AI

Highest verified score

94.5%
2026

Post-Saturation

Multiple models approach or claim 99%. HumanEval is retired as a meaningful differentiator. The community has moved on.

Mar 2026
Sarvam-30BSarvam AI30B
92.1%
2026
Gemini 2.5 ProGoogle

Approaching perfect score

99%
2026
Kimi K2.5Moonshot AI
99%

Key Milestones

28.8%Jul 2021

Codex launches HumanEval

The starting line. A 12B parameter model sets the first benchmark.

47%2022

code-davinci-002 doubles it

Specialized code models prove the approach works.

72%Mar 2023

ChatGPT breaks 70%

Chat-tuned models are surprisingly good at code.

85.7%Nov 2023

GPT-4 approaches 90%

The 90% barrier is within reach for the first time.

90.2%May 2024

GPT-4o breaks 90%

The psychological barrier falls. Three models follow within weeks.

93.7%Oct 2024

Claude 3.5 Sonnet v2

Anthropic takes the lead. The gap between vendors shrinks to noise.

94.5%Sep 2025

Kimi K2 pushes ceiling

Highest verified score. Improvement from 93% to 95% takes a full year.

~99%2026

Effectively solved

Multiple models approach perfect. The benchmark can no longer differentiate.

Why HumanEval is saturated

HumanEval served its purpose brilliantly. In 2021, it was the right benchmark at the right time — simple enough to be reproducible, hard enough to be meaningful. But three structural limitations made saturation inevitable:

164

Too few problems

With only 164 problems, each one worth 0.6% of the total score. Statistical noise from a single problem can shift rankings.

~7.7

Too few tests per problem

Many problems have only 3–5 unit tests. Models can pass with subtly wrong solutions that happen to satisfy weak test suites.

Public

Data contamination

The problems have been on GitHub for 5 years. They're in every training dataset. Models may have memorized solutions, not learned to code.

This doesn't mean HumanEval scores are meaningless — a model scoring 30% is genuinely worse at coding than one scoring 90%. But the difference between 93% and 95% is mostly noise.

What comes after HumanEval

The community has moved to harder, more realistic benchmarks:

Data sources & methodology

Scores compiled from: original papers (arxiv), official model cards, llm-stats.com leaderboard, and HumanEval Revisited (arxiv:2402.14852).

All scores are pass@1 unless noted. Where multiple evaluations exist for the same model, we prefer the officially reported score. Prompting strategy (0-shot vs few-shot, system prompt) can cause 5–15% variance.

Last updated: March 17, 2026.

Related benchmarks