HumanEval+.

Name: HumanEval+ Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

Paper ↗Leaderboard ↓Lineage

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

pass@1

Pass@1 is the reported evaluation metric for HumanEval+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for pass@1verifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Qwen2.5-Coder-32B Qwen2.5-Coder-32B-Instruct (Alibaba, Nov 2024). HumanEval+ pass@1 87.2%. Table 16 of Qwen2.5-Coder technical report.	verified	87.2	2024	Source ↗
02	DeepSeek-V3 DeepSeek-V3 (DeepSeek AI, Dec 2024). HumanEval+ pass@1 86.6. From EvalPlus leaderboard results.json.	verified	86.6	2025	Source ↗
03	GPT-4o GPT-4o (2024-08-06). HumanEval+ pass@1 86.0%. Table 16 of Qwen2.5-Coder technical report.	verified	86	2024	Source ↗
04	DeepSeek-Coder-V2 DeepSeek-Coder-V2-Instruct (236B). HumanEval+ pass@1 82.3%. Table 16 of Qwen2.5-Coder technical report.	verified	82.3	2024	Source ↗
05	DeepSeek-Coder-33B DeepSeek-Coder-33B-Instruct. HumanEval+ pass@1 75.0%. Table 16 of Qwen2.5-Coder technical report.	verified	75	2024	Source ↗

Lineage

HumanEval+ in context.

See full coding benchmarks lineage →

Predecessors (1)

saturated2021-07

HumanEval

EvalPlus added 80× test cases per problem to catch the edge cases the original 164 missed. Reopened the gap on saturated leaderboards.

This benchmark (1)

active2023-05

HumanEval+

Successors (1)

active2023-09

LiveCodeBench

Where leaderboard attention moved once EvalPlus problems also began saturating. LiveCodeBench's by-date contamination control became the new credibility floor.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Code Generation