Codesota · Benchmark · HumanEval+Home/Leaderboards/Code & Software Engineering/Code Generation/HumanEval+
Unknown

HumanEval+.

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

pass@1

Pass@1 is the reported evaluation metric for HumanEval+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for pass@1verifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Qwen2.5-Coder-32B
Qwen2.5-Coder-32B-Instruct (Alibaba, Nov 2024). HumanEval+ pass@1 87.2%. Table 16 of Qwen2.5-Coder technical report.
verified87.22024Source ↗
02DeepSeek-V3
DeepSeek-V3 (DeepSeek AI, Dec 2024). HumanEval+ pass@1 86.6. From EvalPlus leaderboard results.json.
verified86.62025Source ↗
03GPT-4o
GPT-4o (2024-08-06). HumanEval+ pass@1 86.0%. Table 16 of Qwen2.5-Coder technical report.
verified862024Source ↗
04DeepSeek-Coder-V2
DeepSeek-Coder-V2-Instruct (236B). HumanEval+ pass@1 82.3%. Table 16 of Qwen2.5-Coder technical report.
verified82.32024Source ↗
05DeepSeek-Coder-33B
DeepSeek-Coder-33B-Instruct. HumanEval+ pass@1 75.0%. Table 16 of Qwen2.5-Coder technical report.
verified752024Source ↗
Lineage

HumanEval+ in context.

See full coding benchmarks lineage →
This benchmark (1)
active2023-05
HumanEval+
§ 04 · Submit a result

Add to the leaderboard.

← Back to Code Generation