Codesota · Benchmark · HumanEval+Home/Leaderboards/Code & Software Engineering/Code Generation/HumanEval+
Unknown

HumanEval+.

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Pass 1

Pass 1 is the reported evaluation metric for HumanEval+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass 1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Llama 3 (405B, Instruct)unverified892024Paper ↗Code ↗Looks wrong?
02Qwen2.5-Plusunverified87.82024Paper ↗Code ↗Looks wrong?
03Qwen2.5-VL-72Bunverified87.82025Paper ↗Code ↗Looks wrong?
04MiniCPM-o 4.5-Instructunverified86.62026Paper ↗Code ↗Looks wrong?
05Step-3.5-Flash Baseunverified81.12026Paper ↗Code ↗Looks wrong?
06Ariaunverified73.22024Paper ↗Code ↗Looks wrong?
07Code Llama - Instruct 70Bunverified67.82023Paper ↗Code ↗Looks wrong?
08BLT-Entropy 8Bunverified35.42024Paper ↗Code ↗Looks wrong?
09Llama 2 70B (5-shot)unverified29.92023Paper ↗Code ↗Looks wrong?
10LLaMA-65Bunverified23.72023Paper ↗Code ↗Looks wrong?
11SmoLM2 (1.7B)unverified22.62025Paper ↗Code ↗Looks wrong?
12BLOOM-176Bunverified15.522022Paper ↗Code ↗Looks wrong?

pass@1

Pass@1 is the reported evaluation metric for HumanEval+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for pass@1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Qwen2.5-Coder-32B
Qwen2.5-Coder-32B-Instruct (Alibaba, Nov 2024). HumanEval+ pass@1 87.2%. Table 16 of Qwen2.5-Coder technical report.
verified87.22024Source ↗Looks wrong?
02DeepSeek-V3
DeepSeek-V3 (DeepSeek AI, Dec 2024). HumanEval+ pass@1 86.6. From EvalPlus leaderboard results.json.
verified86.62025Source ↗Looks wrong?
03GPT-4o
GPT-4o (2024-08-06). HumanEval+ pass@1 86.0%. Table 16 of Qwen2.5-Coder technical report.
verified862024Source ↗Looks wrong?
04DeepSeek-Coder-V2
DeepSeek-Coder-V2-Instruct (236B). HumanEval+ pass@1 82.3%. Table 16 of Qwen2.5-Coder technical report.
verified82.32024Source ↗Looks wrong?
05DeepSeek-Coder-33B
DeepSeek-Coder-33B-Instruct. HumanEval+ pass@1 75.0%. Table 16 of Qwen2.5-Coder technical report.
verified752024Source ↗Looks wrong?
Lineage

HumanEval+ in context.

See full coding benchmarks lineage →
This benchmark (1)
active2023-05
HumanEval+
§ 04 · Submit a result

Add to the leaderboard.

← Back to Code Generation