Codesota · Benchmark · MBPP+Home/Leaderboards/Code & Software Engineering/Code Generation/MBPP+
Unknown

MBPP+.

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Pass 1

Pass 1 is the reported evaluation metric for MBPP+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass 1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

pass@1

Pass@1 is the reported evaluation metric for MBPP+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for pass@1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Qwen2.5-Coder-32B
Qwen2.5-Coder-32B-Instruct (Alibaba, Nov 2024). MBPP+ pass@1 76.4%. Table 16 of Qwen2.5-Coder technical report.
verified76.42024Source ↗Looks wrong?
02DeepSeek-V3
DeepSeek-V3 (DeepSeek AI, Dec 2024). MBPP+ pass@1 73.0. From EvalPlus leaderboard results.json (evalplus.github.io).
verified732025Source ↗Looks wrong?
03GPT-4o
GPT-4o (2024-08-06). MBPP+ pass@1 71.2%. Table 16 of Qwen2.5-Coder technical report.
verified71.22024Source ↗Looks wrong?
04DeepSeek-Coder-33B
DeepSeek-Coder-33B-Instruct. MBPP+ pass@1 66.0%. Table 16 of Qwen2.5-Coder technical report.
verified662024Source ↗Looks wrong?
Lineage

MBPP+ in context.

See full coding benchmarks lineage →
This benchmark (1)
active2023-05
MBPP+
None yet — this is the current frontier.
§ 04 · Submit a result

Add to the leaderboard.

← Back to Code Generation