Codesota · Benchmark · MBPP+Home/Leaderboards/Code & Software Engineering/Code Generation/MBPP+
Unknown

MBPP+.

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Only 4 models on this benchmark
Help build the community leaderboard — submit your model results.

pass@1

Pass@1 is the reported evaluation metric for MBPP+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for pass@1verifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Qwen2.5-Coder-32B
Qwen2.5-Coder-32B-Instruct (Alibaba, Nov 2024). MBPP+ pass@1 76.4%. Table 16 of Qwen2.5-Coder technical report.
verified76.42024Source ↗
02DeepSeek-V3
DeepSeek-V3 (DeepSeek AI, Dec 2024). MBPP+ pass@1 73.0. From EvalPlus leaderboard results.json (evalplus.github.io).
verified732025Source ↗
03GPT-4o
GPT-4o (2024-08-06). MBPP+ pass@1 71.2%. Table 16 of Qwen2.5-Coder technical report.
verified71.22024Source ↗
04DeepSeek-Coder-33B
DeepSeek-Coder-33B-Instruct. MBPP+ pass@1 66.0%. Table 16 of Qwen2.5-Coder technical report.
verified662024Source ↗
Lineage

MBPP+ in context.

See full coding benchmarks lineage →
This benchmark (1)
active2023-05
MBPP+
None yet — this is the current frontier.
§ 04 · Submit a result

Add to the leaderboard.

← Back to Code Generation