Who leads the MBPP+ benchmark?

Qwen2.5-72B-Instruct currently leads MBPP+ with a score of 88.2 on Pass 1.

What is the state-of-the-art score on MBPP+?

The state-of-the-art result on MBPP+ is 88.2 (Pass 1), achieved by Qwen2.5-72B-Instruct as of 2026.

How many models are tracked on MBPP+?

Codesota tracks 13 models on MBPP+ across 2 metrics.

When was the MBPP+ leaderboard last updated?

The MBPP+ leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2023.

Codesota · Benchmark · MBPP+Home/Leaderboards/Code & Software Engineering/Code Generation/MBPP+

Unknown

MBPP+.

Name: MBPP+ Benchmark Results
Creator: Unknown
Published: 2023-01-01
License: https://creativecommons.org/licenses/by/4.0/

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

Paper ↗Leaderboard ↓Lineage

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Pass 1

Pass 1 is the reported evaluation metric for MBPP+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass 1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Qwen2.5-72B-Instruct	unverified	88.2	2024	Paper ↗Code ↗	Looks wrong?
02	Qwen3-235B-A22B	unverified	81.4	2025	Paper ↗Code ↗	Looks wrong?
03	Step-3.5-Flash Base	unverified	79.4	2026	Paper ↗Code ↗	Looks wrong?
04	Llama 3 (405B, Instruct)	unverified	78.8	2024	Paper ↗Code ↗	Looks wrong?
05	MiniCPM-o 4.5-Instruct	unverified	76.7	2026	Paper ↗Code ↗	Looks wrong?
06	Code Llama - Python 70B (3-shot)	unverified	65.6	2023	Paper ↗Code ↗	Looks wrong?
07	Apertus-70B-Instruct	unverified	47	2025	Paper ↗Code ↗	Looks wrong?
08	BLT-Entropy 8B	unverified	41.8	2024	Paper ↗Code ↗	Looks wrong?
09	LLaMA-65B	unverified	37.7	2023	Paper ↗Code ↗	Looks wrong?

pass@1

Pass@1 is the reported evaluation metric for MBPP+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for pass@1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Qwen2.5-Coder-32B Qwen2.5-Coder-32B-Instruct (Alibaba, Nov 2024). MBPP+ pass@1 76.4%. Table 16 of Qwen2.5-Coder technical report.	verified	76.4	2024	Source ↗	Looks wrong?
02	DeepSeek-V3 DeepSeek-V3 (DeepSeek AI, Dec 2024). MBPP+ pass@1 73.0. From EvalPlus leaderboard results.json (evalplus.github.io).	verified	73	2025	Source ↗	Looks wrong?
03	GPT-4o GPT-4o (2024-08-06). MBPP+ pass@1 71.2%. Table 16 of Qwen2.5-Coder technical report.	verified	71.2	2024	Source ↗	Looks wrong?
04	DeepSeek-Coder-33B DeepSeek-Coder-33B-Instruct. MBPP+ pass@1 66.0%. Table 16 of Qwen2.5-Coder technical report.	verified	66	2024	Source ↗	Looks wrong?

Lineage

MBPP+ in context.

See full coding benchmarks lineage →

Predecessors (1)

saturated2021-08

MBPP

Same EvalPlus adversarial-test treatment applied to MBPP.

This benchmark (1)

active2023-05

MBPP+

None yet — this is the current frontier.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Code Generation