Who leads the MBPP benchmark?

o4-mini currently leads MBPP with a score of 94.9 on pass@1.

What is the state-of-the-art score on MBPP?

The state-of-the-art result on MBPP is 94.9 (pass@1), achieved by o4-mini as of 2026.

How many models are tracked on MBPP?

Codesota tracks 30 models on MBPP across 2 metrics.

When was the MBPP leaderboard last updated?

The MBPP leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2024.

Codesota · Benchmark · MBPPHome/Leaderboards/Code & Software Engineering/Code Generation/MBPP

Unknown

MBPP.

Name: MBPP Benchmark Results
Creator: Unknown
Published: 2024-01-01
License: https://creativecommons.org/licenses/by/4.0/

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

Paper ↗Leaderboard ↓Lineage

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

pass@1

Pass@1 is the reported evaluation metric for MBPP. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for pass@1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	o4-mini OpenAI model card. MBPP pass@1.	verified	94.9	2026	Source ↗	Looks wrong?
02	o3-mini OpenAI o3-mini model card. MBPP pass@1.	verified	93.3	2026	Source ↗	Looks wrong?
03	Claude Opus 4 Anthropic model card. MBPP pass@1.	verified	92	2026	Source ↗	Looks wrong?
04	Claude 3.5 Sonnet (Oct 2024) Qwen2.5-Coder tech report Table 16	verified	91	2024	Source ↗	Looks wrong?
05	GPT-4.1 OpenAI GPT-4.1 model card. MBPP pass@1.	verified	90.9	2026	Source ↗	Looks wrong?
06	Qwen2.5-Coder-32B-Instruct Qwen2.5-Coder tech report Table 16	verified	90.2	2024	Source ↗	Looks wrong?
07	Qwen2.5-Coder 32B Table 2, arxiv:2409.12186. MBPP pass@1.	verified	90.2	2024	Paper ↗Code ↗	Looks wrong?
08	Claude Sonnet 4 Anthropic model card. MBPP pass@1.	verified	89.6	2026	Source ↗	Looks wrong?
09	DeepSeek-Coder-V2-Instruct Qwen2.5-Coder tech report Table 16	verified	89.4	2024	Source ↗	Looks wrong?
10	DeepSeek-V3 DeepSeek-V3 tech report. MBPP pass@1.	verified	89.3	2026	Source ↗	Looks wrong?
11	Claude 3.5 Sonnet	unverified	89.2	2025	Source ↗	Looks wrong?
12	claude-35-sonnet	paper	89.2	2025	Source ↗	Looks wrong?
13	GPT-4o	unverified	87.8	2025	Source ↗	Looks wrong?
14	GPT-4o (Aug 2024) Qwen2.5-Coder tech report Table 16	verified	86.8	2024	Source ↗	Looks wrong?
15	Qwen2.5-Coder-7B-Instruct Qwen2.5-Coder tech report Table 16	verified	83.5	2024	Source ↗	Looks wrong?
16	Codestral 22B v0.1 Qwen2.5-Coder tech report Table 16	verified	78.2	2024	Source ↗	Looks wrong?
17	Llama 4 Maverick Meta Llama 4 Maverick model card	verified	77.6	2025	Source ↗	Looks wrong?
18	Llama 4 Maverick (17B-128E) Meta Llama 4 Maverick model card	verified	77.6	2025	Source ↗	Looks wrong?
19	Codestral 22B Mistral official blog, May 2024. MBPP pass@1.	verified	75.4	2024	Source ↗	Looks wrong?
20	Gemma-3-27b Gemma 3 tech report	verified	74.4	2025	Source ↗	Looks wrong?
21	Gemma 3 27B IT Gemma 3 tech report	verified	74.4	2025	Source ↗	Looks wrong?
22	Gemma 3 12B IT Gemma 3 tech report	verified	73	2025	Source ↗	Looks wrong?
23	Llama 4 Scout (17B-16E) Meta Llama 4 Scout model card, pre-trained	verified	67.8	2025	Source ↗	Looks wrong?
24	Llama-4-Scout Meta Llama 4 Scout model card, pre-trained	verified	67.8	2025	Source ↗	Looks wrong?
25	Gemma 3 4B IT Gemma 3 tech report	verified	63.2	2025	Source ↗	Looks wrong?
26	Code Llama 34B Code Llama paper, arxiv:2308.12950. MBPP pass@1.	verified	62.6	2026	Source ↗	Looks wrong?
27	StarCoder2 15B Table 2, arxiv:2402.19173. StarCoder2-15B base model.	verified	54.4	2024	Paper ↗Code ↗	Looks wrong?

Pass 1

Pass 1 is the reported evaluation metric for MBPP. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass 1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Qwen2.5-72B-Instruct	unverified	88.2	2024	Paper ↗Code ↗	Looks wrong?
02	Qwen3-235B-A22B	unverified	81.4	2025	Paper ↗Code ↗	Looks wrong?
03	Llama 3 (405B, Instruct)	unverified	78.8	2024	Paper ↗Code ↗	Looks wrong?