Codesota · Benchmark · MBPPHome/Leaderboards/Code & Software Engineering/Code Generation/MBPP
Unknown

MBPP.

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

pass@1

Pass@1 is the reported evaluation metric for MBPP. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for pass@1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01o4-mini
OpenAI model card. MBPP pass@1.
verified94.92026Source ↗Looks wrong?
02o3-mini
OpenAI o3-mini model card. MBPP pass@1.
verified93.32026Source ↗Looks wrong?
03Claude Opus 4
Anthropic model card. MBPP pass@1.
verified922026Source ↗Looks wrong?
04Claude 3.5 Sonnet (Oct 2024)
Qwen2.5-Coder tech report Table 16
verified912024Source ↗Looks wrong?
05GPT-4.1
OpenAI GPT-4.1 model card. MBPP pass@1.
verified90.92026Source ↗Looks wrong?
06Qwen2.5-Coder-32B-Instruct
Qwen2.5-Coder tech report Table 16
verified90.22024Source ↗Looks wrong?
07Qwen2.5-Coder 32B
Table 2, arxiv:2409.12186. MBPP pass@1.
verified90.22024Paper ↗Code ↗Looks wrong?
08Claude Sonnet 4
Anthropic model card. MBPP pass@1.
verified89.62026Source ↗Looks wrong?
09DeepSeek-Coder-V2-Instruct
Qwen2.5-Coder tech report Table 16
verified89.42024Source ↗Looks wrong?
10DeepSeek-V3
DeepSeek-V3 tech report. MBPP pass@1.
verified89.32026Source ↗Looks wrong?
11Claude 3.5 Sonnetunverified89.22025Source ↗Looks wrong?
12claude-35-sonnetpaper89.22025Source ↗Looks wrong?
13GPT-4ounverified87.82025Source ↗Looks wrong?
14GPT-4o (Aug 2024)
Qwen2.5-Coder tech report Table 16
verified86.82024Source ↗Looks wrong?
15Qwen2.5-Coder-7B-Instruct
Qwen2.5-Coder tech report Table 16
verified83.52024Source ↗Looks wrong?
16Codestral 22B v0.1
Qwen2.5-Coder tech report Table 16
verified78.22024Source ↗Looks wrong?
17Llama 4 Maverick
Meta Llama 4 Maverick model card
verified77.62025Source ↗Looks wrong?
18Llama 4 Maverick (17B-128E)
Meta Llama 4 Maverick model card
verified77.62025Source ↗Looks wrong?
19Codestral 22B
Mistral official blog, May 2024. MBPP pass@1.
verified75.42024Source ↗Looks wrong?
20Gemma-3-27b
Gemma 3 tech report
verified74.42025Source ↗Looks wrong?
21Gemma 3 27B IT
Gemma 3 tech report
verified74.42025Source ↗Looks wrong?
22Gemma 3 12B IT
Gemma 3 tech report
verified732025Source ↗Looks wrong?
23Llama 4 Scout (17B-16E)
Meta Llama 4 Scout model card, pre-trained
verified67.82025Source ↗Looks wrong?
24Llama-4-Scout
Meta Llama 4 Scout model card, pre-trained
verified67.82025Source ↗Looks wrong?
25Gemma 3 4B IT
Gemma 3 tech report
verified63.22025Source ↗Looks wrong?
26Code Llama 34B
Code Llama paper, arxiv:2308.12950. MBPP pass@1.
verified62.62026Source ↗Looks wrong?
27StarCoder2 15B
Table 2, arxiv:2402.19173. StarCoder2-15B base model.
verified54.42024Paper ↗Code ↗Looks wrong?

Pass 1

Pass 1 is the reported evaluation metric for MBPP. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass 1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Lineage

MBPP in context.

See full coding benchmarks lineage →
This benchmark (1)
saturated2021-08
MBPP
§ 04 · Submit a result

Add to the leaderboard.

← Back to Code Generation