Who leads the MBPP benchmark?

o4-mini currently leads MBPP with a score of 94.90 on pass@1.

What is the state-of-the-art score on MBPP?

The state-of-the-art result on MBPP is 94.90 (pass@1), achieved by o4-mini as of 2026.

How many models are tracked on MBPP?

Codesota tracks 21 models on MBPP across 2 metrics.

When was the MBPP leaderboard last updated?

The MBPP leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2024.

Codesota · Computer Code · Code Generation · MBPPTasks/Computer Code/Code Generation

Code Generation · benchmark dataset · 2021 · PYTHON

Mostly Basic Python Problems.

Name: Mostly Basic Python Problems Benchmark Results
Creator: Codesota
Published: 2024-01-01
License: https://creativecommons.org/licenses/by/4.0/

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

22 results indexed across 2 metrics. Shaded row marks current SOTA; ties broken by submission date.

Primary: pass@1 · higher is better
All metrics: pass-1, pass@1

pass-1

3 rows

#	Model	Org	Submitted	Paper / code	pass-1
01	Qwen2.5-72B-Instruct	—	Dec 2024	Qwen2.5 Technical Report · code	88.20
02	Qwen3-235B-A22BOpen	Alibaba	May 2025	Qwen3 Technical Report · code	81.40
03	Llama 3 (405B, Instruct)	Meta	Jul 2024	The Llama 3 Herd of Models · code	78.80

pass@1· primary

19 rows

#	Model	Org	Submitted	Paper / code	pass@1
01	o4-mini	OpenAI	Mar 2026	official-model-card	94.90
02	o3-miniAPI	OpenAI	Mar 2026	official-model-card	93.30
03	Claude Opus 4	Anthropic	Mar 2026	official-model-card	92
04	GPT-4.1	OpenAI	Mar 2026	official-model-card	90.90
05	Qwen2.5-Coder 32BOpen	Alibaba	Sep 2024	Qwen2.5-Coder Technical Report · code	90.20
06	Claude Sonnet 4	Anthropic	Mar 2026	official-model-card	89.60
07	DeepSeek-Coder-V2-InstructOpen	DeepSeek	Sep 2024	arxiv-2409.12186	89.40
08	DeepSeek-Coder-V2-InstructOpen	DeepSeek	Jun 2024	DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source… · code	89.40
09	DeepSeek-V3Open	DeepSeek	Mar 2026	arxiv	89.30
10	Claude 3.5 SonnetAPI	Anthropic	Dec 2025	anthropic-blog	89.20
11	GPT-4oAPI	OpenAI	Dec 2025	openai-blog	87.80
12	Llama 4 MaverickOpen	Meta	Apr 2025	meta-model-card	77.60
13	Codestral 22BOpen	Mistral	May 2024	official-blog	75.40
14	Gemma-3-27bOpen	Google	Mar 2025	arxiv-2503.19786	74.40
15	Gemma 3 12B ITOpen	Google DeepMind	Mar 2025	arxiv-2503.19786	73
16	Llama-4-ScoutOpen	Meta	Apr 2025	meta-model-card	67.80
17	Gemma 3 4B ITOpen	Google DeepMind	Mar 2025	arxiv-2503.19786	63.20
18	Code Llama 34BOpen	Meta	Mar 2026	arxiv	62.60
19	StarCoder2 15BOpen	BigCode	Feb 2024	StarCoder2 and The Stack v2: The Next Generation · code	54.40

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

5 steps
of state of the art.

Each row below marks a model that broke the previous record on pass@1. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · pass@1

Feb 29, 2024StarCoder2 15BBigCode54.40
May 29, 2024Codestral 22BMistral75.40
Jun 17, 2024DeepSeek-Coder-V2-InstructDeepSeek89.40
Sep 19, 2024Qwen2.5-Coder 32BAlibaba90.20
Mar 27, 2026o4-miniOpenAI94.90

Fig 3 · SOTA-setting models only. 5 entries span Feb 2024 → Mar 2026.

§ 04 · Literature

6 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

Qwen3 Technical Report
May 2025·Qwen3-235B-A22B
arXiv ↗Code
Qwen2.5 Technical Report
Dec 2024·Qwen2.5-72B-Instruct
arXiv ↗Code
Qwen2.5-Coder Technical Report
Sep 2024·Qwen2.5-Coder 32B
arXiv ↗Code
The Llama 3 Herd of Models
Jul 2024·Llama 3 (405B, Instruct)
arXiv ↗Code
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Jun 2024·DeepSeek-Coder-V2-Instruct
arXiv ↗Code
StarCoder2 and The Stack v2: The Next Generation
Feb 2024·StarCoder2 15B
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

Mostly Basic Python Problems.

Best published scores.

5 stepsof state of the art.

6 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

5 steps
of state of the art.

6 papers
tied to this benchmark.

Have a score that beats
this table?