How many models are tracked on MBPP+?

Codesota tracks 9 models on MBPP+.

When was the MBPP+ leaderboard last updated?

The MBPP+ leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2023.

Codesota · Computer Code · Code Generation · MBPP+Tasks/Computer Code/Code Generation

Code Generation · benchmark dataset · 2023 · PYTHON

MBPP+ Extended Version.

Name: MBPP+ Extended Version Benchmark Results
Creator: Codesota
Published: 2023-01-01
License: https://creativecommons.org/licenses/by/4.0/

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

9 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: pass@1 · higher is better

pass-1

9 rows

#	Model	Org	Submitted	Paper / code	pass-1
01	Qwen2.5-72B-Instruct	—	Dec 2024	Qwen2.5 Technical Report · code	88.20
02	Qwen3-235B-A22BOpen	Alibaba	May 2025	Qwen3 Technical Report · code	81.40
03	Step-3.5-Flash Base	—	Feb 2026	Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code	79.40
04	Llama 3 (405B, Instruct)	Meta	Jul 2024	The Llama 3 Herd of Models · code	78.80
05	MiniCPM-o 4.5-Instruct	—	Apr 2026	MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal … · code	76.70
06	Code Llama - Python 70B (3-shot)	—	Aug 2023	Code Llama: Open Foundation Models for Code · code	65.60
07	Apertus-70B-Instruct	—	Sep 2025	Apertus: Democratizing Open and Compliant LLMs for Globa… · code	47
08	BLT-Entropy 8B	—	Dec 2024	Byte Latent Transformer: Patches Scale Better Than Token… · code	41.80
09	LLaMA-65B	—	Feb 2023	LLaMA: Open and Efficient Foundation Language Models · code	37.70

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 04 · Literature

9 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
Apr 2026·MiniCPM-o 4.5-Instruct
arXiv ↗Code
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Feb 2026·Step-3.5-Flash Base
arXiv ↗Code
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Sep 2025·Apertus-70B-Instruct
arXiv ↗Code
Qwen3 Technical Report
May 2025·Qwen3-235B-A22B
arXiv ↗Code
Qwen2.5 Technical Report
Dec 2024·Qwen2.5-72B-Instruct
arXiv ↗Code
Byte Latent Transformer: Patches Scale Better Than Tokens
Dec 2024·BLT-Entropy 8B
arXiv ↗Code
The Llama 3 Herd of Models
Jul 2024·Llama 3 (405B, Instruct)
arXiv ↗Code
Code Llama: Open Foundation Models for Code
Aug 2023·Code Llama - Python 70B (3-shot)
arXiv ↗Code
LLaMA: Open and Efficient Foundation Language Models
Feb 2023·LLaMA-65B
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

MBPP+ Extended Version.

Best published scores.

9 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

9 papers
tied to this benchmark.

Have a score that beats
this table?