Codesota · Computer Code · Code Generation · MBPP+Tasks/Computer Code/Code Generation
Code Generation · benchmark dataset · 2023 · PYTHON

MBPP+ Extended Version.

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

9 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
pass@1 · higher is better
pass-1
9 rows
#ModelOrgSubmittedPaper / codepass-1
01Qwen2.5-72B-InstructDec 2024Qwen2.5 Technical Report · code88.20
02Qwen3-235B-A22BOpenAlibabaMay 2025Qwen3 Technical Report · code81.40
03Step-3.5-Flash BaseFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code79.40
04Llama 3 (405B, Instruct)MetaJul 2024The Llama 3 Herd of Models · code78.80
05MiniCPM-o 4.5-InstructApr 2026MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal … · code76.70
06Code Llama - Python 70B (3-shot)Aug 2023Code Llama: Open Foundation Models for Code · code65.60
07Apertus-70B-InstructSep 2025Apertus: Democratizing Open and Compliant LLMs for Globa… · code47
08BLT-Entropy 8BDec 2024Byte Latent Transformer: Patches Scale Better Than Token… · code41.80
09LLaMA-65BFeb 2023LLaMA: Open and Efficient Foundation Language Models · code37.70
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 04 · Literature

9 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
MBPP+ — Code Generation | CodeSOTA