Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.
9 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | pass-1 |
|---|---|---|---|---|---|
| 01 | Qwen2.5-72B-Instruct | — | Dec 2024 | Qwen2.5 Technical Report · code | 88.20 |
| 02 | Qwen3-235B-A22BOpen | Alibaba | May 2025 | Qwen3 Technical Report · code | 81.40 |
| 03 | Step-3.5-Flash Base | — | Feb 2026 | Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code | 79.40 |
| 04 | Llama 3 (405B, Instruct) | Meta | Jul 2024 | The Llama 3 Herd of Models · code | 78.80 |
| 05 | MiniCPM-o 4.5-Instruct | — | Apr 2026 | MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal … · code | 76.70 |
| 06 | Code Llama - Python 70B (3-shot) | — | Aug 2023 | Code Llama: Open Foundation Models for Code · code | 65.60 |
| 07 | Apertus-70B-Instruct | — | Sep 2025 | Apertus: Democratizing Open and Compliant LLMs for Globa… · code | 47 |
| 08 | BLT-Entropy 8B | — | Dec 2024 | Byte Latent Transformer: Patches Scale Better Than Token… · code | 41.80 |
| 09 | LLaMA-65B | — | Feb 2023 | LLaMA: Open and Efficient Foundation Language Models · code | 37.70 |
Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.