Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.
Pass 1 is the reported evaluation metric for MBPP+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Qwen2.5-72B-Instruct | unverified | 88.2 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 02 | Qwen3-235B-A22B | unverified | 81.4 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 03 | Step-3.5-Flash Base | unverified | 79.4 | 2026 | Paper ↗Code ↗ | Looks wrong? |
| 04 | Llama 3 (405B, Instruct) | unverified | 78.8 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 05 | MiniCPM-o 4.5-Instruct | unverified | 76.7 | 2026 | Paper ↗Code ↗ | Looks wrong? |
| 06 | Code Llama - Python 70B (3-shot) | unverified | 65.6 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 07 | Apertus-70B-Instruct | unverified | 47 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 08 | BLT-Entropy 8B | unverified | 41.8 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 09 | LLaMA-65B | unverified | 37.7 | 2023 | Paper ↗Code ↗ | Looks wrong? |
Pass@1 is the reported evaluation metric for MBPP+. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Qwen2.5-Coder-32B | verified | 76.4 | 2024 | Source ↗ | Looks wrong? |
| 02 | DeepSeek-V3 | verified | 73 | 2025 | Source ↗ | Looks wrong? |
| 03 | GPT-4o | verified | 71.2 | 2024 | Source ↗ | Looks wrong? |
| 04 | DeepSeek-Coder-33B | verified | 66 | 2024 | Source ↗ | Looks wrong? |