Code Generation2021python

Mostly Basic Python Problems

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

Metrics:pass@1, pass@10
Paper / WebsiteDownload
Current State of the Art

o4-mini

OpenAI

94.9

pass@1

MBPP — pass@1

19 results · 5 SOTA advances · higher is better

All results
SOTA frontier
505560657075808590952024202520262027pass@1StarCoder2-15BDeepSeek-Coder-V2-Instructo4-mini

Model Size vs Score — Pareto Frontier

5 models · log scale · Pareto frontier shown

Global
Bielik
PLLuM
Pareto
63.063.564.064.565.065.566.066.567.067.568.068.569.069.570.070.571.071.572.072.573.073.574.074.575.075.576.076.577.077.53B7B11B14B24B32B70B120B235BParameters (log scale)pass@1

pass@1 Progress Over Time

Showing 4 breakthroughs from Aug 2023 to Mar 2026

59.469.178.888.498.1Aug 2023Jun 2024May 2025Mar 2026pass@1Date

Key Milestones

Aug 2023
Code Llama 34B

Code Llama paper, arxiv:2308.12950. MBPP pass@1.

62.6
Jun 2024
DeepSeek-Coder-V2-Instruct

Table 1, arxiv:2406.11931. MBPP pass@1.

89.4
+42.8%
Sep 2024
Qwen2.5-Coder-32B-Instruct

Table 2, arxiv:2409.12186. MBPP pass@1.

90.2
+0.9%
Mar 2026
o4-miniCurrent SOTA

OpenAI model card. MBPP pass@1.

94.9
+5.2%
Total Improvement
51.6%
Time Span
2y 8m
Breakthroughs
4
Current SOTA
94.9

Top Models Performance Comparison

Top 10 models ranked by pass@1

pass@11o4-mini94.9100.0%2o3-mini93.398.3%3Claude Opus 492.096.9%4GPT-4.190.995.8%5Qwen2.5-Coder-32B-Instruct90.295.0%6Claude Sonnet 489.694.4%7DeepSeek-Coder-V2-Instruct89.494.2%8DeepSeek-Coder-V2-Instruct89.494.2%9DeepSeek-V389.394.1%10Claude 3.5 Sonnet89.294.0%0%25%50%75%100%% of best
Best Score
94.9
Top Model
o4-mini
Models Compared
10
Score Range
5.7

pass@1Primary

#ModelScorePaper / CodeDate
1
o4-miniAPI
OpenAI
94.9Mar 2026
2
o3-miniAPI
OpenAI
93.3Mar 2026
3
Claude Opus 4API
Anthropic
92Mar 2026
4
GPT-4.1API
OpenAI
90.9Mar 2026
5
Qwen2.5-Coder-32B-InstructOpen Source
Alibaba
90.2Sep 2024
6
Claude Sonnet 4API
Anthropic
89.6Mar 2026
7
DeepSeek-Coder-V2-InstructOpen Source
DeepSeek
89.4Sep 2024
8
DeepSeek-Coder-V2-InstructOpen Source
DeepSeek
89.4Jun 2024
9
DeepSeek-V3Open Source
DeepSeek
89.3Mar 2026
10
Claude 3.5 SonnetAPI
Anthropic
89.2Dec 2025
11
GPT-4oAPI
OpenAI
87.8Dec 2025
12
Llama-4-MaverickOpen Source
Meta
77.6Apr 2025
13
Codestral 22B
Mistral
75.4
Codestral: Hello, World!
May 2024
14
Gemma-3-27b
Google
74.4Mar 2025
15
Gemma 3 12B IT
Google DeepMind
73Mar 2025
16
Llama-4-ScoutOpen Source
Meta
67.8Apr 2025
17
Gemma 3 4B IT
Google DeepMind
63.2Mar 2025
18
Code Llama 34BOpen Source
Meta
62.6Mar 2026
19
StarCoder2-15BOpen Source
BigCode
54.4Feb 2024

Related Papers3

Other Code Generation Datasets

MBPP Benchmark - Code Generation | CodeSOTA