Code Generation

Generating code from natural language descriptions (HumanEval, MBPP).

9
Datasets
122
Results
resolve-rate
Canonical metric
Canonical Benchmark

SWE-Bench Verified

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Primary metric: resolve-rate
View full leaderboard

Top 10

Leading models on SWE-Bench Verified.

RankModelresolve-rateYearSource
1
Claude Opus 4.5
80.92026paper
2
Claude Opus 4.6
80.82026paper
3
Gemini 3.1 Pro
80.62026paper
4
MiniMax M2.5
80.22026paper
5
GPT-5.2 Thinking
80.02026paper
6
Claude Sonnet 4.6
79.62026paper
7
Gemini 3 Flash
78.02026paper
8
Claude Sonnet 4.5
77.22026paper
9
Kimi K2.5
76.82026paper
10
GPT-5.1
76.32026paper

What were you looking for on Code Generation?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

9 datasets tracked for this task.

Related tasks

Other tasks in Computer Code.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Code Generation? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.