Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
9
Datasets
122
Results
resolve-rate
Canonical metric
Canonical Benchmark
SWE-Bench Verified
500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.
Primary metric: resolve-rate
Top 10
Leading models on SWE-Bench Verified.
| Rank | Model | resolve-rate | Year | Source |
|---|---|---|---|---|
| 1 | Claude Opus 4.5 | 80.9 | 2026 | paper |
| 2 | Claude Opus 4.6 | 80.8 | 2026 | paper |
| 3 | Gemini 3.1 Pro | 80.6 | 2026 | paper |
| 4 | MiniMax M2.5 | 80.2 | 2026 | paper |
| 5 | GPT-5.2 Thinking | 80.0 | 2026 | paper |
| 6 | Claude Sonnet 4.6 | 79.6 | 2026 | paper |
| 7 | Gemini 3 Flash | 78.0 | 2026 | paper |
| 8 | Claude Sonnet 4.5 | 77.2 | 2026 | paper |
| 9 | Kimi K2.5 | 76.8 | 2026 | paper |
| 10 | GPT-5.1 | 76.3 | 2026 | paper |
What were you looking for on Code Generation?
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
All datasets
9 datasets tracked for this task.
SWE-Bench Verified
CANONICAL38results·resolve-rate
Top: Claude Opus 4.5 — 80.9
HumanEval
33results·pass@1
Top: o4-mini (high) — 99.3
LiveCodeBench
22results·pass@1
Top: DeepSeek-R1-0528 — 73.3
MBPP
14results·pass@1
Top: Claude 3.5 Sonnet (Oct 2024) — 91.0
HumanEval+
5results·pass@1
Top: Qwen2.5-Coder-32B — 87.2
MBPP+
4results·pass@1
Top: Qwen2.5-Coder-32B — 76.4
APPS
3results·pass@1
Top: CodeLlama-34B — 32.8
CodeContests
3results·pass@1
Top: GPT-4 + AlphaCodium — 44.0
SWE-Bench
0results·resolve-rate
Related tasks
Other tasks in Computer Code.
Reply within 48 hours · No newsletter
Didn't find what you came for?
Still looking for something on Code Generation? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.