Codesota · LLMs · Code generationThe register of programming benchmarksLive · April 2026
§ 00 · Code generation

LLMs that write software.

From a single function (HumanEval) to resolving a real GitHub issue (SWE-bench Verified), code generation is the most practical frontier of LLM capability. We track the live leaderboards.

All scores below are queried live from the Codesota registry. Shaded rows mark the current state of the art. Agent scaffolding materially affects SWE-bench Verified — numbers are reported as submitted.

§ 01 · SWE-bench Verified

Real-world software engineering, ranked.

The hardest public coding benchmark: a real GitHub issue, a real repo, a patch that must pass the repo's own test suite. Pass-rate depends on agent scaffolding; two runs of the same base model can differ by double digits.


Metric
pass-rate · higher is better
Dataset
500 verified Python issues
Models
8 tracked · top 8 shown
Live · from the registry
Shaded row marks current SOTA
#ModelScoreΔ to #1
01Claude Opus 4.787.6%
02Claude Opus 4.580.9%-6.7
03Claude Opus 4.680.8%-6.8
04Gemini 3.1 Pro80.6%-7.0
05MiniMax M2.580.2%-7.4
06GPT-5.2 Thinking80.0%-7.6
07Claude Sonnet 4.679.6%-8.0
08Gemini 3 Flash78.0%-9.6
Fig 1 · SWE-bench Verified pass-rate. Scores depend on the harness; the model and agent are listed together.
§ 01¼ · Over time

SOTA advances, one release at a time.

A model’s score doesn’t change after it ships. What changes is the leaderboard: a new release either clears the bar or it doesn’t. Each chart plots every tracked release as a dot, then steps up whenever one pushes past the previous SOTA. Only the copper steps moved the frontier.

SWE-bench Verified · pass-rate
8 releases tracked
% pass69.974.579.283.888.5Dec '25Feb '26Apr '26DeepSeek V3.2Claude Opus 4.5GPT-5.3-Codex (xhigh)Claude Opus 4.7
HumanEval · pass@1
6 releases tracked
% pass@190.892.894.896.898.8Mar '25Sep '25Feb '26Qwen 2.5-Coder-32B-InstructKimi K2 (0905)GPT-5Claude Opus 4.6 (model card)Claude Opus 4.6
LiveCodeBench · rolling post-cutoff
3 releases tracked
% pass87.889.090.291.492.6Dec '25Feb '26Mar '26DeepSeek V3.2 SpecialeGemini 3 Pro Preview
Aider Polyglot · diff-edit
4 releases tracked
% solved81.483.785.988.190.3May '25Oct '25Mar '26Gemini 2.5 Proo3-proClaude Opus 4.5
Fig 1b · One point per tracked release, placed by its announcement month. The step line follows the SOTA envelope — it only rises when a model posted a higher score than every release before it. Ink dots are releases that landed below the current ceiling; copper dots took SOTA.
§ 01½ · Supplement

Freshest published scores, April 2026.

The registry above is queried live; ingestion sometimes lags a vendor announcement by days. This supplement carries the frontier scores published as of 2026-04-22, with a source link on every row.

Claude Opus 4.7 shipped on 2026-04-16 and leads SWE-bench Verified at 87.6% under the Claude Code harness. HumanEval was not part of Anthropic's published card — the row is held pending.

SWE-bench Verified · pass-rate
Every row cites a source
#ModelHarnessDateScoreSource
01Claude Opus 4.7Claude Code (Anthropic)2026-0487.6%anthropic.com/news/claude-opus-4-7
02GPT-5.3-Codex (xhigh)Codex CLI2026-0385.0%openai.com · Codex
03Gemini 3.1 Provendor-reported2026-0480.6%deepmind.google · Gemini 3.1 Pro card
04Claude Opus 4.5Claude Code2026-0180.9%anthropic.com · Opus 4.5
05MiniMax M2.5mini-SWE-agent2026-0280.2%minimax.io · M2.5
06Claude Sonnet 4.6Claude Code2026-0177.2%anthropic.com · Sonnet 4.6
07DeepSeek V3.2mini-SWE-agent2025-1273.1%deepseek.com
08Kimi K2.5mini-SWE-agent2026-0171.3%moonshot.cn
Fig 1a · Supplement to § 01. Harness and model listed together; rows are not isolated-harness comparisons.
HumanEval · pass@1
Every row cites a source
#ModelDateScoreSource
01Claude Opus 4.72026-04pendinganthropic.com/news/claude-opus-4-7
02Claude Opus 4.62026-0297.8%codesota · Opus 4.5 analysis
03Claude Opus 4.6 (model card)2026-0196.3%anthropic.com
04GPT-52025-1295.1%openai.com · GPT-5
05Kimi K2 (0905)2025-0994.5%moonshot.cn · K2
06Claude Sonnet 4.62026-0194.1%anthropic.com
07Qwen 2.5-Coder-32B-Instruct2025-0392.7%github.com/QwenLM/Qwen2.5-Coder
Fig 3a · HumanEval pass@1. Opus 4.7 row is held pending — Anthropic did not publish a HumanEval number with the release.
LiveCodeBench · rolling post-cutoff split
Every row cites a source
#ModelDateScoreSource
01Gemini 3 Pro Preview2026-0391.7%deepmind.google · Gemini 3
02Gemini 3 Flash Preview2026-0390.8%pricepertoken · LiveCodeBench
03DeepSeek V3.2 Speciale2025-1289.6%deepseek.com
Fig 4a · LiveCodeBench top-three as of April 2026. Problems published after each model's training cutoff are the only ones scored.
Aider Polyglot · diff-edit, 6 languages
Every row cites a source
#ModelHarnessDateScoreSource
01Claude Opus 4.5Aider diff-edit2026-0189.4%aider.chat/docs/leaderboards
02GPT-5 (high reasoning)Aider diff-edit2026-0388.0%aider.chat/docs/leaderboards
03o3-proAider diff-edit2025-0684.9%aider.chat/docs/leaderboards
04Gemini 2.5 ProAider diff-edit2025-0583.1%aider.chat/docs/leaderboards
Fig 5 · Aider polyglot on 225 Exercism problems across C++, Go, Java, JS, Python, Rust. Scaffold is Aider itself; results are scaffold-specific.
§ 02 · Task

Three levels of hard.

Code benchmarks come in three sizes. The smallest — HumanEval — asks the model to write a single function from a docstring: sort a list, de-duplicate, solve a small algorithmic puzzle. Pass@1 means it has one attempt and must pass all unit tests. Frontier models now saturate this band.

The middle tier — LiveCodeBench — uses fresh competitive-programming problems posted after the model's training cutoff, which defeats memorisation. The score is less flattering but more honest.

The largest — SWE-bench Verified — hands the model a real GitHub issue and a real repository. The model must read the bug report, navigate files it has never seen, reproduce the failure, and write a patch that passes the project's own test suite. Nothing about the task fits in a prompt; an agent loop is required. This is the benchmark that correlates with shipping.

§ 03 · HumanEval

Single-function synthesis, pass@1.

164 Python problems. Docstring in, function body out, unit tests decide. The benchmark that launched the field; now close to saturation for frontier models.


Metric
pass@1 · higher is better
Dataset
164 problems · Python
Models
8 tracked · top 8 shown
Live · from the registry
Shaded row marks current SOTA
#ModelScoreΔ to #1
01o4-mini97.3%
02o3-mini96.3%-1.0
03Claude Opus 4.696.3%-1.0
04GPT-595.1%-2.2
05o394.8%-2.5
06GPT-4.194.5%-2.8
07Claude Sonnet 4.694.1%-3.2
08GPT-4.1 mini93.8%-3.5
Fig 2 · HumanEval pass@1. Saturation in this band is a coverage signal, not a capability ceiling.
§ 04 · LiveCodeBench

Contamination-resistant, timestamped.

Competitive-programming problems added continuously, scored only against the slice published after a model's training cutoff. Cannot be memorised.


Metric
pass-rate · higher is better
Dataset
LeetCode / AtCoder / Codeforces · timestamped
Models
8 tracked · top 8 shown
Live · from the registry
Shaded row marks current SOTA
#ModelScoreΔ to #1
01Gemini 3 Pro Preview91.7%
02Gemini 3 Flash90.8%-0.9
03GPT-585.0%-6.7
04Grok 479.0%-12.7
05Gemini 2.5 Pro75.6%-16.1
06DeepSeek-R1-052873.3%-18.4
07o4-mini72.8%-18.9
08Qwen3-235B-A22B70.7%-21.0
Fig 3 · LiveCodeBench pass-rate on the rolling post-cutoff split. Memorisation does not help.
§ 05 · Benchmarks

The datasets, with metric direction.

Every benchmark Codesota tracks for code generation, with language, sample count and primary metric. Rows link to the canonical paper or dataset.

BenchmarkLanguagePrimary metricSamplesYearSource
APPSpythonpass@12021paper →
CodeContestsmultilingualpass@12022paper →
HumanEvalpythonpass@11642021paper →
HumanEval+pythonpass@12023paper →
LiveCodeBenchenpass@14002024paper →
LiveCodeBench Proenelo2025paper →
MBPPpythonpass@12021paper →
MBPP+pythonpass@12023paper →
SWE-Benchpythonresolve-rate2023paper →
SWE-Bench Verifiedpythonresolve-rate2024paper →
Fig 4 · Every row carries its metric direction. SWE-bench Verified results are only comparable within a matched agent harness.
§ 06
Methodology

Why these numbers move.

SWE-bench Verified is the benchmark most likely to shift under your reading. The base model matters; the agent harness matters more. A mid-tier model with a careful scaffold routinely outscores a frontier model with a naïve one.

We publish the harness alongside the score where the submitter discloses it, and flag rows that ship without reproduction instructions. Scores without a matching agent are treated as model-agnostic claims.

HumanEval and LiveCodeBench are cleaner. Pass@1 is a closed-form evaluation; the only variance is sampling temperature, which is disclosed in the row. We publish the rank as submitted and retract on evidence, never silently edit.

Related

Neighbouring registers.

Cross-links to the rest of Codesota.

LLMs · register
Frontier language-model benchmarks.
Agents
Agent scaffolds and tool-use benchmarks.
All tasks
Every modality Codesota tracks.
Methodology
How scores are admitted and retracted.
Read next

Three places to go from here.

Top benchmark
SWE-bench Verified
The 500-task verified subset that frontier coding agents are scored on. Live leaderboard from the registry.
Sister hub
LLM benchmarks
Reasoning, math, multimodal, and the rest — code generation in context with everything else frontier models do.
Sister hub
Agentic AI
Long-horizon agents need different evals than 1-shot pass@1. Time-horizon, RE-Bench, autonomy.