Which coding model is best on average?
A single high score is easy to game — train on the test set, hand-tune one case, publish a paper. Average performance across many benchmarks is harder to fake.
We rank every coding model that placed on at least 2 of 6 public benchmarks. Then — where we've run the model on our own held-out repos — we show our number next to the public consensus.
Within each benchmark we rank every model that has a score, then map the rank to a 0–100 percentile (top = 100). This neutralises that some metrics are lower-better while others are higher-better — both end up on the same 0–100 axis.
Power score is the unweighted mean of a model's percentiles across the benchmarks where it has a score. We require a minimum of 2 benchmarks — one strong showing isn't enough.
Where CodeSOTA has run its own eval (currently 0 of 14 ranked models), the right-most column shows that score. When the public consensus and our number disagree, that disagreement is the most useful thing on this page.
The Power Ranking, 14 models.
Sorted by average percentile across the 6 axes. Coverage column is load-bearing — a model on top with 2/6 is making a narrower claim than one on top with most axes.
Pills below each model show per-benchmark percentile. Copper = top quartile (≥75), grey = middle, faded = bottom quartile.
| # | Model | Power | Coverage | CodeSOTA verified | Per-benchmark percentile |
|---|---|---|---|---|---|
| 01 | Qwen2.5-Coder-32B-Instruct | 86.5 | 2 / 6 | not yet | HumanEval 81MBPP 92 |
| 02 | Claude 3.5 Sonnet (Oct 2024) | 86.0 | 2 / 6 | not yet | HumanEval 72MBPP 100 |
| 03 | Claude 3.5 Sonnet | 73.0 | 2 / 6 | not yet | HumanEval 69MBPP 77 |
| 04 | o4-mini | 69.5 | 2 / 6 | not yet | SWE-bench 45HumanEval 94 |
| 05 | GPT-4o | 67.5 | 2 / 6 | not yet | HumanEval 66MBPP 69 |
| 06 | o3-mini | 57.5 | 2 / 6 | not yet | SWE-bench 24HumanEval 91 |
| 07 | DeepSeek-Coder-V2-Instruct | 55.0 | 2 / 6 | not yet | HumanEval 25MBPP 85 |
| 08 | Qwen2.5-Coder-7B-Instruct | 52.0 | 2 / 6 | not yet | HumanEval 50MBPP 54 |
| 09 | o3 | 42.5 | 2 / 6 | not yet | SWE-bench 47HumanEval 38 |
| 10 | DeepSeek-V3 | 35.3 | 3 / 6 | not yet | HumanEval 0HumanEval+ 75MBPP 31 |
| 11 | Gemma 3 27B IT | 32.0 | 2 / 6 | not yet | HumanEval 41MBPP 23 |
| 12 | GPT-4o | 27.5 | 2 / 6 | not yet | SWE-bench 5HumanEval+ 50 |
| 13 | Gemma 3 12B IT | 21.5 | 2 / 6 | not yet | HumanEval 28MBPP 15 |
| 14 | Gemma 3 4B IT | 1.5 | 2 / 6 | not yet | HumanEval 3MBPP 0 |
Public benchmarks aren’t enough.
Three problems compound. One: popular benchmarks are easy to overfit — six months after a paper ships, the test set is in the next training run. Two: they miss the cases that actually pay rent — your data, your edge cases, your failure modes. Three: a vendor’s self-reported score is a marketing artefact until somebody else runs the same eval.
Our verified column closes the third gap. The first two we close with a hold-out architecture: methodology and sample items are public, the actual test set rotates quarterly and stays private — so even when our questions eventually leak into a training corpus, they’re no longer the questions we’re using.
Currently 0 of 14 models on this page have a CodeSOTA-verified score. Expanding that coverage is the work.
Want a model verified against your data?
If you're choosing a coding model for production and a model on this list doesn't have a CodeSOTA-verified score, tell us. We evaluate on your own repositories and task types under a private hold-out — not the public set every model has already trained on.