Codesota · Code · Power RankingThe condensed answer · who’s actually best on averageIssue: May 30, 2026
§ 00 · Premise

Which coding model is best on average?

A single high score is easy to game — train on the test set, hand-tune one case, publish a paper. Average performance across many benchmarks is harder to fake.

We rank every coding model that placed on at least 2 of 6 public benchmarks. Then — where we've run the model on our own held-out repos — we show our number next to the public consensus.

§ 01
Per-benchmark percentile

Within each benchmark we rank every model that has a score, then map the rank to a 0–100 percentile (top = 100). This neutralises that some metrics are lower-better while others are higher-better — both end up on the same 0–100 axis.

§ 02
Average across coverage

Power score is the unweighted mean of a model's percentiles across the benchmarks where it has a score. We require a minimum of 2 benchmarks — one strong showing isn't enough.

§ 03
Our own column, when we have one

Where CodeSOTA has run its own eval (currently 0 of 14 ranked models), the right-most column shows that score. When the public consensus and our number disagree, that disagreement is the most useful thing on this page.

§ 01 · Ranking

The Power Ranking, 14 models.

Sorted by average percentile across the 6 axes. Coverage column is load-bearing — a model on top with 2/6 is making a narrower claim than one on top with most axes.

Pills below each model show per-benchmark percentile. Copper = top quartile (≥75), grey = middle, faded = bottom quartile.

#ModelPowerCoverageCodeSOTA verifiedPer-benchmark percentile
01Qwen2.5-Coder-32B-Instruct86.52 / 6not yetHumanEval 81MBPP 92
02Claude 3.5 Sonnet (Oct 2024)86.02 / 6not yetHumanEval 72MBPP 100
03Claude 3.5 Sonnet73.02 / 6not yetHumanEval 69MBPP 77
04o4-mini69.52 / 6not yetSWE-bench 45HumanEval 94
05GPT-4o67.52 / 6not yetHumanEval 66MBPP 69
06o3-mini57.52 / 6not yetSWE-bench 24HumanEval 91
07DeepSeek-Coder-V2-Instruct55.02 / 6not yetHumanEval 25MBPP 85
08Qwen2.5-Coder-7B-Instruct52.02 / 6not yetHumanEval 50MBPP 54
09o342.52 / 6not yetSWE-bench 47HumanEval 38
10DeepSeek-V335.33 / 6not yetHumanEval 0HumanEval+ 75MBPP 31
11Gemma 3 27B IT32.02 / 6not yetHumanEval 41MBPP 23
12GPT-4o27.52 / 6not yetSWE-bench 5HumanEval+ 50
13Gemma 3 12B IT21.52 / 6not yetHumanEval 28MBPP 15
14Gemma 3 4B IT1.52 / 6not yetHumanEval 3MBPP 0
Tab 1 · Power score = mean of per-benchmark percentiles. Coverage gate ≥ 2. CodeSOTA-verified column shows our own numbers when we have run the model in-house.
§ 02 · Why a second column

Public benchmarks aren’t enough.

Three problems compound. One: popular benchmarks are easy to overfit — six months after a paper ships, the test set is in the next training run. Two: they miss the cases that actually pay rent — your data, your edge cases, your failure modes. Three: a vendor’s self-reported score is a marketing artefact until somebody else runs the same eval.

Our verified column closes the third gap. The first two we close with a hold-out architecture: methodology and sample items are public, the actual test set rotates quarterly and stays private — so even when our questions eventually leak into a training corpus, they’re no longer the questions we’re using.

Currently 0 of 14 models on this page have a CodeSOTA-verified score. Expanding that coverage is the work.

§ 03 · Request

Want a model verified against your data?

If you're choosing a coding model for production and a model on this list doesn't have a CodeSOTA-verified score, tell us. We evaluate on your own repositories and task types under a private hold-out — not the public set every model has already trained on.