Codesota · LLM · Power RankingThe condensed answer · who’s actually best on averageIssue: May 30, 2026

§ 00 · Premise

Which LLM is best on average?

A single high score is easy to game — train on the test set, hand-tune one case, publish a paper. Average performance across many benchmarks is harder to fake.

We rank every LLM that placed on at least 2 of 8 public benchmarks. Then — where we've verified the model on our own held-out set — we show our number next to the public consensus.

Jump to ranking →Method Back to LLM hub

§ 01

Per-benchmark percentile

Within each benchmark we rank every model that has a score, then map the rank to a 0–100 percentile (top = 100). This neutralises that some metrics are lower-better while others are higher-better — both end up on the same 0–100 axis.

§ 02

Average across coverage

Power score is the unweighted mean of a model's percentiles across the benchmarks where it has a score. We require a minimum of 2 benchmarks — one strong showing isn't enough.

§ 03

Our own column, when we have one

Where CodeSOTA has run its own eval (currently 0 of 19 ranked models), the right-most column shows that score. When the public consensus and our number disagree, that disagreement is the most useful thing on this page.

§ 01 · Ranking

The Power Ranking, 19 models.

Sorted by average percentile across the 8 axes. Coverage column is load-bearing — a model on top with 2/8 is making a narrower claim than one on top with most axes.

Pills below each model show per-benchmark percentile. Copper = top quartile (≥75), grey = middle, faded = bottom quartile.

#	Model	Power	Coverage	CodeSOTA verified	Per-benchmark percentile
01	o3	96.3	3 / 8	not yet	MMLU 100GPQA 100MATH 89
02	o1	85.7	3 / 8	not yet	MMLU 94GPQA 88MATH 75
03	o4-mini	84.0	3 / 8	not yet	MMLU 72GPQA 94MATH 86
04	GPT-4.5 Preview	79.0	2 / 8	not yet	MMLU 89GPQA 69
05	o1-preview	78.8	5 / 8	not yet	MMLU 83GPQA 75MATH 36AIME 100GSM8K 100
06	GPT-4.1	70.5	2 / 8	not yet	MMLU 78GPQA 63
07	o3-mini	65.3	3 / 8	not yet	MMLU 22GPQA 81MATH 93
08	Claude 3.5 Sonnet	54.0	7 / 8	not yet	MMLU 56GPQA 50MATH 14GSM8K 75ARC-C 100HellaSwag 33Winogrande 50
09	DeepSeek V3	53.5	2 / 8	not yet	MMLU 61MATH 46
10	Llama 3.1 405B	52.5	2 / 8	not yet	MMLU 67GPQA 38
11	GPT-4o	48.8	8 / 8	not yet	MMLU 44GPQA 25MATH 29AIME 0GSM8K 25ARC-C 67HellaSwag 100Winogrande 100
12	Grok 2	47.0	2 / 8	not yet	MMLU 50GPQA 44
13	o1-mini	38.7	3 / 8	not yet	MMLU 17GPQA 56MATH 43
14	Claude 3 Opus	35.0	2 / 8	not yet	MMLU 39GPQA 31
15	GPT-4 Turbo	26.0	2 / 8	not yet	MMLU 33GPQA 19
16	Gemini 1.5 Pro	24.2	6 / 8	not yet	MMLU 28GPQA 13MATH 4GSM8K 0ARC-C 33HellaSwag 67
17	Llama 3 70B	10.0	5 / 8	not yet	MMLU 0GSM8K 50ARC-C 0HellaSwag 0Winogrande 0
18	Llama 3.1 70B	8.5	2 / 8	not yet	MMLU 11GPQA 6
19	GPT-4o Mini	5.7	3 / 8	not yet	MMLU 6GPQA 0MATH 11

Tab 1 · Power score = mean of per-benchmark percentiles. Coverage gate ≥ 2. CodeSOTA-verified column shows our own numbers when we have run the model in-house.

§ 02 · Why a second column

Public benchmarks aren’t enough.

Three problems compound. One: popular benchmarks are easy to overfit — six months after a paper ships, the test set is in the next training run. Two: they miss the cases that actually pay rent — your data, your edge cases, your failure modes. Three: a vendor’s self-reported score is a marketing artefact until somebody else runs the same eval.

Our verified column closes the third gap. The first two we close with a hold-out architecture: methodology and sample items are public, the actual test set rotates quarterly and stays private — so even when our questions eventually leak into a training corpus, they’re no longer the questions we’re using.

Currently 0 of 19 models on this page have a CodeSOTA-verified score. Expanding that coverage is the work.

§ 03 · Request

Want a model verified against your data?

If you're choosing an LLM for production and a model on this list doesn't have a CodeSOTA-verified score, tell us. We run a private, hold-out evaluation on the tasks you actually care about — so you're not picking on a contaminated public number.

How custom benchmarks work →Pricing Email a request Back to LLM hub