Which LLM is best on average?
A single high score is easy to game — train on the test set, hand-tune one case, publish a paper. Average performance across many benchmarks is harder to fake.
We rank every LLM that placed on at least 2 of 8 public benchmarks. Then — where we've verified the model on our own held-out set — we show our number next to the public consensus.
Within each benchmark we rank every model that has a score, then map the rank to a 0–100 percentile (top = 100). This neutralises that some metrics are lower-better while others are higher-better — both end up on the same 0–100 axis.
Power score is the unweighted mean of a model's percentiles across the benchmarks where it has a score. We require a minimum of 2 benchmarks — one strong showing isn't enough.
Where CodeSOTA has run its own eval (currently 0 of 19 ranked models), the right-most column shows that score. When the public consensus and our number disagree, that disagreement is the most useful thing on this page.
The Power Ranking, 19 models.
Sorted by average percentile across the 8 axes. Coverage column is load-bearing — a model on top with 2/8 is making a narrower claim than one on top with most axes.
Pills below each model show per-benchmark percentile. Copper = top quartile (≥75), grey = middle, faded = bottom quartile.
| # | Model | Power | Coverage | CodeSOTA verified | Per-benchmark percentile |
|---|---|---|---|---|---|
| 01 | o3 | 96.3 | 3 / 8 | not yet | MMLU 100GPQA 100MATH 89 |
| 02 | o1 | 85.7 | 3 / 8 | not yet | MMLU 94GPQA 88MATH 75 |
| 03 | o4-mini | 84.0 | 3 / 8 | not yet | MMLU 72GPQA 94MATH 86 |
| 04 | GPT-4.5 Preview | 79.0 | 2 / 8 | not yet | MMLU 89GPQA 69 |
| 05 | o1-preview | 78.8 | 5 / 8 | not yet | MMLU 83GPQA 75MATH 36AIME 100GSM8K 100 |
| 06 | GPT-4.1 | 70.5 | 2 / 8 | not yet | MMLU 78GPQA 63 |
| 07 | o3-mini | 65.3 | 3 / 8 | not yet | MMLU 22GPQA 81MATH 93 |
| 08 | Claude 3.5 Sonnet | 54.0 | 7 / 8 | not yet | MMLU 56GPQA 50MATH 14GSM8K 75ARC-C 100HellaSwag 33Winogrande 50 |
| 09 | DeepSeek V3 | 53.5 | 2 / 8 | not yet | MMLU 61MATH 46 |
| 10 | Llama 3.1 405B | 52.5 | 2 / 8 | not yet | MMLU 67GPQA 38 |
| 11 | GPT-4o | 48.8 | 8 / 8 | not yet | MMLU 44GPQA 25MATH 29AIME 0GSM8K 25ARC-C 67HellaSwag 100Winogrande 100 |
| 12 | Grok 2 | 47.0 | 2 / 8 | not yet | MMLU 50GPQA 44 |
| 13 | o1-mini | 38.7 | 3 / 8 | not yet | MMLU 17GPQA 56MATH 43 |
| 14 | Claude 3 Opus | 35.0 | 2 / 8 | not yet | MMLU 39GPQA 31 |
| 15 | GPT-4 Turbo | 26.0 | 2 / 8 | not yet | MMLU 33GPQA 19 |
| 16 | Gemini 1.5 Pro | 24.2 | 6 / 8 | not yet | MMLU 28GPQA 13MATH 4GSM8K 0ARC-C 33HellaSwag 67 |
| 17 | Llama 3 70B | 10.0 | 5 / 8 | not yet | MMLU 0GSM8K 50ARC-C 0HellaSwag 0Winogrande 0 |
| 18 | Llama 3.1 70B | 8.5 | 2 / 8 | not yet | MMLU 11GPQA 6 |
| 19 | GPT-4o Mini | 5.7 | 3 / 8 | not yet | MMLU 6GPQA 0MATH 11 |
Public benchmarks aren’t enough.
Three problems compound. One: popular benchmarks are easy to overfit — six months after a paper ships, the test set is in the next training run. Two: they miss the cases that actually pay rent — your data, your edge cases, your failure modes. Three: a vendor’s self-reported score is a marketing artefact until somebody else runs the same eval.
Our verified column closes the third gap. The first two we close with a hold-out architecture: methodology and sample items are public, the actual test set rotates quarterly and stays private — so even when our questions eventually leak into a training corpus, they’re no longer the questions we’re using.
Currently 0 of 19 models on this page have a CodeSOTA-verified score. Expanding that coverage is the work.
Want a model verified against your data?
If you're choosing an LLM for production and a model on this list doesn't have a CodeSOTA-verified score, tell us. We run a private, hold-out evaluation on the tasks you actually care about — so you're not picking on a contaminated public number.