Which vision model is best on average?
A single high score is easy to game — train on the test set, hand-tune one case, publish a paper. Average performance across many benchmarks is harder to fake.
We rank every vision model that placed on at least 2 of 5 public benchmarks. Then — where we've verified the model ourselves — we show our number next to the public consensus.
Within each benchmark we rank every model that has a score, then map the rank to a 0–100 percentile (top = 100). This neutralises that some metrics are lower-better while others are higher-better — both end up on the same 0–100 axis.
Power score is the unweighted mean of a model's percentiles across the benchmarks where it has a score. We require a minimum of 2 benchmarks — one strong showing isn't enough.
Where CodeSOTA has run its own eval (currently 0 of 8 ranked models), the right-most column shows that score. When the public consensus and our number disagree, that disagreement is the most useful thing on this page.
The Power Ranking, 8 models.
Sorted by average percentile across the 5 axes. Coverage column is load-bearing — a model on top with 2/5 is making a narrower claim than one on top with most axes.
Pills below each model show per-benchmark percentile. Copper = top quartile (≥75), grey = middle, faded = bottom quartile.
| # | Model | Power | Coverage | CodeSOTA verified | Per-benchmark percentile |
|---|---|---|---|---|---|
| 01 | InternImage-H | 93.0 | 2 / 5 | not yet | COCO 94ADE20K 92 |
| 02 | EVA-02-L | 88.3 | 3 / 5 | not yet | ImageNet 90COCO 75CIFAR-100 100 |
| 03 | DeiT-B Distilled | 51.5 | 2 / 5 | not yet | ImageNet 43CIFAR-10 60 |
| 04 | ViT-H/14 | 50.0 | 2 / 5 | not yet | ImageNet 57CIFAR-100 43 |
| 05 | ViT-L/16 (IN-21K) | 35.5 | 2 / 5 | not yet | CIFAR-100 21CIFAR-10 50 |
| 06 | EfficientNet-B7 | 26.0 | 2 / 5 | not yet | ImageNet 38CIFAR-100 14 |
| 07 | ViT-B/16 | 13.0 | 2 / 5 | not yet | ImageNet 19CIFAR-100 7 |
| 08 | ResNet-50 | 0.0 | 3 / 5 | not yet | ImageNet 0CIFAR-100 0CIFAR-10 0 |
Public benchmarks aren’t enough.
Three problems compound. One: popular benchmarks are easy to overfit — six months after a paper ships, the test set is in the next training run. Two: they miss the cases that actually pay rent — your data, your edge cases, your failure modes. Three: a vendor’s self-reported score is a marketing artefact until somebody else runs the same eval.
Our verified column closes the third gap. The first two we close with a hold-out architecture: methodology and sample items are public, the actual test set rotates quarterly and stays private — so even when our questions eventually leak into a training corpus, they’re no longer the questions we’re using.
Currently 0 of 8 models on this page have a CodeSOTA-verified score. Expanding that coverage is the work.
Want a model verified against your data?
If you're choosing a vision model for production and a model on this list doesn't have a CodeSOTA-verified score, tell us. We run a private hold-out evaluation on your imagery and edge cases — not the public set that's already in every training run.