Codesota · OCR · Power RankingThe condensed answer · who’s actually best on averageIssue: April 27, 2026
§ 00 · Premise

Which OCR model is best on average?

A single high score is easy to game — train on the test set, hand-tune one document type, publish a paper. Average performance across many benchmarks is harder to fake.

We rank every OCR model that placed on at least 2 of 9 public OCR benchmarks. Then — where we’ve verified the model ourselves — we show our own number next to the public consensus.

§ 01
Per-benchmark percentile

Within each benchmark we rank every model that has a score, then map the rank to a 0–100 percentile (top = 100). This neutralises that CER is lower-better while OmniDoc composite is higher-better — both end up on the same 0–100 axis.

§ 02
Average across coverage

Power score is the unweighted mean of a model's percentiles across the benchmarks where it has a score. We require a minimum of 2 benchmarks — one strong showing isn't enough.

§ 03
Our own column, when we have one

Where CodeSOTA has run its own eval (currently 2 of 31 ranked models), the right-most column shows that score. When the public consensus and our number disagree, that disagreement is the most useful thing on this page.

§ 01 · Ranking

The Power Ranking, 31 models.

Sorted by average percentile across the eight axes. Coverage column is load-bearing — a model on top with 2/8 is making a narrower claim than one on top with 6/8.

Pills below each model show per-benchmark percentile. Copper = top quartile (≥75), grey = middle, faded = bottom quartile.

#ModelPowerCoverageCodeSOTA verifiedPer-benchmark percentile
01TeleMM-2.093.02 / 9not yetOCRBench EN 86OCRBench ZH 100
02Gemini 3 Pro Preview89.02 / 9not yetOCRBench EN 92OCRBench ZH 86
03Qwen2.5-VL-72B81.02 / 9not yetOCRBench EN 83OCRBench ZH 79
04PaddleOCR-VL78.03 / 9not yetOmniDoc 85OmniDoc 80olmOCR 69
05PaddleOCR-VL 1.576.02 / 9not yetOmniDoc 93olmOCR 59
06Gemini 2.5 Pro73.85 / 9not yetOmniDoc 51OCRBench EN 72OCRBench ZH 71MME-VideoOCR 100Thai-OCR 75
07Qianfan-OCR70.04 / 9not yetOmniDoc 90OCRBench EN 67OCRBench ZH 57olmOCR 66
08Falcon-OCR65.52 / 9not yetOmniDoc 59olmOCR 72
09Ovis2.5-9B65.02 / 9not yetOCRBench EN 94OCRBench ZH 36
10Claude Sonnet 464.02 / 9not yetOCRBench EN 28Thai-OCR 100
11Intern-S1-Pro62.52 / 9not yetOCRBench EN 75OCRBench ZH 50
12Gemini 1.5 Pro60.02 / 9not yetCC-OCR 100MME-VideoOCR 20
13DeepSeek-OCR-259.52 / 9not yetOmniDoc 78olmOCR 41
14GLM-OCR55.52 / 9not yetOmniDoc 95olmOCR 16
15dots.ocr 3B55.02 / 9not yetOmniDoc 54olmOCR 56
16GPT-4o50.04 / 9not yetOCRBench EN 64CC-OCR 25MME-VideoOCR 40KITAB 71
17minicpm-v-4.5-8b48.02 / 9not yetOCRBench EN 53OCRBench ZH 43
18MonkeyOCR-pro-3B47.02 / 9not yetOmniDoc 63olmOCR 31
19MinerU 2.546.33 / 9not yetOmniDoc 73olmOCR 47olmOCR 19
20GPT-4o Mini44.02 / 9not yetOCRBench EN 31KITAB 57
21sail-vl2-8b42.52 / 9not yetOCRBench EN 56OCRBench ZH 29
22Qwen2.5-VL 72B40.02 / 9not yetMME-VideoOCR 80Thai-OCR 0
23Mistral OCR 333.52 / 994.9 %Internal acc3.7 %CodeSOTA CER7.1 %CodeSOTA WEROmniDoc 17olmOCR 50
24claude-3.5-sonnet32.52 / 9not yetOCRBench EN 44OCRBench ZH 21
25DeepSeek-OCR32.02 / 9not yetOmniDoc 39olmOCR 25
26Qwen2-VL-72B28.52 / 9not yetOCRBench EN 50OCRBench ZH 7
27InternVL2.5-78B25.02 / 9not yetOCRBench EN 36OCRBench ZH 14
28Qwen2.5-VL 32B25.02 / 9not yetMME-VideoOCR 0Thai-OCR 50
29gpt-4o-202423.52 / 9not yetOCRBench EN 47OCRBench ZH 0
30olmOCR23.02 / 9not yetOmniDoc 24olmOCR 22
31mistral-ocr-25129.02 / 91.22 p/spages/sOmniDoc 15OCRBench EN 3
Tab 1 · Power score = mean of per-benchmark percentiles. Coverage gate ≥ 2. CodeSOTA-verified column shows our own numbers when we have run the model in-house.
§ 02 · Why a second column

Public benchmarks aren’t enough.

Three problems compound. One: popular OCR benchmarks (OmniDoc, OCRBench, olmOCR) are easy to overfit — six months after a paper ships, the test set is in the next training run. Two: they miss the document types that actually pay rent — Polish invoices, German handwritten medical forms, scanned legacy PDFs with deliberate redactions. Three: a vendor’s self-reported score is a marketing artefact until somebody else runs the same eval.

Our verified column closes the third gap. The first two we close with a hold-out architecture: methodology and sample items are public, the actual test set rotates quarterly and stays private — so even when our questions eventually leak into a training corpus, they’re no longer the questions we’re using.

Currently 2 of 31 models on this page have a CodeSOTA-verified score. Expanding that coverage is the work.

§ 03 · Request

Want a model verified against your docs?

If you’re evaluating OCR for production and a model on this list doesn’t have a CodeSOTA-verified score, tell us. We’ll prioritise what real practitioners are about to deploy over what arXiv published last week.