Which OCR model is best on average?
A single high score is easy to game — train on the test set, hand-tune one document type, publish a paper. Average performance across many benchmarks is harder to fake.
We rank every OCR model that placed on at least 2 of 9 public OCR benchmarks. Then — where we’ve verified the model ourselves — we show our own number next to the public consensus.
Within each benchmark we rank every model that has a score, then map the rank to a 0–100 percentile (top = 100). This neutralises that CER is lower-better while OmniDoc composite is higher-better — both end up on the same 0–100 axis.
Power score is the unweighted mean of a model's percentiles across the benchmarks where it has a score. We require a minimum of 2 benchmarks — one strong showing isn't enough.
Where CodeSOTA has run its own eval (currently 2 of 31 ranked models), the right-most column shows that score. When the public consensus and our number disagree, that disagreement is the most useful thing on this page.
The Power Ranking, 31 models.
Sorted by average percentile across the eight axes. Coverage column is load-bearing — a model on top with 2/8 is making a narrower claim than one on top with 6/8.
Pills below each model show per-benchmark percentile. Copper = top quartile (≥75), grey = middle, faded = bottom quartile.
| # | Model | Power | Coverage | CodeSOTA verified | Per-benchmark percentile |
|---|---|---|---|---|---|
| 01 | TeleMM-2.0 | 93.0 | 2 / 9 | not yet | OCRBench EN 86OCRBench ZH 100 |
| 02 | Gemini 3 Pro Preview | 89.0 | 2 / 9 | not yet | OCRBench EN 92OCRBench ZH 86 |
| 03 | Qwen2.5-VL-72B | 81.0 | 2 / 9 | not yet | OCRBench EN 83OCRBench ZH 79 |
| 04 | PaddleOCR-VL | 78.0 | 3 / 9 | not yet | OmniDoc 85OmniDoc 80olmOCR 69 |
| 05 | PaddleOCR-VL 1.5 | 76.0 | 2 / 9 | not yet | OmniDoc 93olmOCR 59 |
| 06 | Gemini 2.5 Pro | 73.8 | 5 / 9 | not yet | OmniDoc 51OCRBench EN 72OCRBench ZH 71MME-VideoOCR 100Thai-OCR 75 |
| 07 | Qianfan-OCR | 70.0 | 4 / 9 | not yet | OmniDoc 90OCRBench EN 67OCRBench ZH 57olmOCR 66 |
| 08 | Falcon-OCR | 65.5 | 2 / 9 | not yet | OmniDoc 59olmOCR 72 |
| 09 | Ovis2.5-9B | 65.0 | 2 / 9 | not yet | OCRBench EN 94OCRBench ZH 36 |
| 10 | Claude Sonnet 4 | 64.0 | 2 / 9 | not yet | OCRBench EN 28Thai-OCR 100 |
| 11 | Intern-S1-Pro | 62.5 | 2 / 9 | not yet | OCRBench EN 75OCRBench ZH 50 |
| 12 | Gemini 1.5 Pro | 60.0 | 2 / 9 | not yet | CC-OCR 100MME-VideoOCR 20 |
| 13 | DeepSeek-OCR-2 | 59.5 | 2 / 9 | not yet | OmniDoc 78olmOCR 41 |
| 14 | GLM-OCR | 55.5 | 2 / 9 | not yet | OmniDoc 95olmOCR 16 |
| 15 | dots.ocr 3B | 55.0 | 2 / 9 | not yet | OmniDoc 54olmOCR 56 |
| 16 | GPT-4o | 50.0 | 4 / 9 | not yet | OCRBench EN 64CC-OCR 25MME-VideoOCR 40KITAB 71 |
| 17 | minicpm-v-4.5-8b | 48.0 | 2 / 9 | not yet | OCRBench EN 53OCRBench ZH 43 |
| 18 | MonkeyOCR-pro-3B | 47.0 | 2 / 9 | not yet | OmniDoc 63olmOCR 31 |
| 19 | MinerU 2.5 | 46.3 | 3 / 9 | not yet | OmniDoc 73olmOCR 47olmOCR 19 |
| 20 | GPT-4o Mini | 44.0 | 2 / 9 | not yet | OCRBench EN 31KITAB 57 |
| 21 | sail-vl2-8b | 42.5 | 2 / 9 | not yet | OCRBench EN 56OCRBench ZH 29 |
| 22 | Qwen2.5-VL 72B | 40.0 | 2 / 9 | not yet | MME-VideoOCR 80Thai-OCR 0 |
| 23 | Mistral OCR 3 | 33.5 | 2 / 9 | 94.9 %Internal acc3.7 %CodeSOTA CER7.1 %CodeSOTA WER | OmniDoc 17olmOCR 50 |
| 24 | claude-3.5-sonnet | 32.5 | 2 / 9 | not yet | OCRBench EN 44OCRBench ZH 21 |
| 25 | DeepSeek-OCR | 32.0 | 2 / 9 | not yet | OmniDoc 39olmOCR 25 |
| 26 | Qwen2-VL-72B | 28.5 | 2 / 9 | not yet | OCRBench EN 50OCRBench ZH 7 |
| 27 | InternVL2.5-78B | 25.0 | 2 / 9 | not yet | OCRBench EN 36OCRBench ZH 14 |
| 28 | Qwen2.5-VL 32B | 25.0 | 2 / 9 | not yet | MME-VideoOCR 0Thai-OCR 50 |
| 29 | gpt-4o-2024 | 23.5 | 2 / 9 | not yet | OCRBench EN 47OCRBench ZH 0 |
| 30 | olmOCR | 23.0 | 2 / 9 | not yet | OmniDoc 24olmOCR 22 |
| 31 | mistral-ocr-2512 | 9.0 | 2 / 9 | 1.22 p/spages/s | OmniDoc 15OCRBench EN 3 |
Public benchmarks aren’t enough.
Three problems compound. One: popular OCR benchmarks (OmniDoc, OCRBench, olmOCR) are easy to overfit — six months after a paper ships, the test set is in the next training run. Two: they miss the document types that actually pay rent — Polish invoices, German handwritten medical forms, scanned legacy PDFs with deliberate redactions. Three: a vendor’s self-reported score is a marketing artefact until somebody else runs the same eval.
Our verified column closes the third gap. The first two we close with a hold-out architecture: methodology and sample items are public, the actual test set rotates quarterly and stays private — so even when our questions eventually leak into a training corpus, they’re no longer the questions we’re using.
Currently 2 of 31 models on this page have a CodeSOTA-verified score. Expanding that coverage is the work.
Want a model verified against your docs?
If you’re evaluating OCR for production and a model on this list doesn’t have a CodeSOTA-verified score, tell us. We’ll prioritise what real practitioners are about to deploy over what arXiv published last week.