Which OCR model is best on average?
A single high score is easy to game — train on the test set, hand-tune one document type, publish a paper. Average performance across many benchmarks is harder to fake.
We rank every OCR model that placed on at least 2 of 9 public OCR benchmarks. Then — where we’ve verified the model ourselves — we show our own number next to the public consensus.
Within each benchmark we rank every model that has a score, then map the rank to a 0–100 percentile (top = 100). This neutralises that CER is lower-better while OmniDoc composite is higher-better — both end up on the same 0–100 axis.
Power score is the unweighted mean of a model's percentiles across the benchmarks where it has a score. We require a minimum of 2 benchmarks — one strong showing isn't enough.
Where CodeSOTA has run its own eval (currently 2 of 21 ranked models), the right-most column shows that score. When the public consensus and our number disagree, that disagreement is the most useful thing on this page.
The Power Ranking, 21 models.
Sorted by average percentile across the eight axes. Coverage column is load-bearing — a model on top with 2/8 is making a narrower claim than one on top with 6/8.
Pills below each model show per-benchmark percentile. Copper = top quartile (≥75), grey = middle, faded = bottom quartile.
| # | Model | Power | Coverage | CodeSOTA verified | Per-benchmark percentile |
|---|---|---|---|---|---|
| 01 | Qwen2.5-VL-72B | 98.5 | 2 / 9 | not yet | OCRBench EN 97OCRBench ZH 100 |
| 02 | Gemini 2.5 Pro | 82.6 | 5 / 9 | not yet | OmniDoc 63OCRBench EN 87OCRBench ZH 88MME-VideoOCR 100Thai-OCR 75 |
| 03 | PaddleOCR-VL | 82.5 | 2 / 9 | not yet | OmniDoc 91olmOCR 74 |
| 04 | Qianfan-OCR | 79.3 | 4 / 9 | not yet | OmniDoc 94OCRBench EN 80OCRBench ZH 75olmOCR 68 |
| 05 | PaddleOCR-VL-1.5 | 77.5 | 2 / 9 | not yet | OmniDoc 97olmOCR 58 |
| 06 | Claude Sonnet 4 | 66.5 | 2 / 9 | not yet | OCRBench EN 33Thai-OCR 100 |
| 07 | minicpm-v-4.5-8b | 63.0 | 2 / 9 | not yet | OCRBench EN 63OCRBench ZH 63 |
| 08 | Gemini 1.5 Pro | 60.0 | 2 / 9 | not yet | CC-OCR 100MME-VideoOCR 20 |
| 09 | dots.ocr 3B | 59.5 | 2 / 9 | not yet | OmniDoc 66olmOCR 53 |
| 10 | sail-vl2-8b | 58.5 | 2 / 9 | not yet | OCRBench EN 67OCRBench ZH 50 |
| 11 | GPT-4o | 53.3 | 4 / 9 | not yet | OCRBench EN 77CC-OCR 25MME-VideoOCR 40KITAB 71 |
| 12 | MinerU 2.5 | 52.5 | 2 / 9 | not yet | OmniDoc 84olmOCR 21 |
| 13 | GPT-4o Mini | 47.0 | 2 / 9 | not yet | OCRBench EN 37KITAB 57 |
| 14 | claude-3.5-sonnet | 45.5 | 2 / 9 | not yet | OCRBench EN 53OCRBench ZH 38 |
| 15 | Qwen2.5-VL 72B | 40.0 | 2 / 9 | not yet | MME-VideoOCR 80Thai-OCR 0 |
| 16 | Qwen2-VL-72B | 36.5 | 2 / 9 | not yet | OCRBench EN 60OCRBench ZH 13 |
| 17 | Mistral OCR 3 | 34.5 | 2 / 9 | 94.9 %Internal acc3.7 %CodeSOTA CER7.1 %CodeSOTA WER | OmniDoc 22olmOCR 47 |
| 18 | InternVL2.5-78B | 34.0 | 2 / 9 | not yet | OCRBench EN 43OCRBench ZH 25 |
| 19 | gpt-4o-2024 | 28.5 | 2 / 9 | not yet | OCRBench EN 57OCRBench ZH 0 |
| 20 | Qwen2.5-VL 32B | 25.0 | 2 / 9 | not yet | MME-VideoOCR 0Thai-OCR 50 |
| 21 | mistral-ocr-2512 | 11.0 | 2 / 9 | 1.22 p/spages/s | OmniDoc 19OCRBench EN 3 |
Public benchmarks aren’t enough.
Three problems compound. One: popular OCR benchmarks (OmniDoc, OCRBench, olmOCR) are easy to overfit — six months after a paper ships, the test set is in the next training run. Two: they miss the document types that actually pay rent — Polish invoices, German handwritten medical forms, scanned legacy PDFs with deliberate redactions. Three: a vendor’s self-reported score is a marketing artefact until somebody else runs the same eval.
Our verified column closes the third gap. The first two we close with a hold-out architecture: methodology and sample items are public, the actual test set rotates quarterly and stays private — so even when our questions eventually leak into a training corpus, they’re no longer the questions we’re using.
Currently 2 of 21 models on this page have a CodeSOTA-verified score. Expanding that coverage is the work.
Want a model verified against your docs?
If you’re evaluating OCR for production and a model on this list doesn’t have a CodeSOTA-verified score, tell us. We’ll prioritise what real practitioners are about to deploy over what arXiv published last week.