Which OCR benchmarks have the best model coverage for fair comparison? Data-driven analysis of 242 results across 10 OCR benchmarks and 42 models.
dots.ocr reports OmniDocBench but not OCRBench v2. Mistral OCR reports neither. GPT-5.4 reports both but with different metrics. For fair “apples-to-apples” comparison, you need benchmarks where multiple models have published results.
A single benchmark score is like a single data point: potentially misleading, easily gamed, and insufficient for real decisions. Understanding why requires seeing the four fundamental problems with single-benchmark evaluation.
Every single-benchmark evaluation suffers from at least one of these issues. The more benchmarks you use, the harder it is for these problems to hide.
Model developers naturally report benchmarks where their model excels
One benchmark tests one skill. OCR has dozens of distinct challenges
When everyone scores 95%+, the benchmark stops being useful
You can only compare models on benchmarks they BOTH report
A model that claims "SOTA on Benchmark X" might be:
Think of OCR models like restaurants. Each benchmark is like a dish category. Click through to see why multiple benchmarks matter.
You want to find the best restaurant in town
Each restaurant shows off their best dish
Restaurant A shows their pasta, B shows their steak, C shows their sushi
When everyone scores near-perfect, the benchmark loses its discriminative power. Compare these three scenarios.
22-point spread. Clear winner (Model D) and meaningful ranking.
OCR encompasses at least 8 distinct capabilities. A model can excel at printed text while failing at handwriting. No single benchmark tests all of these.
Clean printed documents
Structured tables with cells
Cursive and print handwriting
Text in photos, signs, products
LaTeX equations, symbols
Newspapers, academic papers
Faded, blurry, old documents
Structured data extraction
Each row is a model. Green = benchmark reported, Red = benchmark missing. Notice how few benchmarks overlap - making comparison impossible!
| Model | OmniDoc | OCRBench v2 | SROIE | IAM | ICDAR | olmOCR |
|---|---|---|---|---|---|---|
| Model A | Y | Y | Y | - | - | - |
| Model B | - | - | - | Y | Y | - |
| Model C | Y | - | - | Y | - | - |
Only Model A and Model C can be fairly compared (on OmniDoc). Model B uses entirely different benchmarks!
The number of benchmarks directly correlates with evaluation reliability. Here is why more benchmarks lead to better decisions.
Look at each model's reported benchmark
Model A: 97% on BenchX | Model B: 95% on BenchY | Model C: 99% on BenchZThese are different benchmarks - cannot compare!
Find benchmarks where ALL models have results
All 3 models tested on OmniDocBench: A=82%, B=89%, C=76%Now you can actually compare. Model B wins.
The Bottom Line: A model claiming "best OCR" based on one benchmark is like a student claiming "smartest in class" based on one quiz. Real evaluation requires comprehensive testing across multiple dimensions. That is what this page helps you do.
Which key OCR benchmarks is each model missing? Empty cells show gaps that prevent fair comparison.
| Model | Coverage | OmniDocBench | olmOCR-Bench | OCRBench v2 | OCRBench | SROIE | IAM Database | ICDAR 2015 | Total-Text | CC-OCR | KITAB-Bench | ThaiOCRBench |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.4 | 45% | ✓ | ✓ | ✓ | — | — | — | — | — | ✓ | ✓ | — |
| Gemini 2.5 Pro | 27% | ✓ | — | ✓ | — | — | — | — | — | — | — | ✓ |
| dots.ocr 3B | 18% | ✓ | ✓ | — | — | — | — | — | — | — | — | — |
| PaddleOCR-VL | 18% | ✓ | ✓ | — | — | — | — | — | — | — | — | — |
| MinerU 2.5 | 18% | ✓ | ✓ | — | — | — | — | — | — | — | — | — |
| Mistral OCR 3 | 18% | ✓ | ✓ | — | — | — | — | — | — | — | — | — |
| Mistral OCR 3 | 18% | ✓ | — | ✓ | — | — | — | — | — | — | — | — |
| Claude Sonnet 4 | 18% | — | — | ✓ | — | — | — | — | — | — | — | ✓ |
| Chandra v0.1.0 | 9% | — | ✓ | — | — | — | — | — | — | — | — | — |
| Qwen2.5-VL 72B | 9% | — | — | — | — | — | — | — | — | — | — | ✓ |
| Tesseract | 9% | — | — | — | — | — | — | — | — | — | ✓ | — |
| EasyOCR | 9% | — | — | — | — | — | — | — | — | — | ✓ | — |
| olmOCR v0.4.0 | 9% | — | ✓ | — | — | — | — | — | — | — | — | — |
| Marker 1.10.1 | 9% | — | ✓ | — | — | — | — | — | — | — | — | — |
| Gemini 3.1 Pro | 9% | — | — | — | — | — | — | — | — | ✓ | — | — |
✓ = Model has published results. — = No data available (cannot compare on this benchmark).
Benchmarks where multiple top models have published results. Use these for apples-to-apples comparison.
For comprehensive OCR model comparison, use these benchmarks that cover different aspects and have good model coverage.
981 pages, 9 categories. Tests tables, formulas, layouts. The most comprehensive document parsing benchmark.
7,010 tests across 1,402 PDFs. Old scans, math, multi-column, tiny text. Tests real-world edge cases.
8 core capabilities, 23 tasks. Tests text recognition, referring, extraction across English and Chinese.
13,353 lines from 657 writers. The gold standard for handwriting recognition since 1999.
626 receipt images. Key information extraction: company, date, address, total. Standard invoice benchmark.
1,500 images from wearable cameras. Industry standard for scene text detection in the wild.
Sorted by number of models with published results. SOTA highlight = strongest comparison set.
| # | Benchmark | Category | Language | Models | Year | Priority |
|---|---|---|---|---|---|---|
| 1 | OCRBench v2 | ocr capabilities | multilingual | 48 | 2024 | Must include |
| 2 | OmniDocBench | document parsing | en | 34 | 2024 | Must include |
| 3 | olmOCR-Bench | document parsing | en | 22 | 2024 | Must include |
| 4 | ParseBench | document parsing | en | 14 | 2026 | Must include |
| 5 | FUNSD | document understanding | en | 13 | 2019 | Must include |
| 6 | KITAB-Bench | document ocr | ar | 8 | 2024 | Must include |
| 7 | MME-VideoOCR | ocr capabilities | en | 6 | 2024 | Recommended |
| 8 | IAM | handwriting recognition | en | 5 | 1999 | Recommended |
| 9 | CC-OCR | ocr capabilities | multilingual | 5 | 2024 | Recommended |
| 10 | ThaiOCRBench | document ocr | th | 5 | 2024 | Recommended |
Benchmarks where model scores are spread out (not saturated). A 30-point spread means models actually differ; a 5-point spread means everyone performs similarly.
| Benchmark | Metric | Worst | Best | Spread | Models | Verdict |
|---|---|---|---|---|---|---|
| OmniDocBench | table-teds | 0.80 | 93.52 | 92.72 | 4 | Highly discriminative |
| OmniDocBench | composite | 31.70 | 94.62 | 62.92 | 33 | Highly discriminative |
| olmOCR-Bench | headers-footers | 42.00 | 96.10 | 54.10 | 4 | Highly discriminative |
| OCRBench v2 | overall-zh-public | 9.10 | 55.70 | 46.60 | 17 | Highly discriminative |
| ParseBench | accuracy | 45.20 | 84.90 | 39.70 | 14 | Highly discriminative |
| OCRBench v2 | overall-en-private | 23.40 | 62.20 | 38.80 | 31 | Highly discriminative |
| olmOCR-Bench | old-scans | 40.70 | 73.10 | 32.40 | 5 | Highly discriminative |
| OCRBench v2 | overall-en-public | 23.10 | 52.60 | 29.50 | 17 | Highly discriminative |
| olmOCR-Bench | pass-rate | 63.80 | 83.90 | 20.10 | 20 | Highly discriminative |
| OCRBench v2 | overall-zh-private | 45.70 | 63.70 | 18.00 | 9 | Good separation |
| FUNSD | f1 | 77.89 | 92.08 | 14.19 | 13 | Good separation |
| MME-VideoOCR | total-accuracy | 61.00 | 73.70 | 12.70 | 6 | Good separation |
Spread = Best score - Worst score. Higher spread = more meaningful for comparison.
Important OCR benchmarks with limited model coverage. These represent blind spots where we cannot make fair comparisons.
Use OmniDocBench (composite) + olmOCR-Bench (pass-rate). Both have 10+ models tested.
Use OCRBench v2 (overall-en) for general OCR, SROIE (F1) for receipts/invoices.
Use IAM Database (CER). Standard since 1999, most models report it.
Use ICDAR 2015 (F1) + Total-Text for curved text.
Use KITAB-Bench (Arabic), ThaiOCRBench (Thai), OCRBench v2 (zh-private). Limited model coverage — be careful with claims.
Use our benchmark data to make informed decisions about OCR models. Compare dots.ocr, GPT-5.4, Mistral OCR, PaddleOCR and more.