Open-weights leaders on OmniDocBench (GLM-OCR 94.62, PaddleOCR-VL 94.50) beat closed frontier APIs (GPT-5 ~85.80, Gemini 3 Pro ~84.20) on document parsing — at roughly 1/100th to 1/167th the cost when self-hosted. The buyer question is no longer “which is best” but “hosted ease or self-host economics?”
This page covers dedicated OCR products. For general VLMs that do OCR as a side-effect (GPT-5, Claude, Gemini, Pixtral), see Visual Question Answering.
Frontier hosted API · hyperscaler cloud · open weights · domain specialists. Pricing shown per 1,000 pages on each vendor’s default tier.
| Provider / Model | Tier | License | Cost / 1K | Bboxes | Tables | Math | Outputs | Langs | |
|---|---|---|---|---|---|---|---|---|---|
| Frontier | Proprietary API | ~$1 | ✓ | Strong | ✓ | Markdown · JSON | Broad (Latin + CJK + RTL) | Claim → | |
Re | Frontier | Proprietary API | ~$5–20 | ✓ | Strong | ✓ | Markdown · JSON · HTML | Broad | Claim → |
Mp | Specialist | Proprietary API | ~$5–15 | ✓ | Strong | ✓ | Markdown · LaTeX · MathML · JSON · HTML | English + math symbols | Claim → |
Mn | Specialist | Proprietary API | Per-doc pricing | ✓ | Decent | — | JSON | Broad (invoice-focused) | Claim → |
Ad | Specialist | Proprietary API | Subscription tiers | ✓ | Strong | — | JSON · CSV (tables) | Broad | Claim → |
| Cloud | Proprietary API | $1.50–$40 | ✓ | Strong | — | JSON | Latin scripts (limited CJK) | Claim → | |
| Cloud | Proprietary API | $1.50–$50 | ✓ | Strong | — | JSON · Markdown (Layout) | 140+ locales | Claim → | |
| Cloud | Proprietary API | $30–$65 | ✓ | Strong | — | JSON | Broad | Claim → | |
| Open | Open weights | Self-host | ✓ | Strong | ✓ | Markdown · JSON · HTML | Broad (en + CJK strong) | Claim → | |
| Open | Open weights | Self-host | ✓ | Strong | ✓ | Markdown · JSON · HTML | Broad (80+, CJK strong) | Claim → | |
| Open | Open weights | Self-host | ✓ | Decent | ✓ | Markdown · JSON | Broad | Claim → | |
| Open | Open weights | Self-host | ✓ | Decent | ✓ | Markdown · JSON | Broad | Claim → |
Pricing as of 2026-04. Hosted APIs show list-price per 1K pages; open-weights show estimated compute cost on a rented A100 at typical batch utilisation. Real unit cost varies with page complexity, image resolution, and whether you batch. Click any price to open the vendor’s pricing page. Spot an error? Tell us →
For most of the last decade, document parsing was a moat: AWS Textract, Google Document AI, and Azure Form Recognizer made tens of millions a year selling pixel recognition that nobody else could match.
That moat collapsed in 2025–2026. The OmniDocBench leaderboard now has two open-weights models — GLM-OCR (94.62) and PaddleOCR-VL (94.50) — sitting above GPT-5 (~85.80) and Gemini 3 Pro (~84.20) on the canonical multi-task document benchmark.
The cost gap is even more dramatic. Google Document AI lists at $30–65 per 1K pages. PaddleOCR-VL on a rented A100 costs ~$0.09 per 1K pages. That’s a 333–722× cost ratio at higher accuracy.
The pragmatic stack in 2026: default to a hosted VLM-based API (Mistral OCR or Reducto) for the developer experience and zero ops overhead, and only self-host when scale, sovereignty, or compliance forces your hand. The hyperscaler tier exists mostly for buyers who already have the MSA.
Picking an OCR provider is a function of volume, document type, and license. Shortcuts by use-case:
Best hosted API (default pick)
Mistral OCR · Reducto
Both are VLM-based and modern. Mistral OCR is the cheapest credible frontier API at ~$1/1K pages; Reducto wins on RAG-shaped output (chunking, layout-aware extraction).
Cheapest at scale (millions of pages)
PaddleOCR-VL · MonkeyOCR-pro · dots.ocr
Self-hosted open weights at $0.03–$0.10/1K pages — 1/100th the cost of Document AI. PaddleOCR-VL is the highest-quality of the three; dots.ocr is the lightest to deploy.
Math, equations, scientific PDFs
Mathpix · GLM-OCR · MonkeyOCR-pro
Mathpix is the only commercial API with first-class LaTeX. Among open weights, GLM-OCR and MonkeyOCR-pro both emit clean equation markup.
Invoices, receipts, AP automation
Mindee · AWS Textract AnalyzeDocument · Azure prebuilt-invoice
Specialist parsers return typed fields (vendor, amount, line items) — much faster to ship than wiring raw OCR into a schema. Mindee leads on prebuilt model breadth.
On-prem / sovereignty / regulated data
GLM-OCR · PaddleOCR-VL · MonkeyOCR-pro
Permissive licenses, run inside your VPC. PaddleOCR-VL and MonkeyOCR-pro are Apache 2.0; GLM-OCR is MIT-style. No data leaves your boundary.
Born-digital PDFs (InDesign / Acrobat originals)
Adobe PDF Extract · Reducto
Adobe leads on PDFs that came out of its own toolchain — element ordering and table structure stay intact. Reducto is the all-purpose modern alternative.
Enterprise with an existing hyperscaler MSA
Azure Document Intelligence · AWS Textract · Google Document AI
Already in the contract, IAM and audit story sorted. Azure leads on language coverage and emits Markdown directly; AWS on the receipt/form ecosystem; GCP on form parser quality.
Vendor benchmarks are run on hand-picked PDFs that flatter the model. Build your own 10-document evaluation set covering these failure modes — most providers stratify sharply:
Run each test against 3–5 providers blind and score on downstream task success (does the parsed output answer your real query?), not character-error-rate. CER is a 2010s metric that misses layout and structure entirely.
Newspapers, academic two-column papers. Tests whether the model preserves reading order or interleaves columns into nonsense.
Financial statements, scientific data tables with merged cells, footnotes, and rotated headers. The hardest sub-task in OCR — most APIs lose 10-20% of cells.
Inline and display math. Score against ground-truth LaTeX. Mathpix and the open-weights leaders can do this; most cannot.
Filled-in forms, signed contracts, notebook scans. Pure OCR is solved on print; handwriting is still a gap and the differentiator.
Phone photos of receipts, faxed invoices, low-DPI archival scans. Real production input is never the clean PDF in vendor demos.
English + Chinese / Japanese / Arabic in a single document. Latin-only models silently drop or transliterate non-Latin runs.
The classic OCR metrics — character error rate (CER), word accuracy, BLEU on the extracted string — were designed for OCR as a string-recognition task. In 2026 document parsing is a structured-extraction task. The model needs to give you Markdown, JSON, or a typed schema — and the metric needs to score that.
A model that transcribes every character perfectly but loses the table-row structure, merges two columns, or strips equations is useless to your RAG pipeline. CER says 99%; downstream task success says 40%.
The benchmarks worth tracking in 2026 are OmniDocBench (multi-task: text + table + formula + reading order), olmOCR-Bench (Allen Institute’s reference set), and DocVQA (does your output answer real questions?).
The pragmatic move: build your own 50-document eval that scores end-to-end on your downstream task — schema validity, RAG retrieval quality, downstream LLM answer correctness — and ignore the global leaderboards entirely.
Useful for academic comparison and open-weights training. Frontier API providers don’t train exclusively on these — most use proprietary annotated corpora orders of magnitude larger.
The 2026 canonical multi-task document benchmark. Scores layout, text, tables, formulas, and reading order in one number. The board where GLM-OCR (94.62) and PaddleOCR-VL (94.50) overtook frontier closed APIs.
Benchmark page →Allen Institute's reference benchmark released alongside olmOCR. Stress-tests rotation, multi-column, math, tables, and reading order across academic, technical, and legal PDFs.
Benchmark page →Industry documents (invoices, forms, reports). Scores OCR + reasoning end-to-end by asking real questions. The standard hostility check for document VQA — passing it requires both pixel recognition and structure understanding.
Benchmark page →Form Understanding in Noisy Scanned Documents. Annotated forms with key-value pair structure. Small but still used as a sanity check for layout-aware OCR — particularly for AP / KYC pipelines.
Benchmark page →IBM's large-scale document-layout corpus. Used to pre-train layout backbones (LayoutLM, DocFormer, Pix2Struct). The dataset that turned layout detection from a research problem into a solved one.
Benchmark page →Specialist benchmark for table extraction. Two tracks: table detection (where) and table structure recognition (rows, columns, merged cells). Still the reference set for table-focused OCR research.
Benchmark page →Don’t run a frontier general VLM for high-volume OCR. GPT-5 and Claude Opus 4.7 can read documents — but you’re paying $5–15 per 1K pages for a model that scores lower on OmniDocBench than dedicated OCR systems at 1/100th the cost. Use VLMs for the reasoning step downstream of OCR, not for the OCR itself.
Bounding boxes matter for grounded extraction. If your downstream system needs to cite the source region (audit trails, legal discovery, RAG with provenance), require bbox output. Mistral OCR, Reducto, all hyperscalers, and the open-weights leaders all return them; some specialist parsers do not.
Hosted vs self-host is a 3-axis trade-off. Hosted wins on time-to-ship, latency variance, and zero ops. Self-host wins on unit-cost (10-300×), throughput at scale, and data sovereignty. The cross-over point is usually around 1M pages/month — below that, hosted; above, build the GPU pool.
Cache by page hash. OCR is deterministic enough that ~30% of production traffic is repeated pages (re-uploads, retries, duplicate documents). Hash the page bytes → output. A 30-line Redis cache pays for itself in a week at any non-trivial volume.
Evaluate on YOUR documents, not vendor demos. Build a 50-document gold set from your real production inputs — scanned invoices in your customer’s actual scan quality, forms in your industry’s actual template variance. Vendor demos are run on the cleanest PDFs they could find; production input is not those PDFs.
CodeSOTA’s OCR comparison is read by engineers picking a document parser for production RAG, AP automation, and compliance pipelines. If you represent a vendor above — or one we missed — claim the listing to submit verified pricing, benchmark scores, demo links, and a logo. Free; credibility-gated, not pay-to-play.
Missing a vendor, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.
Real humans read every message. We track what people are asking for and prioritize accordingly.