Codesota · Tasks · Document OCRTasks/Multimodal/OCR & Document Parsing

Multimodal · the most commoditized AI task of 2026

OCR & Document Parsing.

Open-weights leaders on OmniDocBench (GLM-OCR 94.62, PaddleOCR-VL 94.50) beat closed frontier APIs (GPT-5 ~85.80, Gemini 3 Pro ~84.20) on document parsing — at roughly 1/100th to 1/167th the cost when self-hosted. The buyer question is no longer “which is best” but “hosted ease or self-host economics?”

This page covers dedicated OCR products. For general VLMs that do OCR as a side-effect (GPT-5, Claude, Gemini, Pixtral), see Visual Question Answering.

Live OCR leaderboard →Claim a listing All tasks →

§ 01 · The matrix

12 providers, side by side.

Frontier hosted API · hyperscaler cloud · open weights · domain specialists. Pricing shown per 1,000 pages on each vendor’s default tier.

Provider / Model	Tier	License	Cost / 1K	Bboxes	Tables	Math	Outputs	Langs
Mistral Mistral OCR	Frontier	Proprietary API	~$1	✓	Strong	✓	Markdown · JSON	Broad (Latin + CJK + RTL)	Claim →
Re Reducto Reducto Parse	Frontier	Proprietary API	~$5–20	✓	Strong	✓	Markdown · JSON · HTML	Broad	Claim →
Mp Mathpix Mathpix Convert	Specialist	Proprietary API	~$5–15	✓	Strong	✓	Markdown · LaTeX · MathML · JSON · HTML	English + math symbols	Claim →
Mn Mindee Mindee · Invoices / Receipts / Custom	Specialist	Proprietary API	Per-doc pricing	✓	Decent	—	JSON	Broad (invoice-focused)	Claim →
Ad Adobe Adobe PDF Extract	Specialist	Proprietary API	Subscription tiers	✓	Strong	—	JSON · CSV (tables)	Broad	Claim →
AWS Amazon Web Services Textract Detect / AnalyzeDocument	Cloud	Proprietary API	$1.50–$40	✓	Strong	—	JSON	Latin scripts (limited CJK)	Claim →
Az Microsoft Azure Document Intelligence · Read / Layout / Custom	Cloud	Proprietary API	$1.50–$50	✓	Strong	—	JSON · Markdown (Layout)	140+ locales	Claim →
Google Cloud Document AI · Form Parser / Custom Extractor	Cloud	Proprietary API	$30–$65	✓	Strong	—	JSON	Broad	Claim →
GL Zhipu / THUDM (open) GLM-OCR (GLM-V family)	Open	Open weights	Self-host	✓	Strong	✓	Markdown · JSON · HTML	Broad (en + CJK strong)	Claim →
Pp Baidu (open) PaddleOCR-VL 1.5	Open	Open weights	Self-host	✓	Strong	✓	Markdown · JSON · HTML	Broad (80+, CJK strong)	Claim →
do rednote-hilab (open) dots.ocr 3B	Open	Open weights	Self-host	✓	Decent	✓	Markdown · JSON	Broad	Claim →
Mk HUST / Yuliang Liu (open) MonkeyOCR-pro	Open	Open weights	Self-host	✓	Decent	✓	Markdown · JSON	Broad	Claim →

Pricing as of 2026-04. Hosted APIs show list-price per 1K pages; open-weights show estimated compute cost on a rented A100 at typical batch utilisation. Real unit cost varies with page complexity, image resolution, and whether you batch. Click any price to open the vendor’s pricing page. Spot an error? Tell us →

§ 02 · Manifesto

OCR is the most commoditized AI task in 2026.

For most of the last decade, document parsing was a moat: AWS Textract, Google Document AI, and Azure Form Recognizer made tens of millions a year selling pixel recognition that nobody else could match.

That moat collapsed in 2025–2026. The OmniDocBench leaderboard now has two open-weights models — GLM-OCR (94.62) and PaddleOCR-VL (94.50) — sitting above GPT-5 (~85.80) and Gemini 3 Pro (~84.20) on the canonical multi-task document benchmark.

The cost gap is even more dramatic. Google Document AI lists at $30–65 per 1K pages. PaddleOCR-VL on a rented A100 costs ~$0.09 per 1K pages. That’s a 333–722× cost ratio at higher accuracy.

The pragmatic stack in 2026: default to a hosted VLM-based API (Mistral OCR or Reducto) for the developer experience and zero ops overhead, and only self-host when scale, sovereignty, or compliance forces your hand. The hyperscaler tier exists mostly for buyers who already have the MSA.

§ 03 · Decision shortcuts

Which should I use?

Picking an OCR provider is a function of volume, document type, and license. Shortcuts by use-case:

Best hosted API (default pick)

Mistral OCR · Reducto

Both are VLM-based and modern. Mistral OCR is the cheapest credible frontier API at ~$1/1K pages; Reducto wins on RAG-shaped output (chunking, layout-aware extraction).

Cheapest at scale (millions of pages)

PaddleOCR-VL · MonkeyOCR-pro · dots.ocr

Self-hosted open weights at $0.03–$0.10/1K pages — 1/100th the cost of Document AI. PaddleOCR-VL is the highest-quality of the three; dots.ocr is the lightest to deploy.

Math, equations, scientific PDFs

Mathpix · GLM-OCR · MonkeyOCR-pro

Mathpix is the only commercial API with first-class LaTeX. Among open weights, GLM-OCR and MonkeyOCR-pro both emit clean equation markup.

Invoices, receipts, AP automation

Mindee · AWS Textract AnalyzeDocument · Azure prebuilt-invoice

Specialist parsers return typed fields (vendor, amount, line items) — much faster to ship than wiring raw OCR into a schema. Mindee leads on prebuilt model breadth.

On-prem / sovereignty / regulated data

GLM-OCR · PaddleOCR-VL · MonkeyOCR-pro

Permissive licenses, run inside your VPC. PaddleOCR-VL and MonkeyOCR-pro are Apache 2.0; GLM-OCR is MIT-style. No data leaves your boundary.

Born-digital PDFs (InDesign / Acrobat originals)

Adobe PDF Extract · Reducto

Adobe leads on PDFs that came out of its own toolchain — element ordering and table structure stay intact. Reducto is the all-purpose modern alternative.

Enterprise with an existing hyperscaler MSA

Azure Document Intelligence · AWS Textract · Google Document AI

Already in the contract, IAM and audit story sorted. Azure leads on language coverage and emits Markdown directly; AWS on the receipt/form ecosystem; GCP on form parser quality.

§ 04 · Methodology

What to actually test (vendor demos lie).

Vendor benchmarks are run on hand-picked PDFs that flatter the model. Build your own 10-document evaluation set covering these failure modes — most providers stratify sharply:

Run each test against 3–5 providers blind and score on downstream task success (does the parsed output answer your real query?), not character-error-rate. CER is a 2010s metric that misses layout and structure entirely.

Multi-column layouts

Newspapers, academic two-column papers. Tests whether the model preserves reading order or interleaves columns into nonsense.

Dense tables

Financial statements, scientific data tables with merged cells, footnotes, and rotated headers. The hardest sub-task in OCR — most APIs lose 10-20% of cells.

Equations & formulas

Inline and display math. Score against ground-truth LaTeX. Mathpix and the open-weights leaders can do this; most cannot.

Handwriting (mixed with print)

Filled-in forms, signed contracts, notebook scans. Pure OCR is solved on print; handwriting is still a gap and the differentiator.

Scans with rotation / skew / noise

Phone photos of receipts, faxed invoices, low-DPI archival scans. Real production input is never the clean PDF in vendor demos.

Multilingual mixed-script

English + Chinese / Japanese / Arabic in a single document. Latin-only models silently drop or transliterate non-Latin runs.

§ 05 · Metrics

Why CER and word-accuracy stopped meaning anything.

The classic OCR metrics — character error rate (CER), word accuracy, BLEU on the extracted string — were designed for OCR as a string-recognition task. In 2026 document parsing is a structured-extraction task. The model needs to give you Markdown, JSON, or a typed schema — and the metric needs to score that.

A model that transcribes every character perfectly but loses the table-row structure, merges two columns, or strips equations is useless to your RAG pipeline. CER says 99%; downstream task success says 40%.

The benchmarks worth tracking in 2026 are OmniDocBench (multi-task: text + table + formula + reading order), olmOCR-Bench (Allen Institute’s reference set), and DocVQA (does your output answer real questions?).

The pragmatic move: build your own 50-document eval that scores end-to-end on your downstream task — schema validity, RAG retrieval quality, downstream LLM answer correctness — and ignore the global leaderboards entirely.

§ 06 · Reference benchmarks

The boards that matter.

Useful for academic comparison and open-weights training. Frontier API providers don’t train exclusively on these — most use proprietary annotated corpora orders of magnitude larger.

OmniDocBench

1.4K pages · text + table + formula + reading-order2025

The 2026 canonical multi-task document benchmark. Scores layout, text, tables, formulas, and reading order in one number. The board where GLM-OCR (94.62) and PaddleOCR-VL (94.50) overtook frontier closed APIs.

Benchmark page →

olmOCR-Bench

1.4K pages · diverse PDF benchmark2025

Allen Institute's reference benchmark released alongside olmOCR. Stress-tests rotation, multi-column, math, tables, and reading order across academic, technical, and legal PDFs.

Benchmark page →

DocVQA

50K questions · 12K document images2021

Industry documents (invoices, forms, reports). Scores OCR + reasoning end-to-end by asking real questions. The standard hostility check for document VQA — passing it requires both pixel recognition and structure understanding.

Benchmark page →

FUNSD

199 forms · 31K word tokens · 9K entities2019

Form Understanding in Noisy Scanned Documents. Annotated forms with key-value pair structure. Small but still used as a sanity check for layout-aware OCR — particularly for AP / KYC pipelines.

Benchmark page →

PubLayNet

360K pages · scientific PDF layout annotations2019

IBM's large-scale document-layout corpus. Used to pre-train layout backbones (LayoutLM, DocFormer, Pix2Struct). The dataset that turned layout detection from a research problem into a solved one.

Benchmark page →

TableBank

417K tables · Word + LaTeX sources2019

Specialist benchmark for table extraction. Two tracks: table detection (where) and table structure recognition (rows, columns, merged cells). Still the reference set for table-focused OCR research.

Benchmark page →

§ 07 · Practical tips

Five rules for shipping OCR in 2026.

Don’t run a frontier general VLM for high-volume OCR. GPT-5 and Claude Opus 4.7 can read documents — but you’re paying $5–15 per 1K pages for a model that scores lower on OmniDocBench than dedicated OCR systems at 1/100th the cost. Use VLMs for the reasoning step downstream of OCR, not for the OCR itself.

Bounding boxes matter for grounded extraction. If your downstream system needs to cite the source region (audit trails, legal discovery, RAG with provenance), require bbox output. Mistral OCR, Reducto, all hyperscalers, and the open-weights leaders all return them; some specialist parsers do not.

Hosted vs self-host is a 3-axis trade-off. Hosted wins on time-to-ship, latency variance, and zero ops. Self-host wins on unit-cost (10-300×), throughput at scale, and data sovereignty. The cross-over point is usually around 1M pages/month — below that, hosted; above, build the GPU pool.

Cache by page hash. OCR is deterministic enough that ~30% of production traffic is repeated pages (re-uploads, retries, duplicate documents). Hash the page bytes → output. A 30-line Redis cache pays for itself in a week at any non-trivial volume.

Evaluate on YOUR documents, not vendor demos. Build a 50-document gold set from your real production inputs — scanned invoices in your customer’s actual scan quality, forms in your industry’s actual template variance. Vendor demos are run on the cleanest PDFs they could find; production input is not those PDFs.

For vendors

Run an OCR or document-AI product? Claim your listing.

CodeSOTA’s OCR comparison is read by engineers picking a document parser for production RAG, AP automation, and compliance pipelines. If you represent a vendor above — or one we missed — claim the listing to submit verified pricing, benchmark scores, demo links, and a logo. Free; credibility-gated, not pay-to-play.

Claim a listing →Get a rank badge for your site →

Related comparisons

Live OCR leaderboard →Vision models →Visual Question Answering →Text-to-Speech →

Reply within 48 hours · No newsletter

What were you looking for on OCR / document parsing?

Missing a vendor, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.