Codesota · OCR · Decision GuideHome/OCR/Decision
OCR · decision guide

Which OCR stack minimizes manual review cost?

Independent, reproducible, failure-mode-level evaluation.

For teams choosing OCR for production documents. EU languages, real scans, actual breakage patterns.

100% independentGDPR compliantReproducible tests
§ 01 · Clarity

90-second clarity.

Who is this for?

Teams choosing OCR for production documents. EU languages (Polish, German, Czech), real scans with stamps and noise, regulatory compliance requirements.

What decision does this help?

Which OCR stack minimizes manual review cost. Not "best accuracy" but "least expensive failures for your document type."

Why trust this?

100% independent, no vendor investment. We run our own benchmarks on real documents. GDPR compliant - data stays in EU.

Next action?

Read failure taxonomy below. Pick document type. Request private evaluation on your documents.

See Decision MatrixRequest Evaluation
§ 02 · Failure Taxonomy

What actually breaks.

Forget accuracy percentages. Here's what fails in production and which models handle it.

ą

Dropped Diacritics

Polish ą, ę, ó → a, e, o. German ä, ö, ü → a, o, u. Czech ř, ů → r, u. Changes legal meaning.

Handles:Gemini 2.5 Pro, Qwen2.5-VL 72B, PaddleOCR-VL
Fails:Tesseract, Azure OCR (older versions)

Column Bleed (Multi-column Layouts)

Two-column PDFs read as "Line from left | Line from right | Next left | Next right". Destroys paragraph structure. Common in contracts, scientific papers.

Handles:GPT-5.4, Gemini 1.5 Pro, PaddleOCR-VL
Fails:Tesseract, EasyOCR, traditional OCR engines
8B

Numeric Substitution

8 → B, 0 → O, 1 → l (lowercase L), 5 → S. Fatal for invoice totals, account numbers, tax IDs. Low-quality scans amplify this.

Handles:Chandra OCR, GPT-5.4, Mistral OCR 3
Fails:PaddleOCR (basic), Tesseract on faxed documents

Header/Footer Hallucination

Page numbers, watermarks, "COPY" stamps read as content. "Page 3 of 12" appears mid-paragraph. Clutters extracted text, breaks search.

Handles:Claude Sonnet 4.6 (lowest hallucination), GPT-5.4
Fails:Most traditional OCR, some VLMs without layout awareness

Table Structure Collapse

Tables read as linear text. Loses row/column relationships. Invoice line items become gibberish. Measured by TEDS (Table Edit Distance).

Handles:PaddleOCR-VL (88.56 TEDS), dots.ocr 3B, Mistral OCR 3
Fails:Tesseract, clearOCR (0.8% TEDS), basic vision models

Stamp/Signature Interference

Circular stamps, handwritten signatures overlay printed text. OCR reads both, creating garbage. "APPROVED" stamp corrupts underlying sentence.

Handles:GPT-5.4, Gemini 2.5 Pro, modern VLMs with layout understanding
Fails:Traditional OCR without preprocessing, basic pipelines
§ 03 · Matrix

If your priority is...

Choose your constraint, see recommended models with honest tradeoffs.

Privacy (GDPR, data residency, no cloud)

Medical records, legal documents, customer PII. Must process on-premise or EU cloud only.

Best

PaddleOCR-VL 0.9B

Open-source, runs on-premise. Strong table handling.

Tradeoff: Lower accuracy than GPT-5.4 on complex layouts

Alternative

Tesseract + post-correction

100% local, battle-tested, free.

Tradeoff: Needs language-specific tuning, poor on tables

Cost (processing millions of pages)

Scanning archives, digitizing libraries, high-volume automation.

Best

PaddleOCR (open-source)

Zero per-page cost. Good multilingual support.

Tradeoff: Worse than VLMs on handwriting, complex tables

Alternative

Mistral OCR 3 (batch)

$1/1000 pages with batch API. Fast inference (1.2 pages/sec).

Tradeoff: Still costs money at scale, API dependency

Table Extraction (invoices, reports, structured data)

Need to preserve row/column relationships. Extract line items, financial tables.

Best

PaddleOCR-VL

88.56 TEDS on OmniDocBench. HTML/Markdown table output.

Tradeoff: Requires GPU for good speed

Alternative

dots.ocr 3B

Best table TEDS among 3B models. Compact.

Tradeoff: Newer model, less battle-tested than PaddleOCR

Handwriting (forms, notes, signatures)

Handwritten forms, doctor’s notes, survey responses. Cursive and messy text.

Best

GPT-5.4

Strong handwriting support, multimodal context helps resolve ambiguous strokes.

Tradeoff: API cost, slower than specialized models

Alternative

Gemini 2.5 Pro

Robust on diverse handwriting styles and long documents.

Tradeoff: API cost, occasional layout drift on dense forms

Multi-language (40+ languages, diacritics, non-Latin)

Polish, German, Czech, Arabic, Thai, Korean. Mixed-language documents.

Best

Gemini 2.5 Pro

Tops OCRBench v2 Chinese, KITAB-Bench Arabic, MME-VideoOCR.

Tradeoff: API-only, cost for high volume

Alternative

Chandra OCR 0.1.0

40+ languages, open-source, strong on old scans.

Tradeoff: 9B params, slower inference

Speed (real-time processing, low latency)

Document upload flows, real-time data entry, mobile scanning apps.

Best

Mistral OCR 3

1.22 pages/sec verified by CodeSOTA. Good accuracy.

Tradeoff: API dependency, cost per page

Alternative

PaddleOCR (GPU)

Very fast on GPU, open-source, no API latency.

Tradeoff: Requires GPU infrastructure, setup complexity

§ 04 · Document Types

Quick guide by document.

Recommendations by document category with specific failure modes to watch.

Invoices & Receipts

Critical: Table structure, numeric accuracy, VAT/tax fields.

  • PaddleOCR-VL (tables)
  • Mistral OCR 3 (speed + accuracy)
  • Avoid: clearOCR (no table structure)

Contracts & Legal

Critical: Diacritics (name accuracy), column layout, stamps.

  • GPT-5.4 (layout + stamps)
  • Gemini 1.5 Pro (multi-column)
  • Avoid: Traditional OCR (column bleed)

Scientific PDFs

Critical: Formulas, multi-column, figures, citations.

  • PaddleOCR-VL (formulas)
  • Chandra OCR (old scans, math)
  • Avoid: Basic OCR (formula recognition)

Forms with Handwriting

Critical: Mixed print/handwriting, field extraction, checkboxes.

  • GPT-5.4 (mixed content)
  • Gemini 2.5 Pro (handwriting)
  • Avoid: PaddleOCR basic (handwriting)

ID Documents

Critical: Diacritics, security features, photo interference.

  • Gemini 2.5 Pro (multi-language)
  • Azure OCR (ID-specific)
  • Avoid: Open-source (compliance risk)

Low-quality Scans/Fax

Critical: Noise handling, numeric errors, degraded text.

  • Chandra OCR (old scans)
  • GPT-5.4 (noise robustness)
  • Avoid: Basic Tesseract (numeric errors)
§ 05 · Evaluation

Private OCR evaluation.

We run the same benchmark on your documents.

What you get

  • OCR benchmark on your actual documents (PDF, scans, images)
  • Failure-mode analysis: which errors you'll see in production
  • Model recommendations ranked by manual review cost for your docs
  • GDPR compliant: data processed in EU, deleted after report delivery
  • Runnable code + deployment guide for top-ranked model

No pricing yet — we're validating the format with early users.
Shape: 100-page sample → failure analysis + model ranking → ~1 week turnaround.

§ 06 · Trust

Why trust this guide.

100% Independent

No vendor investment, no affiliate links, no sponsored rankings. We make money from private evaluations, not OCR vendors.

GDPR Compliant

Data stays in EU. Private evaluations processed on EU servers, deleted after delivery. No US cloud providers for sensitive docs.

Open Methodology

All benchmarks documented. Read our methodology or see raw data.

#1 on OmniDocBench92.86 compositeSOTA shipped

Run the best OCR model on your Mac — $6

Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.

Every purchase directly supports CodeSOTA's independent benchmark research.

Ready to choose your OCR stack?

Request a private evaluation on your documents, or start with our public benchmarks.

Request Private EvaluationBrowse Full Benchmarks