Independent, reproducible, failure-mode-level evaluation.
For teams choosing OCR for production documents. EU languages, real scans, actual breakage patterns.
Teams choosing OCR for production documents. EU languages (Polish, German, Czech), real scans with stamps and noise, regulatory compliance requirements.
Which OCR stack minimizes manual review cost. Not "best accuracy" but "least expensive failures for your document type."
100% independent, no vendor investment. We run our own benchmarks on real documents. GDPR compliant - data stays in EU.
Read failure taxonomy below. Pick document type. Request private evaluation on your documents.
Forget accuracy percentages. Here's what fails in production and which models handle it.
Polish ą, ę, ó → a, e, o. German ä, ö, ü → a, o, u. Czech ř, ů → r, u. Changes legal meaning.
Two-column PDFs read as "Line from left | Line from right | Next left | Next right". Destroys paragraph structure. Common in contracts, scientific papers.
8 → B, 0 → O, 1 → l (lowercase L), 5 → S. Fatal for invoice totals, account numbers, tax IDs. Low-quality scans amplify this.
Page numbers, watermarks, "COPY" stamps read as content. "Page 3 of 12" appears mid-paragraph. Clutters extracted text, breaks search.
Tables read as linear text. Loses row/column relationships. Invoice line items become gibberish. Measured by TEDS (Table Edit Distance).
Circular stamps, handwritten signatures overlay printed text. OCR reads both, creating garbage. "APPROVED" stamp corrupts underlying sentence.
Choose your constraint, see recommended models with honest tradeoffs.
Medical records, legal documents, customer PII. Must process on-premise or EU cloud only.
Open-source, runs on-premise. Strong table handling.
Tradeoff: Lower accuracy than GPT-5.4 on complex layouts
100% local, battle-tested, free.
Tradeoff: Needs language-specific tuning, poor on tables
Scanning archives, digitizing libraries, high-volume automation.
Zero per-page cost. Good multilingual support.
Tradeoff: Worse than VLMs on handwriting, complex tables
$1/1000 pages with batch API. Fast inference (1.2 pages/sec).
Tradeoff: Still costs money at scale, API dependency
Need to preserve row/column relationships. Extract line items, financial tables.
88.56 TEDS on OmniDocBench. HTML/Markdown table output.
Tradeoff: Requires GPU for good speed
Best table TEDS among 3B models. Compact.
Tradeoff: Newer model, less battle-tested than PaddleOCR
Handwritten forms, doctor’s notes, survey responses. Cursive and messy text.
Strong handwriting support, multimodal context helps resolve ambiguous strokes.
Tradeoff: API cost, slower than specialized models
Robust on diverse handwriting styles and long documents.
Tradeoff: API cost, occasional layout drift on dense forms
Polish, German, Czech, Arabic, Thai, Korean. Mixed-language documents.
Tops OCRBench v2 Chinese, KITAB-Bench Arabic, MME-VideoOCR.
Tradeoff: API-only, cost for high volume
40+ languages, open-source, strong on old scans.
Tradeoff: 9B params, slower inference
Document upload flows, real-time data entry, mobile scanning apps.
1.22 pages/sec verified by CodeSOTA. Good accuracy.
Tradeoff: API dependency, cost per page
Very fast on GPU, open-source, no API latency.
Tradeoff: Requires GPU infrastructure, setup complexity
Recommendations by document category with specific failure modes to watch.
Critical: Table structure, numeric accuracy, VAT/tax fields.
Critical: Diacritics (name accuracy), column layout, stamps.
Critical: Formulas, multi-column, figures, citations.
Critical: Mixed print/handwriting, field extraction, checkboxes.
Critical: Diacritics, security features, photo interference.
Critical: Noise handling, numeric errors, degraded text.
We run the same benchmark on your documents.
Early access. No spam. Unsubscribe anytime.
No pricing yet — we're validating the format with early users.
Shape: 100-page sample → failure analysis + model ranking → ~1 week turnaround.
No vendor investment, no affiliate links, no sponsored rankings. We make money from private evaluations, not OCR vendors.
Data stays in EU. Private evaluations processed on EU servers, deleted after delivery. No US cloud providers for sensitive docs.
All benchmarks documented. Read our methodology or see raw data.
Run the best OCR model on your Mac — $6
Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.
Every purchase directly supports CodeSOTA's independent benchmark research.
Request a private evaluation on your documents, or start with our public benchmarks.