OCR Benchmarks
How optical character recognition evaluation moved from word-level handwriting transcription to whole-document parsing with tables, charts and layout. Attention path tracks the frontier focus; branches show language-specific forks and metric-isolated variants.
OCR evaluation has had three distinct eras. (1) Pre-VLM (2002–2018): isolated word/line recognition on IAM, ICDAR scene-text, RIMES — the question was 'can the model read'. (2) Form & document era (2019–2022): FUNSD, SROIE, DocVQA shifted the question to 'can the model find what it's reading'. (3) VLM era (2023–): OCRBench bundled 8 sub-tasks into one composite; OCRBench v2 doubled coverage with 10K human-verified items. Attention then moved off 'can it read' entirely — OmniDocBench scores layout, tables, formulas and reading order as a composite, olmOCR-Bench evaluates PDF pass-rate at the page level, and ParseBench (LlamaIndex 2026) introduced rule-based agent evals across five orthogonal axes. As of April 2026, frontier VLMs and specialist OCR-VLMs (PaddleOCR-VL, dots.ocr, GLM-OCR) cluster within 2 points on OmniDocBench composite — the frontier is now the long tail (handwriting, non-Latin scripts, tables with merged cells, scanned legacy PDFs). Language-specific benchmarks (KITAB-Bench for Arabic, ThaiOCRBench, PolEval-OCR) are the working benchmarks where the gap to humans is still material.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
IAM
657 writers, 1,539 scanned forms, 13,353 handwritten English text lines. The foundational handwriting recognition benchmark — every modern HTR paper still reports CER/WER on IAM. Saturated for printed-text models years ago; remains a real gap for general-purpose VLMs.
ICDAR 2015
Incidental scene text — photos taken without intent to capture text, oriented arbitrarily. The dataset that pushed STR from clean cropped words to real-world reading. Spawned a decade of detection+recognition pipelines (CRAFT, EAST, ABINet) before VLMs absorbed the task wholesale.
FUNSD
199 noisy scanned forms with semantic role labels (header, question, answer, other) and key-value links. The benchmark that re-framed OCR as 'find the structure', not 'read the pixels'. Predecessor to LayoutLM and the entire form-understanding line.
OCRBench
1,000 questions across 5 OCR tasks (text recognition, scene-text VQA, document VQA, key-info extraction, handwritten math). The first widely-cited composite specifically for VLM-era OCR. Saturated by 2024 — top closed models all >700/1000.
OCRBench v2
10,000 human-verified items across 31 sub-tasks in English and Chinese, four splits (public/private × EN/ZH). The current standard for 'can a VLM read' — frontier models cluster around 60% on EN-private, leaving real headroom. Where most VLM papers report their OCR claim.
OmniDocBench
End-to-end document parsing scored as a composite over text edit distance, table TEDS, formula edit distance, reading order and layout mAP. The benchmark that moved OCR scoring from 'character accuracy' to 'document fidelity'. The current most-cited surface for OCR-VLM bake-offs.
KITAB-Bench
Arabic-script OCR across 9 domains (newspapers, books, PDFs, handwriting). The honest stress-test for non-Latin scripts — frontier closed models still post CER >0.13 on the easy split. Companion to OCRBench v2's English-and-Chinese focus.
olmOCR-Bench
1,403 PDF pages across nine sub-categories (arXiv, headers/footers, multi-column, old-scans, tables, long-tiny-text, etc.) scored on pass-rate per page. Harder than OmniDocBench because it targets the failure modes — old scans, math, mixed columns. The 'where does it actually break' surface.
ThaiOCRBench
Thai-script document OCR with TED scoring (tree edit distance over the parse tree). A tight, modern benchmark for a script that gets ~zero attention in English-centric papers. Claude Sonnet 4 leads at 0.84.
ParseBench
2,078 human-verified pages from 1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five orthogonal axes: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, visual grounding. No LLM-as-judge — every score is reproducible. The current frontier for agent-document workloads.
OCR · CER
CodeSOTA's isolated CER evaluation — strips away the layout/structure scoring of OmniDocBench/olmOCR and reports raw character-error-rate on the same hold-out so vendor self-reports can be cross-checked. Lower is better. Used as the reproduction column on the OCR Power Ranking.
OCR · WER
Word-level companion to OCR · CER. Higher tolerance to single-character substitutions, harsher penalty for word boundary errors. Reported alongside CER for any model where CodeSOTA has run independent verification.