Codesota · Lineage · OCR Benchmarks12 benchmarks · 11 edgesUpdated 2026-04-27
Benchmark lineage

OCR Benchmarks

How optical character recognition evaluation moved from word-level handwriting transcription to whole-document parsing with tables, charts and layout. Attention path tracks the frontier focus; branches show language-specific forks and metric-isolated variants.

Editor's note

OCR evaluation has had three distinct eras. (1) Pre-VLM (2002–2018): isolated word/line recognition on IAM, ICDAR scene-text, RIMES — the question was 'can the model read'. (2) Form & document era (2019–2022): FUNSD, SROIE, DocVQA shifted the question to 'can the model find what it's reading'. (3) VLM era (2023–): OCRBench bundled 8 sub-tasks into one composite; OCRBench v2 doubled coverage with 10K human-verified items. Attention then moved off 'can it read' entirely — OmniDocBench scores layout, tables, formulas and reading order as a composite, olmOCR-Bench evaluates PDF pass-rate at the page level, and ParseBench (LlamaIndex 2026) introduced rule-based agent evals across five orthogonal axes. As of April 2026, frontier VLMs and specialist OCR-VLMs (PaddleOCR-VL, dots.ocr, GLM-OCR) cluster within 2 points on OmniDocBench composite — the frontier is now the long tail (handwriting, non-Latin scripts, tables with merged cells, scanned legacy PDFs). Language-specific benchmarks (KITAB-Bench for Arabic, ThaiOCRBench, PolEval-OCR) are the working benchmarks where the gap to humans is still material.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
SCOPE SHIFTSCOPE SHIFTDIRECT SUCCESSORSCOPE SHIFTDIRECT SUCCESSORSCOPE SHIFTIAMSEP 2002SOTA 23.20ICDAR 2015AUG 2015FUNSDMAY 2019OCRBenchMAY 2023OCRBench v2DEC 2024SOTA 62.20KITAB-BenchFEB 2025SOTA 4.95ThaiOCRBenchJUN 2025SOTA 0.84OmniDocBenchDEC 2024SOTA 97.50olmOCR-BenchMAR 2025SOTA 99.90ParseBenchJAN 2026SOTA 84.9%OCR · CERMAR 2026OCR · WERMAR 2026
IAMICDAR 2015 · scope shift
From clean handwritten text to incidental scene text — same 'read the pixels' task, fundamentally different visual domain. Spawned the decade of detection-then-recognition pipelines.
IAMFUNSD · scope shift · attention
From transcription to structure: FUNSD reframed OCR as 'find the question, link to its answer' rather than 'recognise every character'. The shift that produced LayoutLM and the entire form-understanding line.
FUNSDOCRBench · scope shift · attention
Once VLMs could read at all, evaluation needed to span more than forms. OCRBench bundled scene text, document VQA, KIE and handwritten math into one composite — the first VLM-era OCR benchmark.
OCRBenchOCRBench v2 · direct successor · attention
10× more items, human-verified, EN+ZH parity, four public/private splits to combat contamination. Original v1 saturated within 18 months; v2 reopened the gap.
OCRBench v2KITAB-Bench · fork
Arabic-script-specific fork — same VLM-era 'comprehensive OCR' framing, but on a script English-centric benchmarks under-cover. Frontier models still post CER >0.13.
OCRBench v2ThaiOCRBench · fork
Thai-script equivalent — TED scoring over parse trees. Another working benchmark for a writing system that gets near-zero attention in English-centric papers.
OCRBench v2OmniDocBench · scope shift · attention
From character-level recognition to document-level fidelity. OmniDocBench scores layout, tables, formulas and reading order — answering 'did the system reconstruct this document' rather than 'did it read each glyph'.
OmniDocBencholmOCR-Bench · direct successor · attention
PDF-focused, harder. olmOCR-Bench targets the failure modes OmniDocBench averages out — old scans, math equations, mixed columns, headers/footers. 'Where does it actually break.'
olmOCR-BenchParseBench · scope shift · attention
Agent-grade document parsing: enterprise docs (insurance, finance, government), 169K rule-based tests across five orthogonal axes, no LLM-as-judge. The frontier where 'OCR' meets 'agent ingestion pipeline'.
OmniDocBenchOCR · CER · variant
Strips OmniDocBench's composite back down to character-error-rate on the same hold-out so vendor claims can be reproduced in isolation. CodeSOTA's verified column.
OCR · CEROCR · WER · variant
Word-level companion. Reported alongside CER for any model where CodeSOTA has run independent verification.
§ 02 · Benchmarks in this lineage

Nodes in detail.

IAM

IAM Handwriting Database

657 writers, 1,539 scanned forms, 13,353 handwritten English text lines. The foundational handwriting recognition benchmark — every modern HTR paper still reports CER/WER on IAM. Saturated for printed-text models years ago; remains a real gap for general-purpose VLMs.

Marti & Bunke (University of Bern) · paper
Aug 2015Saturated

ICDAR 2015

ICDAR 2015 Robust Reading Competition

Incidental scene text — photos taken without intent to capture text, oriented arbitrarily. The dataset that pushed STR from clean cropped words to real-world reading. Spawned a decade of detection+recognition pipelines (CRAFT, EAST, ABINet) before VLMs absorbed the task wholesale.

Karatzas et al. · paper
May 2019Saturated
View benchmark page →

FUNSD

Form Understanding in Noisy Scanned Documents

199 noisy scanned forms with semantic role labels (header, question, answer, other) and key-value links. The benchmark that re-framed OCR as 'find the structure', not 'read the pixels'. Predecessor to LayoutLM and the entire form-understanding line.

Jaume et al. · paper
May 2023Superseded

OCRBench

OCRBench v1

1,000 questions across 5 OCR tasks (text recognition, scene-text VQA, document VQA, key-info extraction, handwritten math). The first widely-cited composite specifically for VLM-era OCR. Saturated by 2024 — top closed models all >700/1000.

Liu et al. · paper

OCRBench v2

OCRBench v2

10,000 human-verified items across 31 sub-tasks in English and Chinese, four splits (public/private × EN/ZH). The current standard for 'can a VLM read' — frontier models cluster around 60% on EN-private, leaving real headroom. Where most VLM papers report their OCR claim.

Fu et al. · paper

OmniDocBench

OmniDocBench

End-to-end document parsing scored as a composite over text edit distance, table TEDS, formula edit distance, reading order and layout mAP. The benchmark that moved OCR scoring from 'character accuracy' to 'document fidelity'. The current most-cited surface for OCR-VLM bake-offs.

Ouyang et al. (OpenDataLab / Shanghai AI Lab) · paper

KITAB-Bench

KITAB-Bench Arabic OCR

Arabic-script OCR across 9 domains (newspapers, books, PDFs, handwriting). The honest stress-test for non-Latin scripts — frontier closed models still post CER >0.13 on the easy split. Companion to OCRBench v2's English-and-Chinese focus.

Heakl et al. (MBZUAI) · paper

olmOCR-Bench

olmOCR-Bench (PDF Document Parsing)

1,403 PDF pages across nine sub-categories (arXiv, headers/footers, multi-column, old-scans, tables, long-tiny-text, etc.) scored on pass-rate per page. Harder than OmniDocBench because it targets the failure modes — old scans, math, mixed columns. The 'where does it actually break' surface.

Allen Institute for AI (Ai2) · paper

ThaiOCRBench

ThaiOCRBench

Thai-script document OCR with TED scoring (tree edit distance over the parse tree). A tight, modern benchmark for a script that gets ~zero attention in English-centric papers. Claude Sonnet 4 leads at 0.84.

SCB 10X · paper

ParseBench

ParseBench: Document Parsing for AI Agents

2,078 human-verified pages from 1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five orthogonal axes: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, visual grounding. No LLM-as-judge — every score is reproducible. The current frontier for agent-document workloads.

LlamaIndex · paper

OCR · CER

Character Error Rate (CodeSOTA)

CodeSOTA's isolated CER evaluation — strips away the layout/structure scoring of OmniDocBench/olmOCR and reports raw character-error-rate on the same hold-out so vendor self-reports can be cross-checked. Lower is better. Used as the reproduction column on the OCR Power Ranking.

CodeSOTA · paper

OCR · WER

Word Error Rate (CodeSOTA)

Word-level companion to OCR · CER. Higher tolerance to single-character substitutions, harsher penalty for word boundary errors. Reported alongside CER for any model where CodeSOTA has run independent verification.

CodeSOTA · paper