Codesota · Lineage · OCR Benchmarks12 benchmarks · 11 edgesUpdated 2026-04-27

Benchmark lineage

OCR Benchmarks

How optical character recognition evaluation moved from word-level handwriting transcription to whole-document parsing with tables, charts and layout. Attention path tracks the frontier focus; branches show language-specific forks and metric-isolated variants.

Editor's note

OCR evaluation has had three distinct eras. (1) Pre-VLM (2002–2018): isolated word/line recognition on IAM, ICDAR scene-text, RIMES — the question was 'can the model read'. (2) Form & document era (2019–2022): FUNSD, SROIE, DocVQA shifted the question to 'can the model find what it's reading'. (3) VLM era (2023–): OCRBench bundled 8 sub-tasks into one composite; OCRBench v2 doubled coverage with 10K human-verified items. Attention then moved off 'can it read' entirely — OmniDocBench scores layout, tables, formulas and reading order as a composite, olmOCR-Bench evaluates PDF pass-rate at the page level, and ParseBench (LlamaIndex 2026) introduced rule-based agent evals across five orthogonal axes. As of April 2026, frontier VLMs and specialist OCR-VLMs (PaddleOCR-VL, dots.ocr, GLM-OCR) cluster within 2 points on OmniDocBench composite — the frontier is now the long tail (handwriting, non-Latin scripts, tables with merged cells, scanned legacy PDFs). Language-specific benchmarks (KITAB-Bench for Arabic, ThaiOCRBench, PolEval-OCR) are the working benchmarks where the gap to humans is still material.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded

IAM → ICDAR 2015 · scope shift

From clean handwritten text to incidental scene text — same 'read the pixels' task, fundamentally different visual domain. Spawned the decade of detection-then-recognition pipelines.

IAM → FUNSD · scope shift · attention

From transcription to structure: FUNSD reframed OCR as 'find the question, link to its answer' rather than 'recognise every character'. The shift that produced LayoutLM and the entire form-understanding line.

FUNSD → OCRBench · scope shift · attention

Once VLMs could read at all, evaluation needed to span more than forms. OCRBench bundled scene text, document VQA, KIE and handwritten math into one composite — the first VLM-era OCR benchmark.

OCRBench → OCRBench v2 · direct successor · attention

10× more items, human-verified, EN+ZH parity, four public/private splits to combat contamination. Original v1 saturated within 18 months; v2 reopened the gap.

OCRBench v2 → KITAB-Bench · fork

Arabic-script-specific fork — same VLM-era 'comprehensive OCR' framing, but on a script English-centric benchmarks under-cover. Frontier models still post CER >0.13.

OCRBench v2 → ThaiOCRBench · fork

Thai-script equivalent — TED scoring over parse trees. Another working benchmark for a writing system that gets near-zero attention in English-centric papers.

OCRBench v2 → OmniDocBench · scope shift · attention

From character-level recognition to document-level fidelity. OmniDocBench scores layout, tables, formulas and reading order — answering 'did the system reconstruct this document' rather than 'did it read each glyph'.

OmniDocBench → olmOCR-Bench · direct successor · attention

PDF-focused, harder. olmOCR-Bench targets the failure modes OmniDocBench averages out — old scans, math equations, mixed columns, headers/footers. 'Where does it actually break.'

olmOCR-Bench → ParseBench · scope shift · attention

Agent-grade document parsing: enterprise docs (insurance, finance, government), 169K rule-based tests across five orthogonal axes, no LLM-as-judge. The frontier where 'OCR' meets 'agent ingestion pipeline'.

OmniDocBench → OCR · CER · variant

Strips OmniDocBench's composite back down to character-error-rate on the same hold-out so vendor claims can be reproduced in isolation. CodeSOTA's verified column.

OCR · CER → OCR · WER · variant

Word-level companion. Reported alongside CER for any model where CodeSOTA has run independent verification.

§ 02 · Benchmarks in this lineage

Nodes in detail.

Sep 2002Active

View benchmark page →

IAM

IAM Handwriting Database

657 writers, 1,539 scanned forms, 13,353 handwritten English text lines. The foundational handwriting recognition benchmark — every modern HTR paper still reports CER/WER on IAM. Saturated for printed-text models years ago; remains a real gap for general-purpose VLMs.

Marti & Bunke (University of Bern) · paper

Aug 2015Saturated

ICDAR 2015

ICDAR 2015 Robust Reading Competition

Incidental scene text — photos taken without intent to capture text, oriented arbitrarily. The dataset that pushed STR from clean cropped words to real-world reading. Spawned a decade of detection+recognition pipelines (CRAFT, EAST, ABINet) before VLMs absorbed the task wholesale.

Karatzas et al. · paper

May 2019Saturated

View benchmark page →

FUNSD

Form Understanding in Noisy Scanned Documents

199 noisy scanned forms with semantic role labels (header, question, answer, other) and key-value links. The benchmark that re-framed OCR as 'find the structure', not 'read the pixels'. Predecessor to LayoutLM and the entire form-understanding line.

Jaume et al. · paper

May 2023Superseded

OCRBench

OCRBench v1

1,000 questions across 5 OCR tasks (text recognition, scene-text VQA, document VQA, key-info extraction, handwritten math). The first widely-cited composite specifically for VLM-era OCR. Saturated by 2024 — top closed models all >700/1000.

Liu et al. · paper

Dec 2024Active

View benchmark page →

OCRBench v2

10,000 human-verified items across 31 sub-tasks in English and Chinese, four splits (public/private × EN/ZH). The current standard for 'can a VLM read' — frontier models cluster around 60% on EN-private, leaving real headroom. Where most VLM papers report their OCR claim.

Fu et al. · paper

Dec 2024Active

View benchmark page →

OmniDocBench

End-to-end document parsing scored as a composite over text edit distance, table TEDS, formula edit distance, reading order and layout mAP. The benchmark that moved OCR scoring from 'character accuracy' to 'document fidelity'. The current most-cited surface for OCR-VLM bake-offs.

Ouyang et al. (OpenDataLab / Shanghai AI Lab) · paper

Feb 2025Active

View benchmark page →

KITAB-Bench

KITAB-Bench Arabic OCR

Arabic-script OCR across 9 domains (newspapers, books, PDFs, handwriting). The honest stress-test for non-Latin scripts — frontier closed models still post CER >0.13 on the easy split. Companion to OCRBench v2's English-and-Chinese focus.

Heakl et al. (MBZUAI) · paper

Mar 2025Active

View benchmark page →

olmOCR-Bench

olmOCR-Bench (PDF Document Parsing)

1,403 PDF pages across nine sub-categories (arXiv, headers/footers, multi-column, old-scans, tables, long-tiny-text, etc.) scored on pass-rate per page. Harder than OmniDocBench because it targets the failure modes — old scans, math, mixed columns. The 'where does it actually break' surface.

Allen Institute for AI (Ai2) · paper

Jun 2025Active

View benchmark page →

ThaiOCRBench

Thai-script document OCR with TED scoring (tree edit distance over the parse tree). A tight, modern benchmark for a script that gets ~zero attention in English-centric papers. Claude Sonnet 4 leads at 0.84.

SCB 10X · paper

Jan 2026Active

View benchmark page →

ParseBench

ParseBench: Document Parsing for AI Agents

2,078 human-verified pages from 1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five orthogonal axes: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, visual grounding. No LLM-as-judge — every score is reproducible. The current frontier for agent-document workloads.

LlamaIndex · paper

Mar 2026Active

View benchmark page →

OCR · CER

Character Error Rate (CodeSOTA)

CodeSOTA's isolated CER evaluation — strips away the layout/structure scoring of OmniDocBench/olmOCR and reports raw character-error-rate on the same hold-out so vendor self-reports can be cross-checked. Lower is better. Used as the reproduction column on the OCR Power Ranking.

CodeSOTA · paper

Mar 2026Active

View benchmark page →

OCR · WER

Word Error Rate (CodeSOTA)

Word-level companion to OCR · CER. Higher tolerance to single-character substitutions, harsher penalty for word boundary errors. Reported alongside CER for any model where CodeSOTA has run independent verification.

CodeSOTA · paper