Optical Character Recognition
Extracting text from document images
Document OCR extracts text from document images (scans, photos of pages, PDFs rendered to images) with awareness of document structure. Unlike scene text OCR, it handles full pages with paragraphs, tables, headers, and reading order. Modern pipelines (PaddleOCR, Surya, Tesseract 5) achieve 99%+ character accuracy on clean printed documents, but the challenge is maintaining that accuracy across diverse real-world document types and conditions.
History
HP develops Tesseract OCR engine — originally for HP scanners, will later become the world's most widely used OCR tool
ABBYY FineReader and OmniPage dominate commercial document OCR; accuracy exceeds 99% on clean printed text for major languages
Google open-sources Tesseract; academic research shifts to neural approaches
Tesseract 4.0 replaces character-based recognition with LSTM-based line recognition, dramatically improving accuracy on degraded text
PaddleOCR v1 released by Baidu — begins the modern open-source OCR toolkit era with detection + recognition + classification
PaddleOCR v3 (PP-OCRv3) achieves best speed-accuracy tradeoff for document OCR; supports 80+ languages
Tesseract 5.0 adds more LSTM models; remains the default for academic and low-resource projects despite being overtaken in accuracy
Surya OCR emerges as the accuracy leader for multilingual document OCR; EasyOCR and docTR provide alternative open-source options
PaddleOCR v4 and Surya 2.0 push document OCR accuracy further; VLMs (GPT-4o) approach document OCR engines in quality
How Optical Character Recognition Works
Page Preprocessing
Document images are deskewed (correcting rotation up to ±45°), denoised (removing scanner artifacts), and optionally binarized. Resolution is checked — OCR works best at 300 DPI; lower resolutions degrade accuracy.
Text Detection
DBNet or CRAFT detects text regions and produces word-level or line-level bounding boxes. For documents, detection is often simpler than scene text because text is typically horizontal and well-separated, but tables and multi-column layouts require layout-aware detection.
Line Recognition
Cropped text lines are recognized by a sequence model: ViT/CNN encoder → LSTM/transformer decoder → character sequence. PaddleOCR uses a lightweight PP-OCR architecture; Surya uses larger transformer models for higher accuracy.
Reading Order + Structure
Recognized text blocks are sorted into reading order (left-to-right, top-to-bottom, with column detection). Paragraphs are assembled from adjacent lines. Headers, footers, and page numbers are identified and separated.
Output
The pipeline outputs structured text with bounding box coordinates (hOCR format), plain text in reading order, or structured formats (JSON with regions and text). Confidence scores per character/word enable downstream quality filtering.
Current Landscape
Document OCR in 2025 is a mature technology with clear market segmentation. For clean, printed, well-scanned documents in major languages, the problem is solved — 99%+ accuracy from multiple tools. The differentiation is now in: (1) multilingual support for underserved scripts, (2) handling of degraded and historical documents, (3) integration with layout analysis and table extraction, and (4) speed optimization for batch processing. PaddleOCR dominates for production speed, Surya leads on multilingual accuracy, and Tesseract persists for its ubiquity and zero-cost deployment. Cloud APIs (Google, Azure, AWS) serve enterprises willing to pay for the best accuracy and simplest integration.
Key Challenges
Multi-column layouts — correctly separating and ordering text in multi-column documents (newspapers, academic papers) is a significant source of errors
Table text extraction — reading text within table cells in the correct row/column order requires combining OCR with table structure recognition
Mixed content — pages with text, handwriting, stamps, logos, and images need the OCR to distinguish readable text from non-text elements
Historical and degraded documents — yellowed paper, bleed-through, faded ink, and physical damage cause character-level errors that compound across pages
Speed vs. accuracy tradeoff — Tesseract is fast but less accurate; Surya is accurate but slower; production systems must balance throughput with quality
Quick Recommendations
Best open-source accuracy
Surya 2.0
Highest character accuracy across multilingual document OCR benchmarks; supports 90+ languages
Best speed-accuracy balance
PaddleOCR v4 (PP-OCRv4)
Optimized for production throughput; excellent accuracy at 10-50 pages/second on GPU; mobile deployment support
Maximum compatibility
Tesseract 5 (LSTM mode)
Available everywhere, supports 100+ languages, extensive community; adequate accuracy for clean documents
Cloud API (enterprise)
Google Cloud Document AI or Azure Document Intelligence
Highest accuracy with layout understanding built in; SLA guarantees; handles tables and forms natively
Integrated pipeline
Docling or Marker (both Surya-based)
Document OCR + layout analysis + table extraction in one pipeline; output to Markdown or JSON
What's Next
Document OCR is converging with document parsing — the trend is toward end-to-end models that read, structure, and understand documents simultaneously rather than separate OCR → layout → extraction pipelines. VLMs will increasingly perform OCR implicitly as part of document understanding. Remaining frontiers: real-time OCR for augmented reality (translating signs, menus, documents through phone cameras), 100+ language coverage including endangered scripts, and OCR for non-traditional documents (whiteboards, handwritten notes, screen captures).
Benchmarks & SOTA
scut-ctw1500
Dataset from Papers With Code
State of the Art
FAST-T-512
129.1
fps
cnn-/-daily-mail
Dataset from Papers With Code
State of the Art
Scrambled code + broken (alter)
48.18
rouge-1
icdar2013
Dataset from Papers With Code
State of the Art
DTrOCR 105M
99.4
accuracy
dart
Dataset from Papers With Code
State of the Art
FactT5B
97.6
factspotter
icdar2015
Dataset from Papers With Code
State of the Art
DTrOCR 105M
93.5
accuracy
tabfact
Dataset from Papers With Code
State of the Art
ARTEMIS-DA
93.1
test
sun-rgb-d
Dataset from Papers With Code
State of the Art
IM3D
64.4
iou
inverse-text
Dataset from Papers With Code
State of the Art
DeepSolo (ViTAEv2-S, TextOCR)
75.8
f-measure-full-lexicon
videodb's-ocr-benchmark-public-collection
Dataset from Papers With Code
State of the Art
GPT-4o
OpenAI
76.22
accuracy
pendigits
Dataset from Papers With Code
State of the Art
DnC-SC
82.86
nmi
CodeSearchNet
Benchmark for code summarization (docstring generation) across 6 programming languages: Python, Java, JavaScript, PHP, Ruby, Go. Over 2M (code, docstring) pairs. Primary metric is BLEU-4.
State of the Art
GPT-4o
OpenAI
25.3
bleu-4
lam(line-level)
Dataset from Papers With Code
State of the Art
GFCN
18.5
test-wer
howsumm-step
Dataset from Papers With Code
State of the Art
LexRank (query: step title)
39.6
rouge-1
e2e
Dataset from Papers With Code
State of the Art
HTLM (fine-tuning)
70.8
rouge-l
howsumm-method
Dataset from Papers With Code
State of the Art
LexRank (query: method + article + steps titles)
53.5
rouge-1
read2016(line-level)
Dataset from Papers With Code
State of the Art
Span
21.1
test-wer
iam(line-level)
Dataset from Papers With Code
State of the Art
GFCN
28.6
test-wer
urdudoc
Dataset from Papers With Code
State of the Art
ContourNet [69]
88.68
recall
belfort
Dataset from Papers With Code
State of the Art
PyLaia (human transcriptions + random split)
28.11
wer
reuters-21578
Dataset from Papers With Code
State of the Art
ApproxRepSet
97.17
accuracy
KITAB-Bench
KITAB Arabic OCR Benchmark
8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.
State of the Art
PaddleOCR
Baidu
0.790
cer
wikibio
Dataset from Papers With Code
State of the Art
MBD
56.16
parent
codesearchnet---java
Dataset from Papers With Code
State of the Art
CodeTrans-MT-Large
21.87
smoothed-bleu-4
codesearchnet---javascript
Dataset from Papers With Code
State of the Art
Transformer
25.61
smoothed-bleu-4
codesearchnet---php
Dataset from Papers With Code
State of the Art
CodeTrans-MT-Base
26.23
smoothed-bleu-4
codesearchnet---go
Dataset from Papers With Code
State of the Art
CodeBERT (MLM)
26.79
smoothed-bleu-4
benchmarking-chinese-text-recognition:-datasets,-b
Dataset from Papers With Code
State of the Art
DTrOCR
89.6
accuracy
codesearchnet---ruby
Dataset from Papers With Code
State of the Art
CodeTrans-MT-Base
15.26
smoothed-bleu-4
codesearchnet---python
Dataset from Papers With Code
State of the Art
CodeTrans-MT-Base
20.39
smoothed-bleu-4
tobacco-small-3482
Dataset from Papers With Code
State of the Art
Optimized Text CNN
84
accuracy
mldoc-zero-shot-english-to-french
Dataset from Papers With Code
State of the Art
XLMft UDA
96.05
accuracy
webnlg-(all)
Dataset from Papers With Code
State of the Art
HTLM (fine-tuning)
55.6
bleu
mldoc-zero-shot-english-to-spanish
Dataset from Papers With Code
State of the Art
XLMft UDA
96.8
accuracy
hoc
Dataset from Papers With Code
State of the Art
BioLinkBERT (large)
88.1
f1
webnlg-(seen)
Dataset from Papers With Code
State of the Art
HTLM (fine-tuning)
65.4
bleu
webnlg-(unseen)
Dataset from Papers With Code
State of the Art
HTLM (fine-tuning)
48.4
bleu
wikipedia-person-and-animal-dataset
Dataset from Papers With Code
State of the Art
VTM
45.36
rouge
mldoc-zero-shot-english-to-russian
Dataset from Papers With Code
State of the Art
XLMft UDA
89.7
accuracy
ThaiOCRBench
Thai OCR Benchmark
2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.
State of the Art
Claude Sonnet 4
Anthropic
0.840
ted-score
mldoc-zero-shot-english-to-german
Dataset from Papers With Code
State of the Art
XLMft UDA
96.95
accuracy
mldoc-zero-shot-english-to-chinese
Dataset from Papers With Code
State of the Art
XLMft UDA
93.32
accuracy
stdw
Dataset from Papers With Code
State of the Art
RetinaNet
0.780
ap
mldoc-zero-shot-english-to-italian
Dataset from Papers With Code
State of the Art
MultiFiT, pseudo
76.02
accuracy
bbcsport
Dataset from Papers With Code
State of the Art
MPAD-path
99.59
accuracy
read-2016
Dataset from Papers With Code
State of the Art
HTR-VT(line-level)
16.5
wer
reuters-rcv1/rcv2-german-to-english
Dataset from Papers With Code
State of the Art
Biinclusion (Euro500kReuters)
84.4
accuracy
rotowire
Dataset from Papers With Code
State of the Art
HierarchicalEncoder + NR + IR
55.88
content-selection-f1
fsns---test
Dataset from Papers With Code
State of the Art
STREET
27.54
sequence-error
sut
Dataset from Papers With Code
State of the Art
CNN
86
accuracy
dareczech
Dataset from Papers With Code
State of the Art
Query-doc RobeCzech (Roberta-base)
46.73
p-10
cub-200-2011
Dataset from Papers With Code
State of the Art
Q-SENN
85.9
top-1-accuracy
bbc-xsum
Dataset from Papers With Code
State of the Art
BigBird-Pegasus
47.12
rouge-1
mldoc-zero-shot-english-to-japanese
Dataset from Papers With Code
State of the Art
MultiFiT, pseudo
69.57
accuracy
Dataset from Papers With Code
State of the Art
ApproxRepSet
72.6
accuracy
amazon
Dataset from Papers With Code
State of the Art
ApproxRepSet
94.31
accuracy
reuters-rcv1/rcv2-english-to-german
Dataset from Papers With Code
State of the Art
Biinclusion (Euro500kReuters)
92.7
accuracy
aapd
Dataset from Papers With Code
State of the Art
KD-LSTMreg
72.9
f1
cedar-signature
Dataset from Papers With Code
State of the Art
Siamese_MultiHeadCrossAttention_SoftAttention (Siamese_MHCA_SA)
5.7
far
classic
Dataset from Papers With Code
State of the Art
REL-RWMD k-NN
96.85
accuracy
clueweb09-b
Dataset from Papers With Code
State of the Art
XLNet
31.1
ndcg-20
dise-2021-dataset
Dataset from Papers With Code
State of the Art
JDeskew
0.860
percentage-correct
i2l-140k
Dataset from Papers With Code
State of the Art
I2L-NOPOOL
89.09
bleu
icdar-2019
Dataset from Papers With Code
State of the Art
DiT-L (Cascade)
96.55
weighted-average-f1-score
imdb-m
Dataset from Papers With Code
State of the Art
Document Classification Using Importance of Sentences
54.8
accuracy
recipe
Dataset from Papers With Code
State of the Art
ApproxRepSet
59.06
accuracy
scidocs-(mag)
Dataset from Papers With Code
State of the Art
SPECTER
82
f1-micro
scidocs-(mesh)
Dataset from Papers With Code
State of the Art
SciNCL
88.7
f1-micro
simara
Dataset from Papers With Code
State of the Art
DAN
14.79
wer
textzoom
Dataset from Papers With Code
State of the Art
CCD-ViT-Small
21.84
average-psnr-db
wos-5736
Dataset from Papers With Code
State of the Art
ConvTextTM
91.28
accuracy
ba
Dataset from Papers With Code
State of the Art
ELSC
51.8
accuracy
arxiv-summarization-dataset
Dataset from Papers With Code
State of the Art
DeepPyramidion
19.99
rouge-2
arxiv-hep-th-citation-graph
Dataset from Papers With Code
State of the Art
DeepPyramidion
47.15
rouge-1
wikilingua-(tr->en)
Dataset from Papers With Code
State of the Art
DOCmT5
31.37
rouge-l
lun
Dataset from Papers With Code
State of the Art
ChuLo
64.4
accuracy
jaffe
Dataset from Papers With Code
State of the Art
ELSC
98.6
accuracy
iris
Dataset from Papers With Code
State of the Art
ELSC
97.7
accuracy
mldoc-zero-shot-german-to-french
Dataset from Papers With Code
State of the Art
BiLSTM (Europarl)
75.45
accuracy
mpqa
Dataset from Papers With Code
State of the Art
MPAD-path
89.81
accuracy
pixraw10p
Dataset from Papers With Code
State of the Art
ELSC
96
accuracy
re-docred
Dataset from Papers With Code
State of the Art
VaeDiff-DocRE
0.790
f1
im2latex-100k
Dataset from Papers With Code
State of the Art
I2L-STRIPS
88.86
bleu
and-dataset
Dataset from Papers With Code
State of the Art
Siamese_MHCA_SA
0.810
average-f1
iam-d
Dataset from Papers With Code
State of the Art
StackMix+Blots
3.01
cer
reuters-de-en
Dataset from Papers With Code
State of the Art
BilBOWA
75
accuracy
reuters-en-de
Dataset from Papers With Code
State of the Art
BilBOWA
86.5
accuracy
iam-b
Dataset from Papers With Code
State of the Art
StackMix+Blots
3.77
cer
hyperpartisan-news-detection
Dataset from Papers With Code
State of the Art
ChuLo
95.38
accuracy
hkr
Dataset from Papers With Code
State of the Art
StackMix+Blots
3.49
cer
saint-gall
Dataset from Papers With Code
State of the Art
StackMix+Blots
3.65
cer
scene-text-recognition-benchmarks
Dataset from Papers With Code
State of the Art
CCD-ViT-Small
84.9
accuracy
wine
Dataset from Papers With Code
State of the Art
ELSC
75.8
accuracy
wos-11967
Dataset from Papers With Code
State of the Art
HDLTex
86.07
accuracy
food-101
Dataset from Papers With Code
State of the Art
Bert
84.41
accuracy
wos-46985
Dataset from Papers With Code
State of the Art
HDLTex
76.58
accuracy
ephoie
Dataset from Papers With Code
State of the Art
LayoutLMv3
99.21
average-f1
dwie
Dataset from Papers With Code
State of the Art
VaeDiff-DocRE
0.731
f1
docred-ie
Dataset from Papers With Code
State of the Art
REXEL
60.1
relation-f1
digital-peter
Dataset from Papers With Code
State of the Art
StackMix+Blots
2.5
cer
textseg
Dataset from Papers With Code
State of the Art
CCD-ViT-Small
84.8
iou
yelp-14
Dataset from Papers With Code
State of the Art
KD-LSTMreg
69.4
accuracy
cl-scisumm
Dataset from Papers With Code
State of the Art
GCN Hybrid
33.88
rouge-2
bentham
Dataset from Papers With Code
State of the Art
StackMix+Blots
1.73
cer
bc8
Dataset from Papers With Code
State of the Art
BioRex+Directionality
56.06
evaluation-macro-f1
warppie10p
Dataset from Papers With Code
State of the Art
ELSC
53.4
accuracy
australian
Dataset from Papers With Code
State of the Art
ELSC
70.9
accuracy
Internal Mistral Benchmark
No results tracked yet
CodeSOTA Polish
CodeSOTA Polish OCR Benchmark
1,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).
No results tracked yet
PolEval 2021 OCR
PolEval 2021 OCR Post-Correction Task
979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.
No results tracked yet
SROIE
Scanned Receipts OCR and Information Extraction
626 receipt images. Key task: extract company, date, address, total from receipts.
No results tracked yet
IMPACT-PSNC
IMPACT Polish Digital Libraries Ground Truth
478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.
No results tracked yet
OCR WER Benchmark
No results tracked yet
OCR CER Benchmark
No results tracked yet
CodeSOTA Verification
No results tracked yet
Related Tasks
Something wrong or missing?
Help keep Optical Character Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.