Computer Vision

Optical Character Recognition

Extracting text from document images

114 datasets696 resultsView full task mapping →

Document OCR extracts text from document images (scans, photos of pages, PDFs rendered to images) with awareness of document structure. Unlike scene text OCR, it handles full pages with paragraphs, tables, headers, and reading order. Modern pipelines (PaddleOCR, Surya, Tesseract 5) achieve 99%+ character accuracy on clean printed documents, but the challenge is maintaining that accuracy across diverse real-world document types and conditions.

History

1985

HP develops Tesseract OCR engine — originally for HP scanners, will later become the world's most widely used OCR tool

2005

ABBYY FineReader and OmniPage dominate commercial document OCR; accuracy exceeds 99% on clean printed text for major languages

2006

Google open-sources Tesseract; academic research shifts to neural approaches

2017

Tesseract 4.0 replaces character-based recognition with LSTM-based line recognition, dramatically improving accuracy on degraded text

2019

PaddleOCR v1 released by Baidu — begins the modern open-source OCR toolkit era with detection + recognition + classification

2021

PaddleOCR v3 (PP-OCRv3) achieves best speed-accuracy tradeoff for document OCR; supports 80+ languages

2022

Tesseract 5.0 adds more LSTM models; remains the default for academic and low-resource projects despite being overtaken in accuracy

2023

Surya OCR emerges as the accuracy leader for multilingual document OCR; EasyOCR and docTR provide alternative open-source options

2024

PaddleOCR v4 and Surya 2.0 push document OCR accuracy further; VLMs (GPT-4o) approach document OCR engines in quality

How Optical Character Recognition Works

1Page PreprocessingDocument images are deskewe…2Text DetectionDBNet or CRAFT detects text…3Line RecognitionCropped text lines are reco…4Reading Order + Struc…Recognized text blocks are …5OutputThe pipeline outputs struct…Optical Character Recognition Pipeline
1

Page Preprocessing

Document images are deskewed (correcting rotation up to ±45°), denoised (removing scanner artifacts), and optionally binarized. Resolution is checked — OCR works best at 300 DPI; lower resolutions degrade accuracy.

2

Text Detection

DBNet or CRAFT detects text regions and produces word-level or line-level bounding boxes. For documents, detection is often simpler than scene text because text is typically horizontal and well-separated, but tables and multi-column layouts require layout-aware detection.

3

Line Recognition

Cropped text lines are recognized by a sequence model: ViT/CNN encoder → LSTM/transformer decoder → character sequence. PaddleOCR uses a lightweight PP-OCR architecture; Surya uses larger transformer models for higher accuracy.

4

Reading Order + Structure

Recognized text blocks are sorted into reading order (left-to-right, top-to-bottom, with column detection). Paragraphs are assembled from adjacent lines. Headers, footers, and page numbers are identified and separated.

5

Output

The pipeline outputs structured text with bounding box coordinates (hOCR format), plain text in reading order, or structured formats (JSON with regions and text). Confidence scores per character/word enable downstream quality filtering.

Current Landscape

Document OCR in 2025 is a mature technology with clear market segmentation. For clean, printed, well-scanned documents in major languages, the problem is solved — 99%+ accuracy from multiple tools. The differentiation is now in: (1) multilingual support for underserved scripts, (2) handling of degraded and historical documents, (3) integration with layout analysis and table extraction, and (4) speed optimization for batch processing. PaddleOCR dominates for production speed, Surya leads on multilingual accuracy, and Tesseract persists for its ubiquity and zero-cost deployment. Cloud APIs (Google, Azure, AWS) serve enterprises willing to pay for the best accuracy and simplest integration.

Key Challenges

Multi-column layouts — correctly separating and ordering text in multi-column documents (newspapers, academic papers) is a significant source of errors

Table text extraction — reading text within table cells in the correct row/column order requires combining OCR with table structure recognition

Mixed content — pages with text, handwriting, stamps, logos, and images need the OCR to distinguish readable text from non-text elements

Historical and degraded documents — yellowed paper, bleed-through, faded ink, and physical damage cause character-level errors that compound across pages

Speed vs. accuracy tradeoff — Tesseract is fast but less accurate; Surya is accurate but slower; production systems must balance throughput with quality

Quick Recommendations

Best open-source accuracy

Surya 2.0

Highest character accuracy across multilingual document OCR benchmarks; supports 90+ languages

Best speed-accuracy balance

PaddleOCR v4 (PP-OCRv4)

Optimized for production throughput; excellent accuracy at 10-50 pages/second on GPU; mobile deployment support

Maximum compatibility

Tesseract 5 (LSTM mode)

Available everywhere, supports 100+ languages, extensive community; adequate accuracy for clean documents

Cloud API (enterprise)

Google Cloud Document AI or Azure Document Intelligence

Highest accuracy with layout understanding built in; SLA guarantees; handles tables and forms natively

Integrated pipeline

Docling or Marker (both Surya-based)

Document OCR + layout analysis + table extraction in one pipeline; output to Markdown or JSON

What's Next

Document OCR is converging with document parsing — the trend is toward end-to-end models that read, structure, and understand documents simultaneously rather than separate OCR → layout → extraction pipelines. VLMs will increasingly perform OCR implicitly as part of document understanding. Remaining frontiers: real-time OCR for augmented reality (translating signs, menus, documents through phone cameras), 100+ language coverage including endangered scripts, and OCR for non-traditional documents (whiteboards, handwritten notes, screen captures).

Benchmarks & SOTA

scut-ctw1500

202082 results

Dataset from Papers With Code

State of the Art

FAST-T-512

129.1

fps

cnn-/-daily-mail

202080 results

Dataset from Papers With Code

State of the Art

Scrambled code + broken (alter)

48.18

rouge-1

icdar2013

202039 results

Dataset from Papers With Code

State of the Art

DTrOCR 105M

99.4

accuracy

dart

202032 results

Dataset from Papers With Code

State of the Art

FactT5B

97.6

factspotter

icdar2015

202026 results

Dataset from Papers With Code

State of the Art

DTrOCR 105M

93.5

accuracy

tabfact

202023 results

Dataset from Papers With Code

State of the Art

ARTEMIS-DA

93.1

test

sun-rgb-d

202019 results

Dataset from Papers With Code

State of the Art

IM3D

64.4

iou

inverse-text

202018 results

Dataset from Papers With Code

State of the Art

DeepSolo (ViTAEv2-S, TextOCR)

75.8

f-measure-full-lexicon

videodb's-ocr-benchmark-public-collection

202015 results

Dataset from Papers With Code

State of the Art

GPT-4o

OpenAI

76.22

accuracy

pendigits

202015 results

Dataset from Papers With Code

State of the Art

DnC-SC

82.86

nmi

CodeSearchNet

202014 results

Benchmark for code summarization (docstring generation) across 6 programming languages: Python, Java, JavaScript, PHP, Ruby, Go. Over 2M (code, docstring) pairs. Primary metric is BLEU-4.

State of the Art

GPT-4o

OpenAI

25.3

bleu-4

lam(line-level)

202012 results

Dataset from Papers With Code

State of the Art

GFCN

18.5

test-wer

howsumm-step

202011 results

Dataset from Papers With Code

State of the Art

LexRank (query: step title)

39.6

rouge-1

e2e

202010 results

Dataset from Papers With Code

State of the Art

HTLM (fine-tuning)

70.8

rouge-l

howsumm-method

20209 results

Dataset from Papers With Code

State of the Art

LexRank (query: method + article + steps titles)

53.5

rouge-1

read2016(line-level)

20209 results

Dataset from Papers With Code

State of the Art

Span

21.1

test-wer

iam(line-level)

20209 results

Dataset from Papers With Code

State of the Art

GFCN

28.6

test-wer

urdudoc

20209 results

Dataset from Papers With Code

State of the Art

ContourNet [69]

88.68

recall

belfort

20208 results

Dataset from Papers With Code

State of the Art

PyLaia (human transcriptions + random split)

28.11

wer

reuters-21578

20208 results

Dataset from Papers With Code

State of the Art

ApproxRepSet

97.17

accuracy

KITAB-Bench

KITAB Arabic OCR Benchmark

20248 results

8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.

State of the Art

PaddleOCR

Baidu

0.790

cer

wikibio

20208 results

Dataset from Papers With Code

State of the Art

MBD

56.16

parent

codesearchnet---java

20208 results

Dataset from Papers With Code

State of the Art

CodeTrans-MT-Large

21.87

smoothed-bleu-4

codesearchnet---javascript

20208 results

Dataset from Papers With Code

State of the Art

Transformer

25.61

smoothed-bleu-4

codesearchnet---php

20208 results

Dataset from Papers With Code

State of the Art

CodeTrans-MT-Base

26.23

smoothed-bleu-4

codesearchnet---go

20207 results

Dataset from Papers With Code

State of the Art

CodeBERT (MLM)

26.79

smoothed-bleu-4

benchmarking-chinese-text-recognition:-datasets,-b

20207 results

Dataset from Papers With Code

State of the Art

DTrOCR

89.6

accuracy

codesearchnet---ruby

20207 results

Dataset from Papers With Code

State of the Art

CodeTrans-MT-Base

15.26

smoothed-bleu-4

codesearchnet---python

20207 results

Dataset from Papers With Code

State of the Art

CodeTrans-MT-Base

20.39

smoothed-bleu-4

tobacco-small-3482

20206 results

Dataset from Papers With Code

State of the Art

Optimized Text CNN

84

accuracy

mldoc-zero-shot-english-to-french

20206 results

Dataset from Papers With Code

State of the Art

XLMft UDA

96.05

accuracy

webnlg-(all)

20206 results

Dataset from Papers With Code

State of the Art

HTLM (fine-tuning)

55.6

bleu

mldoc-zero-shot-english-to-spanish

20206 results

Dataset from Papers With Code

State of the Art

XLMft UDA

96.8

accuracy

hoc

20206 results

Dataset from Papers With Code

State of the Art

BioLinkBERT (large)

88.1

f1

webnlg-(seen)

20206 results

Dataset from Papers With Code

State of the Art

HTLM (fine-tuning)

65.4

bleu

webnlg-(unseen)

20206 results

Dataset from Papers With Code

State of the Art

HTLM (fine-tuning)

48.4

bleu

wikipedia-person-and-animal-dataset

20205 results

Dataset from Papers With Code

State of the Art

VTM

45.36

rouge

mldoc-zero-shot-english-to-russian

20205 results

Dataset from Papers With Code

State of the Art

XLMft UDA

89.7

accuracy

ThaiOCRBench

Thai OCR Benchmark

20245 results

2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.

State of the Art

Claude Sonnet 4

Anthropic

0.840

ted-score

mldoc-zero-shot-english-to-german

20205 results

Dataset from Papers With Code

State of the Art

XLMft UDA

96.95

accuracy

mldoc-zero-shot-english-to-chinese

20205 results

Dataset from Papers With Code

State of the Art

XLMft UDA

93.32

accuracy

stdw

20204 results

Dataset from Papers With Code

State of the Art

RetinaNet

0.780

ap

mldoc-zero-shot-english-to-italian

20204 results

Dataset from Papers With Code

State of the Art

MultiFiT, pseudo

76.02

accuracy

bbcsport

20204 results

Dataset from Papers With Code

State of the Art

MPAD-path

99.59

accuracy

read-2016

20204 results

Dataset from Papers With Code

State of the Art

HTR-VT(line-level)

16.5

wer

reuters-rcv1/rcv2-german-to-english

20203 results

Dataset from Papers With Code

State of the Art

Biinclusion (Euro500kReuters)

84.4

accuracy

rotowire

20203 results

Dataset from Papers With Code

State of the Art

HierarchicalEncoder + NR + IR

55.88

content-selection-f1

fsns---test

20203 results

Dataset from Papers With Code

State of the Art

STREET

27.54

sequence-error

sut

20203 results

Dataset from Papers With Code

State of the Art

CNN

86

accuracy

dareczech

20203 results

Dataset from Papers With Code

State of the Art

Query-doc RobeCzech (Roberta-base)

46.73

p-10

cub-200-2011

20203 results

Dataset from Papers With Code

State of the Art

Q-SENN

85.9

top-1-accuracy

bbc-xsum

20203 results

Dataset from Papers With Code

State of the Art

BigBird-Pegasus

47.12

rouge-1

mldoc-zero-shot-english-to-japanese

20203 results

Dataset from Papers With Code

State of the Art

MultiFiT, pseudo

69.57

accuracy

twitter

20203 results

Dataset from Papers With Code

State of the Art

ApproxRepSet

72.6

accuracy

amazon

20203 results

Dataset from Papers With Code

State of the Art

ApproxRepSet

94.31

accuracy

reuters-rcv1/rcv2-english-to-german

20203 results

Dataset from Papers With Code

State of the Art

Biinclusion (Euro500kReuters)

92.7

accuracy

aapd

20202 results

Dataset from Papers With Code

State of the Art

KD-LSTMreg

72.9

f1

cedar-signature

20202 results

Dataset from Papers With Code

State of the Art

Siamese_MultiHeadCrossAttention_SoftAttention (Siamese_MHCA_SA)

5.7

far

classic

20202 results

Dataset from Papers With Code

State of the Art

REL-RWMD k-NN

96.85

accuracy

clueweb09-b

20202 results

Dataset from Papers With Code

State of the Art

XLNet

31.1

ndcg-20

dise-2021-dataset

20202 results

Dataset from Papers With Code

State of the Art

JDeskew

0.860

percentage-correct

i2l-140k

20202 results

Dataset from Papers With Code

State of the Art

I2L-NOPOOL

89.09

bleu

icdar-2019

20202 results

Dataset from Papers With Code

State of the Art

DiT-L (Cascade)

96.55

weighted-average-f1-score

imdb-m

20202 results

Dataset from Papers With Code

State of the Art

Document Classification Using Importance of Sentences

54.8

accuracy

recipe

20202 results

Dataset from Papers With Code

State of the Art

ApproxRepSet

59.06

accuracy

scidocs-(mag)

20202 results

Dataset from Papers With Code

State of the Art

SPECTER

82

f1-micro

scidocs-(mesh)

20202 results

Dataset from Papers With Code

State of the Art

SciNCL

88.7

f1-micro

simara

20202 results

Dataset from Papers With Code

State of the Art

DAN

14.79

wer

textzoom

20202 results

Dataset from Papers With Code

State of the Art

CCD-ViT-Small

21.84

average-psnr-db

wos-5736

20202 results

Dataset from Papers With Code

State of the Art

ConvTextTM

91.28

accuracy

ba

20201 results

Dataset from Papers With Code

State of the Art

ELSC

51.8

accuracy

arxiv-summarization-dataset

20201 results

Dataset from Papers With Code

State of the Art

DeepPyramidion

19.99

rouge-2

arxiv-hep-th-citation-graph

20201 results

Dataset from Papers With Code

State of the Art

DeepPyramidion

47.15

rouge-1

wikilingua-(tr->en)

20201 results

Dataset from Papers With Code

State of the Art

DOCmT5

31.37

rouge-l

lun

20201 results

Dataset from Papers With Code

State of the Art

ChuLo

64.4

accuracy

jaffe

20201 results

Dataset from Papers With Code

State of the Art

ELSC

98.6

accuracy

iris

20201 results

Dataset from Papers With Code

State of the Art

ELSC

97.7

accuracy

mldoc-zero-shot-german-to-french

20201 results

Dataset from Papers With Code

State of the Art

BiLSTM (Europarl)

75.45

accuracy

mpqa

20201 results

Dataset from Papers With Code

State of the Art

MPAD-path

89.81

accuracy

pixraw10p

20201 results

Dataset from Papers With Code

State of the Art

ELSC

96

accuracy

re-docred

20201 results

Dataset from Papers With Code

State of the Art

VaeDiff-DocRE

0.790

f1

im2latex-100k

20201 results

Dataset from Papers With Code

State of the Art

I2L-STRIPS

88.86

bleu

and-dataset

20201 results

Dataset from Papers With Code

State of the Art

Siamese_MHCA_SA

0.810

average-f1

iam-d

20201 results

Dataset from Papers With Code

State of the Art

StackMix+Blots

3.01

cer

reuters-de-en

20201 results

Dataset from Papers With Code

State of the Art

BilBOWA

75

accuracy

reuters-en-de

20201 results

Dataset from Papers With Code

State of the Art

BilBOWA

86.5

accuracy

iam-b

20201 results

Dataset from Papers With Code

State of the Art

StackMix+Blots

3.77

cer

hyperpartisan-news-detection

20201 results

Dataset from Papers With Code

State of the Art

ChuLo

95.38

accuracy

hkr

20201 results

Dataset from Papers With Code

State of the Art

StackMix+Blots

3.49

cer

saint-gall

20201 results

Dataset from Papers With Code

State of the Art

StackMix+Blots

3.65

cer

scene-text-recognition-benchmarks

20201 results

Dataset from Papers With Code

State of the Art

CCD-ViT-Small

84.9

accuracy

wine

20201 results

Dataset from Papers With Code

State of the Art

ELSC

75.8

accuracy

wos-11967

20201 results

Dataset from Papers With Code

State of the Art

HDLTex

86.07

accuracy

food-101

20201 results

Dataset from Papers With Code

State of the Art

Bert

84.41

accuracy

wos-46985

20201 results

Dataset from Papers With Code

State of the Art

HDLTex

76.58

accuracy

ephoie

20201 results

Dataset from Papers With Code

State of the Art

LayoutLMv3

99.21

average-f1

dwie

20201 results

Dataset from Papers With Code

State of the Art

VaeDiff-DocRE

0.731

f1

docred-ie

20201 results

Dataset from Papers With Code

State of the Art

REXEL

60.1

relation-f1

digital-peter

20201 results

Dataset from Papers With Code

State of the Art

StackMix+Blots

2.5

cer

textseg

20201 results

Dataset from Papers With Code

State of the Art

CCD-ViT-Small

84.8

iou

yelp-14

20201 results

Dataset from Papers With Code

State of the Art

KD-LSTMreg

69.4

accuracy

cl-scisumm

20201 results

Dataset from Papers With Code

State of the Art

GCN Hybrid

33.88

rouge-2

bentham

20201 results

Dataset from Papers With Code

State of the Art

StackMix+Blots

1.73

cer

bc8

20201 results

Dataset from Papers With Code

State of the Art

BioRex+Directionality

56.06

evaluation-macro-f1

warppie10p

20201 results

Dataset from Papers With Code

State of the Art

ELSC

53.4

accuracy

australian

20201 results

Dataset from Papers With Code

State of the Art

ELSC

70.9

accuracy

Internal Mistral Benchmark

0 results

No results tracked yet

CodeSOTA Polish

CodeSOTA Polish OCR Benchmark

20250 results

1,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).

No results tracked yet

PolEval 2021 OCR

PolEval 2021 OCR Post-Correction Task

20210 results

979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.

No results tracked yet

SROIE

Scanned Receipts OCR and Information Extraction

20190 results

626 receipt images. Key task: extract company, date, address, total from receipts.

No results tracked yet

IMPACT-PSNC

IMPACT Polish Digital Libraries Ground Truth

20120 results

478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.

No results tracked yet

OCR WER Benchmark

0 results

No results tracked yet

OCR CER Benchmark

0 results

No results tracked yet

CodeSOTA Verification

0 results

No results tracked yet

Related Tasks

Something wrong or missing?

Help keep Optical Character Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000