Optical Character Recognition

Dataset from Papers With Code

State of the Art

Transformer

25.61

smoothed-bleu-4

KITAB-Bench

KITAB Arabic OCR Benchmark

202414 results

8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.

State of the Art

Surya

VikParuchuri

4.95

cer

lam(line-level)

Dataset from Papers With Code

State of the Art

GFCN

18.5

test-wer

CodeSearchNet

Benchmark for code summarization (docstring generation) across 6 programming languages: Python, Java, JavaScript, PHP, Ruby, Go. Over 2M (code, docstring) pairs. Primary metric is BLEU-4.

State of the Art

GPT-4o

OpenAI

25.3

bleu-4

codesearchnet---java

Dataset from Papers With Code

State of the Art

StarCoder-LoRA

BigCode / Salesforce

22.61

smoothed-bleu-4

urdudoc

202013 results

Dataset from Papers With Code

State of the Art

ContourNet [69]

88.68

recall

howsumm-method

202012 results

Dataset from Papers With Code

State of the Art

Oracle-BERT (HowSumm-Method)

63.2

rouge-1

read2016(line-level)

202011 results

Dataset from Papers With Code

State of the Art

Span

21.1

test-wer

iam(line-level)

20209 results

Dataset from Papers With Code

State of the Art

GFCN

28.6

test-wer

belfort

Dataset from Papers With Code

State of the Art

PyLaia (human transcriptions + random split)

28.11

wer

reuters-21578

Dataset from Papers With Code

State of the Art

ApproxRepSet

97.17

accuracy

codesearchnet---php

Dataset from Papers With Code

State of the Art

CodeTrans-MT-Base

26.23

smoothed-bleu-4

hoc

Dataset from Papers With Code

State of the Art

BioLinkBERT (large)

88.1

codesearchnet---ruby

Dataset from Papers With Code

State of the Art

CodeTrans-MT-Base

15.26

smoothed-bleu-4

benchmarking-chinese-text-recognition:-datasets,-b

Dataset from Papers With Code

State of the Art

DTrOCR

89.6

accuracy

codesearchnet---python

Dataset from Papers With Code

State of the Art

CodeTrans-MT-Base

20.39

smoothed-bleu-4

codesearchnet---go

Dataset from Papers With Code

State of the Art

CodeBERT (MLM)

26.79

smoothed-bleu-4

tobacco-small-3482

Dataset from Papers With Code

State of the Art

Optimized Text CNN

accuracy

webnlg-(unseen)

Dataset from Papers With Code

State of the Art

HTLM (fine-tuning)

48.4

bleu

mldoc-zero-shot-english-to-french

Dataset from Papers With Code

State of the Art

XLMft UDA

96.05

accuracy

webnlg-(all)

Dataset from Papers With Code

State of the Art

HTLM (fine-tuning)

55.6

bleu

mldoc-zero-shot-english-to-spanish

Dataset from Papers With Code

State of the Art

XLMft UDA

96.8

accuracy

webnlg-(seen)

Dataset from Papers With Code

State of the Art

HTLM (fine-tuning)

65.4

bleu

mldoc-zero-shot-english-to-german

Dataset from Papers With Code

State of the Art

XLMft UDA

96.95

accuracy

wikipedia-person-and-animal-dataset

Dataset from Papers With Code

State of the Art

VTM

45.36

rouge

ThaiOCRBench

Thai OCR Benchmark

20245 results

2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.

State of the Art

Claude Sonnet 4

Anthropic

0.840

ted-score

mldoc-zero-shot-english-to-russian

Dataset from Papers With Code

State of the Art

XLMft UDA

89.7

accuracy

mldoc-zero-shot-english-to-chinese

Dataset from Papers With Code

State of the Art

XLMft UDA

93.32

accuracy

stdw

Dataset from Papers With Code

State of the Art

RetinaNet

0.780

mldoc-zero-shot-english-to-italian

Dataset from Papers With Code

State of the Art

MultiFiT, pseudo

76.02

accuracy

bbcsport

Dataset from Papers With Code

State of the Art

MPAD-path

99.59

accuracy

read-2016

Dataset from Papers With Code

State of the Art

HTR-VT(line-level)

16.5

wer

reuters-rcv1/rcv2-german-to-english

Dataset from Papers With Code

State of the Art

Biinclusion (Euro500kReuters)

84.4

accuracy

rotowire

Dataset from Papers With Code

State of the Art

HierarchicalEncoder + NR + IR

55.88

content-selection-f1

fsns---test

Dataset from Papers With Code

State of the Art

STREET

27.54

sequence-error

sut

Dataset from Papers With Code

State of the Art

CNN

accuracy

dareczech

Dataset from Papers With Code

State of the Art

Query-doc RobeCzech (Roberta-base)

46.73

p-10

cub-200-2011

Dataset from Papers With Code

State of the Art

Q-SENN

85.9

top-1-accuracy

bbc-xsum

Dataset from Papers With Code

State of the Art

BigBird-Pegasus

47.12

rouge-1

mldoc-zero-shot-english-to-japanese

Dataset from Papers With Code

State of the Art

MultiFiT, pseudo

69.57

accuracy

twitter

Dataset from Papers With Code

State of the Art

ApproxRepSet

72.6

accuracy

amazon

Dataset from Papers With Code

State of the Art

ApproxRepSet

94.31

accuracy

reuters-rcv1/rcv2-english-to-german

Dataset from Papers With Code

State of the Art

Biinclusion (Euro500kReuters)

92.7

accuracy

aapd

Dataset from Papers With Code

State of the Art

KD-LSTMreg

72.9

cedar-signature

Dataset from Papers With Code

State of the Art

Siamese_MultiHeadCrossAttention_SoftAttention (Siamese_MHCA_SA)

5.7

far

classic

Dataset from Papers With Code

State of the Art

REL-RWMD k-NN

96.85

accuracy

clueweb09-b

Dataset from Papers With Code

State of the Art

XLNet

31.1

ndcg-20

dise-2021-dataset

Dataset from Papers With Code

State of the Art

JDeskew

0.860

percentage-correct

i2l-140k

Dataset from Papers With Code

State of the Art

I2L-NOPOOL

89.09

bleu

icdar-2019

Dataset from Papers With Code

State of the Art

DiT-L (Cascade)

96.55

weighted-average-f1-score

imdb-m

Dataset from Papers With Code

State of the Art

Document Classification Using Importance of Sentences

54.8

accuracy

recipe

Dataset from Papers With Code

State of the Art

ApproxRepSet

59.06

accuracy

scidocs-(mag)

Dataset from Papers With Code

State of the Art

SPECTER

f1-micro

scidocs-(mesh)

Dataset from Papers With Code

State of the Art

SciNCL

88.7

f1-micro

simara

Dataset from Papers With Code

State of the Art

DAN

14.79

wer

textzoom

Dataset from Papers With Code

State of the Art

CCD-ViT-Small

21.84

average-psnr-db

wos-5736

Dataset from Papers With Code

State of the Art

ConvTextTM

91.28

accuracy

ba

Dataset from Papers With Code

State of the Art

ELSC

51.8

accuracy

arxiv-summarization-dataset

Dataset from Papers With Code

State of the Art

DeepPyramidion

19.99

rouge-2

arxiv-hep-th-citation-graph

Dataset from Papers With Code

State of the Art

DeepPyramidion

47.15

rouge-1

wikilingua-(tr->en)

Dataset from Papers With Code

State of the Art

DOCmT5

31.37

rouge-l

lun

Dataset from Papers With Code

State of the Art

ChuLo

64.4

accuracy

jaffe

Dataset from Papers With Code

State of the Art

ELSC

98.6

accuracy

iris

Dataset from Papers With Code

State of the Art

ELSC

97.7

accuracy

mldoc-zero-shot-german-to-french

Dataset from Papers With Code

State of the Art

BiLSTM (Europarl)

75.45

accuracy

mpqa

Dataset from Papers With Code

State of the Art

MPAD-path

89.81

accuracy

pixraw10p

Dataset from Papers With Code

State of the Art

ELSC

accuracy

re-docred

Dataset from Papers With Code

State of the Art

VaeDiff-DocRE

0.790

im2latex-100k

Dataset from Papers With Code

State of the Art

I2L-STRIPS

88.86

bleu

and-dataset

Dataset from Papers With Code

State of the Art

Siamese_MHCA_SA

0.810

average-f1

iam-d

Dataset from Papers With Code

State of the Art

StackMix+Blots

3.01

cer

reuters-de-en

Dataset from Papers With Code

State of the Art

BilBOWA

accuracy

reuters-en-de

Dataset from Papers With Code

State of the Art

BilBOWA

86.5

accuracy

iam-b

Dataset from Papers With Code

State of the Art

StackMix+Blots

3.77

cer

hyperpartisan-news-detection

Dataset from Papers With Code

State of the Art

ChuLo

95.38

accuracy

hkr

Dataset from Papers With Code

State of the Art

StackMix+Blots

3.49

cer

saint-gall

Dataset from Papers With Code

State of the Art

StackMix+Blots

3.65

cer

scene-text-recognition-benchmarks

Dataset from Papers With Code

State of the Art

CCD-ViT-Small

84.9

accuracy

wine

Dataset from Papers With Code

State of the Art

ELSC

75.8

accuracy

wos-11967

Dataset from Papers With Code

State of the Art

HDLTex

86.07

accuracy

food-101

Dataset from Papers With Code

State of the Art

Bert

84.41

accuracy

wos-46985

Dataset from Papers With Code

State of the Art

HDLTex

76.58

accuracy

ephoie

Dataset from Papers With Code

State of the Art

LayoutLMv3

99.21

average-f1

dwie

Dataset from Papers With Code

State of the Art

VaeDiff-DocRE

0.731

docred-ie

Dataset from Papers With Code

State of the Art

REXEL

60.1

relation-f1

digital-peter

Dataset from Papers With Code

State of the Art

StackMix+Blots

2.5

cer

textseg

Dataset from Papers With Code

State of the Art

CCD-ViT-Small

84.8

iou

yelp-14

Dataset from Papers With Code

State of the Art

KD-LSTMreg

69.4

accuracy

cl-scisumm

Dataset from Papers With Code

State of the Art

GCN Hybrid

33.88

rouge-2

bentham

Dataset from Papers With Code

State of the Art

StackMix+Blots

1.73

cer

bc8

Dataset from Papers With Code

State of the Art

BioRex+Directionality

56.06

evaluation-macro-f1

warppie10p

Dataset from Papers With Code

State of the Art

ELSC

53.4

accuracy

australian

Dataset from Papers With Code

State of the Art

ELSC

70.9

accuracy

Internal Mistral Benchmark

No results tracked yet

CodeSOTA Polish

CodeSOTA Polish OCR Benchmark

20250 results

1,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).

No results tracked yet

PolEval 2021 OCR

PolEval 2021 OCR Post-Correction Task

20210 results

979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.

No results tracked yet

SROIE

Scanned Receipts OCR and Information Extraction

20190 results

626 receipt images. Key task: extract company, date, address, total from receipts.

No results tracked yet

IMPACT-PSNC

IMPACT Polish Digital Libraries Ground Truth

20120 results

478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.

No results tracked yet

OCR WER Benchmark

No results tracked yet

OCR CER Benchmark

No results tracked yet

CodeSOTA Verification