Computer Vision

Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.

16 tasks169 datasets1643 results

Computer Vision is one of the most mature areas of applied ML, with production systems processing billions of images daily. The field has evolved from hand-crafted features to deep learning, and now to vision-language models that understand images in context.

State of the Field (Dec 2024)

  • -Vision Transformers (ViT) have largely replaced CNNs for high-accuracy tasks
  • -Multimodal models (GPT-4o, Gemini 1.5, Claude 3.5) are changing how we approach OCR and document understanding
  • -Real-time inference is now possible for most tasks on edge devices
  • -Self-supervised pretraining (DINOv2, SAM) provides strong foundations without labeled data

Quick Recommendations

Document OCR (clean PDFs)

PaddleOCR or Tesseract 5

Free, fast, accurate enough for 90% of use cases

Document OCR (complex layouts)

Azure Document Intelligence or Google Document AI

Best at tables, forms, and mixed layouts

Handwriting Recognition

Google Cloud Vision or Microsoft Azure

Still the leaders for cursive and messy handwriting

Scene Text (signs, products)

EasyOCR or PaddleOCR

Trained on natural scene images, not just documents

Tasks & Benchmarks

Optical Character Recognition

Extracting text from document images

110 datasets680 resultsSOTA tracked

Scene Text Detection

Detecting text regions in natural scene images

10 datasets465 resultsSOTA tracked

Document Layout Analysis

Analyzing the layout structure of documents

5 datasets126 resultsSOTA tracked

Scene Text Recognition

Recognizing text in natural scene images

11 datasets109 resultsSOTA tracked

Document Image Classification

Classifying documents by type or category

7 datasets54 resultsSOTA tracked

Document Parsing

Parsing document structure and content

2 datasets51 resultsSOTA tracked

General OCR Capabilities

Comprehensive benchmarks covering multiple aspects of OCR performance.

4 datasets50 resultsSOTA tracked

Table Recognition

Detecting and parsing tables in documents

5 datasets38 resultsSOTA tracked

Handwriting Recognition

Recognizing handwritten text

6 datasets38 resultsSOTA tracked

Image Classification

Categorizing images into predefined classes (ImageNet, CIFAR).

4 datasets25 resultsSOTA tracked

Object Detection

Locating and classifying objects in images (COCO, Pascal VOC).

2 datasets5 resultsSOTA tracked

Semantic Segmentation

Pixel-level classification of images (Cityscapes, ADE20K).

2 datasets2 resultsSOTA tracked

Document Understanding

Understanding document content and structure

1 datasets0 results

Key Information Extraction

Extracting key-value pairs from documents

0 datasets0 results

LaTeX OCR

Converting mathematical formulas to LaTeX

0 datasets0 results

Polish OCR

OCR for Polish language including historical documents, gothic fonts, and diacritic recognition.

0 datasets0 results
Show all datasets and SOTA results

Optical Character Recognition

CodeSOTA PolishCodeSOTA Polish OCR Benchmark2025

1,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).

IMPACT-PSNCIMPACT Polish Digital Libraries Ground Truth2012

478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.

KITAB-BenchKITAB Arabic OCR Benchmark2024
SOTA:0.79(cer)
PaddleOCR

8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.

PolEval 2021 OCRPolEval 2021 OCR Post-Correction Task2021

979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.

SROIEScanned Receipts OCR and Information Extraction2019

626 receipt images. Key task: extract company, date, address, total from receipts.

ThaiOCRBenchThai OCR Benchmark2024
SOTA:0.84(ted-score)
Claude Sonnet 4

2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.

aapd2020
SOTA:72.9(f1)
KD-LSTMreg

Dataset from Papers With Code

amazon2020
SOTA:94.31(accuracy)
ApproxRepSet

Dataset from Papers With Code

SOTA:0.81(average-f1)
Siamese_MHCA_SA

Dataset from Papers With Code

SOTA:47.15(rouge-1)
DeepPyramidion

Dataset from Papers With Code

SOTA:19.99(rouge-2)
DeepPyramidion

Dataset from Papers With Code

SOTA:70.9(accuracy)
ELSC

Dataset from Papers With Code

ba2020
SOTA:51.8(accuracy)
ELSC

Dataset from Papers With Code

SOTA:47.12(rouge-1)
BigBird-Pegasus

Dataset from Papers With Code

SOTA:99.59(accuracy)
MPAD-path

Dataset from Papers With Code

bc82020
SOTA:56.06(evaluation-macro-f1)
BioRex+Directionality

Dataset from Papers With Code

SOTA:28.11(wer)
PyLaia (human transcriptions + random split)

Dataset from Papers With Code

SOTA:89.6(accuracy)
DTrOCR 105M

Dataset from Papers With Code

SOTA:1.73(cer)
StackMix+Blots

Dataset from Papers With Code

SOTA:5.7(far)
Siamese_MultiHeadCrossAttention_SoftAttention (Siamese_MHCA_SA)

Dataset from Papers With Code

SOTA:33.88(rouge-2)
GCN Hybrid

Dataset from Papers With Code

SOTA:96.85(accuracy)
REL-RWMD k-NN

Dataset from Papers With Code

SOTA:31.1(ndcg-20)
XLNet

Dataset from Papers With Code

SOTA:48.18(rouge-1)
Scrambled code + broken (alter)

Dataset from Papers With Code

SOTA:15.99(smoothed-bleu-4)
CodeBERT (MLM+RTD)

Dataset from Papers With Code

SOTA:26.79(smoothed-bleu-4)
CodeBERT (MLM)

Dataset from Papers With Code

SOTA:21.87(smoothed-bleu-4)
CodeTrans-MT-Large

Dataset from Papers With Code

SOTA:25.61(smoothed-bleu-4)
Transformer

Dataset from Papers With Code

SOTA:26.23(smoothed-bleu-4)
CodeTrans-MT-Base

Dataset from Papers With Code

SOTA:20.39(smoothed-bleu-4)
CodeTrans-MT-Base

Dataset from Papers With Code

SOTA:15.26(smoothed-bleu-4)
CodeTrans-MT-Base

Dataset from Papers With Code

SOTA:85.9(top-1-accuracy)
Q-SENN

Dataset from Papers With Code

SOTA:46.73(p-10)
Query-doc RobeCzech (Roberta-base)

Dataset from Papers With Code

dart2020
SOTA:97.6(factspotter)
FactT5B

Dataset from Papers With Code

SOTA:2.5(cer)
StackMix+Blots

Dataset from Papers With Code

SOTA:0.86(percentage-correct)
JDeskew

Dataset from Papers With Code

SOTA:60.1(relation-f1)
REXEL

Dataset from Papers With Code

dwie2020
SOTA:0.73(f1)
VaeDiff-DocRE

Dataset from Papers With Code

e2e2020
SOTA:70.8(rouge-l)
HTLM (fine-tuning)

Dataset from Papers With Code

ephoie2020
SOTA:99.21(average-f1)
LayoutLMv3

Dataset from Papers With Code

SOTA:84.41(accuracy)
Bert

Dataset from Papers With Code

SOTA:27.54(sequence-error)
STREET

Dataset from Papers With Code

hkr2020
SOTA:3.49(cer)
StackMix+Blots

Dataset from Papers With Code

hoc2020
SOTA:88.1(f1)
BioLinkBERT (large)

Dataset from Papers With Code

SOTA:53.5(rouge-1)
LexRank (query: method + article + steps titles)

Dataset from Papers With Code

SOTA:39.6(rouge-1)
LexRank (query: step title)

Dataset from Papers With Code

SOTA:95.38(accuracy)
ChuLo

Dataset from Papers With Code

SOTA:89.09(bleu)
I2L-NOPOOL

Dataset from Papers With Code

SOTA:28.6(test-wer)
GFCN

Dataset from Papers With Code

iam-b2020
SOTA:3.77(cer)
StackMix+Blots

Dataset from Papers With Code

iam-d2020
SOTA:3.01(cer)
StackMix+Blots

Dataset from Papers With Code

SOTA:96.55(weighted-average-f1-score)
DiT-L (Cascade)

Dataset from Papers With Code

SOTA:99.4(accuracy)
DTrOCR 105M

Dataset from Papers With Code

SOTA:93.5(accuracy)
DTrOCR 105M

Dataset from Papers With Code

SOTA:88.86(bleu)
I2L-STRIPS

Dataset from Papers With Code

imdb-m2020
SOTA:54.8(accuracy)
Document Classification Using Importance of Sentences

Dataset from Papers With Code

SOTA:75.8(f-measure-full-lexicon)
DeepSolo (ViTAEv2-S, TextOCR)

Dataset from Papers With Code

iris2020
SOTA:97.7(accuracy)
ELSC

Dataset from Papers With Code

jaffe2020
SOTA:98.6(accuracy)
ELSC

Dataset from Papers With Code

SOTA:18.5(test-wer)
GFCN

Dataset from Papers With Code

lun2020
SOTA:64.4(accuracy)
ChuLo

Dataset from Papers With Code

SOTA:93.32(accuracy)
XLMft UDA

Dataset from Papers With Code

SOTA:96.05(accuracy)
XLMft UDA

Dataset from Papers With Code

SOTA:96.95(accuracy)
XLMft UDA

Dataset from Papers With Code

SOTA:76.02(accuracy)
MultiFiT, pseudo

Dataset from Papers With Code

SOTA:69.57(accuracy)
MultiFiT, pseudo

Dataset from Papers With Code

SOTA:89.7(accuracy)
XLMft UDA

Dataset from Papers With Code

SOTA:96.8(accuracy)
XLMft UDA

Dataset from Papers With Code

SOTA:75.45(accuracy)
BiLSTM (Europarl)

Dataset from Papers With Code

mpqa2020
SOTA:89.81(accuracy)
MPAD-path

Dataset from Papers With Code

SOTA:82.86(nmi)
DnC-SC

Dataset from Papers With Code

SOTA:96(accuracy)
ELSC

Dataset from Papers With Code

SOTA:0.79(f1)
VaeDiff-DocRE

Dataset from Papers With Code

SOTA:16.5(wer)
HTR-VT(line-level)

Dataset from Papers With Code

SOTA:21.1(test-wer)
Span

Dataset from Papers With Code

recipe2020
SOTA:59.06(accuracy)
ApproxRepSet

Dataset from Papers With Code

SOTA:97.17(accuracy)
ApproxRepSet

Dataset from Papers With Code

SOTA:75(accuracy)
BilBOWA

Dataset from Papers With Code

SOTA:86.5(accuracy)
BilBOWA

Dataset from Papers With Code

SOTA:92.7(accuracy)
Biinclusion (Euro500kReuters)

Dataset from Papers With Code

SOTA:84.4(accuracy)
Biinclusion (Euro500kReuters)

Dataset from Papers With Code

SOTA:55.88(content-selection-f1)
HierarchicalEncoder + NR + IR

Dataset from Papers With Code

SOTA:3.65(cer)
StackMix+Blots

Dataset from Papers With Code

SOTA:84.9(accuracy)
CCD-ViT-Small

Dataset from Papers With Code

SOTA:82(f1-micro)
SPECTER

Dataset from Papers With Code

SOTA:88.7(f1-micro)
SciNCL

Dataset from Papers With Code

SOTA:129.1(fps)
FAST-T-512

Dataset from Papers With Code

simara2020
SOTA:14.79(wer)
DAN

Dataset from Papers With Code

stdw2020
SOTA:0.78(ap)
RetinaNet

Dataset from Papers With Code

SOTA:64.4(iou)
IM3D

Dataset from Papers With Code

sut2020
SOTA:86(accuracy)
CNN

Dataset from Papers With Code

SOTA:93.1(test)
ARTEMIS-DA

Dataset from Papers With Code

SOTA:84.8(iou)
CCD-ViT-Small

Dataset from Papers With Code

SOTA:21.84(average-psnr-db)
CCD-ViT-Small

Dataset from Papers With Code

SOTA:84(accuracy)
Optimized Text CNN

Dataset from Papers With Code

SOTA:72.6(accuracy)
ApproxRepSet

Dataset from Papers With Code

SOTA:88.68(recall)
ContourNet [69]

Dataset from Papers With Code

SOTA:76.22(accuracy)
GPT-4o

Dataset from Papers With Code

SOTA:53.4(accuracy)
ELSC

Dataset from Papers With Code

SOTA:55.6(bleu)
HTLM (fine-tuning)

Dataset from Papers With Code

SOTA:65.4(bleu)
HTLM (fine-tuning)

Dataset from Papers With Code

SOTA:48.4(bleu)
HTLM (fine-tuning)

Dataset from Papers With Code

SOTA:56.16(parent)
MBD

Dataset from Papers With Code

SOTA:31.37(rouge-l)
DOCmT5

Dataset from Papers With Code

SOTA:45.36(rouge)
VTM

Dataset from Papers With Code

wine2020
SOTA:75.8(accuracy)
ELSC

Dataset from Papers With Code

SOTA:86.07(accuracy)
HDLTex

Dataset from Papers With Code

SOTA:76.58(accuracy)
HDLTex

Dataset from Papers With Code

SOTA:91.28(accuracy)
ConvTextTM

Dataset from Papers With Code

SOTA:69.4(accuracy)
KD-LSTMreg

Dataset from Papers With Code

Scene Text Detection

CTW1500Curved Text in the Wild 15002019

1500 images with curved text annotations. Focus on arbitrary-shaped text.

ICDAR 2015ICDAR 2015 Incidental Scene Text2015
SOTA:93.96(precision)
TextFuseNet (ResNeXt-101)

1000 training + 500 test images captured with wearable cameras. Industry standard for scene text detection.

ICDAR 2019 ArTICDAR 2019 Arbitrary-Shaped Text2019

Text in arbitrary shapes including curved and rotated text. 10,166 images total.

Total-TextTotal-Text2017
SOTA:152.8(fps)
FAST-T-448

Curved text benchmark. 1555 images with polygon annotations.

SOTA:81.9(1-1-accuracy)
CLIP4STR-L

Dataset from Papers With Code

SOTA:86.4(accuracy)
CLIP4STR-L (DataComp-1B)

Dataset from Papers With Code

SOTA:93.36(f-measure)
BDN

Dataset from Papers With Code

SOTA:97.4(precision)
CRAFT

Dataset from Papers With Code

SOTA:84.42(precision)
PMTD*

Dataset from Papers With Code

SOTA:137.2(fps)
FAST-T-512

Dataset from Papers With Code

Document Layout Analysis

d4la2020
SOTA:70.72(map)
DoPTA

Dataset from Papers With Code

Dataset from Papers With Code

SOTA:0.97(figure)
fglihai

Dataset from Papers With Code

SOTA:0.98(table)
DETR

Dataset from Papers With Code

SOTA:83.4(class-average-iou)
CV-Group

Dataset from Papers With Code

Scene Text Recognition

cute802020
SOTA:99.7(accuracy)
CLIP4STR-L (DataComp-1B)

Dataset from Papers With Code

host2020
SOTA:82.7(1-1-accuracy)
CLIP4STR-L

Dataset from Papers With Code

ic132020
SOTA:97.8(accuracy)
ABINet-LV+TPS++

Dataset from Papers With Code

SOTA:97.1(accuracy)
Yet Another Text Recognizer

Dataset from Papers With Code

iiit5k2020
SOTA:99.6(accuracy)
CLIP4STR-L (DataComp-1B)

Dataset from Papers With Code

msda2020
SOTA:42(accuracy)
MetaSelf-Learning

Dataset from Papers With Code

svt2020
SOTA:99.1(accuracy)
CLIP4STR-H (DFN-5B)

Dataset from Papers With Code

svt-p2020
SOTA:89.6(accuracy)
ABINet-LV+TPS++

Dataset from Papers With Code

svtp2020
SOTA:98.6(accuracy)
DTrOCR 105M

Dataset from Papers With Code

SOTA:92.2(accuracy)
CLIP4STR-L (DataComp-1B)

Dataset from Papers With Code

wost2020
SOTA:90.9(1-1-accuracy)
CLIP4STR-H (DFN-5B)

Dataset from Papers With Code

Document Image Classification

aip2020
SOTA:83.4(top-1-accuracy-verb)
ResNet-RS (ResNet-200 + RS training tricks)

Dataset from Papers With Code

SOTA:97.62(accuracy)
Pixel-level RC

Dataset from Papers With Code

SOTA:89.54(accuracy)
PCGAN-CHAR

Dataset from Papers With Code

SOTA:96.68(accuracy)
PCGAN-CHAR

Dataset from Papers With Code

SOTA:98.43(accuracy)
PCGAN-CHAR

Dataset from Papers With Code

SOTA:97.7(accuracy)
EAML

Dataset from Papers With Code

SOTA:95.57(accuracy)
DocXClassifier-L

Dataset from Papers With Code

Document Parsing

OmniDocBenchOmniDocBench v1.52024
SOTA:97.5(layout-map)
MinerU 2.5

981 annotated PDF pages across 9 document categories. Tests end-to-end document parsing including text, tables, and formulas.

olmOCR-BencholmOCR-Bench2024
SOTA:99.9(base)
Chandra v0.1.0

7,010 unit tests across 1,402 PDF documents. Tests parsing of tables, math, multi-column layouts, old scans, and more.

General OCR Capabilities

CC-OCRComprehensive Challenge OCR2024
SOTA:83.25(multi-scene-f1)
Gemini 1.5 Pro

Multi-scene text reading, key information extraction, multilingual text, and document parsing benchmark.

MME-VideoOCRMME Video OCR Benchmark2024
SOTA:73.7(total-accuracy)
Gemini 2.5 Pro

1,464 videos with 2,000 QA pairs across 25 tasks. Tests OCR capabilities in video content.

OCRBench v2OCRBench v22024
SOTA:62.2(overall-zh-private)
Gemini 2.5 Pro

Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.

reVISIONreVISION Polish Vision-Language Benchmark2025

Polish benchmark for vision-language models including OCR evaluation on educational exam materials. Covers middle school, high school, and professional exams.

Table Recognition

SOTA:95.46(f-measure)
Proposed System (With post- processing)

Dataset from Papers With Code

SOTA:97.88(teds-struct)
Multi-Task Learning Model

Dataset from Papers With Code

SOTA:98.35(teds-simple-samples)
Re0

Dataset from Papers With Code

SOTA:91.87(teds-simple-samples)
EDD

Dataset from Papers With Code

wtw2020
SOTA:78.9(f1)
StrucTexTv2 (small)

Dataset from Papers With Code

Handwriting Recognition

CHURRO-DSCultural Heritage Understanding Research Repository OCR Dataset2024
SOTA:82.3(printed-levenshtein)
CHURRO (3B)

Historical documents from 46 languages, 99K pages. Tests handwritten and printed text recognition across diverse scripts.

IAMIAM Handwriting Database1999
SOTA:23.2(wer)
Start, Follow, Read

13,353 handwritten text lines from 657 writers. Standard handwriting benchmark.

Polish EMNIST ExtensionEMNIST Extended with Polish Diacritics2020

Extension of EMNIST dataset with Polish handwritten characters including diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż). Tests recognition of Polish-specific characters.

Dataset from Papers With Code

SOTA:96.8(accuracy)
AKHCRNet

Dataset from Papers With Code

kohtd2020
SOTA:8.36(cer)
Bluche

Dataset from Papers With Code

Image Classification

CIFAR-10Canadian Institute for Advanced Research 102009
SOTA:99.1(accuracy)
DeiT-B Distilled

60K 32x32 color images in 10 classes. Classic small-scale image classification benchmark with 50K training and 10K test images.

CIFAR-100Canadian Institute for Advanced Research 1002009
SOTA:94.55(accuracy)
ViT-H/14

60K 32x32 color images in 100 fine-grained classes grouped into 20 superclasses. More challenging than CIFAR-10.

ImageNet-1KImageNet Large Scale Visual Recognition Challenge 20122012
SOTA:91(top-1-accuracy)
CoCa (finetuned)

1.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.

ImageNet-V2ImageNet-V2 Matched Frequency2019
SOTA:84(top-1-accuracy)
Swin Transformer V2 Large

10K new test images following ImageNet collection process. Tests model generalization beyond the original test set.

Object Detection

COCOMicrosoft COCO: Common Objects in Context2014
SOTA:66(mAP)
Co-DETR (Swin-L)

330K images, 1.5 million object instances, 80 object categories. Standard benchmark for object detection and segmentation.

Pascal VOC 2012Pascal Visual Object Classes Challenge 20122012

11,530 images with 27,450 ROI annotated objects and 6,929 segmentations. Classic object detection benchmark.

Semantic Segmentation

ADE20KADE20K Scene Parsing Benchmark2016
SOTA:62.9(mIoU)
InternImage-H

20K training, 2K validation images annotated with 150 object categories. Complex scene parsing benchmark.

CityscapesCityscapes Dataset2016

5,000 images with fine annotations and 20,000 with coarse annotations of urban street scenes.

Document Understanding

FUNSDForm Understanding in Noisy Scanned Documents2019

199 fully annotated forms. Tests semantic entity labeling and linking.

Key Information Extraction

No datasets indexed yet. Contribute on GitHub

LaTeX OCR

No datasets indexed yet. Contribute on GitHub

Polish OCR

No datasets indexed yet. Contribute on GitHub

Honest Takes

OCR is solved for clean documents

For printed text on white backgrounds, accuracy differences between models are negligible. The real challenge is messy real-world documents, handwriting, and multi-language support.

Benchmarks don't predict production performance

A model scoring 95% on ICDAR may fail on your specific invoice format. Always test on your own data before committing.

Vision LLMs are overkill for most tasks

GPT-4o is impressive but costs 100x more than specialized models. Use it for complex reasoning, not simple extraction.

In-Depth Guides

Need help choosing?

We can run these benchmarks on your actual documents. Same methodology, your data.

Get Private Evaluation