Computer Vision
Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.
Computer Vision is one of the most mature areas of applied ML, with production systems processing billions of images daily. The field has evolved from hand-crafted features to deep learning, and now to vision-language models that understand images in context.
State of the Field (Dec 2024)
- -Vision Transformers (ViT) have largely replaced CNNs for high-accuracy tasks
- -Multimodal models (GPT-4o, Gemini 1.5, Claude 3.5) are changing how we approach OCR and document understanding
- -Real-time inference is now possible for most tasks on edge devices
- -Self-supervised pretraining (DINOv2, SAM) provides strong foundations without labeled data
Quick Recommendations
Document OCR (clean PDFs)
PaddleOCR or Tesseract 5
Free, fast, accurate enough for 90% of use cases
Document OCR (complex layouts)
Azure Document Intelligence or Google Document AI
Best at tables, forms, and mixed layouts
Handwriting Recognition
Google Cloud Vision or Microsoft Azure
Still the leaders for cursive and messy handwriting
Scene Text (signs, products)
EasyOCR or PaddleOCR
Trained on natural scene images, not just documents
Tasks & Benchmarks
Optical Character Recognition
Extracting text from document images
Scene Text Detection
Detecting text regions in natural scene images
Document Layout Analysis
Analyzing the layout structure of documents
Scene Text Recognition
Recognizing text in natural scene images
Document Image Classification
Classifying documents by type or category
Document Parsing
Parsing document structure and content
General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
Table Recognition
Detecting and parsing tables in documents
Handwriting Recognition
Recognizing handwritten text
Image Classification
Categorizing images into predefined classes (ImageNet, CIFAR).
Object Detection
Locating and classifying objects in images (COCO, Pascal VOC).
Semantic Segmentation
Pixel-level classification of images (Cityscapes, ADE20K).
Document Understanding
Understanding document content and structure
Key Information Extraction
Extracting key-value pairs from documents
LaTeX OCR
Converting mathematical formulas to LaTeX
Polish OCR
OCR for Polish language including historical documents, gothic fonts, and diacritic recognition.
Show all datasets and SOTA results
Optical Character Recognition
1,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).
478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.
8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.
979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.
626 receipt images. Key task: extract company, date, address, total from receipts.
2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Scene Text Detection
1500 images with curved text annotations. Focus on arbitrary-shaped text.
1000 training + 500 test images captured with wearable cameras. Industry standard for scene text detection.
Text in arbitrary shapes including curved and rotated text. 10,166 images total.
Curved text benchmark. 1555 images with polygon annotations.
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Document Layout Analysis
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Scene Text Recognition
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Document Image Classification
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Document Parsing
981 annotated PDF pages across 9 document categories. Tests end-to-end document parsing including text, tables, and formulas.
7,010 unit tests across 1,402 PDF documents. Tests parsing of tables, math, multi-column layouts, old scans, and more.
General OCR Capabilities
Multi-scene text reading, key information extraction, multilingual text, and document parsing benchmark.
1,464 videos with 2,000 QA pairs across 25 tasks. Tests OCR capabilities in video content.
Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.
Polish benchmark for vision-language models including OCR evaluation on educational exam materials. Covers middle school, high school, and professional exams.
Table Recognition
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Handwriting Recognition
Historical documents from 46 languages, 99K pages. Tests handwritten and printed text recognition across diverse scripts.
13,353 handwritten text lines from 657 writers. Standard handwriting benchmark.
Extension of EMNIST dataset with Polish handwritten characters including diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż). Tests recognition of Polish-specific characters.
Dataset from Papers With Code
Dataset from Papers With Code
Dataset from Papers With Code
Image Classification
60K 32x32 color images in 10 classes. Classic small-scale image classification benchmark with 50K training and 10K test images.
60K 32x32 color images in 100 fine-grained classes grouped into 20 superclasses. More challenging than CIFAR-10.
1.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.
10K new test images following ImageNet collection process. Tests model generalization beyond the original test set.
Object Detection
330K images, 1.5 million object instances, 80 object categories. Standard benchmark for object detection and segmentation.
11,530 images with 27,450 ROI annotated objects and 6,929 segmentations. Classic object detection benchmark.
Semantic Segmentation
20K training, 2K validation images annotated with 150 object categories. Complex scene parsing benchmark.
5,000 images with fine annotations and 20,000 with coarse annotations of urban street scenes.
Document Understanding
199 fully annotated forms. Tests semantic entity labeling and linking.
Key Information Extraction
LaTeX OCR
Polish OCR
Honest Takes
OCR is solved for clean documents
For printed text on white backgrounds, accuracy differences between models are negligible. The real challenge is messy real-world documents, handwriting, and multi-language support.
Benchmarks don't predict production performance
A model scoring 95% on ICDAR may fail on your specific invoice format. Always test on your own data before committing.
Vision LLMs are overkill for most tasks
GPT-4o is impressive but costs 100x more than specialized models. Use it for complex reasoning, not simple extraction.
In-Depth Guides
Need help choosing?
We can run these benchmarks on your actual documents. Same methodology, your data.
Get Private Evaluation