Computer Vision
Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.
Scene Text Detection
Detecting text regions in natural scene images.
1000 training + 500 test images captured with wearable cameras. Industry standard for scene text detection.
Text in arbitrary shapes including curved and rotated text. 10,166 images total.
Curved text benchmark. 1555 images with polygon annotations.
1500 images with curved text annotations. Focus on arbitrary-shaped text.
Document OCR
Converting scanned documents and images into machine-readable text.
626 receipt images. Key task: extract company, date, address, total from receipts.
8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.
2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.
979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.
478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.
1,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).
Handwriting Recognition
Recognizing handwritten text from images.
13,353 handwritten text lines from 657 writers. Standard handwriting benchmark.
Historical documents from 46 languages, 99K pages. Tests handwritten and printed text recognition across diverse scripts.
Extension of EMNIST dataset with Polish handwritten characters including diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż). Tests recognition of Polish-specific characters.
Document Understanding
Extracting semantic information and structure from documents (VDU).
199 fully annotated forms. Tests semantic entity labeling and linking.
Document Parsing
Converting documents (like PDFs) into structured formats (Markdown/HTML).
981 annotated PDF pages across 9 document categories. Tests end-to-end document parsing including text, tables, and formulas.
7,010 unit tests across 1,402 PDF documents. Tests parsing of tables, math, multi-column layouts, old scans, and more.
General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.
Multi-scene text reading, key information extraction, multilingual text, and document parsing benchmark.
1,464 videos with 2,000 QA pairs across 25 tasks. Tests OCR capabilities in video content.
Polish benchmark for vision-language models including OCR evaluation on educational exam materials. Covers middle school, high school, and professional exams.
Polish OCR
OCR for Polish language including historical documents, gothic fonts, and diacritic recognition.
Image Classification
Categorizing images into predefined classes (ImageNet, CIFAR).
1.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.
10K new test images following ImageNet collection process. Tests model generalization beyond the original test set.
60K 32x32 color images in 10 classes. Classic small-scale image classification benchmark with 50K training and 10K test images.
60K 32x32 color images in 100 fine-grained classes grouped into 20 superclasses. More challenging than CIFAR-10.
Object Detection
Locating and classifying objects in images (COCO, Pascal VOC).
330K images, 1.5 million object instances, 80 object categories. Standard benchmark for object detection and segmentation.
11,530 images with 27,450 ROI annotated objects and 6,929 segmentations. Classic object detection benchmark.
Semantic Segmentation
Pixel-level classification of images (Cityscapes, ADE20K).
5,000 images with fine annotations and 20,000 with coarse annotations of urban street scenes.
20K training, 2K validation images annotated with 150 object categories. Complex scene parsing benchmark.