Document OCR
Converting scanned documents and images into machine-readable text.
Benchmarks & Datasets
SROIE
626 receipt images. Key task: extract company, date, address, total from receipts.
KITAB-Bench
8,809 Arabic text samples across 9 domains. Tests Arabic script recognition.
ThaiOCRBench
2,808 Thai text samples across 13 tasks. Tests Thai script structural understanding.
PolEval 2021 OCR
979 Polish books (69,000 pages) from 1791-1998. Focus on OCR post-correction using NLP methods. Major benchmark for Polish historical document processing.
IMPACT-PSNC
478 pages of ground truth from four Polish digital libraries at 99.95% accuracy. Includes annotations at region, line, word, and glyph levels. Gothic and antiqua fonts.
CodeSOTA Polish
1,000 synthetic and real Polish text images with 5 degradation levels (clean to severe). Tests character-level OCR on diacritics with contamination-resistant synthetic categories. Categories: synth_random (pure character recognition), synth_words (Markov-generated words), real_corpus (Pan Tadeusz, official documents), wikipedia (potential contamination baseline).