General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
General OCR (Optical Character Recognition) converts images of text into machine-readable strings. Modern OCR systems handle printed text in 100+ languages at 99%+ character accuracy, but the real differentiation is in handling degraded scans, complex layouts, mixed scripts, and mathematical notation. PaddleOCR and Surya dominate open-source; Google Cloud Vision and Azure lead cloud APIs.
History
Ray Kurzweil develops the first omni-font OCR machine, reading text in any font — commercialized by Xerox
Tesseract open-sourced by Google (originally HP, 1985); becomes the default free OCR engine for two decades
Deep learning OCR (CRNN: CNN + RNN + CTC loss) surpasses traditional methods on scene text and printed text benchmarks
Attention-based sequence-to-sequence models replace CTC for OCR, better handling variable-length text and complex scripts
PaddleOCR (Baidu) releases a comprehensive open-source OCR toolkit supporting 80+ languages with PP-OCR pipeline (detect → recognize → classify)
TrOCR (Microsoft) applies transformer encoder-decoder architecture to OCR, matching LSTM-based methods with simpler architecture
Surya OCR (Vikram Nair) achieves state-of-the-art multilingual OCR with transformer-based models, supporting 90+ languages
GOT (General OCR Theory) demonstrates OCR as visual generation — a single model handles text, math, tables, sheet music, and molecular formulas
Large VLMs (GPT-4o, Qwen2-VL) perform OCR implicitly — send any image and get text extraction as a byproduct of visual understanding
How General OCR Capabilities Works
Text Detection
A detection model (EAST, DBNet, CRAFT) finds text regions in the image, outputting bounding boxes or polygons around each text line or word. DBNet uses a differentiable binarization approach that handles curved and rotated text.
Text Line Extraction
Detected regions are cropped, deskewed, and normalized to fixed height (32-48px) while preserving aspect ratio. Sorting by reading order (top-to-bottom, left-to-right) organizes the text spatially.
Text Recognition
Each cropped text line is processed by a recognition model: a CNN/ViT encoder produces feature sequences, and a decoder (CTC or attention-based) produces character sequences. Modern models (TrOCR, PaddleOCR v4) use ViT encoders for better accuracy.
Language Model Post-Processing
Optional spell-checking, language model rescoring, or dictionary lookup corrects OCR errors. For structured documents, post-processing may include table reconstruction and reading order correction.
Evaluation
Character Error Rate (CER) and Word Error Rate (WER) are the primary metrics. Printed English achieves <1% CER; handwriting and degraded scans range 5-20% CER. Benchmarks include ICDAR datasets, SROIE (receipts), and multilingual text datasets.
Current Landscape
General OCR in 2025 is bifurcated between two paradigms: specialized OCR pipelines (PaddleOCR, Surya, Tesseract) that are fast, cheap, and well-understood, and large VLMs (GPT-4o, Qwen2-VL) that perform OCR as an emergent capability alongside deeper understanding. For high-throughput, well-defined tasks (scanning thousands of invoices), specialized OCR is still the right choice. For complex, diverse, or low-volume documents, VLMs offer better accuracy and flexibility with no pipeline engineering. PaddleOCR dominates the open-source space for production use, while Surya leads on multilingual accuracy. Cloud APIs (Google, Azure, AWS) remain the default for enterprises that don't want to self-host.
Key Challenges
Handwritten text — unconstrained handwriting recognition remains 5-10× worse than printed text OCR, with CER of 5-20% depending on script and quality
Multilingual and mixed-script text — documents mixing Latin, Arabic, CJK, and Devanagari require per-script detection and recognition models
Degraded quality — old documents, faxes, photocopies, and low-resolution images produce OCR errors that compound in downstream processing
Mathematical notation and special symbols — formulas, chemical structures, and musical notation require specialized models beyond standard text OCR
Layout-dependent reading order — multi-column text, tables, and documents with complex spatial arrangements need correct ordering of recognized text
Quick Recommendations
Best open-source general OCR
PaddleOCR v4 (PP-OCRv4)
Best accuracy-speed tradeoff across 80+ languages; highly optimized for production with mobile support
Best multilingual accuracy
Surya OCR
SOTA on multilingual text recognition benchmarks; handles 90+ languages including low-resource scripts
Cloud API (highest accuracy)
Google Cloud Vision API or Azure AI Vision
99%+ accuracy on printed text; handles complex layouts, tables, and forms; SLA-backed for enterprise
Document-specific OCR
Donut or TrOCR-Large
Transformer-based end-to-end models that jointly handle detection and recognition; TrOCR excels on printed text
Math / scientific notation
Mathpix or LaTeX-OCR (Lukas Blecher)
Specialized for equation recognition; converts images of math to LaTeX at 90%+ accuracy
What's Next
OCR as a standalone task is being subsumed by document understanding — models that read, understand, and reason about text simultaneously. The remaining hard problems are handwriting (especially historical and medical), low-resource languages (scripts with <100K training samples), and real-time OCR for AR/camera applications. Video OCR (tracking and reading text in moving scenes) is an emerging frontier. Within 2-3 years, most OCR will be performed implicitly by VLMs rather than dedicated OCR engines.
Benchmarks & SOTA
OCRBench v2
OCRBench v2
Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.
State of the Art
Seed1.6-vision
ByteDance
62.2
overall-en-private
CC-OCR
Comprehensive Challenge OCR
Multi-scene text reading, key information extraction, multilingual text, and document parsing benchmark.
State of the Art
Gemini 1.5 Pro
83.25
multi-scene-f1
MME-VideoOCR
MME Video OCR Benchmark
1,464 videos with 2,000 QA pairs across 25 tasks. Tests OCR capabilities in video content.
State of the Art
Gemini 2.5 Pro
73.7
total-accuracy
reVISION
reVISION Polish Vision-Language Benchmark
Polish benchmark for vision-language models including OCR evaluation on educational exam materials. Covers middle school, high school, and professional exams.
No results tracked yet
Related Tasks
Something wrong or missing?
Help keep General OCR Capabilities benchmarks accurate. Report outdated results, missing benchmarks, or errors.