Computer Vision

General OCR Capabilities

Comprehensive benchmarks covering multiple aspects of OCR performance.

4 datasets70 resultsView full task mapping →

General OCR (Optical Character Recognition) converts images of text into machine-readable strings. Modern OCR systems handle printed text in 100+ languages at 99%+ character accuracy, but the real differentiation is in handling degraded scans, complex layouts, mixed scripts, and mathematical notation. PaddleOCR and Surya dominate open-source; Google Cloud Vision and Azure lead cloud APIs.

History

1974

Ray Kurzweil develops the first omni-font OCR machine, reading text in any font — commercialized by Xerox

2006

Tesseract open-sourced by Google (originally HP, 1985); becomes the default free OCR engine for two decades

2015

Deep learning OCR (CRNN: CNN + RNN + CTC loss) surpasses traditional methods on scene text and printed text benchmarks

2017

Attention-based sequence-to-sequence models replace CTC for OCR, better handling variable-length text and complex scripts

2019

PaddleOCR (Baidu) releases a comprehensive open-source OCR toolkit supporting 80+ languages with PP-OCR pipeline (detect → recognize → classify)

2021

TrOCR (Microsoft) applies transformer encoder-decoder architecture to OCR, matching LSTM-based methods with simpler architecture

2023

Surya OCR (Vikram Nair) achieves state-of-the-art multilingual OCR with transformer-based models, supporting 90+ languages

2024

GOT (General OCR Theory) demonstrates OCR as visual generation — a single model handles text, math, tables, sheet music, and molecular formulas

2025

Large VLMs (GPT-4o, Qwen2-VL) perform OCR implicitly — send any image and get text extraction as a byproduct of visual understanding

How General OCR Capabilities Works

Text Detection

A detection model (EAST, DBNet, CRAFT) finds text regions in the image, outputting bounding boxes or polygons around each text line or word. DBNet uses a differentiable binarization approach that handles curved and rotated text.

Text Line Extraction

Detected regions are cropped, deskewed, and normalized to fixed height (32-48px) while preserving aspect ratio. Sorting by reading order (top-to-bottom, left-to-right) organizes the text spatially.

Text Recognition

Each cropped text line is processed by a recognition model: a CNN/ViT encoder produces feature sequences, and a decoder (CTC or attention-based) produces character sequences. Modern models (TrOCR, PaddleOCR v4) use ViT encoders for better accuracy.

Language Model Post-Processing

Optional spell-checking, language model rescoring, or dictionary lookup corrects OCR errors. For structured documents, post-processing may include table reconstruction and reading order correction.

Evaluation

Character Error Rate (CER) and Word Error Rate (WER) are the primary metrics. Printed English achieves <1% CER; handwriting and degraded scans range 5-20% CER. Benchmarks include ICDAR datasets, SROIE (receipts), and multilingual text datasets.

Current Landscape

General OCR in 2025 is bifurcated between two paradigms: specialized OCR pipelines (PaddleOCR, Surya, Tesseract) that are fast, cheap, and well-understood, and large VLMs (GPT-4o, Qwen2-VL) that perform OCR as an emergent capability alongside deeper understanding. For high-throughput, well-defined tasks (scanning thousands of invoices), specialized OCR is still the right choice. For complex, diverse, or low-volume documents, VLMs offer better accuracy and flexibility with no pipeline engineering. PaddleOCR dominates the open-source space for production use, while Surya leads on multilingual accuracy. Cloud APIs (Google, Azure, AWS) remain the default for enterprises that don't want to self-host.

Key Challenges

Handwritten text — unconstrained handwriting recognition remains 5-10× worse than printed text OCR, with CER of 5-20% depending on script and quality

Multilingual and mixed-script text — documents mixing Latin, Arabic, CJK, and Devanagari require per-script detection and recognition models

Degraded quality — old documents, faxes, photocopies, and low-resolution images produce OCR errors that compound in downstream processing

Mathematical notation and special symbols — formulas, chemical structures, and musical notation require specialized models beyond standard text OCR

Layout-dependent reading order — multi-column text, tables, and documents with complex spatial arrangements need correct ordering of recognized text

Quick Recommendations

Best open-source general OCR

PaddleOCR v4 (PP-OCRv4)

Best accuracy-speed tradeoff across 80+ languages; highly optimized for production with mobile support

Best multilingual accuracy

Surya OCR

SOTA on multilingual text recognition benchmarks; handles 90+ languages including low-resource scripts

Cloud API (highest accuracy)

Google Cloud Vision API or Azure AI Vision

99%+ accuracy on printed text; handles complex layouts, tables, and forms; SLA-backed for enterprise

Document-specific OCR

Donut or TrOCR-Large

Transformer-based end-to-end models that jointly handle detection and recognition; TrOCR excels on printed text

Math / scientific notation

Mathpix or LaTeX-OCR (Lukas Blecher)

Specialized for equation recognition; converts images of math to LaTeX at 90%+ accuracy

What's Next

OCR as a standalone task is being subsumed by document understanding — models that read, understand, and reason about text simultaneously. The remaining hard problems are handwriting (especially historical and medical), low-resource languages (scripts with <100K training samples), and real-time OCR for AR/camera applications. Video OCR (tracking and reading text in moving scenes) is an emerging frontier. Within 2-3 years, most OCR will be performed implicitly by VLMs rather than dedicated OCR engines.

Benchmarks & SOTA

OCRBench v2

202436 results

Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.

State of the Art

Ovis2.5-9B

63.4

english-score

CC-OCR

Comprehensive Challenge OCR

202428 results

Multi-scene text reading, key information extraction, multilingual text, and document parsing benchmark.

State of the Art

Gemini 1.5 Pro

Google

83.25

multi-scene-f1

MME-VideoOCR

MME Video OCR Benchmark

20246 results

1,464 videos with 2,000 QA pairs across 25 tasks. Tests OCR capabilities in video content.

State of the Art

Gemini 2.5 Pro

Google

73.7

total-accuracy

reVISION

reVISION Polish Vision-Language Benchmark

20250 results

Polish benchmark for vision-language models including OCR evaluation on educational exam materials. Covers middle school, high school, and professional exams.

No results tracked yet

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep General OCR Capabilities benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision