Optical Character Recognition
Detect and read text in images and documents. Core for document intake, receipts, and scene text search.
How OCR Works
Optical Character Recognition transforms images of text into machine-readable characters. From ancient manuscripts to street signs, OCR bridges the gap between visual and textual information.
The OCR Pipeline
Picture an assembly line where each station transforms the image one step closer to text. Raw pixels enter on one end, structured text emerges from the other.
Watch the Pipeline in Action
"Hello"Preprocessing Matters
Garbage in, garbage out. Good preprocessing can improve accuracy by 20-30%. Think of it as cleaning your glasses before reading.
OCR Challenges
Not all text is created equal. A clean printed document is trivial; cursive handwriting on a crumpled receipt is a different beast entirely.
Document OCR
EasyClean printed text on white background
Key Challenges
- *Low quality scans
- *Faded text
- *Complex layouts
Visual Comparison
Detection vs Recognition: Two Distinct Problems
Think of it like reading a book in a messy room. First you find the book (detection), then you read the words (recognition). Most OCR systems solve both, but understanding the distinction clarifies why some succeed where others fail.
Text Detection
Text Recognition
Recognition Decoders: CTC vs Attention
How do we go from a sequence of visual features to a sequence of characters? Two approaches dominate, each with distinct tradeoffs.
Architecture Evolution
From hand-crafted features to vision transformers. Each generation brought new capabilities and new use cases.
CRNN Architecture (2015)
The workhorse of modern OCR. Still used in production systems today.
TrOCR Architecture (2021)
Transformers take over. Pre-trained vision encoder meets pre-trained language decoder.
The Multimodal Revolution (2024)
Large multimodal models like GPT-5V and Gemini can now perform OCR as a byproduct of their general vision-language capabilities. A single model handles detection, recognition, and even semantic understanding. The question becomes: when do you need a specialized OCR model versus a general-purpose multimodal model?
OCR Engines Compared
Open source vs cloud APIs. Speed vs accuracy. The right choice depends on your constraints.
| Engine | Type | Languages | Speed | Accuracy |
|---|---|---|---|---|
| Tesseract | Open Source | 100+ | Medium | Good |
| PaddleOCR | Open Source | 80+ | Fast | Excellent |
| EasyOCR | Open Source | 80+ | Slow | Good |
| Google Vision | Cloud API | 100+ | Fast | Excellent |
| AWS Textract | Cloud API | Limited | Fast | Excellent |
| Azure AI Vision | Cloud API | 100+ | Fast | Excellent |
Choose Open Source When:
- * Privacy/offline is required
- * High volume (cost matters)
- * You can handle preprocessing
- * Document OCR (clean images)
Choose Cloud APIs When:
- * Maximum accuracy needed
- * Handwriting recognition
- * Complex document layouts
- * Quick prototyping
Consider Multimodal LLMs When:
- * You need understanding, not just text
- * Complex reasoning required
- * Handling diverse document types
- * OCR is part of larger pipeline
Code Examples
Get started with OCR in Python. Each library has its strengths.
import pytesseract
from PIL import Image
# Basic OCR
image = Image.open('document.png')
text = pytesseract.image_to_string(image)
print(text)
# With language specification
text_de = pytesseract.image_to_string(image, lang='deu')
# Get bounding boxes for each character
boxes = pytesseract.image_to_boxes(image)
# Get detailed data with confidence scores
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
for i, word in enumerate(data['text']):
conf = data['conf'][i]
if conf > 60: # Filter low confidence
print(f"{word} (confidence: {conf}%)")
# Preprocessing helps accuracy
import cv2
img = cv2.imread('document.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
text = pytesseract.image_to_string(thresh)Quick Reference
- - Tesseract (free, offline)
- - PaddleOCR (production-ready)
- - AWS Textract (forms/tables)
- - EasyOCR (simple API)
- - PaddleOCR (fast)
- - Google Vision (best accuracy)
- - TrOCR (transformer-based)
- - Google Vision (handwritten)
- - Azure AI Vision (Read API)
The Bottom Line
OCR has matured dramatically. For clean documents, any modern engine achieves 99%+ accuracy. The hard problems remain: degraded historical documents, unusual fonts, complex layouts, and handwriting. Choose your tool based on your specific challenge, not the benchmark numbers. Preprocessing often matters more than the engine itself.
Use Cases
- ✓Invoice/receipt ingestion
- ✓Scene text search
- ✓ID card digitization
- ✓Video subtitle extraction
Architectural Patterns
Detector + Recognizer
Find text regions then recognize lines (DBNet/CRAFT + CRNN/SAR).
Transformer OCR
End-to-end transformer decoders (TrOCR, Florence) on cropped text.
Layout-Aware OCR
Preserve layout for downstream extraction (DocTr, Docling OCR modules).
Implementations
Open Source
Benchmarks
Quick Facts
- Input
- Image
- Output
- Text
- Implementations
- 3 open source, 0 API
- Patterns
- 3 approaches
Related Blocks
Have benchmark data?
Help us track the state of the art for optical character recognition.
Submit Results