Home/Building Blocks/Optical Character Recognition
ImageText

Optical Character Recognition

Detect and read text in images and documents. Core for document intake, receipts, and scene text search.

How OCR Works

Optical Character Recognition transforms images of text into machine-readable characters. From ancient manuscripts to street signs, OCR bridges the gap between visual and textual information.

1

The OCR Pipeline

Picture an assembly line where each station transforms the image one step closer to text. Raw pixels enter on one end, structured text emerges from the other.

IMG
Preprocessing
Prepare image for recognition
BOX
Text Detection
Find where text exists
ABC
Recognition
Convert pixels to characters
TXT
Post-processing
Clean and structure output

Watch the Pipeline in Action

Hello
Input Image
Noisy, rotated
Hello
Preprocessed
Clean, aligned
Hello
Detected
Bounding box found
"Hello"
Recognized
Text extracted

Preprocessing Matters

Garbage in, garbage out. Good preprocessing can improve accuracy by 20-30%. Think of it as cleaning your glasses before reading.

Binarization
Convert to black/white
Removes noise, enhances contrast
Deskewing
Correct rotation
Aligns text horizontally
Denoising
Remove artifacts
Cleaner character edges
Resizing
Scale to optimal size
Better feature extraction
2

OCR Challenges

Not all text is created equal. A clean printed document is trivial; cursive handwriting on a crumpled receipt is a different beast entirely.

Document OCR

Easy

Clean printed text on white background

Examples:Scanned PDFs, forms, books
Typical Accuracy:99%+

Key Challenges

  • *Low quality scans
  • *Faded text
  • *Complex layouts

Visual Comparison

Document Text
Document
Clean, uniform
STOP
Scene
Perspective, lighting
Hello world
Handwritten
Personal style
Mix and blend
Multi-lingual
Mixed scripts
3

Detection vs Recognition: Two Distinct Problems

Think of it like reading a book in a messy room. First you find the book (detection), then you read the words (recognition). Most OCR systems solve both, but understanding the distinction clarifies why some succeed where others fail.

Text Detection

WHERE is the text?
CAFE
OPEN
24/7
Output:Bounding boxes / polygons
Models:CRAFT, EAST, DBNet, TextFuseNet
Challenge:Curved text, arbitrary orientation

Text Recognition

WHAT does it say?
CAFE
-> "CAFE"
Output:Character sequence
Models:CRNN, TrOCR, SVTR, ABINet
Challenge:Variable length, large vocabulary

Recognition Decoders: CTC vs Attention

How do we go from a sequence of visual features to a sequence of characters? Two approaches dominate, each with distinct tradeoffs.

CTC (Connectionist Temporal Classification)
+Fast inference
+Simple training
+No autoregressive decoding
-Conditional independence assumption
-Struggles with long sequences
Best for: Real-time, simple text
Attention-based
+Handles long sequences
+Better for complex scripts
+Can model dependencies
-Slower (autoregressive)
-Attention drift issues
Best for: Complex layouts, varied fonts
4

Architecture Evolution

From hand-crafted features to vision transformers. Each generation brought new capabilities and new use cases.

Tesseract v3
2006
Traditional
CRNN
2015
Deep Learning
Attention OCR
2017
Deep Learning
Tesseract v4
2018
Hybrid
CRAFT
2019
Detection
TrOCR
2021
Transformer
PaddleOCR v3
2022
Production
GOT-OCR
2024
Foundation

CRNN Architecture (2015)

The workhorse of modern OCR. Still used in production systems today.

CNN
->
BiLSTM
->
CTC
CNN extracts visual features, BiLSTM captures sequence context, CTC decodes to text

TrOCR Architecture (2021)

Transformers take over. Pre-trained vision encoder meets pre-trained language decoder.

ViT/DeiT
->
Cross-Attn
->
GPT-2
Vision Transformer encodes, GPT-2 style decoder generates text autoregressively

The Multimodal Revolution (2024)

Large multimodal models like GPT-5V and Gemini can now perform OCR as a byproduct of their general vision-language capabilities. A single model handles detection, recognition, and even semantic understanding. The question becomes: when do you need a specialized OCR model versus a general-purpose multimodal model?

5

OCR Engines Compared

Open source vs cloud APIs. Speed vs accuracy. The right choice depends on your constraints.

EngineTypeLanguagesSpeedAccuracy
TesseractOpen Source100+MediumGood
PaddleOCROpen Source80+FastExcellent
EasyOCROpen Source80+SlowGood
Google VisionCloud API100+FastExcellent
AWS TextractCloud APILimitedFastExcellent
Azure AI VisionCloud API100+FastExcellent

Choose Open Source When:

  • * Privacy/offline is required
  • * High volume (cost matters)
  • * You can handle preprocessing
  • * Document OCR (clean images)

Choose Cloud APIs When:

  • * Maximum accuracy needed
  • * Handwriting recognition
  • * Complex document layouts
  • * Quick prototyping

Consider Multimodal LLMs When:

  • * You need understanding, not just text
  • * Complex reasoning required
  • * Handling diverse document types
  • * OCR is part of larger pipeline
6

Code Examples

Get started with OCR in Python. Each library has its strengths.

Tesseractpip install pytesseract
Classic
import pytesseract
from PIL import Image

# Basic OCR
image = Image.open('document.png')
text = pytesseract.image_to_string(image)
print(text)

# With language specification
text_de = pytesseract.image_to_string(image, lang='deu')

# Get bounding boxes for each character
boxes = pytesseract.image_to_boxes(image)

# Get detailed data with confidence scores
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
for i, word in enumerate(data['text']):
    conf = data['conf'][i]
    if conf > 60:  # Filter low confidence
        print(f"{word} (confidence: {conf}%)")

# Preprocessing helps accuracy
import cv2
img = cv2.imread('document.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
text = pytesseract.image_to_string(thresh)

Quick Reference

For Documents
  • - Tesseract (free, offline)
  • - PaddleOCR (production-ready)
  • - AWS Textract (forms/tables)
For Scene Text
  • - EasyOCR (simple API)
  • - PaddleOCR (fast)
  • - Google Vision (best accuracy)
For Handwriting
  • - TrOCR (transformer-based)
  • - Google Vision (handwritten)
  • - Azure AI Vision (Read API)

The Bottom Line

OCR has matured dramatically. For clean documents, any modern engine achieves 99%+ accuracy. The hard problems remain: degraded historical documents, unusual fonts, complex layouts, and handwriting. Choose your tool based on your specific challenge, not the benchmark numbers. Preprocessing often matters more than the engine itself.

Use Cases

  • Invoice/receipt ingestion
  • Scene text search
  • ID card digitization
  • Video subtitle extraction

Architectural Patterns

Detector + Recognizer

Find text regions then recognize lines (DBNet/CRAFT + CRNN/SAR).

Transformer OCR

End-to-end transformer decoders (TrOCR, Florence) on cropped text.

Layout-Aware OCR

Preserve layout for downstream extraction (DocTr, Docling OCR modules).

Implementations

Open Source

PaddleOCR

Apache 2.0
Open Source

Strong multilingual OCR with detection + recognition.

TrOCR

MIT
Open Source

Transformer-based OCR. Good on scanned docs.

Tesseract

Apache 2.0
Open Source

Lightweight classic OCR. Good for Latin scripts.

Benchmarks

Quick Facts

Input
Image
Output
Text
Implementations
3 open source, 0 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for optical character recognition.

Submit Results