Home/Building Blocks/Optical Character Recognition

Image→Text

Optical Character Recognition

Detect and read text in images and documents. Core for document intake, receipts, and scene text search.

How OCR Works

Optical Character Recognition transforms images of text into machine-readable characters. From ancient manuscripts to street signs, OCR bridges the gap between visual and textual information.

1. The Pipeline 2. OCR Challenges 3. Detection vs Recognition 4. Architectures 5. OCR Engines 6. Code

The OCR Pipeline

Picture an assembly line where each station transforms the image one step closer to text. Raw pixels enter on one end, structured text emerges from the other.

IMG

Preprocessing

Prepare image for recognition

BOX

Text Detection

Find where text exists

ABC

Recognition

Convert pixels to characters

TXT

Post-processing

Clean and structure output

Watch the Pipeline in Action

Hello

Input Image

Noisy, rotated

Hello

Preprocessed

Clean, aligned

Hello

Detected

Bounding box found

"Hello"

Recognized

Text extracted

Preprocessing Matters

Garbage in, garbage out. Good preprocessing can improve accuracy by 20-30%. Think of it as cleaning your glasses before reading.

Binarization

Convert to black/white

Removes noise, enhances contrast

Deskewing

Correct rotation

Aligns text horizontally

Denoising

Remove artifacts

Cleaner character edges

Resizing

Scale to optimal size

Better feature extraction

OCR Challenges

Not all text is created equal. A clean printed document is trivial; cursive handwriting on a crumpled receipt is a different beast entirely.

Document OCR

Easy

Clean printed text on white background

Examples:Scanned PDFs, forms, books

Typical Accuracy:99%+

Key Challenges

*Low quality scans
*Faded text
*Complex layouts

Visual Comparison

Document Text

Document

Clean, uniform

STOP

Scene

Perspective, lighting

Hello world

Handwritten

Personal style

Mix and blend

Multi-lingual

Mixed scripts

Detection vs Recognition: Two Distinct Problems

Think of it like reading a book in a messy room. First you find the book (detection), then you read the words (recognition). Most OCR systems solve both, but understanding the distinction clarifies why some succeed where others fail.

Text Detection

WHERE is the text?

CAFE

OPEN

24/7

Output:Bounding boxes / polygons

Models:CRAFT, EAST, DBNet, TextFuseNet

Challenge:Curved text, arbitrary orientation

Text Recognition

WHAT does it say?

CAFE

-> "CAFE"

Output:Character sequence

Models:CRNN, TrOCR, SVTR, ABINet

Challenge:Variable length, large vocabulary

Recognition Decoders: CTC vs Attention

How do we go from a sequence of visual features to a sequence of characters? Two approaches dominate, each with distinct tradeoffs.

CTC (Connectionist Temporal Classification)

+Fast inference

+Simple training

+No autoregressive decoding

-Conditional independence assumption

-Struggles with long sequences

Best for: Real-time, simple text

Attention-based

+Handles long sequences

+Better for complex scripts

+Can model dependencies

-Slower (autoregressive)

-Attention drift issues

Best for: Complex layouts, varied fonts

Architecture Evolution

From hand-crafted features to vision transformers. Each generation brought new capabilities and new use cases.

Tesseract v3

2006

Traditional

CRNN

2015

Deep Learning

Attention OCR

2017

Deep Learning

Tesseract v4

2018

Hybrid

CRAFT

2019

Detection

TrOCR

2021

Transformer

PaddleOCR v3

2022

Production

GOT-OCR

2024

Foundation

CRNN Architecture (2015)

The workhorse of modern OCR. Still used in production systems today.

CNN

BiLSTM

CTC

CNN extracts visual features, BiLSTM captures sequence context, CTC decodes to text

TrOCR Architecture (2021)

Transformers take over. Pre-trained vision encoder meets pre-trained language decoder.

ViT/DeiT

Cross-Attn

GPT-2

Vision Transformer encodes, GPT-2 style decoder generates text autoregressively

The Multimodal Revolution (2024)

Large multimodal models like GPT-5V and Gemini can now perform OCR as a byproduct of their general vision-language capabilities. A single model handles detection, recognition, and even semantic understanding. The question becomes: when do you need a specialized OCR model versus a general-purpose multimodal model?

OCR Engines Compared

Open source vs cloud APIs. Speed vs accuracy. The right choice depends on your constraints.

Engine	Type	Languages	Speed	Accuracy
Tesseract	Open Source	100+	Medium	Good
PaddleOCR	Open Source	80+	Fast	Excellent
EasyOCR	Open Source	80+	Slow	Good
Google Vision	Cloud API	100+	Fast	Excellent
AWS Textract	Cloud API	Limited	Fast	Excellent
Azure AI Vision	Cloud API	100+	Fast	Excellent

Choose Open Source When:

* Privacy/offline is required
* High volume (cost matters)
* You can handle preprocessing
* Document OCR (clean images)

Choose Cloud APIs When:

* Maximum accuracy needed
* Handwriting recognition
* Complex document layouts
* Quick prototyping

Consider Multimodal LLMs When:

* You need understanding, not just text
* Complex reasoning required
* Handling diverse document types
* OCR is part of larger pipeline

Code Examples

Get started with OCR in Python. Each library has its strengths.

Tesseractpip install pytesseract

Classic

import pytesseract
from PIL import Image

# Basic OCR
image = Image.open('document.png')
text = pytesseract.image_to_string(image)
print(text)

# With language specification
text_de = pytesseract.image_to_string(image, lang='deu')

# Get bounding boxes for each character
boxes = pytesseract.image_to_boxes(image)

# Get detailed data with confidence scores
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
for i, word in enumerate(data['text']):
    conf = data['conf'][i]
    if conf > 60:  # Filter low confidence
        print(f"{word} (confidence: {conf}%)")

# Preprocessing helps accuracy
import cv2
img = cv2.imread('document.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
text = pytesseract.image_to_string(thresh)

Quick Reference

For Documents

- Tesseract (free, offline)
- PaddleOCR (production-ready)
- AWS Textract (forms/tables)

For Scene Text

- EasyOCR (simple API)
- PaddleOCR (fast)
- Google Vision (best accuracy)

For Handwriting

- TrOCR (transformer-based)
- Google Vision (handwritten)
- Azure AI Vision (Read API)

The Bottom Line

OCR has matured dramatically. For clean documents, any modern engine achieves 99%+ accuracy. The hard problems remain: degraded historical documents, unusual fonts, complex layouts, and handwriting. Choose your tool based on your specific challenge, not the benchmark numbers. Preprocessing often matters more than the engine itself.