Build a Document Scanner
Detect document edges, correct perspective, and enhance scanned images. Interactive demo below.
Try It
Upload a photo of a document (receipt, page, ID card). The scanner will detect the edges, let you adjust them, and transform the image to a flat, rectangular scan.
Loading OpenCV.js...
Drop an image here or click to upload
Try a receipt, invoice, or document photo
Edge Detection (Canny Algorithm)
Original
Grayscale
Gaussian Blur (5x5)
Canny Edges
Contour Detection (Find Document)
All Contours Found
0 contours
4-Point Polygons
0 candidates
Selected Document
Perspective Transform (Homography)
Source (4 corners)
Warped Result
Enhancement:
OCR (Tesseract.js)
Word Bounding Boxes
0 words detected
Extracted Text
Structured Data Extraction (Regex)
How It Works
Document scanning involves four steps:
- Edge detection: Find where the document boundaries are using Canny edge detection
- Contour finding: Extract the document outline as a 4-point polygon
- Perspective transform: Warp the tilted document into a flat rectangle
- Enhancement: Improve contrast and optionally convert to black-and-white
Step 1: Edge Detection
The Canny algorithm finds edges by looking for rapid changes in pixel intensity. We first convert to grayscale and blur to reduce noise:
1. Original
2. Grayscale
3. Gaussian Blur (5x5)
4. Canny Edges
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
edges = cv2.Canny(blurred, 50, 150)
The 50, 150 are the low and high thresholds. Edges with gradient above 150 are kept; edges between 50-150 are kept only if connected to strong edges.
Step 2: Find the Document Contour
We find all contours (closed shapes) in the edge image, then look for the largest one that approximates to a 4-point polygon:
All Contours (112 found)
4-Point Polygons
Selected + Corner Labels
contours, _ = cv2.findContours(edges, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
for contour in sorted(contours, key=cv2.contourArea, reverse=True):
peri = cv2.arcLength(contour, True)
approx = cv2.approxPolyDP(contour, 0.02 * peri, True)
if len(approx) == 4:
doc_contour = approx
break
The approxPolyDP function simplifies the contour. The epsilon value (0.02 * perimeter) controls how much simplification - larger values produce simpler shapes.
Step 3: Perspective Transform
Now we have 4 corner points of the tilted document. We want to map these to a rectangle. This is a homography transformation:
Source: 4 Corner Points
Result: Flat Rectangle
# Source points (corners of tilted document)
src_pts = np.array([[x1,y1], [x2,y2], [x3,y3], [x4,y4]], dtype=np.float32)
# Destination points (rectangle)
dst_pts = np.array([[0,0], [width,0], [width,height], [0,height]], dtype=np.float32)
# Get transform matrix and apply
M = cv2.getPerspectiveTransform(src_pts, dst_pts)
result = cv2.warpPerspective(img, M, (width, height)) The order of corners matters. They must be in the same order (e.g., clockwise starting from top-left) in both arrays.
Step 4: Enhancement
For text documents, adaptive thresholding produces clean black-and-white output:
Grayscale
Adaptive Threshold
gray = cv2.cvtColor(result, cv2.COLOR_BGR2GRAY)
enhanced = cv2.adaptiveThreshold(
gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
11, 2 # block size, constant
) Unlike global thresholding, adaptive thresholding calculates the threshold for each pixel based on its neighbors. This handles uneven lighting across the document.
Complete Python Code
import cv2
import numpy as np
def scan_document(image_path: str, output_path: str) -> None:
"""
Scan a document: detect edges, correct perspective, enhance.
"""
# Load image
img = cv2.imread(image_path)
orig = img.copy()
# Resize for processing (keep aspect ratio)
height, width = img.shape[:2]
scale = 500 / max(height, width)
img = cv2.resize(img, None, fx=scale, fy=scale)
# Convert to grayscale and blur
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
# Edge detection
edges = cv2.Canny(blurred, 50, 150)
edges = cv2.dilate(edges, np.ones((3, 3), np.uint8))
# Find contours
contours, _ = cv2.findContours(edges, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
contours = sorted(contours, key=cv2.contourArea, reverse=True)
# Find the document contour (largest 4-sided polygon)
doc_contour = None
for contour in contours:
peri = cv2.arcLength(contour, True)
approx = cv2.approxPolyDP(contour, 0.02 * peri, True)
if len(approx) == 4:
doc_contour = approx
break
if doc_contour is None:
raise ValueError("Could not detect document edges")
# Scale contour back to original image size
doc_contour = (doc_contour / scale).astype(np.float32)
# Order corners: top-left, top-right, bottom-right, bottom-left
pts = doc_contour.reshape(4, 2)
rect = order_corners(pts)
# Calculate output dimensions
width_top = np.linalg.norm(rect[1] - rect[0])
width_bottom = np.linalg.norm(rect[2] - rect[3])
height_left = np.linalg.norm(rect[3] - rect[0])
height_right = np.linalg.norm(rect[2] - rect[1])
max_width = int(max(width_top, width_bottom))
max_height = int(max(height_left, height_right))
# Perspective transform
dst = np.array([
[0, 0],
[max_width - 1, 0],
[max_width - 1, max_height - 1],
[0, max_height - 1]
], dtype=np.float32)
M = cv2.getPerspectiveTransform(rect, dst)
scanned = cv2.warpPerspective(orig, M, (max_width, max_height))
# Enhance: convert to grayscale and apply adaptive threshold
scanned_gray = cv2.cvtColor(scanned, cv2.COLOR_BGR2GRAY)
scanned_enhanced = cv2.adaptiveThreshold(
scanned_gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
cv2.imwrite(output_path, scanned_enhanced)
print(f"Saved: {output_path}")
def order_corners(pts: np.ndarray) -> np.ndarray:
"""Order corners: top-left, top-right, bottom-right, bottom-left."""
rect = np.zeros((4, 2), dtype=np.float32)
# Top-left has smallest sum, bottom-right has largest
s = pts.sum(axis=1)
rect[0] = pts[np.argmin(s)]
rect[2] = pts[np.argmax(s)]
# Top-right has smallest diff, bottom-left has largest
d = np.diff(pts, axis=1)
rect[1] = pts[np.argmin(d)]
rect[3] = pts[np.argmax(d)]
return rect
if __name__ == "__main__":
scan_document("photo.jpg", "scanned.png")
Install: pip install opencv-python numpy
Minimal Version (15 lines)
If you know the document will be detected correctly, here's the minimal version:
import cv2
import numpy as np
# Load and detect edges
img = cv2.imread("photo.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(cv2.GaussianBlur(gray, (5,5), 0), 50, 150)
# Find largest 4-sided contour
contours, _ = cv2.findContours(edges, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
for c in sorted(contours, key=cv2.contourArea, reverse=True):
approx = cv2.approxPolyDP(c, 0.02 * cv2.arcLength(c, True), True)
if len(approx) == 4:
pts = approx.reshape(4, 2).astype(np.float32)
break
# Transform to rectangle
w, h = 800, 1000
dst = np.array([[0,0], [w,0], [w,h], [0,h]], dtype=np.float32)
M = cv2.getPerspectiveTransform(pts, dst)
result = cv2.warpPerspective(img, M, (w, h))
cv2.imwrite("scanned.png", result) When Edge Detection Fails
Auto-detection fails when:
- Document is on a similar-colored background (white paper on white desk)
- Part of the document is cut off in the photo
- Strong shadows or reflections break the edge
- Multiple documents in the frame
For these cases, let users manually select the 4 corners (like in the demo above). Many apps show the auto-detected corners but allow adjustment before transforming.
Adding OCR
Once you have a clean scan, run OCR to extract text. See Getting Started with OCR for how to use PaddleOCR or GPT-4o on your scanned documents.
from paddleocr import PaddleOCR
ocr = PaddleOCR(lang='en')
result = ocr.predict('scanned.png')
for item in result:
for text in item.get('rec_texts', []):
print(text) Why Benchmarks Don't Tell the Whole Story
OCR benchmarks typically measure Character Error Rate (CER) and Word Error Rate (WER). These metrics count how many characters or words the model got wrong:
# CER = (insertions + deletions + substitutions) / total_characters
# WER = (insertions + deletions + substitutions) / total_words
Ground truth: "Invoice Total: $1,234.56"
OCR output: "Invoice Total: $1,234.56"
CER = 0%, WER = 0% # Perfect! But for invoices and tables, perfect character accuracy doesn't mean correct extraction. Consider this invoice table:
Ground truth (what you want): ┌─────────────────────┬─────┬─────────┬───────────┐ │ Description │ Qty │ Price │ Total │ ├─────────────────────┼─────┼─────────┼───────────┤ │ Web Development │ 40 │ $150.00 │ $6,000.00 │ │ UI/UX Design │ 20 │ $125.00 │ $2,500.00 │ └─────────────────────┴─────┴─────────┴───────────┘ OCR output (what you get): Web Development 40 $150.00 $6,000.00 UI/UX Design 20 $125.00 $2,500.00
CER and WER are both 0% - every character is correct. But the table structure is completely lost. You can't programmatically answer "what's the price of UI/UX Design?" without reconstructing which numbers belong to which row.
What Benchmarks Actually Measure
| Metric | Measures | Misses |
|---|---|---|
| CER | Character-level accuracy | Word boundaries, structure, semantics |
| WER | Word-level accuracy | Line order, table structure, relationships |
| TEDS | Table structure (edit distance on HTML) | Cell content accuracy, merged cells |
| F1 (field extraction) | Correct key-value pairs extracted | Best for invoices, but schema-dependent |
The Real Pipeline for Tabular Data
For invoices and forms, text extraction is just step 1. You also need:
# Step 1: Preprocess (what we covered above)
scanned = scan_document("invoice_photo.jpg")
# Step 2: OCR - extract text with bounding boxes
ocr = PaddleOCR(lang='en')
result = ocr.predict(scanned)
# Returns: [{"text": "Web Development", "bbox": [x1,y1,x2,y2], "confidence": 0.98}, ...]
# Step 3: Table detection - find table regions
# (Requires separate model or heuristics)
# Step 4: Cell assignment - which text belongs to which cell
# (Spatial clustering based on bounding box positions)
# Step 5: Structure reconstruction - row/column relationships
# (Graph-based or rule-based assignment)
# Step 6: Field extraction - map to your schema
# {"line_items": [{"description": "Web Development", "qty": 40, "price": 150.00}]} Traditional OCR engines (PaddleOCR, Tesseract) only do Step 2. You're responsible for Steps 3-6, which is where most of the complexity lies.
Why GPT-4o Changes This
Vision-language models like GPT-4o collapse Steps 2-6 into a single prompt:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": """Extract line items from this invoice as JSON:
{"line_items": [{"description": str, "qty": int, "price": float, "total": float}]}"""},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
]
}],
response_format={"type": "json_object"}
)
# Returns structured JSON directly - no table detection needed GPT-4o understands that "40" in the Qty column relates to "Web Development" in the Description column, even though they're spatially separated. This is document understanding, not just text extraction.
When to Use What
- PaddleOCR + Custom Logic
- High volume, consistent layouts, cost-sensitive. Build your own table parser for your specific document format.
- GPT-4o
- Variable layouts, complex tables, need semantic understanding. ~$0.015/image but handles edge cases automatically.
- Specialized Document AI
- AWS Textract, Google Document AI, Azure Form Recognizer. Middle ground: structured output without LLM costs.
See our OCR benchmarks comparison for how different models perform on various document types, including tables and forms.
Browser vs Server
| Approach | Pros | Cons |
|---|---|---|
| OpenCV.js (browser) | No server needed, instant preview, privacy (images stay local) | 5MB download, slower than native, limited to what JS can do |
| Python/OpenCV (server) | Fast processing, full OpenCV features, can chain with OCR | Requires backend, upload latency, server costs |
| GPT-4o Vision | Can extract text directly without scanning, handles messy images | ~$0.01-0.02/image, requires API call |
For simple scanning (receipts, documents), the browser approach works well. For high-volume processing or integration with OCR pipelines, use server-side Python.