Codesota · OCR · Failure ModesHome/OCR/Failure Modes
Unique Analysis · Not Found Elsewhere

OCR failure modes.

Systematic analysis of when and why OCR fails in production.

No marketing fluff. Real failure modes with code examples for mitigation.

· 7 failure categories
· Code examples included
· Model recommendations
Jump to failure mode
1. Handwriting2. Low Resolution3. Complex Tables4. Mixed Language5. Layout6. Image Quality7. MitigationModel Selection
§ 01 · Handwriting

Handwriting edge cases.

When models struggle with cursive, mixed print/cursive, and individual writing styles.

What Fails

  • ×Cursive connections: Letters blend together, making segmentation impossible
  • ×Mixed print/cursive: Same word uses both styles (common in forms)
  • ×Baseline drift: Text curves or tilts across the line
  • ×Letter variations: Same person writes 'a' three different ways
  • ×Crossed-out text: Corrections and strikethroughs read as content

Why It Fails

Traditional OCR relies on template matching against known character shapes. Handwriting has infinite variability - there's no template to match.

Character segmentation assumes clear boundaries between letters. Cursive writing is inherently connected, breaking this assumption.

Training data is biased toward printed text. Most OCR models see 100x more typed characters than handwritten ones during training.

Solutions

Model Selection

  • GPT-5.4: Best for mixed print/cursive, uses context to resolve ambiguity
  • Gemini 2.5 Pro: Strong on diverse handwriting styles
  • Avoid: Tesseract, PaddleOCR basic (not designed for handwriting)

Preprocessing

  • Apply binarization to increase contrast
  • Use deskewing to correct baseline drift
  • Increase resolution to 400+ DPI for small text
  • Consider word-level rather than character-level recognition
§ 02 · Resolution

Low resolution documents.

DPI thresholds, when upscaling helps, and when it makes things worse.

300+ DPI
Optimal for most OCR. No preprocessing needed.
200-300 DPI
Marginal. Small fonts may fail. Consider upscaling.
<200 DPI
High failure rate. Upscaling required.

What Fails

  • ×Small fonts: 8pt and below become unreadable below 300 DPI
  • ×Thin strokes: Fonts like Arial Narrow lose critical detail
  • ×Similar characters: c/e, o/0, I/l/1 become indistinguishable
  • ×Diacritics: Accent marks merge with base letters

Why It Fails

OCR models learn from high-resolution training data. At low resolution, features disappear - the subtle curves that distinguish 'c' from 'e' are lost.

Aliasing artifacts from poor sampling create false features that confuse recognition.

Neural networks are sensitive to input distribution shift. Low-res images are out-of-distribution for most models.

Solution: Intelligent Upscaling

Use Lanczos interpolation for upscaling. Avoid bicubic for text - it introduces blur. For severely degraded images, consider AI upscalers like Real-ESRGAN.

from PIL import Image
import cv2
import numpy as np

def upscale_for_ocr(image_path, target_dpi=300):
    """Upscale low-resolution images for better OCR."""
    img = cv2.imread(image_path)

    # Estimate current DPI (assuming standard scan)
    height, width = img.shape[:2]
    current_dpi = min(width, height) / 8.5  # Assume letter size

    if current_dpi < target_dpi:
        scale = target_dpi / current_dpi
        new_width = int(width * scale)
        new_height = int(height * scale)

        # Use INTER_LANCZOS4 for upscaling
        upscaled = cv2.resize(img, (new_width, new_height),
                              interpolation=cv2.INTER_LANCZOS4)
        return upscaled
    return img

Warning: Upscaling cannot recover information that was never captured. If the original scan was 72 DPI, upscaling to 300 DPI creates interpolated pixels, not real detail. It can help with some models but may introduce artifacts.

§ 03 · Tables

Complex table failures.

Merged cells, nested tables, spanning headers, and why linear reading breaks.

Failure Patterns

  • ×Merged cells: OCR reads across the merge, mixing unrelated data
  • ×Nested tables: Inner table structure is completely lost
  • ×Spanning headers: Column associations break when headers span multiple columns
  • ×Borderless tables: Without visual separators, alignment is guessed
  • ×Multi-line cells: Line breaks within cells create phantom rows

Benchmark Reality

PaddleOCR-VL88.56 TEDS
dots.ocr 3B86.8 TEDS
Mistral OCR 370.9 TEDS
clearOCR0.8 TEDS

TEDS = Tree Edit Distance Score (higher is better). Measures structural accuracy of table extraction.

Solutions

Model Selection

Use VLM-based models with explicit table understanding:

  • PaddleOCR-VL: Best table TEDS, outputs HTML/Markdown
  • dots.ocr 3B: Compact model with strong table handling
  • Docling: Dedicated table extraction pipeline

Output Format Strategy

Request structured output to preserve table semantics:

  • HTML tables: Preserves colspan/rowspan
  • Markdown: Simpler but loses merged cells
  • JSON: Best for downstream processing
  • Avoid: Plain text extraction for tables
§ 04 · Languages

Mixed language confusion.

Code-switching, embedded formulas, and when the language model gets confused.

What Fails

  • ×Code-switching: German text with English product names
  • ×Embedded formulas: LaTeX or math notation within text
  • ×Script mixing: Latin + Cyrillic + Greek in same document
  • ×Transliteration: Names written in multiple scripts
  • ×CJK + Latin: Japanese/Chinese with English terminology

Why It Fails

Most OCR models use language-specific decoders. When you select "German", it expects German vocabulary and grammar patterns.

Embedded English words get force-fit into German vocabulary, creating nonsense like "Softwerr" for "Software".

Math formulas use symbols that look like letters but have different meanings, causing semantic confusion.

Solutions

Model Selection

  • Gemini 2.5 Pro: Best multilingual handling, no language selection needed
  • Chandra OCR: 40+ languages, handles mixed content
  • GPT-5.4: Uses context to resolve language ambiguity

Strategies

  • Use auto-detect mode instead of forcing language
  • For formulas, use specialized extractors (LaTeX-OCR, Mathpix)
  • Post-process with spell-check that handles multiple dictionaries
§ 05 · Layout

Layout complexity.

Multi-column, footnotes, marginalia, sidebars, and reading order disasters.

What Fails

  • ×Column bleed: Two columns read as alternating lines
  • ×Footnotes: Mixed with main text, destroying flow
  • ×Marginalia: Margin notes read as part of body text
  • ×Floating boxes: Callouts and sidebars interrupt reading order
  • ×Wrapped text: Text flowing around images gets fragmented

Reading Order Benchmark

dots.ocr 3B95.0%
Mistral OCR 391.6%
clearOCR86.0%

Reading order accuracy from OmniDocBench. Measures correct sequencing of document elements.

Solutions

Model Selection

  • GPT-5.4: Best at understanding reading order from layout context
  • PaddleOCR-VL: Explicit layout detection module
  • Docling: Academic paper specialist

Preprocessing

  • Use layout detection before OCR (YOLO, LayoutLM)
  • Extract regions separately then reassemble
  • For PDFs, try PDF parsing before OCR (PyMuPDF)
§ 06 · Image Quality

Image quality issues.

Shadows, folds, reflections, skew, blur, and physical document damage.

Shadows & Lighting

  • Page curl shadows near spine
  • Finger shadows from holding
  • Uneven lighting across page

Physical Damage

  • Creases and fold marks
  • Water damage / staining
  • Torn edges, punch holes

Capture Issues

  • Skewed/rotated capture
  • Motion blur
  • Reflections from glossy paper

Deskewing

Correct rotation using Hough transform or minimum area rectangle detection. Skew angles above 5 degrees significantly impact OCR accuracy.

import cv2
import numpy as np

def deskew_image(image_path):
    """Correct document skew using projection profile."""
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # Threshold and find contours
    _, binary = cv2.threshold(img, 0, 255,
                              cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

    # Find the minimum area rectangle
    coords = np.column_stack(np.where(binary > 0))
    angle = cv2.minAreaRect(coords)[-1]

    # Adjust angle
    if angle < -45:
        angle = 90 + angle

    # Rotate the image
    (h, w) = img.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(img, M, (w, h),
                             flags=cv2.INTER_CUBIC,
                             borderMode=cv2.BORDER_REPLICATE)
    return rotated

Shadow Removal

Remove uneven lighting and shadows using background subtraction. Particularly important for book scans and camera captures.

import cv2
import numpy as np

def remove_shadows(image_path):
    """Remove shadows from document images."""
    img = cv2.imread(image_path)
    rgb_planes = cv2.split(img)

    result_planes = []
    for plane in rgb_planes:
        # Dilate to get background
        dilated = cv2.dilate(plane, np.ones((7, 7), np.uint8))
        bg = cv2.medianBlur(dilated, 21)

        # Subtract background and normalize
        diff = 255 - cv2.absdiff(plane, bg)
        norm = cv2.normalize(diff, None, 0, 255, cv2.NORM_MINMAX)
        result_planes.append(norm)

    result = cv2.merge(result_planes)
    return result

Full Preprocessing Pipeline

Combine denoising, contrast enhancement, and binarization for optimal OCR input.

import cv2
import numpy as np

def preprocess_for_ocr(image_path):
    """Full preprocessing pipeline for OCR."""
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # 1. Denoise
    denoised = cv2.fastNlMeansDenoising(img, h=10)

    # 2. Increase contrast (CLAHE)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    enhanced = clahe.apply(denoised)

    # 3. Binarization (adaptive threshold)
    binary = cv2.adaptiveThreshold(
        enhanced, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    # 4. Remove small noise (morphological opening)
    kernel = np.ones((2, 2), np.uint8)
    cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

    return cleaned
§ 07 · Mitigation

Mitigation strategies.

Systematic approaches to improving OCR accuracy across all failure modes.

Preprocessing Pipeline

  1. 1Deskew: Correct rotation before anything else
  2. 2Shadow removal: Normalize lighting across image
  3. 3Upscale: If below 300 DPI, upscale to target
  4. 4Denoise: Remove scanner noise and artifacts
  5. 5Binarize: Optional, helps some engines

Post-Processing

  • Spell check: Language-aware correction (Hunspell)
  • Format validation: Regex for dates, numbers, IDs
  • Confidence filtering: Flag low-confidence regions for review
  • Multi-engine voting: Run 2-3 engines, take consensus
  • LLM correction: Use GPT/Claude to fix obvious errors

Quality Assurance

  • Sample-based auditing: Manually verify 1-5% of output
  • Known-document testing: Include test documents with ground truth
  • Confidence thresholds: Route low-confidence to human review
  • Error tracking: Log and categorize failures for improvement

When to Accept Failure

  • ×Severely damaged documents: 40%+ text obscured
  • ×Extremely low resolution: Sub-100 DPI, no upscaling helps
  • ×Artistic/decorative fonts: Designed to be hard to read
  • ×Cost exceeds value: Manual entry cheaper than fixing OCR
§ 08 · Model Selection

Model selection by failure mode.

Choose your model based on your primary failure mode, not just overall accuracy.

Failure ModeBest ChoiceAlternativeAvoid
HandwritingGPT-5.4Gemini 2.5 ProTesseract, PaddleOCR basic
Low ResolutionChandra OCRGPT-5.4 with preprocessingAny without upscaling
Complex TablesPaddleOCR-VL, dots.ocrMistral OCR 3clearOCR, Tesseract
Mixed LanguagesGemini 2.5 ProChandra OCRSingle-language engines
Complex LayoutGPT-5.4, dots.ocrPaddleOCR-VLTraditional OCR
Poor Image QualityChandra OCRGPT-5.4Engines without preprocessing

Need Help with Specific Failures?

We offer private evaluations on your actual documents. Find out which models fail on your specific document types.

OCR Decision GuideBack to OCR Benchmarks

About This Analysis

This failure mode analysis is based on our internal testing across thousands of documents, combined with publicly available benchmark data from OmniDocBench and OCRBench v2. Model recommendations are based on demonstrated performance, not vendor claims.

MethodologyFull BenchmarksRaw Results