Level 1: Single Blocks~30 min

Document Extraction and Parsing

Extract structured text from PDFs, scans, and complex layouts. The DOCUMENT modality is critical for enterprise RAG.

40 Years of Teaching Machines to Read

Document parsing is one of the oldest problems in computing. The challenge sounds simple: take a document — printed, scanned, photographed — and extract its text and structure in machine-readable form. In practice, this is a cascade of hard problems: character recognition, layout understanding, table detection, reading order inference, and semantic classification.

Understanding how we got from 1980s template-matching OCR to today's vision-language models is not just background reading. It explains why different tools fail on different documents, what "layout understanding" actually means, and where the field still falls short.

Era I: Template Matching & Rule-Based OCR
1974

Kurzweil's Omni-Font OCR

Ray Kurzweil founded Kurzweil Computer Products and built the first commercial system that could recognize text printed in virtually any font. Previous OCR systems required specific fonts or pre-defined templates. Kurzweil's system used pattern recognition to generalize across typefaces — a reading machine originally designed to help blind people read printed material. Xerox acquired the technology in 1980.

The core technique was template matching: compare each character image against a library of known character shapes, pick the closest match. It worked well for clean printed text but collapsed on noise, skew, handwriting, or unusual layouts.

1985–2005

Tesseract: From HP Labs to Google

Hewlett-Packard Labs developed Tesseract between 1985 and 1994. It used a pipeline of binarization, connected component analysis, line and word segmentation, and finally character recognition via polygonal approximation of character outlines. HP shelved the project. In 2005, HP released Tesseract as open source, and Google adopted it for their book-scanning project.

# Tesseract OCR pipeline (simplified)
image → binarize → find connected components → segment lines
→ segment words → classify characters → assemble text

# The pipeline is rigid: each stage depends on the previous
# If binarization fails (low contrast scan), everything downstream fails

Tesseract dominated open-source OCR for two decades. Even today, pytesseract remains the most-installed OCR library on PyPI. But its pipeline architecture — character-by-character, no layout understanding — makes it fundamentally limited for modern document AI.

1993–2000s

PDF Text Extraction: The "Easy" Case

Adobe's PDF format (1993) stores text as positioned character strings with font metadata. Libraries like PDFMiner (2004), PyMuPDF/fitz, and pdfplumber extract this text without OCR. But "digital-native" PDFs present their own challenge: characters are placed at absolute coordinates with no semantic structure. A two-column layout becomes jumbled text. Tables become meaningless character soup. Headers and footers mix into body text.

The PDF paradox

A PDF is a rendering format, not a semantic format. It tells a printer where to place each glyph, not what the document means. Extracting text is easy. Extracting structure — paragraphs, headings, tables, reading order — requires understanding layout. This is the core problem that separates naive extraction from document AI.

Era II: Deep Learning Transforms OCR
2015–2017

CNN + RNN: Neural OCR

The deep learning revolution hit OCR when Baoguang Shi, Xiang Bai, and Cong Yao proposed CRNN (2015) — a convolutional neural network for feature extraction followed by a recurrent network for sequence modeling, trained end-to-end with CTC loss (Connectionist Temporal Classification). Instead of segmenting individual characters, the model learned to read entire text lines at once.

Google integrated LSTM-based recognition into Tesseract 4.0 (2018), dramatically improving accuracy on degraded text, unusual fonts, and partially occluded characters. The era of hand-crafted feature pipelines was ending.

Shi, B. et al. (2015). An End-to-End Trainable Neural Network for Image-based Sequence Recognition. TPAMI.

2019

Scene Text Detection: CRAFT & DBNet

Text detection — finding where text appears in an image — advanced rapidly with models like CRAFT (Baek et al., 2019) and DBNet (Liao et al., 2020). These models could detect text at arbitrary angles, on curved surfaces, and in cluttered scenes. For document parsing, accurate text detection meant that OCR could finally work reliably on photographs of documents, not just clean scans.

2020

LayoutLM: Text + Layout + Image

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou at Microsoft Research published LayoutLM — the paper that redefined document understanding. The key insight: pre-train a transformer that jointly models text content, 2D position (bounding box coordinates), and visual features from the document image.

# LayoutLM input representation (simplified)
# Each token gets THREE types of embeddings, summed:
token_emb = text_embedding[token_id]          # What the text says
layout_emb = position_embedding[x1, y1, x2, y2]  # Where on the page
image_emb = CNN_features[region]              # What it looks like

input = token_emb + layout_emb + image_emb   # Fused representation
# Pre-train with masked language model + masked visual-region model

Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout for Document Image Understanding. KDD. 1,500+ citations.

LayoutLM and its successors (LayoutLMv2, LayoutLMv3) set new state-of-the-art on every document AI benchmark: form understanding (FUNSD), receipt extraction (CORD), document classification (RVL-CDIP). The key lesson: position matters as much as content. "$42.99" means nothing without knowing it's in the "Total" column of an invoice.

Era III: End-to-End Document Models
2022

Donut: OCR-Free Document Understanding

Geewook Kim, Teakgyu Hong, Moonbin Yim et al. at NAVER AI Lab asked: what if we skip OCR entirely? Donut (Document Understanding Transformer) takes a document image as input and directly generates structured output (JSON, text) using a Swin Transformer encoder and a BART decoder. No OCR pipeline, no bounding boxes, no text detection stage.

This was a paradigm shift. Instead of a multi-stage pipeline where errors cascade (bad OCR breaks downstream extraction), Donut trained a single model end-to-end to "read" documents. It matched or exceeded LayoutLM-based approaches on receipt and form extraction benchmarks.

Kim, G. et al. (2022). OCR-Free Document Understanding Transformer. ECCV.

2023

Nougat: Academic Papers to Markdown

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic at Meta AI built Nougat (Neural Optical Understanding for Academic Documents) specifically for scientific papers. Using a Donut-like architecture fine-tuned on paired PDF-to-LaTeX/Markdown data, Nougat could convert rendered PDF pages directly into structured Markdown with LaTeX equations preserved.

This was transformative for academic knowledge bases. Previous approaches mangled equations, lost citation structure, and couldn't handle the complex two-column layouts typical of conference papers. Nougat showed that end-to-end models could handle even the hardest document types.

Blecher, L. et al. (2023). Nougat: Neural Optical Understanding for Academic Documents. arXiv.

2024–2026

Vision-Language Models: General-Purpose Document Reading

The latest paradigm uses large vision-language models (VLMs) that weren't trained specifically for documents but can "read" them as a side effect of their general visual understanding. GPT-4o, Claude, Gemini, and specialized models like Mistral OCR and Qwen2-VL accept document images and produce structured text, tables, and even JSON extraction with a simple prompt.

The accuracy is remarkable — Mistral OCR leads OmniDocBench at 79.75% composite score — but the approach trades compute cost for simplicity. No pipeline, no fine-tuning, no training data preparation. Just send the image and ask.

The throughline: 1974 → 2026

Four decades. The same goal, with each generation eliminating one more bottleneck:

1974–1993Template OCR: Match character shapes against known patterns (Kurzweil, Tesseract)
1993–2015PDF extraction: Pull positioned characters from digital PDFs (PDFMiner, PyMuPDF)
2015–2019Neural OCR: End-to-end character recognition with CNNs + RNNs (CRNN, Tesseract 4)
2020–2022Layout-aware: Joint text + position + image understanding (LayoutLM, Donut)
2023–nowVLM reading: General vision-language models that read documents as a capability (GPT-4o, Mistral OCR)

Each generation solved one limitation of the last. Each preserved the core challenge: extracting structure, not just characters.

The Document Parsing Pipeline

Whether you use a traditional tool or a modern AI model, document parsing is a sequence of distinct sub-problems. Understanding each stage helps you diagnose where your pipeline is failing and which tool to reach for.

PDF / ImageInput documentLayout DetectionFind regionsOCRRecognize textStructureTables, headingsStructured OutputJSON / Markdown/ HTML

The five-stage document parsing pipeline. Errors in early stages cascade downstream.

Stage 1: Image Acquisition / Rendering

Before any analysis, the document must become pixels. For scanned documents, this is the scanner output. For digital PDFs, each page is rendered to an image at a target DPI (typically 200–300 DPI for OCR). This stage also includes preprocessing: deskewing rotated scans, denoising, and contrast enhancement.

import fitz  # PyMuPDF

# Render PDF page to high-resolution image for OCR
doc = fitz.open("contract.pdf")
page = doc[0]
# 300 DPI: multiply default 72 DPI by ~4.17
pix = page.get_pixmap(matrix=fitz.Matrix(4.17, 4.17))
pix.save("page_0_300dpi.png")

Stage 2: Text Detection & OCR

Find where text appears (bounding boxes) and recognize what it says. For digital PDFs, this is trivial — extract the embedded text. For scanned documents, this requires OCR. Modern tools use neural text detection (CRAFT, DBNet) followed by recognition (CRNN, TrOCR).

import pytesseract
from PIL import Image

# Traditional OCR with Tesseract
img = Image.open("scanned_invoice.png")
# Get text with bounding box coordinates
data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)

for i, text in enumerate(data['text']):
    if text.strip():
        x, y = data['left'][i], data['top'][i]
        w, h = data['width'][i], data['height'][i]
        conf = data['conf'][i]
        print(f"[{x},{y},{x+w},{y+h}] conf={conf}%: {text}")

Stage 3: Layout Analysis

Group text regions into semantic blocks: titles, paragraphs, tables, figures, captions, headers, footers. Determine reading order (critical for multi-column layouts). This is the stage where most naive tools fail — they extract text left-to-right, top-to-bottom, jumbling two-column academic papers into nonsense.

ACME Corp Annual ReportMarch 2026 | ConfidentialHEADERPARAGRAPHItemQ1 2026Q2 2026TABLERevenue GrowthFIGUREPage 1 of 24 | Confidential | ACME Corp 2026FOOTER

Layout analysis detects semantic regions with bounding boxes: headers, paragraphs, tables, figures, and footers.

# Layout detection with a YOLO-based model (conceptual)
# Models like DocLayout-YOLO detect document regions:
#   - Title, Section Header, Paragraph, Table, Figure
#   - List, Caption, Page Header, Page Footer, Footnote

# Each region has: class_label, bounding_box, confidence
regions = layout_model.detect("page_image.png")
# Sort by reading order (not just top-to-bottom!)
ordered = reading_order_sort(regions)  # Handles multi-column

Stage 4: Table Extraction

Tables are the hardest structure to parse. They require detecting rows, columns, spanning cells, and headers — then mapping OCR text into the correct cell. Borderless tables (common in financial documents) are especially challenging because there are no visual delimiters.

1. Document PageProductQtyPriceWidget A150$12.50Gadget B75$24.00Part C300$5.75Module D42$89.00Table detectedDetect2. Structure RecognitionProductQtyPriceWidget A150$12.50Gadget B75$24.00Part C300$5.75Module D42$89.00R1R2R3R4Extract3. Structured Output{"headers": ["Product", "Qty", "Price"]"rows": [["Widget A", "150", "$12.50"],["Gadget B", "75", "$24.00"],["Part C", "300", "$5.75"],["Module D", "42", "$89.00"]]}JSON

Table extraction: detect the table region, recognize row/column structure, and output as structured data.

# Table extraction approaches:
# 1. Rule-based: detect horizontal/vertical lines, infer grid
# 2. ML-based: models like TableTransformer (Microsoft)
#    detect table structure from visual features
# 3. VLM-based: prompt a vision model to output markdown table

# Example: pdfplumber (rule-based, works on bordered tables)
import pdfplumber

pdf = pdfplumber.open("financial_report.pdf")
page = pdf.pages[0]
tables = page.extract_tables()

for table in tables:
    for row in table:
        print(" | ".join(cell or "" for cell in row))

Stage 5: Structured Output

Assemble everything into a usable format: Markdown (most common for RAG), JSON (for structured extraction), HTML (for layout preservation), or custom schemas. This stage decides what metadata to preserve — headings hierarchy, table structure, image references, page numbers.

# Final output: structured Markdown
# # Contract Agreement           <- heading from font size
#
# **Parties:** Acme Corp and Beta LLC
# **Effective Date:** March 1, 2026
#
# ## Section 1: Terms
#
# The agreement shall remain in effect for...
#
# | Deliverable | Due Date | Amount |    <- table preserved
# |-------------|----------|--------|
# | Phase 1     | Q1 2026  | $50K   |
# | Phase 2     | Q2 2026  | $75K   |

Warning: Pipeline vs End-to-End

Traditional tools (Tesseract, PyMuPDF, pdfplumber) implement each stage separately — errors cascade. Modern tools (Docling, Marker, Unstructured) integrate multiple stages with ML models. VLMs (Mistral OCR, GPT-4o) collapse the entire pipeline into a single forward pass. The trade-off is always: control vs accuracy vs cost.

Rule-Based vs ML-Based Parsing

Before choosing a tool, understand the fundamental architectural divide. Each approach has scenarios where it dominates.

Rule-Based

PyMuPDF, pdfplumber, PDFMiner, Camelot

  • +Deterministic — same input always produces same output
  • +Zero dependencies, fast, no GPU needed
  • +Perfect for clean, digital-native PDFs with simple layouts
  • -Fails on scanned documents (no OCR)
  • -Cannot understand layout semantics (what is a heading vs body text)
  • -Multi-column text is extracted in wrong reading order

ML-Based

Docling, Marker, Unstructured, Nougat, VLMs

  • +Handles scanned documents, photos, complex layouts
  • +Understands semantic structure (headings, tables, captions)
  • +Correct reading order on multi-column documents
  • -Non-deterministic — results may vary between runs
  • -Requires more compute (GPU for best speed, VLMs cost per page)
  • -Can hallucinate text that doesn't exist in the document

When to use which

Rule-based first. If your PDFs are digital-native with simple layouts (single column, standard tables with borders), PyMuPDF or pdfplumber will be faster, cheaper, and more reliable. Escalate to ML-based tools only when you hit scanned documents, complex layouts, or need semantic structure. Most production pipelines use a hybrid: rule-based for the easy cases, ML-based for the hard ones.

Enterprise Use Cases

Invoice Processing

Extract vendor names, line items, amounts, and dates from invoices. Automate accounts payable workflows. Semi-structured documents with consistent fields but varying layouts across vendors.

Key challenge: borderless tables, varying invoice formats, handwritten annotations

Contract Analysis

Extract clauses, parties, dates, and obligations from legal documents. Long documents (50–200 pages) with dense text, footnotes, cross-references, and appendices.

Key challenge: preserving section hierarchy, cross-references, defined terms

Research Paper Ingestion

Parse academic papers for RAG: preserve sections, citations, tables, equations, and figures. Build searchable research knowledge bases across thousands of papers.

Key challenge: two-column layout, LaTeX equations, citation parsing, figure captions

Medical Records

Extract diagnoses, medications, lab results, and treatment plans from clinical notes. Mix of typed and handwritten content, checkboxes, and structured forms.

Key challenge: handwriting recognition, privacy compliance, domain-specific terminology

Document Parsing Tools in Depth

Five tools cover the majority of production use cases. Each has a different philosophy and sweet spot.

PyMuPDF (fitz)

Rule-based

The fastest option for digital-native PDFs. Directly accesses the PDF's internal text stream without OCR. Supports text extraction, table detection (since v1.23), image extraction, and metadata reading. Written in C with Python bindings — processes thousands of pages per second.

import fitz  # pip install PyMuPDF

doc = fitz.open("report.pdf")
for page_num, page in enumerate(doc):
    # Basic text extraction (fast, no layout awareness)
    text = page.get_text()

    # Structured extraction with layout awareness
    blocks = page.get_text("dict")["blocks"]
    for block in blocks:
        if block["type"] == 0:  # Text block
            for line in block["lines"]:
                for span in line["spans"]:
                    print(f"Font: {span['font']}, Size: {span['size']}")
                    print(f"Text: {span['text']}")

    # Table extraction (PyMuPDF 1.23+)
    tables = page.find_tables()
    for table in tables:
        df = table.to_pandas()
        print(df)

Best for: digital-native PDFs, high-volume batch processing, simple layouts. Not for scanned documents.

Docling (IBM)

ML-based

IBM's open-source document converter. Uses a pipeline of ML models for layout detection (DocLayNet), table structure recognition (TableFormer), and OCR (EasyOCR or Tesseract). Outputs clean Markdown, JSON, or a structured document model. Handles PDFs, DOCX, PPTX, images, and HTML.

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("contract.pdf")

# Export to Markdown (most common for RAG)
markdown = result.document.export_to_markdown()
print(markdown)

# Export to structured JSON (for programmatic access)
doc_dict = result.document.export_to_dict()
for element in doc_dict["body"]:
    print(f"Type: {element['label']}")

# Iterate over tables specifically
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

Best for: production pipelines needing layout-aware parsing. Good balance of accuracy and speed. Runs on CPU (GPU optional).

Unstructured.io

ML-based + API

The most format-versatile option. Handles PDFs, DOCX, PPTX, XLSX, HTML, images, emails, and more. Partitions documents into typed elements (Title, NarrativeText, Table, Image, ListItem) with metadata including coordinates, page number, and parent hierarchy. Available as open-source library or hosted API.

from unstructured.partition.auto import partition

# Auto-detect format and partition into elements
elements = partition("report.pdf")

for element in elements:
    print(f"Type: {type(element).__name__}")
    print(f"Text: {element.text[:100]}...")
    print(f"Page: {element.metadata.page_number}")
    print()

# Filter by element type
from unstructured.documents.elements import (
    Table, Title, NarrativeText
)
titles = [e for e in elements if isinstance(e, Title)]
tables = [e for e in elements if isinstance(e, Table)]
body = [e for e in elements if isinstance(e, NarrativeText)]

Best for: enterprise pipelines with mixed document types. Strong element classification. Hosted API for scale.

Marker

ML-based

Purpose-built for converting books, papers, and long documents to Markdown. Uses a pipeline of ML models: layout detection, OCR, table recognition, and equation conversion. Particularly strong on academic content — preserves LaTeX equations, handles two-column layouts, and extracts inline images with references.

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict

models = create_model_dict()
converter = PdfConverter(artifact_dict=models)

# Convert PDF to markdown with extracted images
rendered = converter("research_paper.pdf")
markdown_text = rendered.markdown

# Images are saved separately with references in markdown
# Tables are converted to markdown tables
# Equations are preserved as LaTeX
print(markdown_text)

Best for: academic papers, textbooks, documents with equations and figures. Strong on long documents.

VLM-Based Extraction (Mistral OCR, GPT-4o)

API / VLM

The newest approach: send a document page image to a vision-language model and prompt it to extract text, tables, or structured data. No pipeline, no fine-tuning. Works on any document type including handwriting, diagrams, and damaged scans. Mistral OCR is purpose-built for this; GPT-4o, Claude, and Gemini handle it as a general vision task.

Traditional OCR PipelineDocument ImageDetect TextRegionsCRAFT / DBNetRecognizeCharactersCRNN / TrOCRLayoutAnalysisAssembleStructureStructured TextErrors cascadethrough stages4+ models chainedVision-Language ModelDocument ImageVLMGPT-4o / Mistral OCRClaude / GeminiSingle forward passStructured OutputJSON / Markdown / HTMLNo pipelineNo cascading errors1 model, 1 API call

Traditional OCR chains 4+ models where errors cascade. VLMs collapse the entire pipeline into a single forward pass.

# Mistral OCR (purpose-built for document extraction)
from mistralai import Mistral

client = Mistral(api_key="...")
response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://example.com/doc.pdf"
    }
)
for page in response.pages:
    print(page.markdown)  # Structured markdown output

# ---

# GPT-4o (general VLM approach)
import base64
from openai import OpenAI

client = OpenAI()
with open("page.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Extract all text as markdown. "
                        "Preserve table structure."
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,"
                           + image_b64
                }
            }
        ]
    }]
)
print(response.choices[0].message.content)

Best for: maximum accuracy on complex/damaged documents, handwriting, or when you need structured extraction without training. Cost: $0.01–0.05 per page.

OmniDocBench Results

OmniDocBench is the comprehensive benchmark for document parsing, testing text extraction, table recognition, formula conversion, and layout understanding across diverse document types.

The benchmark uses edit-distance-based metrics normalized across document types. Higher composite scores indicate better overall document understanding, not just raw OCR accuracy.

Mistral OCRVLM
79.75%
GPT-4oVLM
73%
Gemini 2.0 FlashVLM
71.5%
Docling (IBM)ML Pipeline
68%
MarkerML Pipeline
65%
MinerUML Pipeline
63%
PyMuPDF + heuristicsRule-based
55%
Tesseract 5Traditional OCR
48%

OmniDocBench composite score (edit-distance normalized). Higher is better. VLM-based approaches lead, but local ML pipelines are closing the gap.

Key Insight: Accuracy vs Cost

VLMs like Mistral OCR (79.75%) and GPT-4o (~73%) achieve the best accuracy but cost $0.01–0.05 per page via API. For 100,000 documents, that's $1,000–5,000. Local tools like Docling and Marker run for free after setup and handle most documents well.

Production strategy: Use Docling/Marker for the 90% of documents with standard layouts. Route the hard 10% (scanned, damaged, handwritten) to a VLM. This typically cuts cost by 80% while maintaining quality.

Choosing the Right Tool

The decision tree is simpler than it looks. Start with your document type and volume:

Digital PDFs, Simple Layout, High Volume

Use PyMuPDF or pdfplumber. Thousands of pages per second. No ML models to load.

Example: extracting text from SEC filings, legal briefs, standard reports.

Mixed Formats, Enterprise Pipeline

Use Unstructured.io. Handles PDFs, DOCX, PPTX, XLSX, images, emails. Strong element classification.

Example: ingesting a knowledge base with documents in 10+ formats.

Research Papers, Books, Complex Layouts

Use Marker or Docling. Strong on two-column layouts, equations, and figure extraction.

Example: building a searchable corpus of 10,000 arXiv papers.

Maximum Accuracy, Complex/Damaged Documents

Use Mistral OCR or GPT-4o. Handles handwriting, damaged scans, unusual layouts without any fine-tuning.

Example: processing historical archives, handwritten forms, or documents where accuracy is non-negotiable.

Structured Field Extraction (Key-Value)

Use a VLM with JSON mode or Docling + LLM. When you need specific fields (invoice number, total, date) not just full text.

Example: extracting structured data from invoices, receipts, or tax forms for automated processing.

Document-to-RAG Pipeline

Document parsing is typically the first step in a RAG pipeline. Here are two production-ready examples — one using a local ML pipeline, one using a VLM with structured extraction.

Pipeline A: Docling + Chunking + Embeddings (Local)

from docling.document_converter import DocumentConverter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import chromadb

# Step 1: Parse document to markdown (preserves structure)
converter = DocumentConverter()
result = converter.convert("contract.pdf")
markdown = result.document.export_to_markdown()

# Step 2: Chunk with markdown-aware splitting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "]
)
chunks = splitter.split_text(markdown)

# Step 3: Generate embeddings
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = model.encode(chunks)

# Step 4: Store in vector database
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("contracts")
collection.add(
    documents=chunks,
    embeddings=embeddings.tolist(),
    ids=[f"chunk_{i}" for i in range(len(chunks))],
)

print(f"Indexed {len(chunks)} chunks from contract.pdf")

Pipeline B: VLM Structured Extraction (API)

import base64, json
from openai import OpenAI

client = OpenAI()

def extract_invoice_fields(image_path: str) -> dict:
    """Extract structured fields from an invoice image."""
    with open(image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": """Extract these fields:
- vendor_name (str)
- invoice_number (str)
- date (YYYY-MM-DD)
- line_items (list of description, qty, price)
- subtotal, tax, total (float)
Return as JSON."""},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/png;base64," + img_b64
                    }
                }
            ]
        }]
    )
    return json.loads(response.choices[0].message.content)

# Process invoice
invoice = extract_invoice_fields("invoice_scan.png")
vendor = invoice["vendor_name"]
total = invoice["total"]
print(f"Vendor: {vendor}, Total: {total}")

Quality Check: Always Validate Extraction

No document parser is 100% accurate. For production pipelines, add validation: checksum totals on invoices, regex patterns for dates and IDs, confidence scores from OCR, and human-in-the-loop review for low-confidence extractions. A 95% accurate parser with validation is better than a 99% parser without it — because the 1% of errors will be silent and catastrophic.

Common Failure Modes

Understanding how parsers fail helps you design robust pipelines. These are the most frequent failure modes ranked by how often they appear in production:

1

Table Mangling

The most common failure. Borderless tables become paragraph text. Multi-line cells get split into separate rows. Merged cells lose their span information. Nested tables are almost never handled correctly by any tool.

Mitigation: Use pdfplumber for bordered tables, VLMs for borderless tables. Always validate row/column counts against expected structure.

2

Reading Order Confusion

Two-column layouts extracted left-to-right across columns instead of column-by-column. Sidebars mixed into body text. Footnotes injected mid-paragraph. This makes extracted text unintelligible for downstream LLMs.

Mitigation: Use layout-aware tools (Docling, Marker) that detect column boundaries. Verify by checking if extracted sentences are grammatically coherent.

3

Header/Footer Contamination

Running headers, page numbers, and footers mixed into body text across every page. "Page 14 of 82" appearing between paragraphs. Copyright notices repeated hundreds of times in a long document.

Mitigation: Most ML-based tools detect and filter headers/footers. For rule-based tools, use position heuristics (top/bottom 10% of page) to filter.

4

VLM Hallucination

Vision-language models sometimes "read" text that isn't in the document, especially on low-quality scans or when the document has visual artifacts. They may also "correct" spelling errors in the original document, which changes the legal meaning of contracts.

Mitigation: Cross-validate VLM output against rule-based extraction when both are available. For legal documents, use conservative (rule-based) extraction.

Key Takeaways

  • 1

    The pipeline has five stages: image acquisition, OCR, layout analysis, table extraction, and structured output. Modern tools collapse multiple stages but understanding each helps you diagnose failures.

  • 2

    Structure matters more than text: Extracting characters is solved. Extracting meaning — reading order, table structure, heading hierarchy — is what separates good parsers from bad ones.

  • 3

    VLMs lead the benchmarks but cost more: Mistral OCR (79.75%) and GPT-4o (~73%) outperform local tools, but at $0.01–0.05 per page. Use a hybrid strategy: local for easy documents, VLMs for the hard ones.

  • 4

    Always validate extraction: No parser is 100% accurate. Build validation into your pipeline: checksums, regex patterns, confidence thresholds, and human review for critical documents.

  • 5

    The field moves fast: LayoutLM (2020) was state-of-the-art three years ago. Donut and Nougat arrived in 2022–2023. VLMs dominate 2024–2026. The tools you choose today may be superseded within a year. Design your pipeline to be modular.

Practice Exercises

Hands-on exercises to build intuition for document parsing trade-offs:

  1. 1.
    Compare rule-based vs ML-based on a table-heavy PDF.Install PyMuPDF (pip install pymupdf) and Docling (pip install docling). Parse the same financial report with both. Compare the table output quality.
  2. 2.
    Test reading order on a two-column paper.Download any arXiv paper PDF. Extract text with PyMuPDF and with Marker. Check if the two-column content is in the correct reading order. Count jumbled sentences.
  3. 3.
    Build a hybrid pipeline.Implement the production strategy: try PyMuPDF first, check if the output has good structure (paragraph breaks, no jumbled text), and fall back to Docling or a VLM for pages that fail quality checks.
  4. 4.
    VLM structured extraction challenge.Take a photo of a receipt with your phone. Use GPT-4o (or Claude) to extract line items, tax, and total as JSON. Compare against manual transcription. Note any hallucinated or missed fields.

Further Reading

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.