Document Extraction
Extract structured information from documents like PDFs, invoices, forms, and contracts.
How Structured Output Extraction Works
Turn unstructured documents into typed, validated data structures. From Pydantic schemas to LLM extraction with Instructor.
Why Structured Output Matters
Raw text is hard to process programmatically. Structured output gives you typed, validated data that integrates directly with your codebase.
INVOICE #INV-2024-0892
Date: December 15, 2024
Bill To: Acme Corp
123 Business Ave
New York, NY 10001
Items:
- Widget Pro x5 @ $29.99 = $149.95
- Service Fee = $25.00
Subtotal: $174.95
Tax (8%): $13.99
Total: $188.94
Payment Due: January 15, 2025{
"invoice_number": "INV-2024-0892",
"date": "2024-12-15",
"customer": {
"name": "Acme Corp",
"address": "123 Business Ave, New York, NY 10001"
},
"items": [
{
"description": "Widget Pro",
"quantity": 5,
"unit_price": 29.99,
"total": 149.95
},
{
"description": "Service Fee",
"quantity": 1,
"unit_price": 25,
"total": 25
}
],
"subtotal": 174.95,
"tax": 13.99,
"total": 188.94,
"due_date": "2025-01-15"
}Catch errors at parse time, not runtime. IDE autocomplete works.
Ensure data meets constraints: required fields, value ranges, formats.
Directly usable in databases, APIs, analytics pipelines.
Pydantic: The Schema Language
Pydantic is Python's most popular data validation library. Define schemas using type hints, get automatic validation, serialization, and JSON Schema generation.
from pydantic import BaseModel
class User(BaseModel):
name: str
age: int
email: str
is_active: bool = TrueHow It Works
Pydantic uses Python type hints to define the schema. Default values make fields optional.
Key Features
- - Automatic type coercion (str "42" to int 42)
- - Rich error messages on validation failure
- - Generates JSON Schema automatically
- - Serialization to dict/JSON built-in
Interactive: Build a Schema
from pydantic import BaseModel
class MyModel(BaseModel):
passJSON Schema: The Bridge to LLMs
Pydantic models automatically generate JSON Schema. This schema is what LLMs use to understand the expected output format. It's the contract between your code and the model.
Python to JSON Schema Type Mapping
| Python Type | JSON Schema | Example Value |
|---|---|---|
| str | "type": "string" | "hello" |
| int | "type": "integer" | 42 |
| float | "type": "number" | 3.14 |
| bool | "type": "boolean" | true |
| List[str] | "type": "array", "items": {"type": "string"} | ["a", "b"] |
| Optional[str] | "anyOf": [{"type": "string"}, {"type": "null"}] | "text" or null |
| Literal["a", "b"] | "enum": ["a", "b"] | "a" |
| datetime | "type": "string", "format": "date-time" | "2024-01-01T00:00:00Z" |
from pydantic import BaseModel
from typing import List, Optional
class Invoice(BaseModel):
invoice_id: str
amount: float
items: List[str]
paid: bool
notes: Optional[str] = None{
"type": "object",
"properties": {
"invoice_id": {"type": "string"},
"amount": {"type": "number"},
"items": {
"type": "array",
"items": {"type": "string"}
},
"paid": {"type": "boolean"},
"notes": {
"anyOf": [
{"type": "string"},
{"type": "null"}
]
}
},
"required": ["invoice_id", "amount", "items", "paid"]
}How LLMs Use JSON Schema
When you provide a JSON Schema to an LLM (via function calling or structured output mode), the model constrains its token generation to only produce valid JSON matching the schema. This is done at the logit level - invalid tokens are masked out during sampling.
Instructor: LLM + Pydantic
Instructor is the glue between Pydantic and LLMs. It patches OpenAI/Anthropic clients to acceptresponse_model and handles validation, retries, and streaming.
How Instructor Works
import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import List
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
class Invoice(BaseModel):
invoice_number: str
total: float
items: List[LineItem]
# Patch OpenAI client with Instructor
client = instructor.from_openai(OpenAI())
# Extract structured data from text
invoice = client.chat.completions.create(
model="gpt-4o",
response_model=Invoice,
messages=[{
"role": "user",
"content": """
Invoice #12345
Widget Pro x5 @ $29.99 = $149.95
Service Fee = $25.00
Total: $174.95
"""
}]
)
print(invoice.invoice_number) # "12345"
print(invoice.total) # 174.95
print(invoice.items[0].description) # "Widget Pro"Automatic Retries
If the LLM output fails Pydantic validation, Instructor automatically retries with the error message included in the prompt.
client.chat.completions.create(
model="gpt-4o",
response_model=Invoice,
max_retries=3, # Retry up to 3 times
messages=[...]
)Streaming Support
Get partial objects as they stream in. Useful for long extractions where you want to show progress.
for partial in client.chat.completions.create_partial(
model="gpt-4o",
response_model=Invoice,
messages=[...]
):
print(partial) # Partial InvoiceExtraction Methods Compared
There are multiple approaches to getting structured output from LLMs. Each has tradeoffs in reliability, speed, and flexibility.
When to Use What
Best for production apps with OpenAI/Anthropic. Type-safe, battle-tested.
Best for simple schemas when you only use OpenAI. No extra dependencies.
Best for local/open-source models. 100% guaranteed valid output.
Best if already using LangChain. Works with any model.
Full Document Extraction Pipeline
Real-world document extraction combines OCR, layout analysis, and LLM extraction. Here's how the pieces fit together.
import instructor
from openai import OpenAI
from pydantic import BaseModel
from typing import List
from docling import DocumentConverter # Or PyMuPDF, pdfplumber, etc.
# 1. Define the schema
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
vendor_name: str
date: str
items: List[LineItem]
subtotal: float
tax: float
total: float
# 2. Extract text from PDF
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
text = result.document.export_to_markdown()
# 3. Use LLM to extract structured data
client = instructor.from_openai(OpenAI())
invoice = client.chat.completions.create(
model="gpt-4o",
response_model=Invoice,
messages=[
{
"role": "system",
"content": "Extract invoice data from the following document."
},
{
"role": "user",
"content": text
}
]
)
# 4. Use the typed, validated data
print(f"Invoice: {invoice.invoice_number}")
print(f"Total: ${invoice.total:.2f}")
for item in invoice.items:
print(f" - {item.description}: {item.quantity} x ${item.unit_price}")OCR / Text Extraction
- - Docling - Layout-aware
- - PyMuPDF - Fast, native PDF
- - Tesseract - Scanned docs
- - Azure Doc AI - Pre-built
LLM Extraction
- - GPT-4o - Best accuracy
- - Claude 3.5 - Long docs
- - Gemini 1.5 - 1M context
- - Llama 3.1 - Self-host
Structured Output
- - Instructor - Production
- - Outlines - Local models
- - Marvin - Lightweight
- - BAML - Type-first
The Complete Picture
Structured output extraction turns messy documents into clean, typed data. Pydantic defines the contract, JSON Schema bridges to LLMs, and Instructor handles the plumbing. The result: reliable, production-ready document processing.
Use Cases
- ✓Invoice processing
- ✓Resume parsing
- ✓Contract analysis
- ✓Form digitization
Architectural Patterns
Layout-Aware OCR + LLM
Use document OCR (preserving layout) then LLM for extraction.
- +Handles complex layouts
- +Flexible schemas
- +Good accuracy
- -Multi-step
- -LLM cost for extraction
End-to-End Document VLM
Vision-language models that directly process document images.
- +Single model
- +Handles visual elements
- -May miss fine text
- -Fixed context window
Template-Based Extraction
Define zones/templates for known document types.
- +Very fast
- +High accuracy for known formats
- -Breaks on new formats
- -Maintenance overhead
Implementations
API Services
Azure Document Intelligence
MicrosoftPre-built and custom extractors. Good for invoices, receipts.
Google Document AI
GoogleStrong OCR + extraction. Pre-built processors.
Benchmarks
Quick Facts
- Input
- Document
- Output
- Structured Data
- Implementations
- 3 open source, 2 API
- Patterns
- 3 approaches