Document Parsing
Parsing document structure and content
Document parsing converts unstructured documents (PDFs, scans, photos) into structured, machine-readable formats (JSON, Markdown, HTML) — extracting text, tables, figures, and their relationships. It's the full pipeline: layout analysis + OCR + structure extraction + semantic understanding. Tools like Docling, Marker, and MinerU have made this practical for enterprise document processing.
History
Apache Tika and PDFMiner provide basic text extraction from digital PDFs, but lose formatting, tables, and spatial structure
Tabula and Camelot focus specifically on table extraction from PDFs, filling a critical gap in document parsing
LayoutLM combines OCR output with spatial position embeddings, enabling layout-aware document understanding for the first time
Donut (Kim et al.) introduces end-to-end document parsing without OCR — encoder reads the image, decoder generates structured text
Nougat (Meta) parses academic papers end-to-end from images to LaTeX/Markdown, handling equations and figures
Docling (IBM) and Marker (Surya-based) provide open-source production-quality PDF → Markdown/JSON pipelines
MinerU and Unstructured.io combine multiple models (layout detection, OCR, table extraction) into robust parsing pipelines
GPT-4o and Claude 3.5 Sonnet demonstrate that VLMs can parse documents directly from images with impressive accuracy, no specialized pipeline needed
Docling v2, Reducto, and SmolDocling combine specialized components with VLM-based verification; hybrid approaches dominate production
How Document Parsing Works
Document Ingestion
PDFs are rendered to images at 150-300 DPI. Digital-born PDFs may also have extractable text/structure via pdfminer/pdfplumber, but layout fidelity varies. Images (scans, photos) go directly to the vision pipeline.
Layout Detection
A document layout model (DINO-DETR, YOLO-Doc) segments the page into regions: text blocks, tables, figures, headings, lists, headers/footers. Reading order is determined.
Text Extraction (OCR)
Each text region is OCR'd (Tesseract, PaddleOCR, Surya, or cloud APIs). For digital-born PDFs, embedded text is extracted directly. OCR produces text strings with bounding box coordinates.
Table Extraction
Table regions are processed by specialized table recognition models (Table Transformer, TableFormer) that detect rows, columns, and cells, then extract cell content with OCR. The result is structured table data (HTML, CSV, or JSON).
Assembly & Output
All extracted elements (text, tables, figures) are assembled into a structured document following the detected reading order. Output formats include Markdown (for LLM consumption), HTML (preserving layout), JSON (for APIs), or DocX. Metadata (page numbers, fonts, confidence scores) is optionally included.
Current Landscape
Document parsing in 2025 is a hybrid field. The best production pipelines (Docling, MinerU, Unstructured) compose specialized models — layout detectors, OCR engines, table extractors — into orchestrated workflows. Meanwhile, large VLMs (GPT-4o, Claude) can parse documents end-to-end from images but are expensive at scale and inconsistent on complex layouts. The practical sweet spot is using specialized parsers for the 90% case and VLMs for verification or edge cases. The field is increasingly driven by the RAG revolution — every enterprise wants to parse their documents into chunks suitable for LLM retrieval.
Key Challenges
Table extraction quality — complex tables with merged cells, spanning headers, and nested structures remain the hardest parsing problem; accuracy drops sharply beyond simple grid tables
Figure understanding — parsing pipelines can detect figures but rarely extract meaningful information from charts, graphs, or diagrams
Multi-column layout — correctly threading text across columns, especially with footnotes, sidebars, and interrupting figures, causes frequent ordering errors
Mathematical equations — rendering equations correctly from scanned documents requires specialized math OCR (Nougat, im2latex) that most general parsers lack
Cross-page continuity — tables, paragraphs, and lists that span page boundaries need to be merged, which requires understanding that a sentence or row continues on the next page
Quick Recommendations
General-purpose PDF parsing
Docling v2 or MinerU
Best open-source pipelines combining layout detection + OCR + table extraction; Docling handles most document types reliably
Academic papers
Nougat or Marker
Nougat handles equations and LaTeX natively; Marker produces clean Markdown from academic PDFs
High-accuracy commercial
Azure Document Intelligence or AWS Textract
Enterprise-grade with SLA, prebuilt models for invoices/receipts/IDs, and custom model training
LLM-powered parsing
GPT-4o or Claude Sonnet + structured output
Send document images to a VLM and request structured JSON output; handles edge cases that rule-based parsers miss
RAG pipeline preparation
Docling + chunking strategy
Parse documents to Markdown, chunk by section/heading, and embed for retrieval-augmented generation
What's Next
The field is converging toward end-to-end parsing models that replace multi-stage pipelines with a single model that reads a document image and outputs structured text. SmolDocling and similar initiatives aim to make this practical at scale. Longer-term: real-time parsing from camera feeds (phone captures), cross-document understanding (linking references across document collections), and multi-modal document understanding that extracts information from charts and diagrams as fluently as from text.
Benchmarks & SOTA
olmOCR-Bench
olmOCR-Bench
7,010 unit tests across 1,402 PDF documents. Tests parsing of tables, math, multi-column layouts, old scans, and more.
State of the Art
Chandra v0.1.0
datalab-to
99.9
base
OmniDocBench
OmniDocBench v1.5
981 annotated PDF pages across 9 document categories. Tests end-to-end document parsing including text, tables, and formulas.
State of the Art
MinerU 2.5
OpenDataLab
97.5
layout-map
Related Tasks
Something wrong or missing?
Help keep Document Parsing benchmarks accurate. Report outdated results, missing benchmarks, or errors.