Computer Vision

Document Parsing

Parsing document structure and content

2 datasets56 resultsView full task mapping →

Document parsing converts unstructured documents (PDFs, scans, photos) into structured, machine-readable formats (JSON, Markdown, HTML) — extracting text, tables, figures, and their relationships. It's the full pipeline: layout analysis + OCR + structure extraction + semantic understanding. Tools like Docling, Marker, and MinerU have made this practical for enterprise document processing.

History

2010

Apache Tika and PDFMiner provide basic text extraction from digital PDFs, but lose formatting, tables, and spatial structure

2015

Tabula and Camelot focus specifically on table extraction from PDFs, filling a critical gap in document parsing

2019

LayoutLM combines OCR output with spatial position embeddings, enabling layout-aware document understanding for the first time

2021

Donut (Kim et al.) introduces end-to-end document parsing without OCR — encoder reads the image, decoder generates structured text

2022

Nougat (Meta) parses academic papers end-to-end from images to LaTeX/Markdown, handling equations and figures

2023

Docling (IBM) and Marker (Surya-based) provide open-source production-quality PDF → Markdown/JSON pipelines

2024

MinerU and Unstructured.io combine multiple models (layout detection, OCR, table extraction) into robust parsing pipelines

2024

GPT-4o and Claude 3.5 Sonnet demonstrate that VLMs can parse documents directly from images with impressive accuracy, no specialized pipeline needed

2025

Docling v2, Reducto, and SmolDocling combine specialized components with VLM-based verification; hybrid approaches dominate production

How Document Parsing Works

1Document IngestionPDFs are rendered to images…2Layout DetectionA document layout model (DI…3Text Extraction (OCR)Each text region is OCR'd (…4Table ExtractionTable regions are processed…5Assembly & OutputAll extracted elements (textDocument Parsing Pipeline
1

Document Ingestion

PDFs are rendered to images at 150-300 DPI. Digital-born PDFs may also have extractable text/structure via pdfminer/pdfplumber, but layout fidelity varies. Images (scans, photos) go directly to the vision pipeline.

2

Layout Detection

A document layout model (DINO-DETR, YOLO-Doc) segments the page into regions: text blocks, tables, figures, headings, lists, headers/footers. Reading order is determined.

3

Text Extraction (OCR)

Each text region is OCR'd (Tesseract, PaddleOCR, Surya, or cloud APIs). For digital-born PDFs, embedded text is extracted directly. OCR produces text strings with bounding box coordinates.

4

Table Extraction

Table regions are processed by specialized table recognition models (Table Transformer, TableFormer) that detect rows, columns, and cells, then extract cell content with OCR. The result is structured table data (HTML, CSV, or JSON).

5

Assembly & Output

All extracted elements (text, tables, figures) are assembled into a structured document following the detected reading order. Output formats include Markdown (for LLM consumption), HTML (preserving layout), JSON (for APIs), or DocX. Metadata (page numbers, fonts, confidence scores) is optionally included.

Current Landscape

Document parsing in 2025 is a hybrid field. The best production pipelines (Docling, MinerU, Unstructured) compose specialized models — layout detectors, OCR engines, table extractors — into orchestrated workflows. Meanwhile, large VLMs (GPT-4o, Claude) can parse documents end-to-end from images but are expensive at scale and inconsistent on complex layouts. The practical sweet spot is using specialized parsers for the 90% case and VLMs for verification or edge cases. The field is increasingly driven by the RAG revolution — every enterprise wants to parse their documents into chunks suitable for LLM retrieval.

Key Challenges

Table extraction quality — complex tables with merged cells, spanning headers, and nested structures remain the hardest parsing problem; accuracy drops sharply beyond simple grid tables

Figure understanding — parsing pipelines can detect figures but rarely extract meaningful information from charts, graphs, or diagrams

Multi-column layout — correctly threading text across columns, especially with footnotes, sidebars, and interrupting figures, causes frequent ordering errors

Mathematical equations — rendering equations correctly from scanned documents requires specialized math OCR (Nougat, im2latex) that most general parsers lack

Cross-page continuity — tables, paragraphs, and lists that span page boundaries need to be merged, which requires understanding that a sentence or row continues on the next page

Quick Recommendations

General-purpose PDF parsing

Docling v2 or MinerU

Best open-source pipelines combining layout detection + OCR + table extraction; Docling handles most document types reliably

Academic papers

Nougat or Marker

Nougat handles equations and LaTeX natively; Marker produces clean Markdown from academic PDFs

High-accuracy commercial

Azure Document Intelligence or AWS Textract

Enterprise-grade with SLA, prebuilt models for invoices/receipts/IDs, and custom model training

LLM-powered parsing

GPT-4o or Claude Sonnet + structured output

Send document images to a VLM and request structured JSON output; handles edge cases that rule-based parsers miss

RAG pipeline preparation

Docling + chunking strategy

Parse documents to Markdown, chunk by section/heading, and embed for retrieval-augmented generation

What's Next

The field is converging toward end-to-end parsing models that replace multi-stage pipelines with a single model that reads a document image and outputs structured text. SmolDocling and similar initiatives aim to make this practical at scale. Longer-term: real-time parsing from camera feeds (phone captures), cross-document understanding (linking references across document collections), and multi-modal document understanding that extracts information from charts and diagrams as fluently as from text.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Document Parsing benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Document Parsing Benchmarks - Computer Vision - CodeSOTA | CodeSOTA