Computer Vision

Document Layout Analysis

Analyzing the layout structure of documents

5 datasets126 resultsView full task mapping →

Document layout analysis detects and classifies the structural elements of a page — paragraphs, tables, figures, headers, footers, captions, lists — as a prerequisite to extraction. It's the critical preprocessing step that tells downstream models where to look. DINO-based detectors on DocLayNet achieve 80%+ mAP, and foundation models like Florence-2 are making task-specific training optional.

History

2007

PRIMA dataset and ICDAR layout analysis competitions establish the task with traditional methods (connected components, Voronoi, rule-based)

2017

DeepDeSRT applies Faster R-CNN to table detection, showing deep learning works for document layout

2019

PubLayNet (IBM) creates a large-scale dataset (360K+ document images from PubMed) with 5 layout categories — becomes the standard benchmark

2020

LayoutParser provides a unified toolkit for deep learning-based layout analysis with Detectron2 models

2021

DiT (Document Image Transformer) applies self-supervised pretraining to document images, improving layout detection and segmentation

2022

DocLayNet (IBM) introduces 80K pages across 6 document types with 11 fine-grained categories — more diverse and challenging than PubLayNet

2023

DINO-DETR and Grounding DINO adapted for document layout achieve 80%+ mAP on DocLayNet, making transformer detectors the SOTA approach

2024

DocLayout-YOLO and RT-DETR for documents enable real-time layout detection; Docling (IBM) integrates layout analysis into production pipelines

2025

Florence-2 and Qwen2-VL handle layout analysis zero-shot via visual grounding; end-to-end document parsing subsumes layout as an intermediate step

How Document Layout Analysis Works

1Document PreprocessingScanned documents are deske…2Object DetectionA detection model (DINO-DETR3Reading Order Predict…Detected regions are sorted…4Post-ProcessingOverlapping detections are …5EvaluationmAP at IoU 0Document Layout Analysis Pipeline
1

Document Preprocessing

Scanned documents are deskewed, denoised, and binarized if needed. Digital-born PDFs can be rendered at 150-300 DPI. Page images are resized to detector input resolution (typically 800-1333px on the long side).

2

Object Detection

A detection model (DINO-DETR, Faster R-CNN, YOLO) trained on document layout data predicts bounding boxes and class labels for each layout element. Classes include: paragraph, table, figure, heading, list, caption, header, footer, footnote, page number.

3

Reading Order Prediction

Detected regions are sorted into reading order — left-to-right, top-to-bottom for Western documents, with column detection for multi-column layouts. This is non-trivial for complex layouts with sidebars, callouts, and floating figures.

4

Post-Processing

Overlapping detections are resolved, small fragments are merged, and layout hierarchy is constructed (e.g., table cells belong to tables, section headers scope subsequent paragraphs). The output is a structured layout tree.

5

Evaluation

mAP at IoU 0.50 on PubLayNet (5 classes) and DocLayNet (11 classes) are standard. Per-class AP reveals that tables and figures are easier to detect than footnotes and captions. Reading order accuracy is evaluated separately.

Current Landscape

Document layout analysis in 2025 is being absorbed into end-to-end document parsing pipelines. Standalone layout detection (PubLayNet, DocLayNet benchmarks) still matters for research, but practitioners increasingly use integrated tools (Docling, Marker, Surya) that handle layout → OCR → extraction in one pass. Transformer-based detectors (DINO-DETR) dominate accuracy, while YOLO variants serve the speed-sensitive segment. The biggest disruption is large VLMs that understand document layout implicitly — why detect layout regions separately when GPT-4o or Qwen2-VL can directly answer questions about a document image?

Key Challenges

Layout diversity — documents range from simple single-column text to complex multi-column layouts with nested tables, figures spanning columns, and margin annotations

Small and overlapping elements — footnotes, page numbers, and margin notes are small and frequently confused with body text

Multi-language and multi-script — Arabic (right-to-left), CJK (vertical possible), and mixed-script documents require layout models to handle diverse reading orders

Degraded documents — historical documents, faxes, and poorly scanned pages have noise, bleed-through, and damaged regions that confuse detectors

Reading order — detecting layout regions is only half the problem; determining the correct reading order for complex multi-column layouts is equally critical and under-studied

Quick Recommendations

Best accuracy

DINO-DETR (Swin-L backbone) fine-tuned on DocLayNet + PubLayNet

82%+ mAP on DocLayNet; transformer attention handles complex layouts well

Real-time / production

DocLayout-YOLO or RT-DETR for documents

70%+ mAP at 50+ FPS; practical for high-throughput document processing pipelines

Integrated pipeline

Docling (IBM) or Marker

End-to-end PDF/image → structured content; handles layout detection, OCR, table extraction, and reading order together

Zero-shot / no training

Florence-2-Large or Qwen2-VL

Visual grounding capability detects layout elements from text descriptions without document-specific training

Historical documents

Kraken OCR or eScriptorium

Specifically designed for degraded and historical documents with specialized layout segmentation

What's Next

Layout analysis is converging into document foundation models that don't need explicit layout detection as a separate step. End-to-end models will learn layout understanding as an implicit capability. Active research: multi-page layout understanding (headers/footers that repeat across pages), 3D document reconstruction (unfolding curved pages from phone captures), and real-time mobile document analysis for accessibility applications.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Document Layout Analysis benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000