Computer Vision

Document Layout Analysis

Analyzing the layout structure of documents

5 datasets133 resultsView full task mapping →

Document layout analysis detects and classifies the structural elements of a page — paragraphs, tables, figures, headers, footers, captions, lists — as a prerequisite to extraction. It's the critical preprocessing step that tells downstream models where to look. DINO-based detectors on DocLayNet achieve 80%+ mAP, and foundation models like Florence-2 are making task-specific training optional.

History

2007

PRIMA dataset and ICDAR layout analysis competitions establish the task with traditional methods (connected components, Voronoi, rule-based)

2017

DeepDeSRT applies Faster R-CNN to table detection, showing deep learning works for document layout

2019

PubLayNet (IBM) creates a large-scale dataset (360K+ document images from PubMed) with 5 layout categories — becomes the standard benchmark

2020

LayoutParser provides a unified toolkit for deep learning-based layout analysis with Detectron2 models

2021

DiT (Document Image Transformer) applies self-supervised pretraining to document images, improving layout detection and segmentation

2022

DocLayNet (IBM) introduces 80K pages across 6 document types with 11 fine-grained categories — more diverse and challenging than PubLayNet

2023

DINO-DETR and Grounding DINO adapted for document layout achieve 80%+ mAP on DocLayNet, making transformer detectors the SOTA approach

2024

DocLayout-YOLO and RT-DETR for documents enable real-time layout detection; Docling (IBM) integrates layout analysis into production pipelines

2025

Florence-2 and Qwen2-VL handle layout analysis zero-shot via visual grounding; end-to-end document parsing subsumes layout as an intermediate step

How Document Layout Analysis Works

Document Preprocessing

Scanned documents are deskewed, denoised, and binarized if needed. Digital-born PDFs can be rendered at 150-300 DPI. Page images are resized to detector input resolution (typically 800-1333px on the long side).

Object Detection

A detection model (DINO-DETR, Faster R-CNN, YOLO) trained on document layout data predicts bounding boxes and class labels for each layout element. Classes include: paragraph, table, figure, heading, list, caption, header, footer, footnote, page number.

Reading Order Prediction

Detected regions are sorted into reading order — left-to-right, top-to-bottom for Western documents, with column detection for multi-column layouts. This is non-trivial for complex layouts with sidebars, callouts, and floating figures.

Post-Processing

Overlapping detections are resolved, small fragments are merged, and layout hierarchy is constructed (e.g., table cells belong to tables, section headers scope subsequent paragraphs). The output is a structured layout tree.

Evaluation

mAP at IoU 0.50 on PubLayNet (5 classes) and DocLayNet (11 classes) are standard. Per-class AP reveals that tables and figures are easier to detect than footnotes and captions. Reading order accuracy is evaluated separately.

Current Landscape

Document layout analysis in 2025 is being absorbed into end-to-end document parsing pipelines. Standalone layout detection (PubLayNet, DocLayNet benchmarks) still matters for research, but practitioners increasingly use integrated tools (Docling, Marker, Surya) that handle layout → OCR → extraction in one pass. Transformer-based detectors (DINO-DETR) dominate accuracy, while YOLO variants serve the speed-sensitive segment. The biggest disruption is large VLMs that understand document layout implicitly — why detect layout regions separately when GPT-4o or Qwen2-VL can directly answer questions about a document image?

Key Challenges

Layout diversity — documents range from simple single-column text to complex multi-column layouts with nested tables, figures spanning columns, and margin annotations

Small and overlapping elements — footnotes, page numbers, and margin notes are small and frequently confused with body text

Multi-language and multi-script — Arabic (right-to-left), CJK (vertical possible), and mixed-script documents require layout models to handle diverse reading orders

Degraded documents — historical documents, faxes, and poorly scanned pages have noise, bleed-through, and damaged regions that confuse detectors

Reading order — detecting layout regions is only half the problem; determining the correct reading order for complex multi-column layouts is equally critical and under-studied

Quick Recommendations

Best accuracy

DINO-DETR (Swin-L backbone) fine-tuned on DocLayNet + PubLayNet

82%+ mAP on DocLayNet; transformer attention handles complex layouts well

Real-time / production

DocLayout-YOLO or RT-DETR for documents

70%+ mAP at 50+ FPS; practical for high-throughput document processing pipelines

Integrated pipeline

Docling (IBM) or Marker

End-to-end PDF/image → structured content; handles layout detection, OCR, table extraction, and reading order together

Zero-shot / no training

Florence-2-Large or Qwen2-VL

Visual grounding capability detects layout elements from text descriptions without document-specific training

Historical documents

Kraken OCR or eScriptorium

Specifically designed for degraded and historical documents with specialized layout segmentation

What's Next

Layout analysis is converging into document foundation models that don't need explicit layout detection as a separate step. End-to-end models will learn layout understanding as an implicit capability. Active research: multi-page layout understanding (headers/footers that repeat across pages), 3D document reconstruction (unfolding curved pages from phone captures), and real-time mobile document analysis for accessibility applications.

Benchmarks & SOTA

publaynet-val

202092 results

Dataset from Papers With Code

State of the Art

Hybrid DLA (Shehzadi et al.)

DFKI / TU Kaiserslautern

0.986

Table

document-layout-recognition-challenge-test

ICDAR 2019 Recognition of Documents with Complex Layouts (RDCL2019) - Test Set

201918 results

The RDCL2019 test set from the ICDAR 2019 Competition on Recognition of Documents with Complex Layouts. Comprises 85 scanned page images from contemporary magazines and technical/scientific publications (PRImA Layout Analysis Dataset). Evaluation measures region segmentation and classification using Weighted F1-score across layout classes. A continuous competition allowing post-2019 submissions via the Aletheia evaluation tool.

State of the Art

fglihai

0.970

figure

document-layout-recognition-challenge-mini-dev

202012 results

Dataset from Papers With Code

State of the Art

fglihai

table

u-diads-bib

20208 results

Dataset from Papers With Code

State of the Art

CV-Group

83.4

class-average-iou

d4la

20203 results

Dataset from Papers With Code

State of the Art

DoPTA

70.72

map

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Document Layout Analysis benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision