Document Layout Analysis
Analyzing the layout structure of documents
Document layout analysis detects and classifies the structural elements of a page — paragraphs, tables, figures, headers, footers, captions, lists — as a prerequisite to extraction. It's the critical preprocessing step that tells downstream models where to look. DINO-based detectors on DocLayNet achieve 80%+ mAP, and foundation models like Florence-2 are making task-specific training optional.
History
PRIMA dataset and ICDAR layout analysis competitions establish the task with traditional methods (connected components, Voronoi, rule-based)
DeepDeSRT applies Faster R-CNN to table detection, showing deep learning works for document layout
PubLayNet (IBM) creates a large-scale dataset (360K+ document images from PubMed) with 5 layout categories — becomes the standard benchmark
LayoutParser provides a unified toolkit for deep learning-based layout analysis with Detectron2 models
DiT (Document Image Transformer) applies self-supervised pretraining to document images, improving layout detection and segmentation
DocLayNet (IBM) introduces 80K pages across 6 document types with 11 fine-grained categories — more diverse and challenging than PubLayNet
DINO-DETR and Grounding DINO adapted for document layout achieve 80%+ mAP on DocLayNet, making transformer detectors the SOTA approach
DocLayout-YOLO and RT-DETR for documents enable real-time layout detection; Docling (IBM) integrates layout analysis into production pipelines
Florence-2 and Qwen2-VL handle layout analysis zero-shot via visual grounding; end-to-end document parsing subsumes layout as an intermediate step
How Document Layout Analysis Works
Document Preprocessing
Scanned documents are deskewed, denoised, and binarized if needed. Digital-born PDFs can be rendered at 150-300 DPI. Page images are resized to detector input resolution (typically 800-1333px on the long side).
Object Detection
A detection model (DINO-DETR, Faster R-CNN, YOLO) trained on document layout data predicts bounding boxes and class labels for each layout element. Classes include: paragraph, table, figure, heading, list, caption, header, footer, footnote, page number.
Reading Order Prediction
Detected regions are sorted into reading order — left-to-right, top-to-bottom for Western documents, with column detection for multi-column layouts. This is non-trivial for complex layouts with sidebars, callouts, and floating figures.
Post-Processing
Overlapping detections are resolved, small fragments are merged, and layout hierarchy is constructed (e.g., table cells belong to tables, section headers scope subsequent paragraphs). The output is a structured layout tree.
Evaluation
mAP at IoU 0.50 on PubLayNet (5 classes) and DocLayNet (11 classes) are standard. Per-class AP reveals that tables and figures are easier to detect than footnotes and captions. Reading order accuracy is evaluated separately.
Current Landscape
Document layout analysis in 2025 is being absorbed into end-to-end document parsing pipelines. Standalone layout detection (PubLayNet, DocLayNet benchmarks) still matters for research, but practitioners increasingly use integrated tools (Docling, Marker, Surya) that handle layout → OCR → extraction in one pass. Transformer-based detectors (DINO-DETR) dominate accuracy, while YOLO variants serve the speed-sensitive segment. The biggest disruption is large VLMs that understand document layout implicitly — why detect layout regions separately when GPT-4o or Qwen2-VL can directly answer questions about a document image?
Key Challenges
Layout diversity — documents range from simple single-column text to complex multi-column layouts with nested tables, figures spanning columns, and margin annotations
Small and overlapping elements — footnotes, page numbers, and margin notes are small and frequently confused with body text
Multi-language and multi-script — Arabic (right-to-left), CJK (vertical possible), and mixed-script documents require layout models to handle diverse reading orders
Degraded documents — historical documents, faxes, and poorly scanned pages have noise, bleed-through, and damaged regions that confuse detectors
Reading order — detecting layout regions is only half the problem; determining the correct reading order for complex multi-column layouts is equally critical and under-studied
Quick Recommendations
Best accuracy
DINO-DETR (Swin-L backbone) fine-tuned on DocLayNet + PubLayNet
82%+ mAP on DocLayNet; transformer attention handles complex layouts well
Real-time / production
DocLayout-YOLO or RT-DETR for documents
70%+ mAP at 50+ FPS; practical for high-throughput document processing pipelines
Integrated pipeline
Docling (IBM) or Marker
End-to-end PDF/image → structured content; handles layout detection, OCR, table extraction, and reading order together
Zero-shot / no training
Florence-2-Large or Qwen2-VL
Visual grounding capability detects layout elements from text descriptions without document-specific training
Historical documents
Kraken OCR or eScriptorium
Specifically designed for degraded and historical documents with specialized layout segmentation
What's Next
Layout analysis is converging into document foundation models that don't need explicit layout detection as a separate step. End-to-end models will learn layout understanding as an implicit capability. Active research: multi-page layout understanding (headers/footers that repeat across pages), 3D document reconstruction (unfolding curved pages from phone captures), and real-time mobile document analysis for accessibility applications.
Benchmarks & SOTA
publaynet-val
Dataset from Papers With Code
State of the Art
DETR
0.981
table
document-layout-recognition-challenge-test
Dataset from Papers With Code
State of the Art
fglihai
0.970
figure
document-layout-recognition-challenge-mini-dev
Dataset from Papers With Code
State of the Art
fglihai
1
table
u-diads-bib
Dataset from Papers With Code
State of the Art
CV-Group
83.4
class-average-iou
d4la
Dataset from Papers With Code
State of the Art
DoPTA
70.72
map
Related Tasks
Something wrong or missing?
Help keep Document Layout Analysis benchmarks accurate. Report outdated results, missing benchmarks, or errors.