Codesota · OCR · Task: LayoutHome/OCR/Layout
Task Brief · page → regions

Before a model can read a page, it has to know what order to read it in.

Layout analysis decomposes a page into regions — title, paragraph, list, table, figure, caption, header, footer — with bounding boxes and a reading order. It is the task that sits upstream of OCR, and the one that quietly determines whether the text you extract makes sense.

1 title2 text (col L)3 figure4 table5 text (col R)
Output is not text — it is typed regions, boxes, and a sequence. The numbers are the product.
Why this is not just OCR

Plain OCR emits a flat character stream. On a two-column research paper, a reader that ignores layout splices the left and right columns line by line — “We propose a method  the dataset contains  that improves  120k pages …” — producing text that is individually correct and collectively meaningless. Layout analysis is what prevents that. Get the order wrong and every downstream step inherits the error.

§ 01 · The Output Contract

Three things, not one.

1 · Regions

A bounding box for every block. Detection-style output: [x0,y0,x1,y1] per element.

2 · Classes

A type per region — Title, Text, List, Table, Figure, Caption, Page-header, Page-footer. DocLayNet uses 11 labels.

3 · Reading order

The sequence regions should be consumed in. The hardest part, and the one most benchmarks under-measure.

Which OCR fits your use case?

Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.

§ 02 · Benchmarks & Metrics

mAP and layout F1 on PubLayNet and DocLayNet.

Layout is scored like an object-detection problem. mAP (mean average precision over IoU thresholds) measures how well predicted boxes match ground-truth regions; layout F1 trades off precision and recall per class. PubLayNet (~360k clean scientific pages) is near-saturated; DocLayNet (80,863 manually-annotated pages across finance, law, manuals, patents) is the harder, more representative test.

PubLayNetmAP
Hybrid DETR (Shehzadi '24)97.3
RoDLA (CVPR '24)96.0
DiT-large94.5
DocLayNetmAP
DiT-large79.5
LayoutLMv376.8
YOLOv873.2

Sources: Shehzadi et al. arXiv:2404.17888; RoDLA (CVPR 2024) arXiv:2403.14442; DocLayNet (IBM). The ~18-point PubLayNet→DocLayNet drop is the cost of leaving clean scientific PDFs for real-world documents.

§ 03 · Two Paradigms

Detector first, or VLM that emits layout.

A · Dedicated detector

A YOLO / DETR / DiT model trained only to find and classify regions. Fast, cheap, runs on CPU, gives clean boxes. You then crop each region and route it to the right reader (text → OCR, table → table model). Reading order is computed separately.

Best when: high volume, need explicit boxes, want a modular pipeline.

B · VLM that emits layout

Document VLMs like dots.ocr and PaddleOCR-VL emit categories, bounding boxes and text in one pass, with reading order baked in. Fewer moving parts; the layout and the read stay consistent because one model produced both.

Best when: you want one model for read + layout, GPU available.

Detector + explicit reading order

from huggingface_hub import hf_hub_download
from ultralytics import YOLO

# A layout detector returns regions + classes, NOT text.
model = YOLO(hf_hub_download("DILHTWD/yolov8-doclaynet", "yolov8-doclaynet.pt"))

def detect_regions(image_path: str):
    result = model(image_path)[0]
    regions = []
    for box, cls, conf in zip(result.boxes.xyxy, result.boxes.cls, result.boxes.conf):
        regions.append({
            "bbox": [round(v) for v in box.tolist()],   # x0,y0,x1,y1
            "category": result.names[int(cls)],          # Title, Text, Table, ...
            "confidence": round(float(conf), 3),
        })
    # Reading order is your job: sort by column, then top-to-bottom.
    return order_regions(regions)
def order_regions(regions, page_width_mid=None):
    """Naive multi-column reading order: split by x-midpoint, then y."""
    if page_width_mid is None:
        xs = [(r["bbox"][0] + r["bbox"][2]) / 2 for r in regions]
        page_width_mid = (min(xs) + max(xs)) / 2
    left  = sorted([r for r in regions if (r["bbox"][0]+r["bbox"][2])/2 <  page_width_mid],
                   key=lambda r: r["bbox"][1])
    right = sorted([r for r in regions if (r["bbox"][0]+r["bbox"][2])/2 >= page_width_mid],
                   key=lambda r: r["bbox"][1])
    return left + right  # read the whole left column, then the right
§ 04 · The Cascade

Layout errors don't stay layout errors.

  • A missed column boundary scrambles every line of reading on that page.
  • A table mis-labelled as Text never reaches the table extractor, so its grid is lost.
  • A header/footer not separated leaks page numbers and running titles into the body and into your embeddings.
  • A caption attached to the wrong figure misleads any downstream document QA.
§ 05 · References

Sources.

Which OCR fits your use case?

Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.

#1 on OmniDocBench92.86 compositeSOTA shipped

Run the best OCR model on your Mac — $6

Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.

Every purchase directly supports CodeSOTA's independent benchmark research.

§ 06 · Adjacent Tasks

Layout feeds everything downstream.

Task · Read
image / PDF → text
The step that consumes your reading order. Right boxes, wrong order = wrong text.
Task · Tables
table image → cells
A region the layout step must isolate before structure recovery can begin.
← Back to OCR task router