ExplanationUnderstanding-oriented

How Docling Works

The architecture, design decisions, and trade-offs behind IBM's document AI toolkit.

The Key Insight

Traditional OCR works character-by-character, recognizing individual letters and assembling them into words. This is slow and error-prone.

Docling takes a different approach: treat the document as an image and use computer vision to understand its structure directly. This sidesteps OCR entirely for native PDFs and dramatically improves accuracy for scanned documents.

Two Processing Pipelines

1. Standard Pipeline (Default)

Uses specialized models for each task. Fast and efficient for most documents.

PDF Parser→Layout Model→TableFormer→Reading Order→Export

Best for: Native PDFs, well-structured documents, high-volume processing

2. VLM Pipeline (SmolDocling/Granite-Docling)

End-to-end vision-language model. Handles complex layouts in a single pass.

Image→VLM (256M params)→DocTags→Export

Best for: Scanned documents, unusual layouts, formulas, handwriting

Why Not Just Use OCR?

OCR (Optical Character Recognition) has been the standard approach for decades. It works by:

Converting the image to grayscale
Detecting individual characters
Recognizing each character against a trained alphabet
Assembling characters into words based on spacing

This process is inherently sequential and error-prone. Each character recognition can introduce errors. The error rate compounds across a page.

The Docling Alternative

For native PDFs, text is already embedded in the file. Docling extracts it directly without any recognition step. For scanned PDFs, the VLM "sees" the entire page at once, understanding structure and text together.

IBM Research reports this approach is 30x faster and more accurate than traditional OCR for document understanding tasks.

The Models Behind Docling

DocLayNet Layout Model

Trained on ~81,000 manually labeled document pages. Uses object detection to identify:

Text blocks
Headers
Tables
Figures
Lists
Captions
Code
Formulas

Achieves within 5 percentage points of human accuracy on element classification.

TableFormer

Specialized model for table structure recognition. Handles:

- Column and row boundaries
- Merged cells (colspan, rowspan)
- Header rows and columns
- Cell content assignment

SmolDocling / Granite-Docling VLM

A 256M parameter vision-language model that processes entire pages end-to-end.

Architecture:

Vision Encoder: SigLIP (93M params)
Language Model: SmolLM-2 (135M params)
Connector: Pixel shuffle projector

Training Data:

SynthCodeNet: 9.33M samples
DoclingMatix: 1.27M samples
SynthChartNet: 1.98M samples

The DocTags Format

Docling introduces DocTags, a markup format that captures document structure with spatial information. Unlike Markdown, DocTags preserves:

+Bounding boxes for every element
+Reading order relationships
+Element type classification
+Page and position metadata

This enables lossless round-trip conversion. You can convert to DocTags, modify it, and convert back without losing structural information.

When to Use What

Scenario	Recommendation	Why
Native PDF, simple layout	Standard pipeline	Fastest, text already embedded
Scanned document	VLM pipeline	Better accuracy than OCR
Complex tables	Standard + TableFormer	Specialized model excels
Math formulas	VLM pipeline	Native LaTeX output
High volume batch	Standard pipeline	Lower compute cost
Unusual layout	VLM pipeline	End-to-end understanding
Privacy-sensitive	Either (both local)	No cloud dependency

Trade-offs to Consider

Advantages

- Much faster than traditional OCR
- Native table structure preservation
- Runs locally (privacy, no API costs)
- Compact model (256M vs 7B+ params)
- Open source (Apache 2.0)

Limitations

- English-primary (multilingual experimental)
- First run downloads models (~500MB)
- GPU recommended for VLM pipeline
- Not optimized for handwriting (yet)
- Newer, less battle-tested than Tesseract

Key Takeaways

1.Docling uses computer vision instead of OCR - this is fundamentally different and faster.
2.Two pipelines: Standard (fast, specialized models) vs VLM (end-to-end, complex docs).
3.SmolDocling is a 256M VLM that competes with 7B+ models for document understanding.
4.Choose your pipeline based on document type, not just "always use VLM".

← Reference Back to Docling →