How Docling Works
The architecture, design decisions, and trade-offs behind IBM's document AI toolkit.
The Key Insight
Traditional OCR works character-by-character, recognizing individual letters and assembling them into words. This is slow and error-prone.
Docling takes a different approach: treat the document as an image and use computer vision to understand its structure directly. This sidesteps OCR entirely for native PDFs and dramatically improves accuracy for scanned documents.
Two Processing Pipelines
1. Standard Pipeline (Default)
Uses specialized models for each task. Fast and efficient for most documents.
Best for: Native PDFs, well-structured documents, high-volume processing
2. VLM Pipeline (SmolDocling/Granite-Docling)
End-to-end vision-language model. Handles complex layouts in a single pass.
Best for: Scanned documents, unusual layouts, formulas, handwriting
Why Not Just Use OCR?
OCR (Optical Character Recognition) has been the standard approach for decades. It works by:
- Converting the image to grayscale
- Detecting individual characters
- Recognizing each character against a trained alphabet
- Assembling characters into words based on spacing
This process is inherently sequential and error-prone. Each character recognition can introduce errors. The error rate compounds across a page.
The Docling Alternative
For native PDFs, text is already embedded in the file. Docling extracts it directly without any recognition step. For scanned PDFs, the VLM "sees" the entire page at once, understanding structure and text together.
IBM Research reports this approach is 30x faster and more accurate than traditional OCR for document understanding tasks.
The Models Behind Docling
DocLayNet Layout Model
Trained on ~81,000 manually labeled document pages. Uses object detection to identify:
- Text blocks
- Headers
- Tables
- Figures
- Lists
- Captions
- Code
- Formulas
Achieves within 5 percentage points of human accuracy on element classification.
TableFormer
Specialized model for table structure recognition. Handles:
- - Column and row boundaries
- - Merged cells (colspan, rowspan)
- - Header rows and columns
- - Cell content assignment
SmolDocling / Granite-Docling VLM
A 256M parameter vision-language model that processes entire pages end-to-end.
Architecture:
- Vision Encoder: SigLIP (93M params)
- Language Model: SmolLM-2 (135M params)
- Connector: Pixel shuffle projector
Training Data:
- SynthCodeNet: 9.33M samples
- DoclingMatix: 1.27M samples
- SynthChartNet: 1.98M samples
The DocTags Format
Docling introduces DocTags, a markup format that captures document structure with spatial information. Unlike Markdown, DocTags preserves:
- + Bounding boxes for every element
- + Reading order relationships
- + Element type classification
- + Page and position metadata
This enables lossless round-trip conversion. You can convert to DocTags, modify it, and convert back without losing structural information.
When to Use What
| Scenario | Recommendation | Why |
|---|---|---|
| Native PDF, simple layout | Standard pipeline | Fastest, text already embedded |
| Scanned document | VLM pipeline | Better accuracy than OCR |
| Complex tables | Standard + TableFormer | Specialized model excels |
| Math formulas | VLM pipeline | Native LaTeX output |
| High volume batch | Standard pipeline | Lower compute cost |
| Unusual layout | VLM pipeline | End-to-end understanding |
| Privacy-sensitive | Either (both local) | No cloud dependency |
Trade-offs to Consider
Advantages
- - Much faster than traditional OCR
- - Native table structure preservation
- - Runs locally (privacy, no API costs)
- - Compact model (256M vs 7B+ params)
- - Open source (Apache 2.0)
Limitations
- - English-primary (multilingual experimental)
- - First run downloads models (~500MB)
- - GPU recommended for VLM pipeline
- - Not optimized for handwriting (yet)
- - Newer, less battle-tested than Tesseract
Key Takeaways
- 1. Docling uses computer vision instead of OCR - this is fundamentally different and faster.
- 2. Two pipelines: Standard (fast, specialized models) vs VLM (end-to-end, complex docs).
- 3. SmolDocling is a 256M VLM that competes with 7B+ models for document understanding.
- 4. Choose your pipeline based on document type, not just "always use VLM".