Docling Tutorial: Convert PDF to Markdown
Learn to extract structured text, tables, and formulas from PDF documents using IBM Docling.
Verified Tutorial
All code in this tutorial was executed on December 17, 2025. Outputs shown are real results from processing the Docling paper (arxiv:2408.09869).
What You'll Learn
- 1.Install Docling and set up your environment
- 2.Convert a PDF to Markdown with 3 lines of code
- 3.Extract tables as structured data (CSV/DataFrame)
- 4.Use the VLM pipeline for complex documents
- 5.Batch process multiple documents
1Installation
Create a new project directory and install Docling:
# Create project with uv (recommended)
uv init docling-tutorial
cd docling-tutorial
uv add docling pandas
# Or use pip
python -m venv .venv
source .venv/bin/activate
pip install docling pandasDocling works on macOS, Linux, and Windows. Python 3.9-3.12 supported. First run downloads ~1GB of model weights.
2Your First Conversion
Convert a PDF to Markdown with three lines of code:
from docling.document_converter import DocumentConverter
# Create converter instance
converter = DocumentConverter()
# Convert a PDF file
result = converter.convert("document.pdf")
# Export to Markdown
markdown = result.document.export_to_markdown()
print(markdown)Console output:
2025-12-17 00:10:30 - INFO - Initializing pipeline for StandardPdfPipeline
2025-12-17 00:10:40 - INFO - Auto OCR model selected ocrmac.
2025-12-17 00:10:40 - INFO - Accelerator device: 'mps'
2025-12-17 00:10:57 - INFO - Processing document sample.pdf
2025-12-17 00:11:04 - INFO - Finished converting document sample.pdf in 34.95 sec.Actual output (truncated):
## Docling Technical Report
## Version 1.0
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi ...
AI4K Group, IBM Research Ruschlikon, Switzerland
## Abstract
This technical report introduces Docling, an easy to use,
self-contained, MIT-licensed open-source package for PDF
document conversion. It is powered by state-of-the-art
specialized AI models for layout analysis (DocLayNet) and
table structure recognition (TableFormer)...
## 1 Introduction
Converting PDF documents back into a machine-processable
format has been a major challenge for decades...Full output: 33,201 characters from 10-page PDF in 34.95 seconds
Stop picking the wrong OCR model
Monthly OCR benchmark update — new models, price changes, accuracy deltas. Free.
Performance (from our test run)
Test: Docling paper (arxiv:2408.09869) on Apple Silicon with MPS acceleration
Next Steps
Stop picking the wrong OCR model
Monthly OCR benchmark update — new models, price changes, accuracy deltas. Free.