TutorialLearning-oriented
Docling Tutorial: Convert PDF to Markdown
Learn to extract structured text, tables, and formulas from PDF documents using IBM Docling.
Time: 30 minutes|Level: Beginner|Prerequisites: Python 3.9+
Verified Tutorial
All code in this tutorial was executed on December 17, 2025. Outputs shown are real results from processing the Docling paper (arxiv:2408.09869).
What You'll Learn
- 1.Install Docling and set up your environment
- 2.Convert a PDF to Markdown with 3 lines of code
- 3.Extract tables as structured data (CSV/DataFrame)
- 4.Use the VLM pipeline for complex documents
- 5.Batch process multiple documents
1Installation
Create a new project directory and install Docling:
# Create project with uv (recommended)
uv init docling-tutorial
cd docling-tutorial
uv add docling pandas
# Or use pip
python -m venv .venv
source .venv/bin/activate
pip install docling pandasDocling works on macOS, Linux, and Windows. Python 3.9-3.12 supported. First run downloads ~1GB of model weights.
2Your First Conversion
Convert a PDF to Markdown with three lines of code:
from docling.document_converter import DocumentConverter
# Create converter instance
converter = DocumentConverter()
# Convert a PDF file
result = converter.convert("document.pdf")
# Export to Markdown
markdown = result.document.export_to_markdown()
print(markdown)Console output:
2025-12-17 00:10:30 - INFO - Initializing pipeline for StandardPdfPipeline
2025-12-17 00:10:40 - INFO - Auto OCR model selected ocrmac.
2025-12-17 00:10:40 - INFO - Accelerator device: 'mps'
2025-12-17 00:10:57 - INFO - Processing document sample.pdf
2025-12-17 00:11:04 - INFO - Finished converting document sample.pdf in 34.95 sec.Actual output (truncated):
## Docling Technical Report
## Version 1.0
Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi ...
AI4K Group, IBM Research Ruschlikon, Switzerland
## Abstract
This technical report introduces Docling, an easy to use,
self-contained, MIT-licensed open-source package for PDF
document conversion. It is powered by state-of-the-art
specialized AI models for layout analysis (DocLayNet) and
table structure recognition (TableFormer)...
## 1 Introduction
Converting PDF documents back into a machine-processable
format has been a major challenge for decades...Full output: 33,201 characters from 10-page PDF in 34.95 seconds
Performance (from our test run)
34.95s
Conversion time
10
Pages processed
3
Tables extracted
33KB
Markdown output
Test: Docling paper (arxiv:2408.09869) on Apple Silicon with MPS acceleration