TutorialLearning-oriented

Docling Tutorial: Convert PDF to Markdown

Learn to extract structured text, tables, and formulas from PDF documents using IBM Docling.

Time: 30 minutes|Level: Beginner|Prerequisites: Python 3.9+

Verified Tutorial

All code in this tutorial was executed on December 17, 2025. Outputs shown are real results from processing the Docling paper (arxiv:2408.09869).

What You'll Learn

1.Install Docling and set up your environment
2.Convert a PDF to Markdown with 3 lines of code
3.Extract tables as structured data (CSV/DataFrame)
4.Use the VLM pipeline for complex documents
5.Batch process multiple documents

1Installation

Create a new project directory and install Docling:

# Create project with uv (recommended)
uv init docling-tutorial
cd docling-tutorial
uv add docling pandas

# Or use pip
python -m venv .venv
source .venv/bin/activate
pip install docling pandas

Docling works on macOS, Linux, and Windows. Python 3.9-3.12 supported. First run downloads ~1GB of model weights.

2Your First Conversion

Convert a PDF to Markdown with three lines of code:

from docling.document_converter import DocumentConverter

# Create converter instance
converter = DocumentConverter()

# Convert a PDF file
result = converter.convert("document.pdf")

# Export to Markdown
markdown = result.document.export_to_markdown()
print(markdown)

Console output:

2025-12-17 00:10:30 - INFO - Initializing pipeline for StandardPdfPipeline
2025-12-17 00:10:40 - INFO - Auto OCR model selected ocrmac.
2025-12-17 00:10:40 - INFO - Accelerator device: 'mps'
2025-12-17 00:10:57 - INFO - Processing document sample.pdf
2025-12-17 00:11:04 - INFO - Finished converting document sample.pdf in 34.95 sec.

Actual output (truncated):

## Docling Technical Report

## Version 1.0

Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi ...

AI4K Group, IBM Research Ruschlikon, Switzerland

## Abstract

This technical report introduces Docling, an easy to use,
self-contained, MIT-licensed open-source package for PDF
document conversion. It is powered by state-of-the-art
specialized AI models for layout analysis (DocLayNet) and
table structure recognition (TableFormer)...

## 1 Introduction

Converting PDF documents back into a machine-processable
format has been a major challenge for decades...

Full output: 33,201 characters from 10-page PDF in 34.95 seconds

Performance (from our test run)

34.95s

Conversion time

Pages processed

Tables extracted

33KB

Markdown output

Test: Docling paper (arxiv:2408.09869) on Apple Silicon with MPS acceleration

Next Steps

How-To Guides

Configure OCR engines, extract invoices, integrate with RAG

Reference

API docs, configuration options, model specs

Explanation

How Docling works, architecture, design decisions