Home/OCR/Docling

Convert PDFs to Clean Markdown or JSON

Extract text, tables, and formulas from PDFs locally. No cloud APIs, works offline. Open-source Python library from IBM Research.

Open Source256M ParametersApache 2.0IBM Research

Quick Install

pip install docling

Python 3.9-3.14 | macOS, Linux, Windows | Apache 2.0 License

Documentation

Tutorial

Learning-oriented. Take your first steps with Docling by converting a PDF to structured Markdown.

Start here →

How-To Guides

Problem-oriented. Solve specific tasks: extract tables, configure OCR engines, batch process documents.

Solve problems →

Reference

Information-oriented. API documentation, configuration options, export formats, model specifications.

Look it up →

Explanation

Understanding-oriented. How Docling works under the hood, architecture decisions, when to use what.

Understand →

When to Use Docling

Processing research papers or technical documents

Extract tables, equations (LaTeX), and structured content while preserving formatting. Works offline, no API costs.

Building RAG systems or document search

Convert PDFs to clean Markdown for embeddings. Preserves document structure (headings, lists) better than plain OCR.

Handling sensitive documents

Runs entirely on your machine. No data sent to cloud APIs. GDPR/HIPAA compliant by default.

Batch processing thousands of documents

Process ~0.35s per page on GPU, ~2-3s on CPU. No rate limits or API quotas to worry about.

Minimal Example

from docling.document_converter import DocumentConverter converter = DocumentConverter() result = converter.convert("document.pdf")# Export to Markdownprint(result.document.export_to_markdown())# Or JSON, HTML, plain textresult.document.export_to_dict() result.document.export_to_html()

Choose the Right Tool

Your Situation	Docling	Tesseract	AWS Textract	GPT-4o Vision
Extract tables to CSV/Excel	Built-in	Manual parsing	Built-in	Via prompt
Convert math formulas	LaTeX export	Not supported	Not supported	Via prompt
Process 10,000 pages	Free, local	Free, local	$15,000 cost	$100+ cost
Sensitive/confidential docs	Offline	Offline	Cloud upload	Cloud upload
No internet access	Works offline	Works offline	Requires internet	Requires internet
Processing speed needed	Fast (0.35s/page)	Slow (2-5s/page)	Medium + latency	Slow + latency

Supported Formats

PDF

Native + scanned

DOCX

Word documents

PPTX

PowerPoint

XLSX

Excel

HTML

Web pages

Images

PNG, JPG, TIFF

Audio

WAV, MP3 (ASR)

VTT

Subtitles

Resources

GitHub Repository→Official Documentation→Granite-Docling Model→SmolDocling Paper→