Home/OCR/Docling

Convert PDFs to Clean Markdown or JSON

Extract text, tables, and formulas from PDFs locally. No cloud APIs, works offline. Open-source Python library from IBM Research.

Open Source256M ParametersApache 2.0IBM Research

Quick Install

pip install docling

Python 3.9-3.14 | macOS, Linux, Windows | Apache 2.0 License

Documentation

When to Use Docling

1

Processing research papers or technical documents

Extract tables, equations (LaTeX), and structured content while preserving formatting. Works offline, no API costs.

2

Building RAG systems or document search

Convert PDFs to clean Markdown for embeddings. Preserves document structure (headings, lists) better than plain OCR.

3

Handling sensitive documents

Runs entirely on your machine. No data sent to cloud APIs. GDPR/HIPAA compliant by default.

4

Batch processing thousands of documents

Process ~0.35s per page on GPU, ~2-3s on CPU. No rate limits or API quotas to worry about.

Minimal Example

from docling.document_converter import DocumentConverter converter = DocumentConverter() result = converter.convert("document.pdf")# Export to Markdownprint(result.document.export_to_markdown())# Or JSON, HTML, plain textresult.document.export_to_dict() result.document.export_to_html()

Choose the Right Tool

Your SituationDoclingTesseractAWS TextractGPT-4o Vision
Extract tables to CSV/ExcelBuilt-inManual parsingBuilt-inVia prompt
Convert math formulasLaTeX exportNot supportedNot supportedVia prompt
Process 10,000 pagesFree, localFree, local$15,000 cost$100+ cost
Sensitive/confidential docsOfflineOfflineCloud uploadCloud upload
No internet accessWorks offlineWorks offlineRequires internetRequires internet
Processing speed neededFast (0.35s/page)Slow (2-5s/page)Medium + latencySlow + latency

Supported Formats

PDF
Native + scanned
DOCX
Word documents
PPTX
PowerPoint
XLSX
Excel
HTML
Web pages
Images
PNG, JPG, TIFF
Audio
WAV, MP3 (ASR)
VTT
Subtitles

Resources

Related Reading