TutorialLearning-oriented

Mistral OCR 3 Tutorial: Extract Text from PDFs

Learn to extract structured text from PDF documents using the Mistral OCR 3 API. Get clean markdown output in minutes.

Time: 5 minutes|Level: Beginner|Prerequisites: Python 3.8+, Mistral API key

What You Will Learn

1.Install the Mistral Python SDK
2.Authenticate with your API key
3.Extract text from a PDF file
4.Process documents from URLs
5.Handle the markdown output

1Installation and Setup

Install the Mistral Python SDK and set up your API key:

# Install the Mistral Python SDK
pip install mistralai

# Set your API key as an environment variable
export MISTRAL_API_KEY="your-api-key-here"

Get your API key: Sign up at console.mistral.ai and create an API key in the dashboard. Free tier includes limited usage.

2Your First OCR Request

Process a local PDF file and extract text as markdown:

from mistralai import Mistral
import base64
import os

# Initialize the client
client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])

# Load your PDF document
with open("invoice.pdf", "rb") as f:
    document_data = base64.b64encode(f.read()).decode()

# Process with Mistral OCR 3
response = client.ocr.process(
    model="mistral-ocr-2512",
    document={"type": "pdf", "data": document_data}
)

# Print the extracted markdown
print(response.content)

3Understanding the Output

Mistral OCR 3 outputs clean markdown with HTML tables. Here is an example from an invoice:

# Invoice

**Invoice #:** INV-2025-001
**Date:** December 19, 2025

## Bill To
John Smith
123 Main Street
San Francisco, CA 94102

<table>
  <tr>
    <th>Description</th>
    <th>Qty</th>
    <th>Price</th>
    <th>Total</th>
  </tr>
  <tr>
    <td>Web Development Services</td>
    <td>40</td>
    <td>$150.00</td>
    <td>$6,000.00</td>
  </tr>
  <tr>
    <td colspan="3"><strong>Total</strong></td>
    <td><strong>$6,000.00</strong></td>
  </tr>
</table>

Text Extraction

Headers, paragraphs, and addresses are extracted as markdown with proper formatting.

Table Handling

Tables are output as HTML for better structure, including colspan and rowspan support.

4Processing Documents from URLs

You can also process documents directly from URLs without downloading them first:

from mistralai import Mistral
import os

client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])

# Process a document from URL
response = client.ocr.process(
    model="mistral-ocr-2512",
    document={
        "type": "pdf",
        "url": "https://arxiv.org/pdf/2408.09869.pdf"
    }
)

print(response.content[:2000])  # First 2000 characters