Mistral OCR 3 Tutorial: Extract Text from PDFs
Learn to extract structured text from PDF documents using the Mistral OCR 3 API. Get clean markdown output in minutes.
What You Will Learn
- 1.Install the Mistral Python SDK
- 2.Authenticate with your API key
- 3.Extract text from a PDF file
- 4.Process documents from URLs
- 5.Handle the markdown output
1Installation and Setup
Install the Mistral Python SDK and set up your API key:
# Install the Mistral Python SDK
pip install mistralai
# Set your API key as an environment variable
export MISTRAL_API_KEY="your-api-key-here"Get your API key: Sign up at console.mistral.ai and create an API key in the dashboard. Free tier includes limited usage.
2Your First OCR Request
Process a local PDF file and extract text as markdown:
from mistralai import Mistral
import base64
import os
# Initialize the client
client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
# Load your PDF document
with open("invoice.pdf", "rb") as f:
document_data = base64.b64encode(f.read()).decode()
# Process with Mistral OCR 3
response = client.ocr.process(
model="mistral-ocr-2512",
document={"type": "pdf", "data": document_data}
)
# Print the extracted markdown
print(response.content)3Understanding the Output
Mistral OCR 3 outputs clean markdown with HTML tables. Here is an example from an invoice:
# Invoice
**Invoice #:** INV-2025-001
**Date:** December 19, 2025
## Bill To
John Smith
123 Main Street
San Francisco, CA 94102
<table>
<tr>
<th>Description</th>
<th>Qty</th>
<th>Price</th>
<th>Total</th>
</tr>
<tr>
<td>Web Development Services</td>
<td>40</td>
<td>$150.00</td>
<td>$6,000.00</td>
</tr>
<tr>
<td colspan="3"><strong>Total</strong></td>
<td><strong>$6,000.00</strong></td>
</tr>
</table>Text Extraction
Headers, paragraphs, and addresses are extracted as markdown with proper formatting.
Table Handling
Tables are output as HTML for better structure, including colspan and rowspan support.
4Processing Documents from URLs
You can also process documents directly from URLs without downloading them first:
from mistralai import Mistral
import os
client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
# Process a document from URL
response = client.ocr.process(
model="mistral-ocr-2512",
document={
"type": "pdf",
"url": "https://arxiv.org/pdf/2408.09869.pdf"
}
)
print(response.content[:2000]) # First 2000 charactersWhat to Expect
Based on CodeSOTA verified benchmarks (December 2025)
Common Issues
API Key Not Found
Ensure MISTRAL_API_KEY is set in your environment. Check with: echo $MISTRAL_API_KEY
File Too Large
PDFs over 50MB may need to be split. Consider processing pages individually for large documents.
Rate Limits
Free tier has rate limits. For production, use the Batch API (50% cheaper, async processing).