“Turn this page into text” hides three different jobs. A born-digital PDF already contains perfect text — running OCR on it adds errors. A scan contains no text at all. And most real files are a mix of both. The first decision is not which model; it is whether to OCR at all.
A PDF is not an image. Roughly two-thirds of PDFs in the wild are born-digital — they carry an embedded text layer that is character-perfect. The correct tool there is a parser (PyMuPDF, pdfplumber, pdfminer), not an OCR model. OCR-ing a digital PDF throws away ground-truth text and replaces it with a best-guess transcription. OCR is for the pages that have no text layer.
Rule of thumb: never feed a whole PDF to an OCR API without checking for a text layer first.
Which OCR fits your use case?
Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.
CER / WER are edit distance over characters or words, normalised by reference length. 1% CER ≈ one wrong character per hundred. Lower is better. They reward raw transcription accuracy but say nothing about structure or order.
OmniDocBench reports a text edit distance over the whole page (lower is better), so reading-order mistakes are penalised, not just glyph errors. OCRBench uses a 0–1000 composite over short tasks; OCRBench v2 adds private EN/ZH splits to resist contamination.
A model can win OCRBench (clean crops) and lose OmniDocBench (full pages) if its reading order is weak. Test on your own layout.
| Rank | Model | Type | Composite | Text edit dist. ↓ |
|---|---|---|---|---|
| ★ | GLM-OCR | Expert VLM | 94.62 | — |
| 2 | Qianfan-OCR | Expert VLM | 93.12 | — |
| 3 | PaddleOCR-VL | Expert VLM (0.9B) | 92.56 | — |
| · | dots.ocr | Expert VLM (1.7B) | — | 0.125 |
| · | DeepSeek-OCR | Expert VLM | — | 0.123 |
| · | PP-StructureV3 | Pipeline | — | 0.145 |
Source: OmniDocBench (live registry). Composite is higher-is-better; text edit distance is lower-is-better. On OCRBench's 0–1000 scale, frontier VLMs (Qwen2.5-VL 72B ≈ 885) now cluster near the top.
The headline: open-weight expert OCR models now lead the closed VLMs on full-page reading. For the closed-API view and the full live tables, see the open-weight leaderboard and the OmniDocBench page.
import fitz # PyMuPDF
def has_text_layer(page, min_chars: int = 12) -> bool:
"""A born-digital page returns real characters; a scan returns ~nothing."""
return len(page.get_text("text").strip()) >= min_chars
def read_pdf(path: str) -> list[str]:
"""Parse where there is a text layer, OCR only where there isn't."""
doc = fitz.open(path)
out = []
for page in doc:
if has_text_layer(page):
# Perfect, lossless text — never OCR this.
out.append(page.get_text("text"))
else:
# Scanned page → rasterise and hand to an OCR model.
pix = page.get_pixmap(dpi=200)
out.append(ocr_image(pix.tobytes("png")))
return outimport replicate, base64
def ocr_image(png_bytes: bytes) -> str:
"""Scanned page → Markdown. Reading order + structure preserved."""
b64 = base64.b64encode(png_bytes).decode()
output = replicate.run(
"rednote-hilab/dots.ocr",
input={
"image": f"data:image/png;base64,{b64}",
"prompt": "Convert this page to Markdown. Preserve reading order, "
"headings, lists and tables. Do not invent text.",
"temperature": 0.0,
},
)
return "".join(output)Which OCR fits your use case?
Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.
Run the best OCR model on your Mac — $6
Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.
Every purchase directly supports CodeSOTA's independent benchmark research.