Frontier VLMs (GPT-5, Claude Opus 4.7, Gemini 3) now hold the top of the IAM leaderboard, with GPT-5 at ~1.22% CER. Specialized HTR models like TrOCR and DTrOCR — once SOTA — are now best positioned as open-source fine-tune baselines, not ceilings.
GPT-5 leads IAM at ~1.22% CER, with Claude Opus 4.7 (~1.31%) and Gemini 3 (~1.44%) close behind. The 1.69% CER benchmark from GPT-4o (arXiv 2503.15195, March 2025) marked the moment VLMs dethroned specialized HTR models — the gap has only widened since. Among specialized (non-VLM) models, DTrOCR remains the leader at 2.38% CER (WACV 2024), and TrOCR-Large at 2.89% is still the most practical open-weight baseline for fine-tuning on domain data. For enterprise use with bounding boxes, Azure Document Intelligence v4.0 (~1.8% CER) still offers the best combination of accuracy and structured output.
No — not since ~2023. TrOCR (Microsoft, 2021) hit SOTA on IAM with 2.89% CER and was a genuine breakthrough at the time. It has since been surpassed by DTrOCR (2.38%, WACV 2024) among specialized models, and by frontier VLMs by a wide margin. TrOCR is still a strong pick when you need local, privacy-preserving OCR on pre-segmented lines, or a cheap fine-tuning base for domain-specific handwriting (medical notes, historical scripts, forms). It is not the model to reach for if you simply want the lowest error rate on a new image — that's a frontier VLM call now.
Methodology: CER/WER results on the IAM Handwriting Database (13,353 text lines, 657 writers). Academic numbers from published papers; commercial numbers from vendor benchmarks and independent evaluations. All results writer-independent unless noted.
Which OCR fits your use case?
Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.
Character Error Rate on IAM Handwriting Database. Below 2.5% is excellent, above 10% is poor.
| # | Model | CER | WER | Cost/1K pg | Type | Best For |
|---|---|---|---|---|---|---|
| ★ | GPT-5 | ~1.22% | ~2.8% | ~$12 | Frontier VLM | Maximum accuracy |
| 2 | Claude Opus 4.7 | ~1.31% | ~2.9% | ~$15 | Frontier VLM | Long docs, reasoning |
| 3 | Gemini 3 | ~1.44% | ~3.1% | ~$8 | Frontier VLM | Multilingual, word boxes |
| 4 | GPT-5-mini | ~1.52% | -- | ~$2 | VLM | Cost-efficient accuracy |
| 5 | GPT-4o (prior SOTA) | 1.69% | 3.66% | ~$10 | VLM | Legacy integrations |
| 6 | Azure Doc Intel v4.0 | ~1.8% | -- | $15 | Cloud API | Enterprise, bounding boxes |
| 7 | Mistral OCR 3 | ~2.1% | -- | $2 | Cloud API | Best value, cursive |
| 8 | DTrOCR (Best Specialized) | 2.38% | -- | $0 | Open (WACV 2024) | Research, fine-tuning |
| 9 | TrOCR-Large (fine-tune base) | 2.89% | -- | $0 | Open (Microsoft) | Local, privacy, fine-tune |
| 10 | Transkribus | 2.95% | -- | $8 | Cloud + Desktop | Historical documents |
| 11 | GOT-OCR 2.0 | ~3.4% | -- | $0 | Open VLM | Multi-format OCR |
| 12 | Qwen2.5-VL | ~3.8% | -- | $0 | Open VLM | Multilingual handwriting |
| 13 | PaddleOCR | 5.8% | -- | $0 | Open | CJK, budget |
| 14 | Tesseract 5 | 12.5% | ~35% | $0 | Open | Not recommended |
CER below 2.5% is excellent. 2.5-5% is good. 5-10% is acceptable for clean handwriting. Above 10% indicates systematic failures.
Different models return different levels of spatial information. Line-level boxes are most common; only a few models return word or character-level coordinates.
The same IAM handwriting sample processed by four models. Errors highlighted in colour.
A Move to Stop Mr. Gaitskell from
A Move to Stop Mr. Gaitskell from
A Move to Stop Mr. Gaitskell from
A Move to Stop Mr. Gaitskell from
A Move to Stop Mr. Gaitskell from
A Move to Stop Mr. Gaitskell from
A Move to Stop Mr. Gaitskell trom
A Move to Stop Mr. Gaitskell from
A Mave to Stap Mr. Galtskell tram
Not all models return spatial coordinates. If you need word-level or character-level bounding boxes for downstream processing, your options are limited.
| Model | Character | Word | Line | Paragraph | Page |
|---|---|---|---|---|---|
| Azure Doc Intel v4 | — | ✓ | ✓ | ✓ | ✓ |
| GPT-5 | — | — | — | — | ✓ |
| Claude Opus 4.7 | — | — | — | — | ✓ |
| Gemini 3 | — | ✓ | ✓ | — | ✓ |
| Mistral OCR 3 | — | — | — | — | ✓ |
| GOT-OCR 2.0 | ✓ | ✓ | ✓ | — | ✓ |
| TrOCR-Large | — | — | ✓ | — | — |
| Tesseract 5 | ✓ | ✓ | ✓ | ✓ | ✓ |
| PaddleOCR | ✓ | ✓ | ✓ | — | ✓ |
| Qwen2.5-VL | — | — | — | — | ✓ |
| DTrOCR | — | — | ✓ | — | — |
| Transkribus | — | ✓ | ✓ | ✓ | ✓ |
Azure Document Intelligence v4 and Tesseract provide the most comprehensive bounding box output. GOT-OCR 2.0 is notable as an open VLM that returns character-level coordinates.
Expected CER varies dramatically with handwriting quality. Models that excel on clean IAM samples may fail on real-world messy handwriting.
Cursive letters connect. Where does 'm' end and 'a' begin? Traditional OCR assumes isolated characters. Modern models use sequence-to-sequence architectures (encoder-decoder transformers) to avoid explicit segmentation entirely.
Writer-dependent accuracy can reach 97.8%. Writer-independent drops to ~80%. The IAM benchmark tests generalization across 657 different writers, which is why VLMs with massive pretraining data now dominate.
Frontier VLM results come from language understanding. When GPT-5 or Claude sees "Q4 budget $45,0__" it infers "00" from context. Pure vision models can't do this, which is why VLMs now beat specialized handwriting models by a widening margin.
Paper quality, ink bleed-through, smudges, scanning artifacts. Historical documents: 12-30% CER even with best models. Transkribus specializes here with crowd-sourced training data.
Decoder-only transformer for OCR. Achieves 2.38% CER on IAM — best among traditional (non-VLM) models. Uses GPT-2 architecture adapted for vision, proving decoder-only works for HTR.
Applies dynamic LoRA to TrOCR for efficient fine-tuning on new handwriting styles. Dramatically reduces training cost while maintaining or improving on base TrOCR accuracy.
General OCR Theory model. Handles handwriting, sheet music, math, and scene text in a single model. Returns fine-grained bounding boxes including character-level, rare for VLMs.
Alibaba's vision-language model with strong OCR capabilities. Particularly competitive on multilingual handwriting (CJK, Arabic) where Western-trained models struggle.
CER = (Insertions + Deletions + Substitutions) / Total Characters1.22% CER = roughly 1 character error per 100 characters. The primary metric for handwriting OCR quality.
WER = (Word Insertions + Deletions + Substitutions) / Total WordsWER is typically higher than CER because one wrong character makes an entire word wrong. Frontier VLMs hold roughly constant WER:CER ratios of ~2.3x — worse than specialized HTR models on average, but the raw numbers are low enough that it doesn't matter in practice.
Vision-language approach. No bounding boxes, but highest raw accuracy.
import base64
from openai import OpenAI
client = OpenAI()
def recognize_handwriting(image_path: str) -> str:
with open(image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-5", # or "gpt-5-mini" for ~5x cheaper at ~1.52% CER
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe all handwritten text exactly as written. Preserve line breaks."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
]
}],
max_tokens=1000
)
return response.choices[0].message.contentStrongest on long multi-page handwriting and reasoning over transcribed content.
import base64
import anthropic
client = anthropic.Anthropic()
def recognize_handwriting(image_path: str) -> str:
with open(image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
message = client.messages.create(
model="claude-opus-4-7", # ~1.31% CER on IAM
max_tokens=1000,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img_b64}},
{"type": "text", "text": "Transcribe all handwritten text exactly as written. Preserve line breaks."}
]
}]
)
return message.content[0].textReturns word and line-level polygons with handwriting style detection.
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
endpoint="https://your-endpoint.cognitiveservices.azure.com/",
credential=AzureKeyCredential("your-key")
)
def recognize_handwriting(image_path: str) -> dict:
with open(image_path, "rb") as f:
poller = client.begin_analyze_document(
"prebuilt-read", # v4.0 — supports handwriting + bounding boxes
f.read(),
content_type="application/octet-stream"
)
result = poller.result()
lines = []
for page in result.pages:
for line in page.lines:
lines.append({
"text": line.content,
"polygon": line.polygon, # word + line bounding boxes
"style": line.appearance.style # "handwritten" or "other"
})
return {"text": "\n".join(l["text"] for l in lines), "lines": lines}Best value for high-volume handwriting processing.
from mistralai import Mistral
import base64
client = Mistral(api_key="your-api-key")
def recognize_handwriting(image_path: str) -> str:
with open(image_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.ocr.process(
model="mistral-ocr-2512", # OCR 3
document={
"type": "image_base64",
"image_base64": img_b64
}
)
return response.pages[0].markdownBest established open-source option. Works on single pre-segmented lines.
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
# TrOCR-large fine-tuned on IAM handwriting
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten")
def recognize_line(image_path: str) -> str:
"""Recognize a single line of handwriting."""
image = Image.open(image_path).convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Note: TrOCR works best on pre-segmented single lines
# For full pages, use line detection first (e.g., with CRAFT)Don't trust self-reported confidence scores. Calculate CER/WER against ground truth:
def levenshtein_distance(s1: str, s2: str) -> int:
"""Calculate edit distance between two strings."""
if len(s1) < len(s2):
return levenshtein_distance(s2, s1)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
def calculate_cer(ground_truth: str, prediction: str) -> float:
"""CER = edit_distance / len(ground_truth)"""
if len(ground_truth) == 0:
return 0.0 if len(prediction) == 0 else 1.0
return levenshtein_distance(ground_truth, prediction) / len(ground_truth)
def calculate_wer(ground_truth: str, prediction: str) -> float:
"""WER = word_edit_distance / word_count(ground_truth)"""
gt_words = ground_truth.split()
pred_words = prediction.split()
if len(gt_words) == 0:
return 0.0 if len(pred_words) == 0 else 1.0
return levenshtein_distance(" ".join(gt_words), " ".join(pred_words)) / len(gt_words)Run the best OCR model on your Mac — $6
Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.
Every purchase directly supports CodeSOTA's independent benchmark research.
Which OCR fits your use case?
Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.
Read next
OCR hub
Every OCR model and dataset we track. Pricing, accuracy, throughput, $/1K pages.
Comparison
3-way benchmark on real documents — throughput, edit distance, where each engine wins.
Most-clicked benchmark
Scene parsing SOTA — InternImage, BEiT-3, Mask2Former across 150 categories.