Codesota · OCR · PolskiThe register of Polish document recognitionUpdated · March 2026

§ 00 · Polish OCR

Nine diacritics, two centuries of print.

Polish OCR sits between two hard regimes: PolEval 2021 measures NLP post-correction on historical books, IMPACT-PSNC supplies the ground truth, and our own CodeSOTA Polish panel separates models that read characters from models that read the dictionary.

0 models with at least one Polish benchmark result. Shaded rows mark the best current CER. Everything on this page is computed from the registry JSON — nothing invented.

Leaderboard →CodeSOTA Polish panel

§ 01 · Leaderboard

Character error rate, three panels.

CodeSOTA Polish, PolEval 2021 and IMPACT-PSNC report CER directly. Lower is better. Rows are ranked by combined CER across available panels.

Metric: CER · lower is better
Models: 0 with Polish results
Leader: —

Ranked by combined CER · March 2026

Shaded row marks current best

#	Model	Vendor	Type	CodeSOTA CER	PolEval CER	IMPACT CER	Trend

Fig 1 · CER across the three Polish panels. Blanks are honest — we do not impute missing benchmarks.

§ 02 · Task

Three reasons Polish OCR is hard.

First, the nine diacritics. Standard OCR engines trained on English corpora routinely confuse ą with a, ł with l, ó with o. A single missed ogonek can invert the meaning of a Polish sentence.

Second, the historical stock. Polish documents from 1791 to the early twentieth century frequently use gothic or fraktur typefaces. Modern OCR is trained on modern fonts; historical print requires either a dedicated model or a post-correction step.

Third, the PolEval 2021 task is deliberately not raw OCR — it is OCR post-correction. Given noisy OCR output, a Polish NLP model must recover the correct text. This rewards language understanding, not vision.

§ 03 · CodeSOTA Polish

1,000 images, four categories.

Our own panel, built to separate character recognition from dictionary assistance. Tesseract 5.5.1 is the published baseline; overall CER is 26.3%.

Images: 1,000
Baseline: Tesseract 5.5.1 · 26.3% CER
Degradation: Augraphy · five levels

Tesseract 5.5.1 baseline by category

Lower CER = better

Category	Description	CER
Wikipedia	Polish Wikipedia excerpts (potential contamination baseline)	5.2%
Real Corpus	Pan Tadeusz, official documents	7.3%
Synth Random	Random Polish characters (pure OCR)	40.6%
Synth Words	Markov-generated words (no dictionary)	52.1%
Overall	All 1,000 images	26.3%

Fig 2 · A ten-fold gap between real corpus (7.3% CER) and Markov-generated words (52.1%) is evidence that modern OCR leans heavily on the language model, not the image.

Degradation levels · Augraphy

Clean

No artifacts

Light

Subtle noise

Medium

Roller marks

Heavy

Ink bleed

Severe

Bad photocopy

§ 04 · Datasets

The panels, each with its metric.

Every Polish-language OCR dataset currently in the Codesota registry, with its task, sample count and source link.

Dataset	Task	Pages / Samples	Year	Source
PolEval 2021 OCR	document-ocr	69,000	2021	paper →
IMPACT-PSNC	document-ocr	478	2012	paper →
reVISION	ocr-capabilities	—	2025	paper →
Polish EMNIST Extension	handwriting-recognition	—	2020	—
CodeSOTA Polish	document-ocr	1,000	2025	dataset →

Fig 3 · Each row carries its metric direction via the linked source. PolEval 2021 is post-correction, not raw OCR; scores are not comparable to IMPACT-PSNC.

§ 05

Methodology

What CER does and does not say.

Character Error Rate is a tidy Levenshtein over a known reference. 2.1% means roughly two errors per 100 characters; it does not say where those errors are. On Polish documents, a stripped ogonek is a CER error but often a different word.

Our CodeSOTA Polish panel deliberately tests both ends. The Wikipedia and Real Corpus categories let a model lean on a Polish language prior — a dictionary, n-gram statistics, fine-tuning on Polish text. The Synth Random and Synth Words categories deny that prior; the model has only the image.

A ten-fold gap between the two is the signal we track. A model that reads characters, not the dictionary, is the one that will hold up on the next out-of-distribution document you hand it.

Neighbouring registers.

Cross-links to the rest of Codesota.

OCR · register →

General-purpose document OCR leaderboards.

Polish LLMs →

Bielik, PLLuM and the five Polish-language panels.

CPTU-Bench →

Complex Polish text understanding.

Methodology →

How scores are admitted and retracted.