Codesota · OCR · PolskiThe register of Polish document recognitionUpdated · March 2026
§ 00 · Polish OCR

Nine diacritics, two centuries of print.

Polish OCR sits between two hard regimes: PolEval 2021 measures NLP post-correction on historical books, IMPACT-PSNC supplies the ground truth, and our own CodeSOTA Polish panel separates models that read characters from models that read the dictionary.

0 models with at least one Polish benchmark result. Shaded rows mark the best current CER. Everything on this page is computed from the registry JSON — nothing invented.

§ 01 · Leaderboard

Character error rate, three panels.

CodeSOTA Polish, PolEval 2021 and IMPACT-PSNC report CER directly. Lower is better. Rows are ranked by combined CER across available panels.


Metric
CER · lower is better
Models
0 with Polish results
Leader
Ranked by combined CER · March 2026
Shaded row marks current best
#ModelVendorTypeCodeSOTA CERPolEval CERIMPACT CERTrend
Fig 1 · CER across the three Polish panels. Blanks are honest — we do not impute missing benchmarks.
§ 02 · Task

Three reasons Polish OCR is hard.

First, the nine diacritics. Standard OCR engines trained on English corpora routinely confuse ą with a, ł with l, ó with o. A single missed ogonek can invert the meaning of a Polish sentence.

Second, the historical stock. Polish documents from 1791 to the early twentieth century frequently use gothic or fraktur typefaces. Modern OCR is trained on modern fonts; historical print requires either a dedicated model or a post-correction step.

Third, the PolEval 2021 task is deliberately not raw OCR — it is OCR post-correction. Given noisy OCR output, a Polish NLP model must recover the correct text. This rewards language understanding, not vision.

§ 03 · CodeSOTA Polish

1,000 images, four categories.

Our own panel, built to separate character recognition from dictionary assistance. Tesseract 5.5.1 is the published baseline; overall CER is 26.3%.


Images
1,000
Baseline
Tesseract 5.5.1 · 26.3% CER
Degradation
Augraphy · five levels
Tesseract 5.5.1 baseline by category
Lower CER = better
CategoryDescriptionCER
WikipediaPolish Wikipedia excerpts (potential contamination baseline)5.2%
Real CorpusPan Tadeusz, official documents7.3%
Synth RandomRandom Polish characters (pure OCR)40.6%
Synth WordsMarkov-generated words (no dictionary)52.1%
OverallAll 1,000 images26.3%
Fig 2 · A ten-fold gap between real corpus (7.3% CER) and Markov-generated words (52.1%) is evidence that modern OCR leans heavily on the language model, not the image.
Degradation levels · Augraphy
Clean
No artifacts
Light
Subtle noise
Medium
Roller marks
Heavy
Ink bleed
Severe
Bad photocopy
§ 04 · Datasets

The panels, each with its metric.

Every Polish-language OCR dataset currently in the Codesota registry, with its task, sample count and source link.

DatasetTaskPages / SamplesYearSource
PolEval 2021 OCRdocument-ocr69,0002021paper →
IMPACT-PSNCdocument-ocr4782012paper →
reVISIONocr-capabilities2025paper →
Polish EMNIST Extensionhandwriting-recognition2020
CodeSOTA Polishdocument-ocr1,0002025dataset →
Fig 3 · Each row carries its metric direction via the linked source. PolEval 2021 is post-correction, not raw OCR; scores are not comparable to IMPACT-PSNC.
§ 05
Methodology

What CER does and does not say.

Character Error Rate is a tidy Levenshtein over a known reference. 2.1% means roughly two errors per 100 characters; it does not say where those errors are. On Polish documents, a stripped ogonek is a CER error but often a different word.

Our CodeSOTA Polish panel deliberately tests both ends. The Wikipedia and Real Corpus categories let a model lean on a Polish language prior — a dictionary, n-gram statistics, fine-tuning on Polish text. The Synth Random and Synth Words categories deny that prior; the model has only the image.

A ten-fold gap between the two is the signal we track. A model that reads characters, not the dictionary, is the one that will hold up on the next out-of-distribution document you hand it.

Related

Neighbouring registers.

Cross-links to the rest of Codesota.

OCR · register
General-purpose document OCR leaderboards.
Polish LLMs
Bielik, PLLuM and the five Polish-language panels.
CPTU-Bench
Complex Polish text understanding.
Methodology
How scores are admitted and retracted.