First, the nine diacritics. Standard OCR engines trained on English corpora routinely confuse ą with a, ł with l, ó with o. A single missed ogonek can invert the meaning of a Polish sentence.
Second, the historical stock. Polish documents from 1791 to the early twentieth century frequently use gothic or fraktur typefaces. Modern OCR is trained on modern fonts; historical print requires either a dedicated model or a post-correction step.
Third, the PolEval 2021 task is deliberately not raw OCR — it is OCR post-correction. Given noisy OCR output, a Polish NLP model must recover the correct text. This rewards language understanding, not vision.