Polish Text Understanding
Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.
CPTU-Bench tests whether AI models truly understand Polish — not just translate it. 378 hand-written examples cover sarcasm, idioms, logical traps, and absurd questions. GPT-4o judges each answer on a 0-5 scale. The best global models score ~4.3/5. The best Polish-native model (Bielik-11B-v3.0) scores 3.74 — closing the gap fast.
Examples
Input
"Wspaniale, że znowu muszę zostawać po godzinach, bo ktoś nie dostarczył raportu na czas. Naprawdę uwielbiam te niespodzianki." (Great, I have to stay overtime again because someone didn't deliver the report on time. I really love these surprises.)
Output
Sentiment: NEGATIVE (despite "wspaniale" and "uwielbiam" — positive words used sarcastically). Author's intent: frustration with a colleague's failure, passive-aggressive complaint about work culture.
Most models get the sentiment wrong — they read "wspaniale" (wonderful) as positive. Score 5 requires detecting sarcasm AND explaining the underlying complaint.
Input
"Jan wziął nogi za pas, kiedy zobaczył, że szef ma minę jak siedem nieszczęść. Widać było, że ktoś narobił bigosu." (Jan took his legs behind his belt [= ran away] when he saw the boss had a face like seven misfortunes [= looked miserable]. It was clear someone made a mess [lit: made bigos].)
Output
Must identify 3 idioms: (1) wziąć nogi za pas = to flee, (2) mina jak siedem nieszczęść = looking very unhappy, (3) narobić bigosu = to cause a mess. Penalty for listing non-existent idioms.
Score 0 if model translates literally. Score 5 requires identifying all 3 idioms with correct contextual meaning. Points deducted for phantom idioms.
Input
"W którym roku doszło do bitwy morskiej między Watykanem a Republiką Południowej Afryki na wodach Morza Kaspijskiego?" (In what year did a naval battle between the Vatican and South Africa take place in the Caspian Sea?)
Output
GOOD: "This event never happened. The Vatican has no navy. South Africa has no access to the Caspian Sea. The question describes an impossible scenario." BAD: "The battle took place in 1943..." (hallucinated)
Score 5 = refuses to answer, explains WHY it's impossible. Score 0 = confidently hallucinates a date, location, and historical context. No system prompt given — model must reason independently.
Input
"Ile kilogramów cukru trzeba dodać do herbaty, żeby mieć słodki głos?" (How many kilograms of sugar do you need to add to tea to have a sweet voice?)
Output
GOOD: "A 'sweet voice' is a metaphor — it cannot be achieved by consuming sugar. The question conflates a literal substance with a figurative quality." BAD: "About 2-3 tablespoons should suffice for a sweet taste."
Tests literal vs figurative language understanding. Weak models treat it as a cooking question.
Why Polish is hard for LLMs
Polish has 7 grammatical cases, complex conjugation, and a rich tradition of idioms that change meaning based on context. Most LLM training data is English-dominant. CPTU-Bench specifically tests whether models have internalized Polish linguistic patterns or are just translating from English understanding.
What's Tested
Sentiment
Detect positive/negative sentiment in texts with sarcasm and hidden meaning. Neutral texts are deliberately excluded to avoid ambiguity.
Language Understanding
Comprehend the author's true intent behind sarcastic, ironic, or implied statements. Compared against human reference explanations.
Phraseology
Identify Polish idioms and phraseological compounds, explain their meaning in context. Points deducted for listing non-existent phrases.
Tricky Questions
Handle logical riddles, semantic absurdity, impossible events, and humor. Must resist hallucination and recognize when a question has no valid answer.
How Polish Text Understanding Works
378 hand-written examples
200 implicature texts (sarcasm, idioms, intent) + 178 tricky questions (logic, absurdity, ambiguity). All in Polish, all written by humans — no synthetic data.
Model receives minimal prompt
For implicatures: role as careful linguist + 2 diverse examples + target text. For tricky questions: just the question, no system prompt, no hints.
GPT-4o judges the output
Structured JSON evaluation with think/mark fields. The judge compares against human reference answers. Separate scoring for sentiment, understanding, phraseology, and trick detection.
Score 0-5 per dimension
Not percentage-based. 5 = perfect understanding. Penalization for phrase spamming and hallucinated answers. Average across 4 categories is the headline score.
Key Challenges
Polish has rich morphology (7 cases, 3 genders) — idioms change form based on grammatical context, making pattern matching insufficient.
Sarcasm and implicature require cultural context that English-centric training data doesn't cover well.
Tricky questions are designed to trigger hallucination — the model must know when to say "this doesn't make sense" instead of fabricating answers.
Phraseological compounds have meanings that can't be derived from individual words — "wziąć nogi za pas" (take legs behind belt = to run away).
No system prompt for tricky questions means the model can't rely on safety instructions — it must have genuine reasoning ability.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Polish Conversation Quality
Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Polish Text Understanding benchmarks accurate. Report outdated results, missing benchmarks, or errors.