Natural Language Processing

Polish Text Understanding

Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.

1 datasets465 resultsView full task mapping →

CPTU-Bench tests whether AI models truly understand Polish — not just translate it. 378 hand-written examples cover sarcasm, idioms, logical traps, and absurd questions. GPT-4o judges each answer on a 0-5 scale. The best global models score ~4.3/5. The best Polish-native model (Bielik-11B-v3.0) scores 3.74 — closing the gap fast.

Examples

Implicature — Hidden sarcasm

Input

"Wspaniale, że znowu muszę zostawać po godzinach, bo ktoś nie dostarczył raportu na czas. Naprawdę uwielbiam te niespodzianki." (Great, I have to stay overtime again because someone didn't deliver the report on time. I really love these surprises.)

Output

Sentiment: NEGATIVE (despite "wspaniale" and "uwielbiam" — positive words used sarcastically). Author's intent: frustration with a colleague's failure, passive-aggressive complaint about work culture.

Most models get the sentiment wrong — they read "wspaniale" (wonderful) as positive. Score 5 requires detecting sarcasm AND explaining the underlying complaint.

Phraseology — Polish idioms in context

Input

"Jan wziął nogi za pas, kiedy zobaczył, że szef ma minę jak siedem nieszczęść. Widać było, że ktoś narobił bigosu." (Jan took his legs behind his belt [= ran away] when he saw the boss had a face like seven misfortunes [= looked miserable]. It was clear someone made a mess [lit: made bigos].)

Output

Must identify 3 idioms: (1) wziąć nogi za pas = to flee, (2) mina jak siedem nieszczęść = looking very unhappy, (3) narobić bigosu = to cause a mess. Penalty for listing non-existent idioms.

Score 0 if model translates literally. Score 5 requires identifying all 3 idioms with correct contextual meaning. Points deducted for phantom idioms.

Tricky question — Impossible event

Input

"W którym roku doszło do bitwy morskiej między Watykanem a Republiką Południowej Afryki na wodach Morza Kaspijskiego?" (In what year did a naval battle between the Vatican and South Africa take place in the Caspian Sea?)

Output

GOOD: "This event never happened. The Vatican has no navy. South Africa has no access to the Caspian Sea. The question describes an impossible scenario." BAD: "The battle took place in 1943..." (hallucinated)

Score 5 = refuses to answer, explains WHY it's impossible. Score 0 = confidently hallucinates a date, location, and historical context. No system prompt given — model must reason independently.

Tricky question — Metaphorical trap

Input

"Ile kilogramów cukru trzeba dodać do herbaty, żeby mieć słodki głos?" (How many kilograms of sugar do you need to add to tea to have a sweet voice?)

Output

GOOD: "A 'sweet voice' is a metaphor — it cannot be achieved by consuming sugar. The question conflates a literal substance with a figurative quality." BAD: "About 2-3 tablespoons should suffice for a sweet taste."

Tests literal vs figurative language understanding. Weak models treat it as a cooking question.

Why Polish is hard for LLMs

Polish has 7 grammatical cases, complex conjugation, and a rich tradition of idioms that change meaning based on context. Most LLM training data is English-dominant. CPTU-Bench specifically tests whether models have internalized Polish linguistic patterns or are just translating from English understanding.

What's Tested

200

Sentiment

Detect positive/negative sentiment in texts with sarcasm and hidden meaning. Neutral texts are deliberately excluded to avoid ambiguity.

0-5 scaleWYDŹWIĘK field

200

Language Understanding

Comprehend the author's true intent behind sarcastic, ironic, or implied statements. Compared against human reference explanations.

0-5 scaleOCENA field

200

Phraseology

Identify Polish idioms and phraseological compounds, explain their meaning in context. Points deducted for listing non-existent phrases.

0-5 scaleZWIĄZKI fieldanti-spam penalty

178

Tricky Questions

Handle logical riddles, semantic absurdity, impossible events, and humor. Must resist hallucination and recognize when a question has no valid answer.

0-5 scalethink + mark JSONno system prompt

SpeakLeash / Spichlerz — Open source Polish AIhttps://github.com/speakleash

How Polish Text Understanding Works

378 hand-written examples

200 implicature texts (sarcasm, idioms, intent) + 178 tricky questions (logic, absurdity, ambiguity). All in Polish, all written by humans — no synthetic data.

Model receives minimal prompt

For implicatures: role as careful linguist + 2 diverse examples + target text. For tricky questions: just the question, no system prompt, no hints.

GPT-4o judges the output

Structured JSON evaluation with think/mark fields. The judge compares against human reference answers. Separate scoring for sentiment, understanding, phraseology, and trick detection.

Score 0-5 per dimension

Not percentage-based. 5 = perfect understanding. Penalization for phrase spamming and hallucinated answers. Average across 4 categories is the headline score.

Key Challenges

Polish has rich morphology (7 cases, 3 genders) — idioms change form based on grammatical context, making pattern matching insufficient.

Sarcasm and implicature require cultural context that English-centric training data doesn't cover well.

Tricky questions are designed to trigger hallucination — the model must know when to say "this doesn't make sense" instead of fabricating answers.

Phraseological compounds have meanings that can't be derived from individual words — "wziąć nogi za pas" (take legs behind belt = to run away).

No system prompt for tricky questions means the model can't rely on safety instructions — it must have genuine reasoning ability.

Benchmarks & SOTA

CPTU-Bench

Complex Polish Text Understanding Benchmark

2025465 results

Evaluates LLMs on understanding Polish text across 4 dimensions: sentiment analysis, language understanding (implicatures, author intent), phraseology (idioms, phraseological compounds), and tricky questions (logic, ambiguity, hallucination resistance). Score range 0-5 per category. 378 hand-written examples. Created by SpeakLeash/Spichlerz.

State of the Art

Qwen/Qwen3.5-35B-A3B thinking (API)

Qwen

4.702247

tricky-questions

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Polish Text Understanding benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing

Polish Text Understanding

Examples

Why Polish is hard for LLMs

What's Tested

Sentiment

Language Understanding

Phraseology

Tricky Questions

How Polish Text Understanding Works

Key Challenges

Benchmarks & SOTA

CPTU-Bench

Related Tasks

Question Answering

Polish LLM General

Natural Language Inference

Reading Comprehension

Something wrong or missing?