Evaluates LLMs on understanding Polish text across 4 dimensions: sentiment analysis, language understanding (implicatures, author intent), phraseology (idioms, phraseological compounds), and tricky questions (logic, ambiguity, hallucination resistance). Score range 0-5 per category. Created by SpeakLeash/Spichlerz.
Tricky Questions is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Qwen/Qwen3.5-35B-A3B thinking (API) | verified | 4.70 | 2026 | Source ↗ | Looks wrong? |
| 02 | Qwen/Qwen3.5-27B thinking (API) | verified | 4.61 | 2026 | Source ↗ | Looks wrong? |
| 03 | Qwen/Qwen3.5-27B non-thinking (API) | verified | 4.43 | 2026 | Source ↗ | Looks wrong? |
| 04 | deepseek-ai/DeepSeek-V3.2 (API) | verified | 4.20 | 2026 | Source ↗ | Looks wrong? |
| 05 | Qwen/Qwen3.5-35B-A3B non-thinking (API) | verified | 4.19 | 2026 | Source ↗ | Looks wrong? |
| 06 | deepseek-ai/DeepSeek-R1 (API) | verified | 4.12 | 2026 | Source ↗ | Looks wrong? |
| 07 | deepseek-ai/DeepSeek-V3-0324 (API) | verified | 4.02 | 2026 | Source ↗ | Looks wrong? |
| 08 | gemini-2.0-flash-001 | verified | 3.99 | 2026 | Source ↗ | Looks wrong? |
| 09 | deepseek-ai/DeepSeek-V3 (API) | verified | 3.99 | 2026 | Source ↗ | Looks wrong? |
| 10 | moonshotai/Kimi-K2-Instruct-0905 (API) | verified | 3.93 | 2026 | Source ↗ | Looks wrong? |
Sentiment is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Language Understanding is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Phraseology is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Average is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better