Codesota · Benchmark · CPTU-BenchHome/Leaderboards/CPTU-Bench
Unknown

CPTU-Bench.

Evaluates LLMs on understanding Polish text across 4 dimensions: sentiment analysis, language understanding (implicatures, author intent), phraseology (idioms, phraseological compounds), and tricky questions (logic, ambiguity, hallucination resistance). Score range 0-5 per category. Created by SpeakLeash/Spichlerz.

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Tricky Questions

Tricky Questions is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Tricky Questionsverifiedpapervendorcommunityunverified

Sentiment

Sentiment is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Sentimentverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01gemini-2.0-flash-001verified4.522026Source ↗Looks wrong?
02deepseek-ai/DeepSeek-R1 (API)verified4.492026Source ↗Looks wrong?
03deepseek-ai/DeepSeek-V3.2 (API)verified4.462026Source ↗Looks wrong?
04deepseek-ai/DeepSeek-V3.1 (API)verified4.422026Source ↗Looks wrong?
05Qwen/Qwen3.5-27B thinking (API)verified4.422026Source ↗Looks wrong?
06meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)verified4.392026Source ↗Looks wrong?
07moonshotai/Kimi-K2-Instruct-0905 (API)verified4.392026Source ↗Looks wrong?
08CYFRAGOVPL/pllum-12b-nc-chat-250715verified4.362026Source ↗Looks wrong?
09deepseek-ai/DeepSeek-V3 (API)verified4.362026Source ↗Looks wrong?
10deepseek-ai/DeepSeek-V3-0324 (API)verified4.362026Source ↗Looks wrong?
11mistralai/Mistral-Large-Instruct-2411verified4.332026Source ↗Looks wrong?
12meta-llama/Meta-Llama-3.1-70B-Instructverified4.332026Source ↗Looks wrong?
13meta-llama/Llama-3.3-70B-Instructverified4.292026Source ↗Looks wrong?
14Qwen/Qwen3.5-27B non-thinking (API)verified4.292026Source ↗Looks wrong?
15gemini-2.0-flash-lite-001verified4.232026Source ↗Looks wrong?
16Qwen/Qwen3.5-35B-A3B non-thinking (API)verified4.232026Source ↗Looks wrong?
17mistralai/Mistral-Large-Instruct-2407verified4.232026Source ↗Looks wrong?
18Qwen/Qwen3-235B-A22B non-thinking (API)verified4.172026Source ↗Looks wrong?
19Qwen/Qwen3-32B non-thinking (API)verified4.132026Source ↗Looks wrong?
20meta-llama/Meta-Llama-3-70B-Instructverified4.132026Source ↗Looks wrong?
21mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)verified4.132026Source ↗Looks wrong?
22speakleash/Bielik-11B-v2.6-Instructverified4.102026Source ↗Looks wrong?
23Qwen/Qwen3.5-35B-A3B thinking (API)verified4.102026Source ↗Looks wrong?
24meta-llama/Llama-4-Scout-17B-16E-Instruct (API)verified4.102026Source ↗Looks wrong?
25Qwen/Qwen2.5-72B-Instructverified4.082026Source ↗Looks wrong?
26speakleash/Bielik-11B-v2.5-Instructverified4.012026Source ↗Looks wrong?
27mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)verified4.012026Source ↗Looks wrong?
28speakleash/Bielik-11B-v2.3-Instructverified3.972026Source ↗Looks wrong?
29meta-llama/Meta-Llama-3.1-8B-Instructverified3.972026Source ↗Looks wrong?
30speakleash/Bielik-11B-v2.0-Instructverified3.972026Source ↗Looks wrong?
31speakleash/Bielik-11B-v2.1-Instructverified3.962026Source ↗Looks wrong?
32openai/gpt-oss-120b (API)verified3.942026Source ↗Looks wrong?
33CYFRAGOVPL/Llama-PLLuM-70B-chatverified3.942026Source ↗Looks wrong?
34mistralai/Mistral-Small-24B-Instruct-2501verified3.912026Source ↗Looks wrong?
35CYFRAGOVPL/pllum-12b-nc-instruct-250715verified3.912026Source ↗Looks wrong?
36Qwen/Qwen2.5-14B-Instructverified3.912026Source ↗Looks wrong?
37Qwen/Qwen3-14B non-thinking (API)verified3.912026Source ↗Looks wrong?

Language Understanding

Language Understanding is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Language Understandingverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01deepseek-ai/DeepSeek-V3.2 (API)verified4.362026Source ↗Looks wrong?
02deepseek-ai/DeepSeek-R1 (API)verified4.342026Source ↗Looks wrong?
03deepseek-ai/DeepSeek-V3.1 (API)verified4.332026Source ↗Looks wrong?
04gemini-2.0-flash-001verified4.322026Source ↗Looks wrong?
05deepseek-ai/DeepSeek-V3 (API)verified4.222026Source ↗Looks wrong?
06Qwen/Qwen3.5-27B thinking (API)verified4.212026Source ↗Looks wrong?
07deepseek-ai/DeepSeek-V3-0324 (API)verified4.202026Source ↗Looks wrong?
08moonshotai/Kimi-K2-Instruct-0905 (API)verified4.182026Source ↗Looks wrong?
09Qwen/Qwen3.5-27B non-thinking (API)verified4.172026Source ↗Looks wrong?
10Qwen/Qwen3-235B-A22B non-thinking (API)verified4.162026Source ↗Looks wrong?
11meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)verified4.112026Source ↗Looks wrong?
12gemini-2.0-flash-lite-001verified4.052026Source ↗Looks wrong?
13Qwen/Qwen3.5-35B-A3B non-thinking (API)verified4.052026Source ↗Looks wrong?
14mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)verified4.002026Source ↗Looks wrong?
15mistralai/Mistral-Large-Instruct-2407verified4.002026Source ↗Looks wrong?
16mistralai/Mistral-Large-Instruct-2411verified3.982026Source ↗Looks wrong?
17openai/gpt-oss-120b (API)verified3.972026Source ↗Looks wrong?
18Qwen/Qwen2.5-72B-Instructverified3.972026Source ↗Looks wrong?
19CYFRAGOVPL/pllum-12b-nc-chat-250715verified3.962026Source ↗Looks wrong?
20speakleash/Bielik-11B-v2.6-Instructverified3.942026Source ↗Looks wrong?
21Qwen/Qwen3.5-35B-A3B thinking (API)verified3.942026Source ↗Looks wrong?
22speakleash/Bielik-11B-v2.1-Instructverified3.922026Source ↗Looks wrong?

Phraseology

Phraseology is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Phraseologyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01gemini-2.0-flash-001verified4.342026Source ↗Looks wrong?
02gemini-2.0-flash-lite-001verified4.242026Source ↗Looks wrong?
03Qwen/Qwen3.5-35B-A3B non-thinking (API)verified4.232026Source ↗Looks wrong?
04alpindale/WizardLM-2-8x22B (API)verified4.222026Source ↗Looks wrong?
05Qwen/Qwen3.5-27B non-thinking (API)verified4.202026Source ↗Looks wrong?
06mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)verified4.152026Source ↗Looks wrong?
07Qwen/Qwen3.5-35B-A3B thinking (API)verified4.152026Source ↗Looks wrong?
08Qwen/Qwen3.5-27B thinking (API)verified4.112026Source ↗Looks wrong?
09Qwen/Qwen2.5-32B-Instructverified4.042026Source ↗Looks wrong?
10google/gemma-3-27b-it (API)verified4.032026Source ↗Looks wrong?
11mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)verified4.002026Source ↗Looks wrong?
12mistralai/Mistral-Large-Instruct-2411verified3.992026Source ↗Looks wrong?
13speakleash/Bielik-11B-v3.0-Instructverified3.962026Source ↗Looks wrong?
14Qwen/Qwen2.5-72B-Instructverified3.932026Source ↗Looks wrong?

Average

Average is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Averageverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Qwen/Qwen3.5-27B thinking (API)verified4.342026Source ↗Looks wrong?
02gemini-2.0-flash-001verified4.292026Source ↗Looks wrong?
03Qwen/Qwen3.5-27B non-thinking (API)verified4.272026Source ↗Looks wrong?
04Qwen/Qwen3.5-35B-A3B thinking (API)verified4.222026Source ↗Looks wrong?
05Qwen/Qwen3.5-35B-A3B non-thinking (API)verified4.182026Source ↗Looks wrong?
06deepseek-ai/DeepSeek-V3.2 (API)verified4.142026Source ↗Looks wrong?
07deepseek-ai/DeepSeek-R1 (API)verified4.142026Source ↗Looks wrong?
08gemini-2.0-flash-lite-001verified4.092026Source ↗Looks wrong?
09deepseek-ai/DeepSeek-V3-0324 (API)verified4.032026Source ↗Looks wrong?
10deepseek-ai/DeepSeek-V3.1 (API)verified4.032026Source ↗Looks wrong?
11deepseek-ai/DeepSeek-V3 (API)verified4.022026Source ↗Looks wrong?
12mistralai/Mistral-Large-Instruct-2411verified4.002026Source ↗Looks wrong?
13moonshotai/Kimi-K2-Instruct-0905 (API)verified3.982026Source ↗Looks wrong?
14Qwen/Qwen2.5-72B-Instructverified3.952026Source ↗Looks wrong?
15mistralai/Mistral-Large-Instruct-2407verified3.932026Source ↗Looks wrong?
16meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)verified3.932026Source ↗Looks wrong?
17Qwen/Qwen3-235B-A22B non-thinking (API)verified3.912026Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards