Codesota · Benchmark · CPTU-BenchHome/Leaderboards/CPTU-Bench

Unknown

CPTU-Bench.

Evaluates LLMs on understanding Polish text across 4 dimensions: sentiment analysis, language understanding (implicatures, author intent), phraseology (idioms, phraseological compounds), and tricky questions (logic, ambiguity, hallucination resistance). Score range 0-5 per category. Created by SpeakLeash/Spichlerz.

Paper ↗Leaderboard ↓

§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Tricky Questions

Tricky Questions is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Tricky Questionsverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Qwen/Qwen3.5-35B-A3B thinking (API)	verified	4.70	2026	Source ↗	Looks wrong?
02	Qwen/Qwen3.5-27B thinking (API)	verified	4.61	2026	Source ↗	Looks wrong?
03	Qwen/Qwen3.5-27B non-thinking (API)	verified	4.43	2026	Source ↗	Looks wrong?
04	deepseek-ai/DeepSeek-V3.2 (API)	verified	4.20	2026	Source ↗	Looks wrong?
05	Qwen/Qwen3.5-35B-A3B non-thinking (API)	verified	4.19	2026	Source ↗	Looks wrong?
06	deepseek-ai/DeepSeek-R1 (API)	verified	4.12	2026	Source ↗	Looks wrong?
07	deepseek-ai/DeepSeek-V3-0324 (API)	verified	4.02	2026	Source ↗	Looks wrong?
08	gemini-2.0-flash-001	verified	3.99	2026	Source ↗	Looks wrong?
09	deepseek-ai/DeepSeek-V3 (API)	verified	3.99	2026	Source ↗	Looks wrong?
10	moonshotai/Kimi-K2-Instruct-0905 (API)	verified	3.93	2026	Source ↗	Looks wrong?

Sentiment

Sentiment is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Sentimentverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	gemini-2.0-flash-001	verified	4.52	2026	Source ↗	Looks wrong?
02	deepseek-ai/DeepSeek-R1 (API)	verified	4.49	2026	Source ↗	Looks wrong?
03	deepseek-ai/DeepSeek-V3.2 (API)	verified	4.46	2026	Source ↗	Looks wrong?
04	deepseek-ai/DeepSeek-V3.1 (API)	verified	4.42	2026	Source ↗	Looks wrong?
05	Qwen/Qwen3.5-27B thinking (API)	verified	4.42	2026	Source ↗	Looks wrong?
06	meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)	verified	4.39	2026	Source ↗	Looks wrong?
07	moonshotai/Kimi-K2-Instruct-0905 (API)	verified	4.39	2026	Source ↗	Looks wrong?
08	CYFRAGOVPL/pllum-12b-nc-chat-250715	verified	4.36	2026	Source ↗	Looks wrong?
09	deepseek-ai/DeepSeek-V3 (API)	verified	4.36	2026	Source ↗	Looks wrong?
10	deepseek-ai/DeepSeek-V3-0324 (API)	verified	4.36	2026	Source ↗	Looks wrong?
11	mistralai/Mistral-Large-Instruct-2411	verified	4.33	2026	Source ↗	Looks wrong?
12	meta-llama/Meta-Llama-3.1-70B-Instruct	verified	4.33	2026	Source ↗	Looks wrong?
13	meta-llama/Llama-3.3-70B-Instruct	verified	4.29	2026	Source ↗	Looks wrong?
14	Qwen/Qwen3.5-27B non-thinking (API)	verified	4.29	2026	Source ↗	Looks wrong?
15	gemini-2.0-flash-lite-001	verified	4.23	2026	Source ↗	Looks wrong?
16	Qwen/Qwen3.5-35B-A3B non-thinking (API)	verified	4.23	2026	Source ↗	Looks wrong?
17	mistralai/Mistral-Large-Instruct-2407	verified	4.23	2026	Source ↗	Looks wrong?
18	Qwen/Qwen3-235B-A22B non-thinking (API)	verified	4.17	2026	Source ↗	Looks wrong?
19	Qwen/Qwen3-32B non-thinking (API)	verified	4.13	2026	Source ↗	Looks wrong?
20	meta-llama/Meta-Llama-3-70B-Instruct	verified	4.13	2026	Source ↗	Looks wrong?
21	mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)	verified	4.13	2026	Source ↗	Looks wrong?
22	speakleash/Bielik-11B-v2.6-Instruct	verified	4.10	2026	Source ↗	Looks wrong?
23	Qwen/Qwen3.5-35B-A3B thinking (API)	verified	4.10	2026	Source ↗	Looks wrong?
24	meta-llama/Llama-4-Scout-17B-16E-Instruct (API)	verified	4.10	2026	Source ↗	Looks wrong?
25	Qwen/Qwen2.5-72B-Instruct	verified	4.08	2026	Source ↗	Looks wrong?
26	speakleash/Bielik-11B-v2.5-Instruct	verified	4.01	2026	Source ↗	Looks wrong?
27	mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)	verified	4.01	2026	Source ↗	Looks wrong?
28	speakleash/Bielik-11B-v2.3-Instruct	verified	3.97	2026	Source ↗	Looks wrong?
29	meta-llama/Meta-Llama-3.1-8B-Instruct	verified	3.97	2026	Source ↗	Looks wrong?
30	speakleash/Bielik-11B-v2.0-Instruct	verified	3.97	2026	Source ↗	Looks wrong?
31	speakleash/Bielik-11B-v2.1-Instruct	verified	3.96	2026	Source ↗	Looks wrong?
32	openai/gpt-oss-120b (API)	verified	3.94	2026	Source ↗	Looks wrong?
33	CYFRAGOVPL/Llama-PLLuM-70B-chat	verified	3.94	2026	Source ↗	Looks wrong?
34	mistralai/Mistral-Small-24B-Instruct-2501	verified	3.91	2026	Source ↗	Looks wrong?
35	CYFRAGOVPL/pllum-12b-nc-instruct-250715	verified	3.91	2026	Source ↗	Looks wrong?
36	Qwen/Qwen2.5-14B-Instruct	verified	3.91	2026	Source ↗	Looks wrong?
37	Qwen/Qwen3-14B non-thinking (API)	verified	3.91	2026	Source ↗	Looks wrong?

Language Understanding

Language Understanding is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Language Understandingverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	deepseek-ai/DeepSeek-V3.2 (API)	verified	4.36	2026	Source ↗	Looks wrong?
02	deepseek-ai/DeepSeek-R1 (API)	verified	4.34	2026	Source ↗	Looks wrong?
03	deepseek-ai/DeepSeek-V3.1 (API)	verified	4.33	2026	Source ↗	Looks wrong?
04	gemini-2.0-flash-001	verified	4.32	2026	Source ↗	Looks wrong?
05	deepseek-ai/DeepSeek-V3 (API)	verified	4.22	2026	Source ↗	Looks wrong?
06	Qwen/Qwen3.5-27B thinking (API)	verified	4.21	2026	Source ↗	Looks wrong?
07	deepseek-ai/DeepSeek-V3-0324 (API)	verified	4.20	2026	Source ↗	Looks wrong?
08	moonshotai/Kimi-K2-Instruct-0905 (API)	verified	4.18	2026	Source ↗	Looks wrong?
09	Qwen/Qwen3.5-27B non-thinking (API)	verified	4.17	2026	Source ↗	Looks wrong?
10	Qwen/Qwen3-235B-A22B non-thinking (API)	verified	4.16	2026	Source ↗	Looks wrong?
11	meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)	verified	4.11	2026	Source ↗	Looks wrong?
12	gemini-2.0-flash-lite-001	verified	4.05	2026	Source ↗	Looks wrong?
13	Qwen/Qwen3.5-35B-A3B non-thinking (API)	verified	4.05	2026	Source ↗	Looks wrong?
14	mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)	verified	4.00	2026	Source ↗	Looks wrong?
15	mistralai/Mistral-Large-Instruct-2407	verified	4.00	2026	Source ↗	Looks wrong?
16	mistralai/Mistral-Large-Instruct-2411	verified	3.98	2026	Source ↗	Looks wrong?
17	openai/gpt-oss-120b (API)	verified	3.97	2026	Source ↗	Looks wrong?
18	Qwen/Qwen2.5-72B-Instruct	verified	3.97	2026	Source ↗	Looks wrong?
19	CYFRAGOVPL/pllum-12b-nc-chat-250715	verified	3.96	2026	Source ↗	Looks wrong?
20	speakleash/Bielik-11B-v2.6-Instruct	verified	3.94	2026	Source ↗	Looks wrong?
21	Qwen/Qwen3.5-35B-A3B thinking (API)	verified	3.94	2026	Source ↗	Looks wrong?
22	speakleash/Bielik-11B-v2.1-Instruct	verified	3.92	2026	Source ↗	Looks wrong?

Phraseology

Phraseology is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Phraseologyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	gemini-2.0-flash-001	verified	4.34	2026	Source ↗	Looks wrong?
02	gemini-2.0-flash-lite-001	verified	4.24	2026	Source ↗	Looks wrong?
03	Qwen/Qwen3.5-35B-A3B non-thinking (API)	verified	4.23	2026	Source ↗	Looks wrong?
04	alpindale/WizardLM-2-8x22B (API)	verified	4.22	2026	Source ↗	Looks wrong?
05	Qwen/Qwen3.5-27B non-thinking (API)	verified	4.20	2026	Source ↗	Looks wrong?
06	mistralai/Mistral-Small-3.1-24B-Instruct-2503 (API FP8)	verified	4.15	2026	Source ↗	Looks wrong?
07	Qwen/Qwen3.5-35B-A3B thinking (API)	verified	4.15	2026	Source ↗	Looks wrong?
08	Qwen/Qwen3.5-27B thinking (API)	verified	4.11	2026	Source ↗	Looks wrong?
09	Qwen/Qwen2.5-32B-Instruct	verified	4.04	2026	Source ↗	Looks wrong?
10	google/gemma-3-27b-it (API)	verified	4.03	2026	Source ↗	Looks wrong?
11	mistralai/Mistral-Small-3.2-24B-Instruct-2506 (API FP8)	verified	4.00	2026	Source ↗	Looks wrong?
12	mistralai/Mistral-Large-Instruct-2411	verified	3.99	2026	Source ↗	Looks wrong?
13	speakleash/Bielik-11B-v3.0-Instruct	verified	3.96	2026	Source ↗	Looks wrong?
14	Qwen/Qwen2.5-72B-Instruct	verified	3.93	2026	Source ↗	Looks wrong?

Average

Average is the reported evaluation metric for CPTU-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Averageverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Qwen/Qwen3.5-27B thinking (API)	verified	4.34	2026	Source ↗	Looks wrong?
02	gemini-2.0-flash-001	verified	4.29	2026	Source ↗	Looks wrong?
03	Qwen/Qwen3.5-27B non-thinking (API)	verified	4.27	2026	Source ↗	Looks wrong?
04	Qwen/Qwen3.5-35B-A3B thinking (API)	verified	4.22	2026	Source ↗	Looks wrong?
05	Qwen/Qwen3.5-35B-A3B non-thinking (API)	verified	4.18	2026	Source ↗	Looks wrong?
06	deepseek-ai/DeepSeek-V3.2 (API)	verified	4.14	2026	Source ↗	Looks wrong?
07	deepseek-ai/DeepSeek-R1 (API)	verified	4.14	2026	Source ↗	Looks wrong?
08	gemini-2.0-flash-lite-001	verified	4.09	2026	Source ↗	Looks wrong?
09	deepseek-ai/DeepSeek-V3-0324 (API)	verified	4.03	2026	Source ↗	Looks wrong?
10	deepseek-ai/DeepSeek-V3.1 (API)	verified	4.03	2026	Source ↗	Looks wrong?
11	deepseek-ai/DeepSeek-V3 (API)	verified	4.02	2026	Source ↗	Looks wrong?
12	mistralai/Mistral-Large-Instruct-2411	verified	4.00	2026	Source ↗	Looks wrong?
13	moonshotai/Kimi-K2-Instruct-0905 (API)	verified	3.98	2026	Source ↗	Looks wrong?
14	Qwen/Qwen2.5-72B-Instruct	verified	3.95	2026	Source ↗	Looks wrong?
15	mistralai/Mistral-Large-Instruct-2407	verified	3.93	2026	Source ↗	Looks wrong?
16	meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)	verified	3.93	2026	Source ↗	Looks wrong?
17	Qwen/Qwen3-235B-A22B non-thinking (API)	verified	3.91	2026	Source ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards