Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · LLMs · PolishThe register of Polish-language evaluationLive · May 2026
§ 00 · Polish LLMs

Polish, measured honestly.

Five benchmarks, two home-grown model families. We compare language models on PLCC, CPTU-Bench, the Open PL LLM Leaderboard, KLEJ NER, PolEval, Polish MT-Bench and Polish EQ-Bench — and track every version of Bielik and PLLuM across the whole panel.

All scores are live from the Codesota registry. Shaded rows mark current state of the art on that benchmark. Models are not retracted silently; when a score moves, the previous entry is kept.

§ 01 · Benchmarks

Five panels, different angles.

Polish-language evaluation is not one task. The five panels below each probe a different competence — cultural knowledge, text understanding, conversation, emotion, general reasoning.

BenchmarkWhat it testsModelsSOTASource
PLCCPolish Linguistic and Cultural Competency — grammar, idioms, cultural references.16597leaderboard →
CPTU-BenchComplex Polish Text Understanding — comprehension of nuanced, multi-layered Polish.934.34leaderboard →
Open PL LLM LeaderboardMulti-task Polish evaluation — reasoning, knowledge and language understanding.66069.84leaderboard →
Polish MT-BenchMulti-turn conversation quality — dialogue coherence and context retention in Polish.509.28leaderboard →
Polish EQ-BenchEmotional intelligence in Polish — understanding of emotions and social nuance.10178.07leaderboard →
Fig 1 · Five Polish-language panels. Model counts reflect current tracker coverage.
§ 01b · Polish filters

The named checks buyers ask for.

KLEJ, PolEval and Bielik should not be buried under one average. The table below maps each request to the concrete CodeSOTA route and search filter.

979 tracked rows

KLEJ

Tracked as KLEJ NER inside Open PL

Useful when the buyer cares about named-entity recognition rather than generic reasoning.

56 tracked rows

PolEval

PolEval 2018 Task 3 in Open PL; PolEval 2021 on Polish OCR

The LLM page links the Open PL PolEval row; the OCR register covers PolEval 2021 post-correction.

routing

Bielik bench

Bielik family tracked across PLCC, CPTU, Open PL, MT-Bench PL, EQ-Bench PL

There is no single honest “Bielik bench” row; the page shows Bielik versions across the Polish dataset panel.

routing

Dataset/language filter

Search now accepts language=pl and dataset filters

Use search chips for language PL plus a specific dataset such as PLCC, CPTU, Open PL, or PolEval OCR.

§ 02 · Task

What “Polish-ready” actually means.

A model that scores well on English MMLU can still fail at a Polish morphology puzzle. Polish has seven cases, free word order, and a century of literature that carries oblique cultural reference. English-first frontier models often translate internally; the loss is invisible on English benchmarks but legible in Polish conversation.

PLCC probes cultural and linguistic competence directly — idioms, registers, references to national authors. CPTU-Bench scores comprehension of nuanced literary and scientific prose. The Open PL LLM Leaderboard is the multi-task catch-all. MT-Bench and EQ-Bench in their Polish variants test conversation and emotional inference respectively.

The two home-grown families — Bielik from SpeakLeash, PLLuM from OPI — are Polish-first by design. They do not always top the boards, but they are the cleanest control for how much of a model's Polish performance is earned by training and how much is a translation trick.

§ 03 · Bielik tracker

SpeakLeash, every version.

Bielik — Polish for white-tailed eagle — is developed by SpeakLeash with a custom APT4 tokeniser and trained on 292B+ tokens of Polish text. Apache 2.0 licensed.

ModelPLCCCPTU-BenchOpen PL LLM LeaderboardPolish MT-BenchPolish EQ-Bench
Bielik-0.137
Bielik-1.5B-v1.0-DPO-001-L245.11
Bielik-1.5B-v1.0-DPO-001-L374
Bielik-1.5B-v1.0-DPO-001-L3-copy43.68
Bielik-1.5B-v1.0-m344.93
Bielik-1.5B-v1.0-m3b41.06
Bielik-1.5B-v1.0-m440.77
Bielik-1.5B-v319.23
Bielik-1.5B-v3.0-Instruct27
Bielik-1.5B-v3.0-Instruct-RC04202569.40
Bielik-1.5B-v3.0-Instruct-SFT-RC04202521.24
Bielik-11B-v260.42
Bielik-11B-v2.0-Instruct5.50
Bielik-11B-v2.1-Instruct9.13
Bielik-11B-v2.2-Instruct8.12
Bielik-11B-v2.2-M-1.286.15
Bielik-11B-v2.3-Instruct9.50
Bielik-11B-v2.3-Instruct-AWQ65.22
Bielik-11B-v2.3-Instruct-GPTQ46.75
Bielik-11B-v2.3-Instruct.IQ1_M.gguf.IQ44.02
Bielik-11B-v2.3-Instruct.IQ2_XXS.gguf.IQ61.34
Bielik-11B-v2.3-Instruct.IQ3_XXS.gguf.IQ54.62
Bielik-11B-v2.3-Instruct.Q4_K_M.gguf52.48
Bielik-11B-v2.3-Instruct.Q4_K_M.gguf.IQ54.52
Bielik-11B-v2.3-Instruct.Q6_K.gguf91.13
Bielik-11B-v2.3-Instruct.Q8_0.gguf65.76
Bielik-11B-v2.4-Instruct-MS65.51
Bielik-11B-v2.4-Instruct-SL92.31
Bielik-11B-v2.4-Instruct-TI79.30
Bielik-11B-v2.5-Instruct-D-GRPO_H_07064.57
Bielik-11B-v2.5-Instruct-GRPO_01071.20
Bielik-11B-v2.5-Instruct-GRPO_02061.58
Bielik-11B-v2.5-Instruct-GRPO_03087.33
Bielik-11B-v2.5-Instruct-GRPO_04035.24
Bielik-11B-v2.5-Instruct-GRPO_05064.18
Bielik-11B-v2.5-Instruct-GRPO_06091.04
Bielik-11B-v2.5-Instruct-GRPO_H_01084.21
Bielik-11B-v2.5-Instruct-GRPO_H_03084.07
Bielik-11B-v3-Base-2025073033.24
Bielik-11B-v3.0-Instruct57
Bielik-11B-v3.0-Instruct-FP8-Dynamic65.32
Bielik-11B-v3.0-Instruct.Q4_K_M.gguf65.09
Bielik-11B-v3.0-Instruct.Q6_K.gguf63.61
Bielik-11B-v3.0-Instruct.Q8_0.gguf80.20
Bielik-2.168
Bielik-2.262
Bielik-2.362.17
Bielik-2.562
Bielik-2.655
Bielik-4.5B-v384.78
Bielik-4.5B-v3.0-Instruct42.33
Bielik-4.5B-v3.0-Instruct-SFT-RC04202540.67
Bielik-7B-Instruct-v0.17.85
Bielik-7B-Instruct-v0.1-GPTQ55.62
Bielik-7B-v0.134.34
Bielik-Minitron-7B-v3.0-Instruct57
Bielik-PL-11B-v3.0-Instruct57.92
Bielik-PL-Minitron-7B-v3.0-Instruct52.87
Bielik-SOLAR-LIKE-10.7B-Instruct-v0.169.2234.17
minitron-Bielik-7B-v3.0-Instruct-GGUF.Q4_K_M.gguf79
minitron-Bielik-7B-v3.0-Instruct-GGUF.Q6_K.gguf58.89
minitron-Bielik-7B-v3.0-Instruct-GGUF.Q8_0.gguf82.69
MSH-Lite-7B-v1-Bielik-v2.3-Instruct-Llama-Prune39.36
MSH-v1-Bielik-v2.3-Instruct-MedIT-merge37.76
speakleash/Bielik-1.5B-v3.0-Instruct1.2241.36
speakleash/Bielik-11B-v2.0-Instruct3.2673.6168.24
speakleash/Bielik-11B-v2.1-Instruct3.9683.6460.07
speakleash/Bielik-11B-v2.2-Instruct3.7356.7769.05
speakleash/Bielik-11B-v2.3-Instruct3.2253.1170.86
speakleash/Bielik-11B-v2.5-Instruct3.1363.9572.00
speakleash/Bielik-11B-v2.6-Instruct4.1061.8973.70
speakleash/Bielik-11B-v3.0-Instruct3.1943.5271.20
speakleash/Bielik-4.5B-v3.0-Instruct2.4673.4153.58
speakleash/Bielik-7B-Instruct-v0.12.882331.26
speakleash/Bielik-Minitron-7B-v3.0-Instruct2.74
Raw scores. Each column uses a different metric and scale — compare within a column, not across.
§ 04 · PLLuM tracker

OPI, 8B to 70B.

PLLuM (Polish Large Language Universal Model) is developed by OPI, the National Information Processing Institute, as part of a government-backed initiative to build open Polish AI infrastructure. Models range from 8B to 70B parameters.

PLLuM project page →
ModelPLCCCPTU-BenchOpen PL LLM LeaderboardPolish MT-BenchPolish EQ-Bench
CYFRAGOVPL/Llama-PLLuM-70B-chat3.9472.56
CYFRAGOVPL/Llama-PLLuM-70B-instruct3.3369.99
CYFRAGOVPL/Llama-PLLuM-8B-chat3.1346.20
CYFRAGOVPL/Llama-PLLuM-8B-instruct1.66
CYFRAGOVPL/PLLuM-12B-chat2.5952.26
CYFRAGOVPL/PLLuM-12B-instruct3.0936.21
CYFRAGOVPL/PLLuM-12B-nc-chat3.22
CYFRAGOVPL/pllum-12b-nc-chat-2507153.9655.17
CYFRAGOVPL/PLLuM-12B-nc-instruct3.31
CYFRAGOVPL/pllum-12b-nc-instruct-2507153.91
CYFRAGOVPL/PLLuM-8x7B-chat3.4545.22
CYFRAGOVPL/PLLuM-8x7B-instruct3.4639.55
CYFRAGOVPL/PLLuM-8x7B-nc-chat3.4847.29
CYFRAGOVPL/PLLuM-8x7B-nc-instruct1.7641.75
Llama-PLLuM-70B-chat508.05
Llama-PLLuM-70B-chat-25080154
Llama-PLLuM-8B-chat38.509.50
PLLuM-12B-chat376.55
PLLuM-12B-nc-chat414.55
PLLuM-12B-nc-chat-25071552
PLLuM-8x7B-chat607.10
PLLuM-8x7B-nc-chat68.174.95
Raw scores. Each column uses a different metric and scale — compare within a column, not across.
§ 05 · Leaderboards

Three panels, top ten each.

PLCC, CPTU-Bench and the Open PL LLM Leaderboard have the deepest coverage. The other two are shown in-line via the sidebar SOTA ticker.

PLCC · top 10
Shaded row marks current SOTA
#ModelTrendScore
01Gemini-3.1-Pro-Preview97
02Gemini-3.0-Pro-Preview95.83
03GPT-5.4-2026-03-05 (high reasoning)92.17
04Gemini-2.5-Pro-Preview-06-0592.17
05Gemini-3-Flash-Preview91.67
06GPT-5-Pro-2025-10-06 (high reasoning)91
07GPT-5.4-2026-03-05 (low reasoning)90.50
08Grok 490.50
09Gemini-2.5-Pro-Exp-03-2589.50
10GPT-5-2025-08-0789.50
CPTU-Bench · top 10
Shaded row marks current SOTA
#ModelTrendScore
01Qwen/Qwen3.5-27B thinking (API)4.34
02gemini-2.0-flash-0014.29
03Qwen/Qwen3.5-27B non-thinking (API)4.27
04Qwen/Qwen3.5-35B-A3B thinking (API)4.22
05Qwen/Qwen3.5-35B-A3B non-thinking (API)4.18
06deepseek-ai/DeepSeek-V3.2 (API)4.14
07deepseek-ai/DeepSeek-R1 (API)4.14
08gemini-2.0-flash-lite-0014.09
09deepseek-ai/DeepSeek-V3-0324 (API)4.03
10deepseek-ai/DeepSeek-V3.1 (API)4.03
Open PL LLM Leaderboard · top 10
Shaded row marks current SOTA
#ModelTrendScore
01mistralai/Mistral-Large-Instruct-241169.84
02Meta-Llama-3.1-405B-Instruct-FP869.44
03mistralai/Mistral-Large-Instruct-240769.11
04Qwen/Qwen2.5-72B-Instruct67.92
05Qwen2.5-72B67.38
06QwQ-32B-Preview67.01
07Qwen2.5-32B66.73
08meta-llama/Llama-3.3-70B-Instruct66.40
09Qwen2-72B66.02
10remek/v3/rl-instruct/110k65.99
§ 06
Methodology

How the Polish register is kept.

Scores are pulled live from the Codesota benchmark database; the page has no hand-written numbers. When the underlying benchmark owners update their tables, our page updates on the next request.

Bielik and PLLuM are the two Polish-first families worth cross-tracking explicitly. The tables in § 03 and § 04 show every version we find, across every benchmark they have been evaluated on. Gaps are gaps — we do not impute.

PLCC and CPTU-Bench use different scoring conventions; comparing raw numbers across benchmarks is meaningless. The SOTA tickers group by benchmark for a reason.

Related

Neighbouring registers.

Cross-links to the rest of Codesota.

ZusWaveBench
Polish bureaucracy, tax & ZUS reasoning — Bielik & PLLuM vs frontier LLMs.
LLMs · register
Frontier English-first LLM benchmarks.
Polish OCR
Polish document recognition and post-correction.
All tasks
Every modality Codesota tracks.
Methodology
How scores are admitted and retracted.