Polish LLM Benchmarks
Real Performance Data
Compare 74 language models on Polish benchmarks: PLCC, CPTU, MT-Bench-PL, EQ-Bench-PL, and Open PL LLM Leaderboard.
Bielik: The Best Open Polish LLM Under 15B
Bielik-3.0-11B demonstrates exceptional Polish language understanding with only 11.2B parameters. It achieves 71.8 PLCC, significantly outperforming GPT-3.5-turbo (43.3 PLCC) and competing with models 6-60x its size.
Polish LLM Leaderboard
74 models ranked by PLCC (Polish Linguistic and Cultural Competency) score. Click "Show more" to see all models.
| # | Model | Provider | Params | PLCC | CPTU | Licence |
|---|---|---|---|---|---|---|
| 1 | Gemini-3.0-Pro-Preview | - | 95.8 | - | Proprietary | |
| 2 | Gemini-2.5-Pro-Preview-06-05 | - | 92.2 | - | Proprietary | |
| 3 | GPT-5-Pro | OpenAI | - | 91.0 | - | Proprietary |
| 4 | Grok-4 | xAI | - | 90.5 | - | Proprietary |
| 5 | O1-2024-12-17 | OpenAI | - | 89.2 | - | Proprietary |
| 6 | GPT-4.5-preview-2025-02-27 | OpenAI | - | 86.5 | - | Proprietary |
| 7 | Claude-3.5-Sonnet-20241022 | Anthropic | - | 82.7 | - | Proprietary |
| 8 | GPT-4o-2024-05-13 | OpenAI | - | 82.3 | - | Proprietary |
| 9 | Claude-3.7-Sonnet-Thinking | Anthropic | - | 82.2 | - | Proprietary |
| 10 | Claude-3.7-Sonnet | Anthropic | - | 81.5 | - | Proprietary |
| 11 | GPT-4o-2024-11-20 | OpenAI | - | 81.3 | - | Proprietary |
| 12 | GPT-4o-2024-08-06 | OpenAI | - | 81.3 | - | Proprietary |
| 13 | Claude-3.5-Sonnet-20240620 | Anthropic | - | 80.7 | - | Proprietary |
| 14 | DeepSeek-R1 | DeepSeek | 671B | 76.0 | 4.10 | Open |
| 15 | Gemini-2.0-Flash-Thinking | - | 74.8 | - | Proprietary | |
| 16 | Gemini-2.0-Flash | - | 74.2 | 4.40 | Proprietary | |
| 17 | Claude-3-Opus | Anthropic | - | 73.8 | - | Proprietary |
| 18 | Bielik-3.0-11BPolish | SpeakLeash | 11.2B | 71.8 | 3.82 | Open |
| 19 | DeepSeek-v3.2 | DeepSeek | 685B | 71.7 | - | Open |
| 20 | DeepSeek-v3.1 (no thinking) | DeepSeek | 671B | 71.0 | - | Open |
| 21 | Mistral-Large-2512 | Mistral | 675B | 70.7 | - | Open |
| 22 | PLLuM-12B-nc-chat-250715Polish | PLLuM | 12.2B | 69.7 | 3.90 | Open |
| 23 | Gemini-Pro-1.5 | - | 69.7 | - | Proprietary | |
| 24 | DeepSeek-v3 | DeepSeek | 671B | 69.2 | 4.00 | Open |
| 25 | PLLuM-8x7B-nc-chatPolish | PLLuM | 46.7B | 68.2 | 3.40 | Open |
| 26 | GPT-4-turbo | OpenAI | - | 67.0 | - | Proprietary |
| 27 | Grok-2-1212 | xAI | - | 66.0 | - | Proprietary |
| 28 | Bielik-2.6Polish | SpeakLeash | 11.2B | 65.5 | 3.80 | Open |
| 29 | Bielik-2.2Polish | SpeakLeash | 11.2B | 63.0 | 3.60 | Open |
| 30 | Bielik-2.3Polish | SpeakLeash | 11.2B | 62.2 | 3.80 | Open |
| 31 | Bielik-2.5Polish | SpeakLeash | 11.2B | 62.0 | 3.70 | Open |
| 32 | Kimi-K2 | Moonshot | 1000B | 62.0 | - | Open |
| 33 | Bielik-2.1Polish | SpeakLeash | 11.2B | 61.0 | 3.70 | Open |
| 34 | Llama-3.1-405B | Meta | 405B | 60.0 | - | Open |
| 35 | PLLuM-12B-nc-chatPolish | PLLuM | 12.2B | 59.5 | 3.30 | Open |
| 36 | GPT-4 | OpenAI | - | 59.5 | - | Proprietary |
| 37 | O3-mini-2025-01-31 | OpenAI | - | 59.3 | - | Proprietary |
| 38 | Llama-PLLuM-70B-chatPolish | PLLuM | 70.6B | 58.5 | 3.60 | Open |
| 39 | Llama-4-Maverick | Meta | 402B | 58.2 | 4.00 | Open |
| 40 | Claude-3.5-Haiku-20241022 | Anthropic | - | 57.8 | - | Proprietary |
| 41 | GPT-4o-mini-2024-07-18 | OpenAI | - | 56.8 | - | Proprietary |
| 42 | Claude-3.0-Sonnet | Anthropic | - | 56.5 | - | Proprietary |
| 43 | Command-A-03-2025 | Cohere | 111B | 56.2 | - | Open |
| 44 | Qwen3-235B-A22B | Alibaba | 235B | 55.0 | - | Open |
| 45 | Mistral-Large-2407 | Mistral | 123B | 54.2 | 4.00 | Open |
| 46 | PLLuM-8x7B-chatPolish | PLLuM | 46.7B | 54.2 | 3.40 | Open |
| 47 | Mistral-Large-2411 | Mistral | 123B | 52.0 | - | Open |
| 48 | O1-mini-2024-09-12 | OpenAI | - | 51.7 | - | Proprietary |
| 49 | WizardLM-2-8x22b | Microsoft | 141B | 51.5 | 3.90 | Open |
| 50 | Mixtral-8x22b | Mistral | 141B | 49.8 | 3.70 | Open |
| 51 | Llama-3.3-70B | Meta | 70.6B | 48.8 | 3.70 | Open |
| 52 | Llama-3.1-70B | Meta | 70.6B | 47.8 | 3.80 | Open |
| 53 | Gemma-3-27b | 27B | 47.3 | 3.90 | Open | |
| 54 | PLLuM-12B-chatPolish | PLLuM | 12.2B | 47.0 | 3.30 | Open |
| 55 | Bielik-0.1Polish | SpeakLeash | 7.24B | 46.7 | 3.10 | Open |
| 56 | Gemini-Flash-1.5 | - | 46.5 | - | Proprietary | |
| 57 | Mistral-Small-3.2-24B-2506 | Mistral | 23.6B | 46.2 | - | Open |
| 58 | GPT-3.5-turbo | OpenAI | - | 43.3 | - | Proprietary |
| 59 | Llama-3.0-70B | Meta | 70.6B | 43.0 | 3.80 | Open |
| 60 | Gemma-2-27b | 27B | 42.7 | - | Open | |
| 61 | Bielik-3.0-4.5BPolish | SpeakLeash | 4.8B | 42.3 | 3.70 | Open |
| 62 | Llama-4-Scout | Meta | 109B | 41.5 | 3.90 | Open |
| 63 | EuroLLM-9B | UTTER | 9B | 41.0 | - | Open |
| 64 | Qwen-2.5-72b | Alibaba | 72.7B | 39.2 | 4.00 | Open |
| 65 | Mistral-Small-24B-2501 | Mistral | 23.6B | 39.0 | 3.80 | Open |
| 66 | Llama-PLLuM-8B-chatPolish | PLLuM | 8.03B | 38.5 | 3.10 | Open |
| 67 | Mixtral-8x7b | Mistral | 46.7B | 35.3 | 3.00 | Open |
| 68 | Qwen-2.5-32b | Alibaba | 32.8B | 30.5 | 3.80 | Open |
| 69 | Phi-4 | Microsoft | 14.7B | 29.2 | 3.50 | Open |
| 70 | Qwen-2.5-14b | Alibaba | 14.8B | 26.7 | 3.60 | Open |
| 71 | Mistral-Nemo | Mistral | 12.2B | 23.0 | - | Open |
| 72 | Llama-3.1-8B | Meta | 8.03B | 22.7 | 3.30 | Open |
| 73 | Mistral-7b-v0.3 | Mistral | 7.25B | 21.8 | 3.00 | Open |
| 74 | Qwen-2.5-7b | Alibaba | 7.62B | 17.7 | 3.20 | Open |
Polish Benchmark Landscape
Polish has developed its own evaluation ecosystem. These benchmarks test real Polish language competence, not just translation quality.
Core Polish Evaluation
Primary benchmarks for Polish language competency
PLCC
Polish Linguistic and Cultural Competency
Open PL LLM Leaderboard
Multi-task Polish evaluation
MT-Bench-PL
Multi-turn conversation quality
Complex Understanding
Implicatures, idioms, cultural references, and nuanced Polish
CPTU
Complex Polish Text Understanding Benchmark
EQ-Bench-PL
Emotional intelligence in Polish
Benchmark Definitions
- PLCC
- Polish Linguistic and Cultural Competency - tests understanding of Polish grammar, idioms, and cultural references.
- CPTU
- Complex Polish Text Understanding Benchmark - measures comprehension of nuanced, multi-layered Polish texts.
- MT-Bench-PL
- Multi-turn conversation benchmark adapted for Polish - tests dialogue coherence and context retention.
- EQ-Bench-PL
- Emotional intelligence benchmark for Polish - evaluates understanding of emotions and social nuances.
Bielik Model Family
Bielik (Polish for "White-tailed Eagle") is developed by SpeakLeash. These models are specifically optimized for Polish language tasks and consistently outperform much larger multilingual models.
Bielik-3.0-11B
latestBest Polish open model under 15B. Outperforms GPT-3.5 by 65.6%
Bielik-2.6
stableStrong EQ-Bench performance (73.7)
Bielik-2.3
stableBest MT-Bench-PL among Bielik versions (8.6)
Bielik-3.0-4.5B
latestEdge deployment ready. Strong CPTU for size (3.7)
Bielik-0.1
legacyOriginal release, baseline for Polish LLM development
Training Highlights
- -292B tokens for v3 models / 198B for v2
- -APT4 Tokenizer - custom Polish tokenizer
- -303M documents of diverse Polish text
- -Apache 2.0 - fully open weights
Model Recommendations
Best Open Source (Under 15B)
71.8 PLCC, 3.82 CPTU. Outperforms GPT-3.5 by 65.6%. Apache 2.0 license, self-hostable.
View on HuggingFaceBest for Edge/Mobile
Only 4.8B params but 42.3 PLCC and 3.7 CPTU. Runs on consumer hardware.
View SpeakLeash modelsBest for Complex Polish
76.0 PLCC, 4.1 CPTU. Best open model for complex reasoning in Polish. 671B params.
View on HuggingFaceBest Commercial API
95.8 / 91.0 PLCC. Top overall Polish performance with enterprise SLAs.
API access via Google/OpenAIResources & Links
Open PL LLM Leaderboard
Official leaderboard on HuggingFace Spaces
Bielik 11B v2 Technical Report
Full methodology and benchmark details
Bielik v3 Small Technical Report
APT4 tokenizer & efficiency innovations
SpeakLeash on HuggingFace
All Bielik model weights and documentation
PLCC Benchmark
Polish Linguistic and Cultural Competency
PLLuM Project
Polish Large Language Model by OPI
Explore More Benchmarks
See how Polish OCR models compare, or explore our broader LLM benchmark tracking.