74 Models Benchmarked

Polish LLM Benchmarks
Real Performance Data

Compare 74 language models on Polish benchmarks: PLCC, CPTU, MT-Bench-PL, EQ-Bench-PL, and Open PL LLM Leaderboard.

Bielik: The Best Open Polish LLM Under 15B

Bielik-3.0-11B demonstrates exceptional Polish language understanding with only 11.2B parameters. It achieves 71.8 PLCC, significantly outperforming GPT-3.5-turbo (43.3 PLCC) and competing with models 6-60x its size.

+65.8%
vs GPT-3.5 on PLCC
#1
Open models under 15B
71.8
PLCC Score

Polish LLM Leaderboard

74 models ranked by PLCC (Polish Linguistic and Cultural Competency) score. Click "Show more" to see all models.

#ModelProviderParamsPLCCCPTULicence
1Gemini-3.0-Pro-PreviewGoogle-95.8-Proprietary
2Gemini-2.5-Pro-Preview-06-05Google-92.2-Proprietary
3GPT-5-ProOpenAI-91.0-Proprietary
4Grok-4xAI-90.5-Proprietary
5O1-2024-12-17OpenAI-89.2-Proprietary
6GPT-4.5-preview-2025-02-27OpenAI-86.5-Proprietary
7Claude-3.5-Sonnet-20241022Anthropic-82.7-Proprietary
8GPT-4o-2024-05-13OpenAI-82.3-Proprietary
9Claude-3.7-Sonnet-ThinkingAnthropic-82.2-Proprietary
10Claude-3.7-SonnetAnthropic-81.5-Proprietary
11GPT-4o-2024-11-20OpenAI-81.3-Proprietary
12GPT-4o-2024-08-06OpenAI-81.3-Proprietary
13Claude-3.5-Sonnet-20240620Anthropic-80.7-Proprietary
14DeepSeek-R1DeepSeek671B76.04.10Open
15Gemini-2.0-Flash-ThinkingGoogle-74.8-Proprietary
16Gemini-2.0-FlashGoogle-74.24.40Proprietary
17Claude-3-OpusAnthropic-73.8-Proprietary
18Bielik-3.0-11BPolishSpeakLeash11.2B71.83.82Open
19DeepSeek-v3.2DeepSeek685B71.7-Open
20DeepSeek-v3.1 (no thinking)DeepSeek671B71.0-Open
21Mistral-Large-2512Mistral675B70.7-Open
22PLLuM-12B-nc-chat-250715PolishPLLuM12.2B69.73.90Open
23Gemini-Pro-1.5Google-69.7-Proprietary
24DeepSeek-v3DeepSeek671B69.24.00Open
25PLLuM-8x7B-nc-chatPolishPLLuM46.7B68.23.40Open
26GPT-4-turboOpenAI-67.0-Proprietary
27Grok-2-1212xAI-66.0-Proprietary
28Bielik-2.6PolishSpeakLeash11.2B65.53.80Open
29Bielik-2.2PolishSpeakLeash11.2B63.03.60Open
30Bielik-2.3PolishSpeakLeash11.2B62.23.80Open

Polish Benchmark Landscape

Polish has developed its own evaluation ecosystem. These benchmarks test real Polish language competence, not just translation quality.

Core Polish Evaluation

Primary benchmarks for Polish language competency

PLCC

Polish Linguistic and Cultural Competency

Accuracy95.8%
LeaderGemini-3.0-Pro

Open PL LLM Leaderboard

Multi-task Polish evaluation

Average Score69.4%
LeaderLlama-3.1-405B

MT-Bench-PL

Multi-turn conversation quality

Score (1-10)9.3
LeaderGemma-3-27b

Complex Understanding

Implicatures, idioms, cultural references, and nuanced Polish

CPTU

Complex Polish Text Understanding Benchmark

Score (1-5)4.4
LeaderGemini-2.0-Flash

EQ-Bench-PL

Emotional intelligence in Polish

Score78.1
LeaderMistral-Large-2407

Benchmark Definitions

PLCC
Polish Linguistic and Cultural Competency - tests understanding of Polish grammar, idioms, and cultural references.
CPTU
Complex Polish Text Understanding Benchmark - measures comprehension of nuanced, multi-layered Polish texts.
MT-Bench-PL
Multi-turn conversation benchmark adapted for Polish - tests dialogue coherence and context retention.
EQ-Bench-PL
Emotional intelligence benchmark for Polish - evaluates understanding of emotions and social nuances.

Bielik Model Family

Bielik (Polish for "White-tailed Eagle") is developed by SpeakLeash. These models are specifically optimized for Polish language tasks and consistently outperform much larger multilingual models.

Bielik-3.0-11B

latest

Best Polish open model under 15B. Outperforms GPT-3.5 by 65.6%

Params
11.2B
PLCC
71.8
CPTU
3.82
Open PL
65.93

Bielik-2.6

stable

Strong EQ-Bench performance (73.7)

Params
11.2B
PLCC
65.5
CPTU
3.8
Open PL
64.3

Bielik-2.3

stable

Best MT-Bench-PL among Bielik versions (8.6)

Params
11.2B
PLCC
62.2
CPTU
3.8
Open PL
64

Bielik-3.0-4.5B

latest

Edge deployment ready. Strong CPTU for size (3.7)

Params
4.8B
PLCC
42.3
CPTU
3.7
Open PL
56.1

Bielik-0.1

legacy

Original release, baseline for Polish LLM development

Params
7.24B
PLCC
46.7
CPTU
3.1
Open PL
44.7

Training Highlights

  • -292B tokens for v3 models / 198B for v2
  • -APT4 Tokenizer - custom Polish tokenizer
  • -303M documents of diverse Polish text
  • -Apache 2.0 - fully open weights

Model Recommendations

Best Open Source (Under 15B)

Bielik-3.0-11B

71.8 PLCC, 3.82 CPTU. Outperforms GPT-3.5 by 65.6%. Apache 2.0 license, self-hostable.

View on HuggingFace

Best for Edge/Mobile

Bielik-3.0-4.5B

Only 4.8B params but 42.3 PLCC and 3.7 CPTU. Runs on consumer hardware.

View SpeakLeash models

Best for Complex Polish

DeepSeek-R1

76.0 PLCC, 4.1 CPTU. Best open model for complex reasoning in Polish. 671B params.

View on HuggingFace

Best Commercial API

Gemini-3.0-Pro / GPT-5-Pro

95.8 / 91.0 PLCC. Top overall Polish performance with enterprise SLAs.

API access via Google/OpenAI

Resources & Links

Explore More Benchmarks

See how Polish OCR models compare, or explore our broader LLM benchmark tracking.