Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Polish Conversation QualityHome/Tasks/Natural Language Processing/Polish Conversation Quality

Polish Conversation Quality.

Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.

1
Datasets
450
Results
pl-score
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

Polish MT-Bench

Polish adaptation of MT-Bench evaluating LLMs on multi-turn conversation quality across 8 categories: coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing. Scores on a 1-10 scale judged by GPT-4. Created by SpeakLeash.

Primary metric: pl-score
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on Polish MT-Bench.

#ModelhumanitiesYearSource
gemma-3-12b-it10.02026paper ↗
2aya-expanse-32b10.02026paper ↗
3Gemma 3 (27B, IT)10.02026paper ↗
4Mistral-Small-3.1-24B-Instruct-250310.02026paper ↗
5Mistral-Small-Instruct-240910.02026paper ↗
6gemma-3-12b-it10.02026paper ↗
7Phi-410.02026paper ↗
8Gemma-2-27b-it10.02026paper ↗
9Gemma 3 (27B, IT)9.952026paper ↗
10Gemma 3 (27B, IT)9.952026paper ↗

What were you looking for on Polish Conversation Quality?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

Polish MT-Bench
CANONICAL
450 results · pl-score
Top: Phi-4 10.0
§ 05 · Related tasks

Other tasks in Natural Language Processing.

Feature ExtractionFill-MaskNamed Entity RecognitionNatural Language InferencePolish Cultural CompetencyPolish Emotional IntelligencePolish LLM GeneralPolish Text Understanding
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Polish Conversation Quality? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.