Codesota · Tasks · Polish Conversation QualityHome/Tasks/Natural Language Processing/Polish Conversation Quality

Polish Conversation Quality.

Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.

Datasets

450

Results

pl-score

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

Polish MT-Bench

Polish adaptation of MT-Bench evaluating LLMs on multi-turn conversation quality across 8 categories: coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing. Scores on a 1-10 scale judged by GPT-4. Created by SpeakLeash.

Primary metric: pl-score

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on Polish MT-Bench.

#	Model	humanities	Year	Source
★	gemma-3-12b-it✓	10.0	2026	paper ↗
2	aya-expanse-32b✓	10.0	2026	paper ↗
3	Gemma 3 (27B, IT)✓	10.0	2026	paper ↗
4	Mistral-Small-3.1-24B-Instruct-2503✓	10.0	2026	paper ↗
5	Mistral-Small-Instruct-2409✓	10.0	2026	paper ↗
6	gemma-3-12b-it✓	10.0	2026	paper ↗
7	Phi-4✓	10.0	2026	paper ↗
8	Gemma-2-27b-it✓	10.0	2026	paper ↗
9	Gemma 3 (27B, IT)✓	9.95	2026	paper ↗
10	Gemma 3 (27B, IT)✓	9.95	2026	paper ↗

What were you looking for on Polish Conversation Quality?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

Polish MT-Bench

CANONICAL

450 results · pl-score

Top: Phi-4 — 10.0

§ 05 · Related tasks

Other tasks in Natural Language Processing.

Feature Extraction Fill-Mask Named Entity Recognition Natural Language Inference Polish Cultural Competency Polish Emotional Intelligence Polish LLM General Polish Text Understanding

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Polish Conversation Quality? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.