Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.
Polish adaptation of MT-Bench evaluating LLMs on multi-turn conversation quality across 8 categories: coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing. Scores on a 1-10 scale judged by GPT-4. Created by SpeakLeash.
Leading models on Polish MT-Bench.
| # | Model | humanities | Year | Source |
|---|---|---|---|---|
| ★ | gemma-3-12b-it✓ | 10.0 | 2026 | paper ↗ |
| 2 | aya-expanse-32b✓ | 10.0 | 2026 | paper ↗ |
| 3 | Gemma 3 (27B, IT)✓ | 10.0 | 2026 | paper ↗ |
| 4 | Mistral-Small-3.1-24B-Instruct-2503✓ | 10.0 | 2026 | paper ↗ |
| 5 | Mistral-Small-Instruct-2409✓ | 10.0 | 2026 | paper ↗ |
| 6 | gemma-3-12b-it✓ | 10.0 | 2026 | paper ↗ |
| 7 | Phi-4✓ | 10.0 | 2026 | paper ↗ |
| 8 | Gemma-2-27b-it✓ | 10.0 | 2026 | paper ↗ |
| 9 | Gemma 3 (27B, IT)✓ | 9.95 | 2026 | paper ↗ |
| 10 | Gemma 3 (27B, IT)✓ | 9.95 | 2026 | paper ↗ |
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
Still looking for something on Polish Conversation Quality? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.