Polish adaptation of MT-Bench evaluating LLMs on multi-turn conversation quality across 8 categories: coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing. Scores on a 1-10 scale judged by GPT-4. Created by SpeakLeash.
Stem is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Humanities is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Roleplay is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Gemma 3 (27B, IT) | verified | 9.95 | 2026 | Source ↗ | Looks wrong? |
| 02 | aya-expanse-32b | verified | 9.70 | 2026 | Source ↗ | Looks wrong? |
| 03 | gemma-3-4b-it | verified | 9.45 | 2026 | Source ↗ | Looks wrong? |
| 04 | gemma-3-12b-it | verified | 9.45 | 2026 | Source ↗ | Looks wrong? |
| 05 | Bielik-11B-v2.1-Instruct | verified | 9.45 | 2026 | Source ↗ | Looks wrong? |
| 06 | Mistral-Small-3.1-24B-Instruct-2503 | verified | 9.40 | 2026 | Source ↗ | Looks wrong? |
| 07 | aya-expanse-8b | verified | 9.25 | 2026 | Source ↗ | Looks wrong? |
| 08 | Phi-4 | verified | 9.20 | 2026 | Source ↗ | Looks wrong? |
| 09 | Qwen2-72B-Instruct | verified | 9.20 | 2026 | Source ↗ | Looks wrong? |
| 10 | Mixtral-8x22b | verified | 9.05 | 2026 | Source ↗ | Looks wrong? |
| 11 | Mistral-Small-24B-Instruct-2501 | verified | 9.05 | 2026 | Source ↗ | Looks wrong? |
| 12 | Bielik-11B-v2.2-Instruct | verified | 9.03 | 2026 | Source ↗ | Looks wrong? |
Extraction is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Writing is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Gemma 3 (27B, IT) | verified | 9.70 | 2026 | Source ↗ | Looks wrong? |
| 02 | aya-expanse-32b | verified | 9.60 | 2026 | Source ↗ | Looks wrong? |
| 03 | Bielik-11B-v2.1-Instruct | verified | 9.50 | 2026 | Source ↗ | Looks wrong? |
| 04 | Bielik-11B-v2.3-Instruct | verified | 9.50 | 2026 | Source ↗ | Looks wrong? |
| 05 | Bielik-11B-v2.2-Instruct | verified | 9.35 | 2026 | Source ↗ | Looks wrong? |
| 06 | Mixtral-8x7b | verified | 9.35 | 2026 | Source ↗ | Looks wrong? |
| 07 | aya-expanse-8b | verified | 9.30 | 2026 | Source ↗ | Looks wrong? |
| 08 | gemma-3-12b-it | verified | 9.30 | 2026 | Source ↗ | Looks wrong? |
| 09 | gemma-3-4b-it | verified | 9.30 | 2026 | Source ↗ | Looks wrong? |
| 10 | Phi-4 | verified | 9.25 | 2026 | Source ↗ | Looks wrong? |
| 11 | Mixtral-8x22b | verified | 9.25 | 2026 | Source ↗ | Looks wrong? |
| 12 | Meta-Llama-3.1-405B-Instruct | verified | 9.20 | 2026 | Source ↗ | Looks wrong? |
| 13 | Mistral-Small-3.1-24B-Instruct-2503 | verified | 9.15 | 2026 | Source ↗ | Looks wrong? |
| 14 | GPT-3.5-turbo | verified | 9.10 | 2026 | Source ↗ | Looks wrong? |
| 15 | Meta-Llama-3.1-70B-Instruct | verified | 9.10 | 2026 | Source ↗ | Looks wrong? |
Reasoning is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Phi-4 | verified | 9.55 | 2026 | Source ↗ | Looks wrong? |
| 02 | Qwen2.5-32B-Instruct | verified | 9.10 | 2026 | Source ↗ | Looks wrong? |
| 03 | Mistral-Small-3.1-24B-Instruct-2503 | verified | 9.00 | 2026 | Source ↗ | Looks wrong? |
Pl Score is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Gemma 3 (27B, IT) | verified | 9.28 | 2026 | Source ↗ | Looks wrong? |
| 02 | Mistral-Small-3.1-24B-Instruct-2503 | verified | 9.18 | 2026 | Source ↗ | Looks wrong? |
| 03 | Phi-4 | verified | 9.07 | 2026 | Source ↗ | Looks wrong? |