Codesota · Benchmark · Polish MT-BenchHome/Leaderboards/Polish MT-Bench

Unknown

Polish MT-Bench.

Polish adaptation of MT-Bench evaluating LLMs on multi-turn conversation quality across 8 categories: coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing. Scores on a 1-10 scale judged by GPT-4. Created by SpeakLeash.

Paper ↗Leaderboard ↓

§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Stem

Stem is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Stemverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Phi-4	verified	10	2026	Source ↗	Looks wrong?
02	gemma-3-12b-it	verified	10	2026	Source ↗	Looks wrong?
03	Gemma 3 (27B, IT)	verified	9.95	2026	Source ↗	Looks wrong?
04	aya-expanse-32b	verified	9.95	2026	Source ↗	Looks wrong?
05	Mistral-Small-3.1-24B-Instruct-2503	verified	9.90	2026	Source ↗	Looks wrong?
06	Gemma-2-27b-it	verified	9.80	2026	Source ↗	Looks wrong?
07	aya-expanse-8b	verified	9.75	2026	Source ↗	Looks wrong?
08	Qwen2.5-32B-Instruct	verified	9.70	2026	Source ↗	Looks wrong?
09	Mistral-Small-Instruct-2409	verified	9.65	2026	Source ↗	Looks wrong?
10	gemma-3-4b-it	verified	9.65	2026	Source ↗	Looks wrong?
11	Qwen2.5-14B-Instruct	verified	9.60	2026	Source ↗	Looks wrong?
12	Qwen2-72B-Instruct	verified	9.55	2026	Source ↗	Looks wrong?
13	Meta-Llama-3.1-70B-Instruct	verified	9.55	2026	Source ↗	Looks wrong?
14	Mistral-Small-24B-Instruct-2501	verified	9.50	2026	Source ↗	Looks wrong?
15	Bielik-11B-v2.2-Instruct	verified	9.45	2026	Source ↗	Looks wrong?
16	Mistral-Large-Instruct-2407	verified	9.35	2026	Source ↗	Looks wrong?
17	GPT-3.5-turbo	verified	9.25	2026	Source ↗	Looks wrong?
18	Mixtral-8x22b	verified	9.25	2026	Source ↗	Looks wrong?
19	PLLuM-12B-nc-chat	verified	9.10	2026	Source ↗	Looks wrong?
20	Bielik-11B-v2.3-Instruct	verified	8.97	2026	Source ↗	Looks wrong?

Humanities

Humanities is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Humanitiesverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Mistral-Small-Instruct-2409	verified	10	2026	Source ↗	Looks wrong?
02	Gemma-2-27b-it	verified	10	2026	Source ↗	Looks wrong?
03	Mistral-Small-3.1-24B-Instruct-2503	verified	10	2026	Source ↗	Looks wrong?
04	aya-expanse-32b	verified	10	2026	Source ↗	Looks wrong?
05	Gemma 3 (27B, IT)	verified	10	2026	Source ↗	Looks wrong?
06	gemma-3-12b-it	verified	10	2026	Source ↗	Looks wrong?
07	Phi-4	verified	9.95	2026	Source ↗	Looks wrong?
08	gemma-3-4b-it	verified	9.90	2026	Source ↗	Looks wrong?
09	Qwen2-72B-Instruct	verified	9.75	2026	Source ↗	Looks wrong?
10	GPT-3.5-turbo	verified	9.75	2026	Source ↗	Looks wrong?
11	Mistral-Small-24B-Instruct-2501	verified	9.70	2026	Source ↗	Looks wrong?
12	Qwen2.5-32B-Instruct	verified	9.65	2026	Source ↗	Looks wrong?
13	Meta-Llama-3.1-405B-Instruct	verified	9.65	2026	Source ↗	Looks wrong?
14	aya-expanse-8b	verified	9.65	2026	Source ↗	Looks wrong?
15	Bielik-11B-v2.3-Instruct	verified	9.50	2026	Source ↗	Looks wrong?
16	Mistral-Nemo-Instruct-2407	verified	9.50	2026	Source ↗	Looks wrong?
17	Llama-PLLuM-8B-chat	verified	9.50	2026	Source ↗	Looks wrong?
18	Meta-Llama-3.1-70B-Instruct	verified	9.50	2026	Source ↗	Looks wrong?
19	PLLuM-12B-nc-chat	verified	9.50	2026	Source ↗	Looks wrong?
20	Mixtral-8x7b	verified	9.45	2026	Source ↗	Looks wrong?
21	Bielik-11B-v2.0-Instruct	verified	9.43	2026	Source ↗	Looks wrong?
22	Mistral-Large-Instruct-2407	verified	9.40	2026	Source ↗	Looks wrong?
23	Bielik-11B-v2.2-Instruct	verified	9.40	2026	Source ↗	Looks wrong?
24	openchat-3.5-0106	verified	9.30	2026	Source ↗	Looks wrong?
25	PLLuM-12B-chat	verified	9.30	2026	Source ↗	Looks wrong?
26	Bielik-11B-v2.1-Instruct	verified	9.20	2026	Source ↗	Looks wrong?
27	Qwen2.5-14B-Instruct	verified	9.18	2026	Source ↗	Looks wrong?
28	Mixtral-8x22b	verified	9.10	2026	Source ↗	Looks wrong?

Roleplay

Roleplay is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Roleplayverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Gemma 3 (27B, IT)	verified	9.95	2026	Source ↗	Looks wrong?
02	aya-expanse-32b	verified	9.70	2026	Source ↗	Looks wrong?
03	gemma-3-4b-it	verified	9.45	2026	Source ↗	Looks wrong?
04	gemma-3-12b-it	verified	9.45	2026	Source ↗	Looks wrong?
05	Bielik-11B-v2.1-Instruct	verified	9.45	2026	Source ↗	Looks wrong?
06	Mistral-Small-3.1-24B-Instruct-2503	verified	9.40	2026	Source ↗	Looks wrong?
07	aya-expanse-8b	verified	9.25	2026	Source ↗	Looks wrong?
08	Phi-4	verified	9.20	2026	Source ↗	Looks wrong?
09	Qwen2-72B-Instruct	verified	9.20	2026	Source ↗	Looks wrong?
10	Mixtral-8x22b	verified	9.05	2026	Source ↗	Looks wrong?
11	Mistral-Small-24B-Instruct-2501	verified	9.05	2026	Source ↗	Looks wrong?
12	Bielik-11B-v2.2-Instruct	verified	9.03	2026	Source ↗	Looks wrong?

Extraction

Extraction is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Extractionverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Mistral-Small-24B-Instruct-2501	verified	9.90	2026	Source ↗	Looks wrong?
02	Gemma 3 (27B, IT)	verified	9.90	2026	Source ↗	Looks wrong?
03	Qwen2.5-32B-Instruct	verified	9.90	2026	Source ↗	Looks wrong?
04	Mistral-Large-Instruct-2407	verified	9.90	2026	Source ↗	Looks wrong?
05	Meta-Llama-3.1-70B-Instruct	verified	9.85	2026	Source ↗	Looks wrong?
06	Meta-Llama-3.1-405B-Instruct	verified	9.85	2026	Source ↗	Looks wrong?
07	Mistral-Small-3.1-24B-Instruct-2503	verified	9.80	2026	Source ↗	Looks wrong?
08	Qwen2-72B-Instruct	verified	9.80	2026	Source ↗	Looks wrong?
09	Gemma-2-27b-it	verified	9.60	2026	Source ↗	Looks wrong?
10	gemma-3-12b-it	verified	9.55	2026	Source ↗	Looks wrong?
11	Mixtral-8x22b	verified	9.55	2026	Source ↗	Looks wrong?
12	Llama-PLLuM-70B-chat	verified	9.45	2026	Source ↗	Looks wrong?
13	Bielik-11B-v2.3-Instruct	verified	9.43	2026	Source ↗	Looks wrong?
14	Phi-4	verified	9.30	2026	Source ↗	Looks wrong?
15	Bielik-11B-v2.2-Instruct	verified	9.30	2026	Source ↗	Looks wrong?
16	Qwen2.5-14B-Instruct	verified	9.25	2026	Source ↗	Looks wrong?
17	Mistral-Small-Instruct-2409	verified	9.15	2026	Source ↗	Looks wrong?
18	Bielik-11B-v2.1-Instruct	verified	9.13	2026	Source ↗	Looks wrong?
19	Meta-Llama-3.1-8B-Instruct	verified	9.10	2026	Source ↗	Looks wrong?

Writing

Writing is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Writingverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Gemma 3 (27B, IT)	verified	9.70	2026	Source ↗	Looks wrong?
02	aya-expanse-32b	verified	9.60	2026	Source ↗	Looks wrong?
03	Bielik-11B-v2.1-Instruct	verified	9.50	2026	Source ↗	Looks wrong?
04	Bielik-11B-v2.3-Instruct	verified	9.50	2026	Source ↗	Looks wrong?
05	Bielik-11B-v2.2-Instruct	verified	9.35	2026	Source ↗	Looks wrong?
06	Mixtral-8x7b	verified	9.35	2026	Source ↗	Looks wrong?
07	aya-expanse-8b	verified	9.30	2026	Source ↗	Looks wrong?
08	gemma-3-12b-it	verified	9.30	2026	Source ↗	Looks wrong?
09	gemma-3-4b-it	verified	9.30	2026	Source ↗	Looks wrong?
10	Phi-4	verified	9.25	2026	Source ↗	Looks wrong?
11	Mixtral-8x22b	verified	9.25	2026	Source ↗	Looks wrong?
12	Meta-Llama-3.1-405B-Instruct	verified	9.20	2026	Source ↗	Looks wrong?
13	Mistral-Small-3.1-24B-Instruct-2503	verified	9.15	2026	Source ↗	Looks wrong?
14	GPT-3.5-turbo	verified	9.10	2026	Source ↗	Looks wrong?
15	Meta-Llama-3.1-70B-Instruct	verified	9.10	2026	Source ↗	Looks wrong?

Reasoning

Reasoning is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Reasoningverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Phi-4	verified	9.55	2026	Source ↗	Looks wrong?
02	Qwen2.5-32B-Instruct	verified	9.10	2026	Source ↗	Looks wrong?
03	Mistral-Small-3.1-24B-Instruct-2503	verified	9.00	2026	Source ↗	Looks wrong?

Pl Score

Pl Score is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pl Scoreverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Gemma 3 (27B, IT)	verified	9.28	2026	Source ↗	Looks wrong?
02	Mistral-Small-3.1-24B-Instruct-2503	verified	9.18	2026	Source ↗	Looks wrong?
03	Phi-4	verified	9.07	2026	Source ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards