Codesota · Benchmark · Polish MT-BenchHome/Leaderboards/Polish MT-Bench
Unknown

Polish MT-Bench.

Polish adaptation of MT-Bench evaluating LLMs on multi-turn conversation quality across 8 categories: coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing. Scores on a 1-10 scale judged by GPT-4. Created by SpeakLeash.

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Stem

Stem is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Stemverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Phi-4verified102026Source ↗Looks wrong?
02gemma-3-12b-itverified102026Source ↗Looks wrong?
03Gemma 3 (27B, IT)verified9.952026Source ↗Looks wrong?
04aya-expanse-32bverified9.952026Source ↗Looks wrong?
05Mistral-Small-3.1-24B-Instruct-2503verified9.902026Source ↗Looks wrong?
06Gemma-2-27b-itverified9.802026Source ↗Looks wrong?
07aya-expanse-8bverified9.752026Source ↗Looks wrong?
08Qwen2.5-32B-Instructverified9.702026Source ↗Looks wrong?
09Mistral-Small-Instruct-2409verified9.652026Source ↗Looks wrong?
10gemma-3-4b-itverified9.652026Source ↗Looks wrong?
11Qwen2.5-14B-Instructverified9.602026Source ↗Looks wrong?
12Qwen2-72B-Instructverified9.552026Source ↗Looks wrong?
13Meta-Llama-3.1-70B-Instructverified9.552026Source ↗Looks wrong?
14Mistral-Small-24B-Instruct-2501verified9.502026Source ↗Looks wrong?
15Bielik-11B-v2.2-Instructverified9.452026Source ↗Looks wrong?
16Mistral-Large-Instruct-2407verified9.352026Source ↗Looks wrong?
17GPT-3.5-turboverified9.252026Source ↗Looks wrong?
18Mixtral-8x22bverified9.252026Source ↗Looks wrong?
19PLLuM-12B-nc-chatverified9.102026Source ↗Looks wrong?
20Bielik-11B-v2.3-Instructverified8.972026Source ↗Looks wrong?

Humanities

Humanities is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Humanitiesverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Mistral-Small-Instruct-2409verified102026Source ↗Looks wrong?
02Gemma-2-27b-itverified102026Source ↗Looks wrong?
03Mistral-Small-3.1-24B-Instruct-2503verified102026Source ↗Looks wrong?
04aya-expanse-32bverified102026Source ↗Looks wrong?
05Gemma 3 (27B, IT)verified102026Source ↗Looks wrong?
06gemma-3-12b-itverified102026Source ↗Looks wrong?
07Phi-4verified9.952026Source ↗Looks wrong?
08gemma-3-4b-itverified9.902026Source ↗Looks wrong?
09Qwen2-72B-Instructverified9.752026Source ↗Looks wrong?
10GPT-3.5-turboverified9.752026Source ↗Looks wrong?
11Mistral-Small-24B-Instruct-2501verified9.702026Source ↗Looks wrong?
12Qwen2.5-32B-Instructverified9.652026Source ↗Looks wrong?
13Meta-Llama-3.1-405B-Instructverified9.652026Source ↗Looks wrong?
14aya-expanse-8bverified9.652026Source ↗Looks wrong?
15Bielik-11B-v2.3-Instructverified9.502026Source ↗Looks wrong?
16Mistral-Nemo-Instruct-2407verified9.502026Source ↗Looks wrong?
17Llama-PLLuM-8B-chatverified9.502026Source ↗Looks wrong?
18Meta-Llama-3.1-70B-Instructverified9.502026Source ↗Looks wrong?
19PLLuM-12B-nc-chatverified9.502026Source ↗Looks wrong?
20Mixtral-8x7bverified9.452026Source ↗Looks wrong?
21Bielik-11B-v2.0-Instructverified9.432026Source ↗Looks wrong?
22Mistral-Large-Instruct-2407verified9.402026Source ↗Looks wrong?
23Bielik-11B-v2.2-Instructverified9.402026Source ↗Looks wrong?
24openchat-3.5-0106verified9.302026Source ↗Looks wrong?
25PLLuM-12B-chatverified9.302026Source ↗Looks wrong?
26Bielik-11B-v2.1-Instructverified9.202026Source ↗Looks wrong?
27Qwen2.5-14B-Instructverified9.182026Source ↗Looks wrong?
28Mixtral-8x22bverified9.102026Source ↗Looks wrong?

Roleplay

Roleplay is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Roleplayverifiedpapervendorcommunityunverified

Extraction

Extraction is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Extractionverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Mistral-Small-24B-Instruct-2501verified9.902026Source ↗Looks wrong?
02Gemma 3 (27B, IT)verified9.902026Source ↗Looks wrong?
03Qwen2.5-32B-Instructverified9.902026Source ↗Looks wrong?
04Mistral-Large-Instruct-2407verified9.902026Source ↗Looks wrong?
05Meta-Llama-3.1-70B-Instructverified9.852026Source ↗Looks wrong?
06Meta-Llama-3.1-405B-Instructverified9.852026Source ↗Looks wrong?
07Mistral-Small-3.1-24B-Instruct-2503verified9.802026Source ↗Looks wrong?
08Qwen2-72B-Instructverified9.802026Source ↗Looks wrong?
09Gemma-2-27b-itverified9.602026Source ↗Looks wrong?
10gemma-3-12b-itverified9.552026Source ↗Looks wrong?
11Mixtral-8x22bverified9.552026Source ↗Looks wrong?
12Llama-PLLuM-70B-chatverified9.452026Source ↗Looks wrong?
13Bielik-11B-v2.3-Instructverified9.432026Source ↗Looks wrong?
14Phi-4verified9.302026Source ↗Looks wrong?
15Bielik-11B-v2.2-Instructverified9.302026Source ↗Looks wrong?
16Qwen2.5-14B-Instructverified9.252026Source ↗Looks wrong?
17Mistral-Small-Instruct-2409verified9.152026Source ↗Looks wrong?
18Bielik-11B-v2.1-Instructverified9.132026Source ↗Looks wrong?
19Meta-Llama-3.1-8B-Instructverified9.102026Source ↗Looks wrong?

Writing

Writing is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Writingverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Gemma 3 (27B, IT)verified9.702026Source ↗Looks wrong?
02aya-expanse-32bverified9.602026Source ↗Looks wrong?
03Bielik-11B-v2.1-Instructverified9.502026Source ↗Looks wrong?
04Bielik-11B-v2.3-Instructverified9.502026Source ↗Looks wrong?
05Bielik-11B-v2.2-Instructverified9.352026Source ↗Looks wrong?
06Mixtral-8x7bverified9.352026Source ↗Looks wrong?
07aya-expanse-8bverified9.302026Source ↗Looks wrong?
08gemma-3-12b-itverified9.302026Source ↗Looks wrong?
09gemma-3-4b-itverified9.302026Source ↗Looks wrong?
10Phi-4verified9.252026Source ↗Looks wrong?
11Mixtral-8x22bverified9.252026Source ↗Looks wrong?
12Meta-Llama-3.1-405B-Instructverified9.202026Source ↗Looks wrong?
13Mistral-Small-3.1-24B-Instruct-2503verified9.152026Source ↗Looks wrong?
14GPT-3.5-turboverified9.102026Source ↗Looks wrong?
15Meta-Llama-3.1-70B-Instructverified9.102026Source ↗Looks wrong?

Reasoning

Reasoning is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Reasoningverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Phi-4verified9.552026Source ↗Looks wrong?
02Qwen2.5-32B-Instructverified9.102026Source ↗Looks wrong?
03Mistral-Small-3.1-24B-Instruct-2503verified9.002026Source ↗Looks wrong?

Pl Score

Pl Score is the reported evaluation metric for Polish MT-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pl Scoreverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Gemma 3 (27B, IT)verified9.282026Source ↗Looks wrong?
02Mistral-Small-3.1-24B-Instruct-2503verified9.182026Source ↗Looks wrong?
03Phi-4verified9.072026Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards