Polish Conversation Quality

Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.

1
Datasets
0
Results
pl-score
Canonical metric
Canonical Benchmark

Polish MT-Bench

Polish adaptation of MT-Bench evaluating LLMs on multi-turn conversation quality across 8 categories: coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing. Scores on a 1-10 scale judged by GPT-4. Created by SpeakLeash.

Primary metric: pl-score
View full leaderboard

Top 10

Leading models on Polish MT-Bench.

No results yet. Be the first to contribute.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Natural Language Processing.