Polish Conversation Quality
Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.
1
Datasets
0
Results
pl-score
Canonical metric
Canonical Benchmark
Polish MT-Bench
Polish adaptation of MT-Bench evaluating LLMs on multi-turn conversation quality across 8 categories: coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing. Scores on a 1-10 scale judged by GPT-4. Created by SpeakLeash.
Primary metric: pl-score
Top 10
Leading models on Polish MT-Bench.
No results yet. Be the first to contribute.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Natural Language Processing.