MT-Bench (MT-bench / MT-Bench) is a multi-turn benchmark for evaluating the conversational and instruction-following abilities of large language model (LLM) chat assistants. It was introduced in the paper “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (arXiv:2306.05685). MT-Bench is a collection of open-ended, multi-turn question/prompts designed to probe coherence, context maintenance, reasoning, and helpfulness in dialogue. The benchmark is commonly evaluated using a “LLM-as-a-judge” methodology (using strong LLMs such as GPT-4 to score/rank responses), which the authors show can achieve high agreement with human preferences. Public Hugging Face mirrors of the MT-Bench data (e.g., philschmid/mt-bench and lighteval/mt-bench) commonly expose an 80-item multi-turn set that is widely used for reporting a numeric MT-Bench score.
Score (1 10) is the reported evaluation metric for MTbench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Qwen2.5-72B-Instruct | paper | 9.35 | N/A | Source ↗ |