MTbench.

Name: MTbench Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

MT-Bench (MT-bench / MT-Bench) is a multi-turn benchmark for evaluating the conversational and instruction-following abilities of large language model (LLM) chat assistants. It was introduced in the paper “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (arXiv:2306.05685). MT-Bench is a collection of open-ended, multi-turn question/prompts designed to probe coherence, context maintenance, reasoning, and helpfulness in dialogue. The benchmark is commonly evaluated using a “LLM-as-a-judge” methodology (using strong LLMs such as GPT-4 to score/rank responses), which the authors show can achieve high agreement with human preferences. Public Hugging Face mirrors of the MT-Bench data (e.g., philschmid/mt-bench and lighteval/mt-bench) commonly expose an 80-item multi-turn set that is widely used for reporting a numeric MT-Bench score.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

Not enough data to show trend.

§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark

Help build the community leaderboard — submit your model results.

Score (1 10)

Score (1 10) is the reported evaluation metric for MTbench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Score (1 10)verifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	Qwen2.5-72B-Instruct dataset: MTbench; task: 5	paper	9.35	N/A	Source ↗

§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling