Codesota · Benchmark · MTbenchHome/Leaderboards/Language & Knowledge/Language Modeling/MTbench
Unknown

MTbench.

MT-Bench (MT-bench / MT-Bench) is a multi-turn benchmark for evaluating the conversational and instruction-following abilities of large language model (LLM) chat assistants. It was introduced in the paper “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (arXiv:2306.05685). MT-Bench is a collection of open-ended, multi-turn question/prompts designed to probe coherence, context maintenance, reasoning, and helpfulness in dialogue. The benchmark is commonly evaluated using a “LLM-as-a-judge” methodology (using strong LLMs such as GPT-4 to score/rank responses), which the authors show can achieve high agreement with human preferences. Public Hugging Face mirrors of the MT-Bench data (e.g., philschmid/mt-bench and lighteval/mt-bench) commonly expose an 80-item multi-turn set that is widely used for reporting a numeric MT-Bench score.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

Not enough data to show trend.
§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark
Help build the community leaderboard — submit your model results.

Score (1 10)

Score (1 10) is the reported evaluation metric for MTbench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Score (1 10)verifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Qwen2.5-72B-Instruct
dataset: MTbench; task: 5
paper9.35N/ASource ↗
§ 04 · Submit a result

Add to the leaderboard.

← Back to Language Modeling