AlignBench: Benchmarking Chinese Alignment of Large Language Models.

A comprehensive multi-dimensional benchmark for evaluating large language models alignment capabilities in Chinese. AlignBench contains 683 high-quality samples curated through a human-in-the-loop data curation pipeline across 8 main categories: Fundamental Language Ability (68 samples), Chinese Advanced Understanding (58), Open-ended Questions (38), Writing Ability (75), Logical Reasoning (92), Mathematics (112), Task-oriented Role Play (116), and Professional Knowledge (124). Each sample includes a task-oriented query, a high-quality reference answer with evidence from reliable web sources, and corresponding category classification. The benchmark uses a multi-dimensional rule-calibrated LLM-as-Judge approach with Chain-of-Thought to generate explanations and ratings (1-10 scale), employing GPT-4 or the dedicated CritiqueLLM evaluator (which recovers 95% of GPT-4s evaluation ability). The evaluation ensures high reliability and interpretability through point-wise grading, Chain-of-Thought reasoning, and rule-calibrated referencing. Since release, AlignBench has been adopted by top Chinese LLMs including ChatGLM, Qwen, DeepSeek, Yi, Baichuan, and Abab.

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

AlignBench: Benchmarking Chinese Alignment of Large Language Models.

Best published scores.

Neighbouring benchmarks.

Have a score that beatsthis table?

Have a score that beats
this table?