A comprehensive multi-dimensional benchmark for evaluating large language models alignment capabilities in Chinese. AlignBench contains 683 high-quality samples curated through a human-in-the-loop data curation pipeline across 8 main categories: Fundamental Language Ability (68 samples), Chinese Advanced Understanding (58), Open-ended Questions (38), Writing Ability (75), Logical Reasoning (92), Mathematics (112), Task-oriented Role Play (116), and Professional Knowledge (124). Each sample includes a task-oriented query, a high-quality reference answer with evidence from reliable web sources, and corresponding category classification. The benchmark uses a multi-dimensional rule-calibrated LLM-as-Judge approach with Chain-of-Thought to generate explanations and ratings (1-10 scale), employing GPT-4 or the dedicated CritiqueLLM evaluator (which recovers 95% of GPT-4s evaluation ability). The evaluation ensures high reliability and interpretability through point-wise grading, Chain-of-Thought reasoning, and rule-calibrated referencing. Since release, AlignBench has been adopted by top Chinese LLMs including ChatGLM, Qwen, DeepSeek, Yi, Baichuan, and Abab.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.