ARCAI2 Reasoning Challenge (ARC) AlignBenchAlignBench: Benchmarking Chinese Alignment of Large Language Models 81.4(Accuracy)Qwen2.5-Plus
AutoLogiAutoLogi: Automated Logic Puzzle Benchmark Bird-SQL (dev)BIRD-SQL (BIg Bench for Large-Scale Database-Grounded Text-to-SQLs) C-EvalC-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models DROPDiscrete Reasoning Over Paragraphs (DROP) ECLeKTicECLeKTic: A Multi-Lingual Knowledge Testing Dataset GPQAGPQA: Graduate-Level Google-Proof Q&A Benchmark 49.7(Accuracy)Qwen2.5-Plus
HellaSwagHellaSwag: Can a Machine Really Finish Your Sentence? IFEvalInstruction-Following Eval 86.3(Accuracy)Qwen2.5-Plus
INCLUDEINCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge LV-EvalLV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K 60.4(Accuracy)Qwen2.5-72B-Instruct
54.6(Accuracy)Qwen2.5-Plus
LongBench v2LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks LongBench-ChatLongBench-Chat: Long Context Instruction-Following Benchmark 8.72(Score (1-10))Qwen2.5-72B-Instruct
MATHMATH (Measuring Mathematical Problem Solving) Dataset 84.7(Accuracy)Qwen2.5-Plus
MGSMultilingual Grade School Math (MGSM) 88.16(Accuracy)Qwen2.5-72B-Instruct
72.5(Accuracy)Qwen2.5-Plus
MMLU-ReduxMMLU-Redux: Massive Multitask Language Understanding Redux 86.8(Accuracy)Qwen2.5-72B-Instruct
MRCR v2 (1M)Multi-turn Response Coherence and Relevance (1M context) MTbenchMT-Bench (Multi-Turn Benchmark) 9.35(Score (1-10))Qwen2.5-72B-Instruct
Multi-IFMulti-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following MultiChallengeMultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs OpenBookQAOpenBookQA (Open Book Question Answering) RULERRULER: What’s the Real Context Size of Your Long-Context Language Models? 95.1(Accuracy)Qwen2.5-72B-Instruct
SafetyBenchSafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions SuperGPQASuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines TriviaQATriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension WinograndeWinoGrande: An Adversarial Winograd Schema Challenge at Scale WritingBenchWritingBench: A Comprehensive Benchmark for Generative Writing 79.97(Accuracy)Qwen2.5-72B-Instruct