MMLU
Massive Multitask Language Understanding — the definitive benchmark for measuring broad AI knowledge across 57 academic subjects from STEM to humanities. 14,042 multiple-choice questions.
92.4%
Current SOTA
5-shot
14,042
Total Questions
57 subjects
57
Subject Areas
4 categories
25%
Random Baseline
4-choice MCQ
89.8%
Human Expert
Specialist avg
What is MMLU?
MMLU (Massive Multitask Language Understanding) tests a model's knowledge and reasoning across 57 academic subjects, from abstract algebra to world religions. Each question is multiple-choice with four options.
Created by Dan Hendrycks et al. at UC Berkeley, MMLU has become the most widely reported benchmark for comparing large language models. It covers professional-level questions in medicine, law, engineering, and more — making it a proxy for “how much does this model know?”
The standard evaluation uses 5-shot prompting (5 examples before the question). Since 2024, MMLU-Pro offers a harder variant with 10-choice questions and chain-of-thought reasoning.
Subject Categories
Physics, Chemistry, Math, CS, Engineering, Biology
History, Philosophy, Law, Literature
Psychology, Economics, Political Science, Sociology
Professional Medicine, Accounting, Clinical Knowledge
SOTA Progress: 43.9% → 92.4%
MMLU 5-shot accuracy over time. From barely above random to superhuman in 5 years.
Leaderboard — MMLU 5-shot
Top models by accuracy. Updated March 2026.
| # | Model | Score | Type | Params | Date |
|---|---|---|---|---|---|
| 1 | GPT-5.2OpenAI | 92.4% | API | Unknown | 2026-02 |
| 2 | Claude Opus 4.5Anthropic | 91.8% | API | Unknown | 2026-01 |
| 3 | Gemini 3 ProGoogle | 91.4% | API | Unknown | 2026-01 |
| 4 | Claude Opus 4.6Anthropic | 91.2% | API | Unknown | 2026-03 |
| 5 | GPT-5OpenAI | 90.8% | API | Unknown | 2025-09 |
| 6 | Claude Sonnet 4.5Anthropic | 90.4% | API | Unknown | 2025-12 |
| 7 | Gemini 3 FlashGoogle | 89.6% | API | Unknown | 2026-01 |
| 8 | Qwen 3 72BAlibaba | 88.7% | Open | 72B | 2025-11 |
| 9 | DeepSeek V3.5DeepSeek | 88.2% | Open | 685B MoE | 2025-10 |
| 10 | Llama 4 405BMeta | 87.8% | Open | 405B | 2025-09 |
| 11 | Mistral Large 3Mistral | 87.1% | Open | 123B | 2025-10 |
| 12 | MiniMax M2.5MiniMax | 86.5% | Open | Unknown | 2026-01 |
| 13 | Kimi K2.5Moonshot AI | 86% | API | Unknown | 2025-12 |
| 14 | Qwen 3 14BAlibaba | 84.3% | Open | 14B | 2025-11 |
| 15 | Phi-4 14BMicrosoft | 83.9% | Open | 14B | 2025-08 |
Key Insights
Improvement since launch
From 43.9% (GPT-3, 2020) to 92.4% (GPT-5.2, 2026). Models now surpass average human expert performance (89.8%).
Benchmark ceiling approaching
Top models score 90%+, nearing the ceiling. MMLU-Pro (10-choice, harder) is the recommended successor for differentiating frontier models.
Qwen 3 72B at 88.7%
Open-weight models are within 4% of the best proprietary systems, with Qwen 3, DeepSeek V3.5, and Llama 4 leading the pack.
MMLU Variants
MMLU
Active14,042
Original 4-choice MCQ across 57 subjects. Standard 5-shot evaluation.
MMLU-Pro
Recommended12,000
10-choice MCQ, harder questions, chain-of-thought. Better discriminates frontier models.
MMLU-Redux
Validation3,000
Error-corrected subset. Fixes annotation noise and ambiguous questions.
Key Papers
Related Benchmarks
| Benchmark | Focus | Questions | Saturated? |
|---|---|---|---|
| MMLU | Broad knowledge (57 subjects) | 14,042 | Approaching |
| MMLU-Pro | Harder knowledge (10-choice) | 12,000 | No |
| ARC-Challenge | Science reasoning (grade school) | 2,590 | Yes (99%+) |
| HellaSwag | Commonsense completion | 10,042 | Yes (95%+) |
| WinoGrande | Coreference resolution | 1,767 | Approaching |
| GPQA | Graduate-level QA | 448 | No |
Access the Benchmark
MMLU is fully open-source. Evaluate any model locally.
Track every AI benchmark in one place
CodeSOTA tracks state-of-the-art results across 200+ benchmarks in reasoning, NLP, computer vision, code, and more.