ICLR 2021UC BerkeleyGold-Standard Benchmark

MMLU

Massive Multitask Language Understanding — the definitive benchmark for measuring broad AI knowledge across 57 academic subjects from STEM to humanities. 14,042 multiple-choice questions.

92.4%

Current SOTA

5-shot

14,042

Total Questions

57 subjects

57

Subject Areas

4 categories

25%

Random Baseline

4-choice MCQ

89.8%

Human Expert

Specialist avg

What is MMLU?

MMLU (Massive Multitask Language Understanding) tests a model's knowledge and reasoning across 57 academic subjects, from abstract algebra to world religions. Each question is multiple-choice with four options.

Created by Dan Hendrycks et al. at UC Berkeley, MMLU has become the most widely reported benchmark for comparing large language models. It covers professional-level questions in medicine, law, engineering, and more — making it a proxy for “how much does this model know?”

The standard evaluation uses 5-shot prompting (5 examples before the question). Since 2024, MMLU-Pro offers a harder variant with 10-choice questions and chain-of-thought reasoning.

Subject Categories

STEM18 subjects

Physics, Chemistry, Math, CS, Engineering, Biology

Humanities13 subjects

History, Philosophy, Law, Literature

Social Sciences12 subjects

Psychology, Economics, Political Science, Sociology

Other14 subjects

Professional Medicine, Accounting, Clinical Knowledge

SOTA Progress: 43.9% → 92.4%

MMLU 5-shot accuracy over time. From barely above random to superhuman in 5 years.

2020-09
43.9%
GPT-3 175BMMLU launch (Hendrycks et al.)
2022-03
67.5%
Chinchilla 70B
2023-03
86.4%
GPT-4First model above 85%
2023-12
90%
Gemini UltraFirst model above 90%
2024-06
88.7%
Claude 3.5 Sonnet
2024-09
88.7%
GPT-4o
2025-03
89.2%
Claude Opus 4
2025-09
90.8%
GPT-5Breaking 91%
2026-02
92.4%
GPT-5.2Breaking 92%

Leaderboard — MMLU 5-shot

Top models by accuracy. Updated March 2026.

#ModelScoreTypeParamsDate
1GPT-5.2OpenAI92.4%APIUnknown2026-02
2Claude Opus 4.5Anthropic91.8%APIUnknown2026-01
3Gemini 3 ProGoogle91.4%APIUnknown2026-01
4Claude Opus 4.6Anthropic91.2%APIUnknown2026-03
5GPT-5OpenAI90.8%APIUnknown2025-09
6Claude Sonnet 4.5Anthropic90.4%APIUnknown2025-12
7Gemini 3 FlashGoogle89.6%APIUnknown2026-01
8Qwen 3 72BAlibaba88.7%Open72B2025-11
9DeepSeek V3.5DeepSeek88.2%Open685B MoE2025-10
10Llama 4 405BMeta87.8%Open405B2025-09
11Mistral Large 3Mistral87.1%Open123B2025-10
12MiniMax M2.5MiniMax86.5%OpenUnknown2026-01
13Kimi K2.5Moonshot AI86%APIUnknown2025-12
14Qwen 3 14BAlibaba84.3%Open14B2025-11
15Phi-4 14BMicrosoft83.9%Open14B2025-08

Key Insights

2.1×

Improvement since launch

From 43.9% (GPT-3, 2020) to 92.4% (GPT-5.2, 2026). Models now surpass average human expert performance (89.8%).

Saturation?

Benchmark ceiling approaching

Top models score 90%+, nearing the ceiling. MMLU-Pro (10-choice, harder) is the recommended successor for differentiating frontier models.

Open catching up

Qwen 3 72B at 88.7%

Open-weight models are within 4% of the best proprietary systems, with Qwen 3, DeepSeek V3.5, and Llama 4 leading the pack.

MMLU Variants

MMLU

Active

14,042

Original 4-choice MCQ across 57 subjects. Standard 5-shot evaluation.

MMLU-Pro

Recommended

12,000

10-choice MCQ, harder questions, chain-of-thought. Better discriminates frontier models.

MMLU-Redux

Validation

3,000

Error-corrected subset. Fixes annotation noise and ambiguous questions.

Key Papers

Measuring Massive Multitask Language Understanding
Hendrycks, Burns, Basart, Zou, Mazeika, Song, Steinhardt·ICLR 2021·4200 citations
Paper
MMLU-Redux: Evaluating Data Quality in MMLU
Gema et al.·arXiv 2024·95 citations
Paper

Related Benchmarks

BenchmarkFocusQuestionsSaturated?
MMLUBroad knowledge (57 subjects)14,042Approaching
MMLU-ProHarder knowledge (10-choice)12,000No
ARC-ChallengeScience reasoning (grade school)2,590Yes (99%+)
HellaSwagCommonsense completion10,042Yes (95%+)
WinoGrandeCoreference resolution1,767Approaching
GPQAGraduate-level QA448No

Access the Benchmark

MMLU is fully open-source. Evaluate any model locally.

Track every AI benchmark in one place

CodeSOTA tracks state-of-the-art results across 200+ benchmarks in reasoning, NLP, computer vision, code, and more.