Who leads the MMLU benchmark?

o3 currently leads MMLU with a score of 92.90 on accuracy.

What is the state-of-the-art score on MMLU?

The state-of-the-art result on MMLU is 92.90 (accuracy), achieved by o3 as of 2026.

How many models are tracked on MMLU?

Codesota tracks 63 models on MMLU.

When was the MMLU leaderboard last updated?

The MMLU leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2023.

ICLR 2021UC BerkeleyGold-Standard Benchmark

MMLU

Name: Massive Multitask Language Understanding Benchmark Results
Creator: Codesota
Published: 2023-01-01
License: https://creativecommons.org/licenses/by/4.0/

Massive Multitask Language Understanding — the definitive benchmark for measuring broad AI knowledge across 57 academic subjects from STEM to humanities. 14,042 multiple-choice questions.

92.4%

Current SOTA

5-shot

14,042

Total Questions

57 subjects

Subject Areas

4 categories

25%

Random Baseline

4-choice MCQ

89.8%

Human Expert

Specialist avg

What is MMLU?

MMLU (Massive Multitask Language Understanding) tests a model's knowledge and reasoning across 57 academic subjects, from abstract algebra to world religions. Each question is multiple-choice with four options.

Created by Dan Hendrycks et al. at UC Berkeley, MMLU has become the most widely reported benchmark for comparing large language models. It covers professional-level questions in medicine, law, engineering, and more — making it a proxy for “how much does this model know?”

The standard evaluation uses 5-shot prompting (5 examples before the question). Since 2024, MMLU-Pro offers a harder variant with 10-choice questions and chain-of-thought reasoning.

Subject Categories

STEM18 subjects

Physics, Chemistry, Math, CS, Engineering, Biology

Humanities13 subjects

History, Philosophy, Law, Literature

Social Sciences12 subjects

Psychology, Economics, Political Science, Sociology

Other14 subjects

Professional Medicine, Accounting, Clinical Knowledge

SOTA Progress: 43.9% → 92.4%

MMLU 5-shot accuracy over time. From barely above random to superhuman in 5 years.

2020-09

43.9%

GPT-3 175B← MMLU launch (Hendrycks et al.)

2022-03

67.5%

Chinchilla 70B

2023-03

86.4%

GPT-4← First model above 85%

2023-12

90%

Gemini Ultra← First model above 90%

2024-06

88.7%

Claude 3.5 Sonnet

2024-09

88.7%

GPT-4o

2025-03

89.2%

Claude Opus 4

2025-09

90.8%

GPT-5← Breaking 91%

2026-02

92.4%

GPT-5.2← Breaking 92%

Leaderboard — MMLU 5-shot

Top models by accuracy. Updated March 2026.

#	Model	Score	Type	Params	Date
1	GPT-5.2OpenAI	92.4%	API	Unknown	2026-02
2	Claude Opus 4.5Anthropic	91.8%	API	Unknown	2026-01
3	Gemini 3 ProGoogle	91.4%	API	Unknown	2026-01
4	Claude Opus 4.6Anthropic	91.2%	API	Unknown	2026-03
5	GPT-5OpenAI	90.8%	API	Unknown	2025-09
6	Claude Sonnet 4.5Anthropic	90.4%	API	Unknown	2025-12
7	Gemini 3 FlashGoogle	89.6%	API	Unknown	2026-01
8	Qwen 3 72BAlibaba	88.7%	Open	72B	2025-11
9	DeepSeek V3.5DeepSeek	88.2%	Open	685B MoE	2025-10
10	Llama 4 405BMeta	87.8%	Open	405B	2025-09
11	Mistral Large 3Mistral	87.1%	Open	123B	2025-10
12	MiniMax M2.5MiniMax	86.5%	Open	Unknown	2026-01
13	Kimi K2.5Moonshot AI	86%	API	Unknown	2025-12
14	Qwen 3 14BAlibaba	84.3%	Open	14B	2025-11
15	Phi-4 14BMicrosoft	83.9%	Open	14B	2025-08

Key Insights

2.1×

Improvement since launch

From 43.9% (GPT-3, 2020) to 92.4% (GPT-5.2, 2026). Models now surpass average human expert performance (89.8%).

Saturation?

Benchmark ceiling approaching

Top models score 90%+, nearing the ceiling. MMLU-Pro (10-choice, harder) is the recommended successor for differentiating frontier models.

Open catching up

Qwen 3 72B at 88.7%

Open-weight models are within 4% of the best proprietary systems, with Qwen 3, DeepSeek V3.5, and Llama 4 leading the pack.

MMLU Variants

MMLU

Active

14,042

Original 4-choice MCQ across 57 subjects. Standard 5-shot evaluation.

MMLU-Pro

Recommended

12,000

10-choice MCQ, harder questions, chain-of-thought. Better discriminates frontier models.

MMLU-Redux

Validation

3,000

Error-corrected subset. Fixes annotation noise and ambiguous questions.

Key Papers

Measuring Massive Multitask Language Understanding

Hendrycks, Burns, Basart, Zou, Mazeika, Song, Steinhardt·ICLR 2021·4200 citations

Paper

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Wang, Ma, Feng et al.·NeurIPS 2024·280 citations

Paper

MMLU-Redux: Evaluating Data Quality in MMLU

Gema et al.·arXiv 2024·95 citations

Paper

Related Benchmarks

Benchmark	Focus	Questions	Saturated?
MMLU	Broad knowledge (57 subjects)	14,042	Approaching
MMLU-Pro	Harder knowledge (10-choice)	12,000	No
ARC-Challenge	Science reasoning (grade school)	2,590	Yes (99%+)
HellaSwag	Commonsense completion	10,042	Yes (95%+)
WinoGrande	Coreference resolution	1,767	Approaching
GPQA	Graduate-level QA	448	No

Access the Benchmark

MMLU is fully open-source. Evaluate any model locally.

GitHub Repository Read the Paper HuggingFace Dataset

Track every AI benchmark in one place

CodeSOTA tracks state-of-the-art results across 200+ benchmarks in reasoning, NLP, computer vision, code, and more.

Browse Reasoning Benchmarks Explore All Areas