Large language models, measured honestly.

The original MMLU: 15,908 four-choice questions across 57 subjects from elementary to professional level. Largely saturated at the frontier — top models cluster above 90%. For a harder variant see MMLU-Pro.

#	Model	Vendor	accuracy
01	o3	OpenAI	92.9%
02	GPT-5.2	OpenAI	92.4%
03	o1	OpenAI	91.8%
04	Claude Opus 4.5	Anthropic	91.8%
05	Claude Opus 4.5	Anthropic	91.6%
06	Gemini 3 Pro	Google	91.4%
07	Claude Opus 4.6	Anthropic	91.2%
08	o1-preview	OpenAI	90.8%
09	GPT-5	OpenAI	90.8%
10	DeepSeek R1	DeepSeek	90.8%
11	GPT-4.5 Preview	OpenAI	90.8%
12	Claude Sonnet 4.5	Anthropic	90.4%
13	GPT-4.1	OpenAI	90.2%
14	Claude Sonnet 4	Anthropic	90.1%
15	o4-mini	OpenAI	90%
16	Gemini 2.5 Pro	Google	89.8%
17	Gemini 3 Flash	Google	89.6%
18	Llama-4-Maverick	Meta	89.4%
19	Claude Opus 4	Anthropic	88.8%
20	Qwen 3 72B	Alibaba	88.7%
21	Llama 3.1 405B	Meta	88.6%
22	DeepSeek-V3	DeepSeek	88.5%
23	Claude 3.5 Sonnet	Anthropic	88.3%
24	DeepSeek V3.5	DeepSeek	88.2%
25	Llama 4 405B	Meta	87.8%
26	Grok 2	xAI	87.5%
27	GPT-4o	OpenAI	87.2%
28	Mistral Large 3	Mistral	87.1%
29	Claude 3 Opus	Anthropic	86.8%
30	GPT-4 Turbo	OpenAI	86.7%
31	Grok 4	xAI	86.6%
32	MiniMax M2.5	MiniMax	86.5%
33	Qwen2.5-72B-Instruct	Alibaba	86.1%
34	Kimi K2.5	Moonshot AI	86%
35	o3-mini	OpenAI	85.9%
36	Gemini 1.5 Pro	Google	85.9%
37	o1-mini	OpenAI	85.2%
38	Qwen 3 14B	Alibaba	84.3%
39	Phi-4 14B	Microsoft	83.9%
40	Llama 3.1 70B	Meta	82%
41	GPT-4o mini	OpenAI	82%

Source: hendrycks/test (MMLU) · Saturated benchmark. Small score deltas at the top (90–93%) are within noise; treat rankings as a cluster, not a strict order.

MMLU-Pro

A harder, contamination-resistant successor to MMLU: 12,000 questions with 10 answer choices (vs. 4) and reasoning-focused items pulled from advanced STEM and professional sources. Frontier now crests 90% — the benchmark is starting to saturate 18 months after release.

#	Model	Vendor	accuracy
01	Gemini 3.1 Pro	Google	90.99%
02	Gemini 3 Pro	Google	89.8%
03	Claude Opus 4.5	Anthropic	89.5%
04	Gemini 3 Flash	Google	89%
05	Qwen3.6 Plus	Alibaba Cloud	88.5%
06	MiniMax M2.1	MiniMax	88%
07	Claude Opus 4.1	Anthropic	88%
08	Qwen3.5-397B-A17B	Alibaba Cloud	87.8%
09	Claude Sonnet 4.5	Anthropic	87.5%
10	GPT-5.2	OpenAI	87.4%
11	GPT-5	OpenAI	87.1%
12	Kimi K2.5	Moonshot AI	87.1%
13	GPT-5.1	OpenAI	87%
14	Grok 4	xAI	86.6%
15	DeepSeek V3.2	DeepSeek	86.2%
16	Claude 3.7 Sonnet	Anthropic	85.1%
17	DeepSeek-R1-0528	DeepSeek	85%
18	Kimi K2-Thinking-0905	Moonshot AI	84.6%
19	GLM-4.5	Zhipu AI	84.6%
20	GPT-4o	OpenAI	72.6%

Source: arXiv:2406.01574 · Scores above ~87% cluster within a few points — treat the top cohort as a band, not a strict ranking. Variant rows (thinking vs. non-thinking) collapsed to one entry per model family.

GPQA Diamond

198 expert-authored graduate-level questions in biology, chemistry, and physics. PhD-level specialists score ~65% on their own field. Designed to be impossible to Google.

#	Model	Vendor	accuracy
01	Gemini 3 Pro	Google	91.9%
02	Claude Opus 4.6	Anthropic	91.3%
03	Gemini 3 Flash	Google	90.4%
04	Claude Sonnet 4.6	Anthropic	89.9%
05	GPT-5	OpenAI	89%
06	Grok 4	xAI	88%
07	Gemini 2.5 Pro	Google	84%
08	Gemini 2.5 Flash	Google	82.8%
09	o3	OpenAI	82.8%
10	o4-mini	OpenAI	77.6%
11	Claude Opus 4	Anthropic	76.7%
12	o1	OpenAI	75.7%
13	Claude Opus 4.5	Anthropic	74.9%
14	o3-mini	OpenAI	74.9%
15	o1-preview	OpenAI	73.3%
16	DeepSeek R1	DeepSeek	71.5%
17	Qwen3-235B-A22B	Alibaba	71.1%
18	Claude Sonnet 4	Anthropic	70%
19	Llama-4-Maverick	Meta	69.8%
20	GPT-4.5 Preview	OpenAI	69.5%
21	GPT-4.1 mini	OpenAI	66.4%
22	GPT-4.1	OpenAI	66.3%
23	o1-mini	OpenAI	60%
24	Claude 3.5 Sonnet	Anthropic	59.4%
25	Grok 2	xAI	56%
26	Llama 3.1 405B	Meta	50.7%
27	Claude 3 Opus	Anthropic	50.4%
28	GPT-4o	OpenAI	49.9%
29	GPT-4 Turbo	OpenAI	49.3%
30	Qwen2.5-72B-Instruct	Alibaba	49%
31	Gemini 1.5 Pro	Google	46.2%
32	Llama 3.1 70B	Meta	41.7%
33	GPT-4o mini	OpenAI	40.2%

Source: arXiv:2311.12022 · Human expert baseline (non-specialist): 34%. PhD specialist: ~65%.

§ 05 · Math / Reasoning

Math & reasoning.

Olympiad-style short answer, released after model training cutoffs.

AIME 2025

The 2025 American Invitational Mathematics Examination: 30 olympiad-style short-answer problems drawn after most 2024-era model training cutoffs. A primary frontier-math signal in recent reasoning-model reports.

#	Model	Vendor	accuracy
01	o4-mini	OpenAI	92.7%
02	Gemini 2.5 Pro	Google	86.7%
03	o3	OpenAI	86.7%
04	Claude Opus 4.5	Anthropic	80%
05	DeepSeek R1	DeepSeek	72%

Source: maa.org/aime · Small test set (30 problems) — a single swing is ~3.3%. Numbers below are pass@1 unless otherwise noted.

§ 06 · Coding

Code.

Contest-style programming. Elo-rated or pass@1 on held-out problems.

LiveCodeBench Pro

The 2026 Elo-rated successor to classic LCB. Built by Olympiad medalists from continuously-updated Codeforces, ICPC and IOI problems. Each LLM is treated as a virtual Codeforces contestant and fit to a Bayesian MAP Elo on the standard Codeforces scale (~800 novice to ~3800 top human).

#	Model	Vendor	Elo
01	Gemini 3 Pro	Google	2439
02	GPT-5	OpenAI	2176
03	o4-mini	OpenAI	2092
04	Gemini 2.5 Pro	Google	1769
05	Qwen3-235B-A22B	Alibaba	1673
06	Claude Sonnet 4.5	Anthropic	1412
07	Gemini 2.5 Flash	Google	1288
08	DeepSeek R1	DeepSeek	1161
09	o3	OpenAI	1010

Source: livecodebenchpro.com · Elo rating comparable to the Codeforces human scale. Top human contestants sit around 3800; the strongest model on the board is Gemini 3 Pro at 2439.

LiveCodeBench (classic)

Classic pass@1 LiveCodeBench — continuously updated with new contest problems from LeetCode, Codeforces, and AtCoder. Largely superseded by LCB Pro for frontier models, but preserved here for historical comparison across older models.

#	Model	Vendor	pass@1
01	Gemini 3 Pro Preview		91.7%
02	Gemini 3 Flash	Google	90.8%
03	GPT-5	OpenAI	85%
04	Grok 4	xAI	79%
05	Gemini 2.5 Pro	Google	75.6%
06	DeepSeek-R1-0528	DeepSeek	73.3%
07	o4-mini	OpenAI	72.8%
08	Qwen3-235B-A22B	Alibaba	70.7%
09	o3-mini	OpenAI	66.9%
10	DeepSeek R1	DeepSeek	65.9%
11	o3	OpenAI	65.3%
12	DeepSeek-R1-Distill-Llama-70B	DeepSeek	65.2%
13	Gemini 2.5 Flash	Google	63.9%
14	Kimi k1.5	Moonshot AI	62.5%
15	DeepSeek-R1-Distill-Qwen-32B	DeepSeek	62.1%
16	Claude Opus 4	Anthropic	57.8%
17	GPT-4.1	OpenAI	54.4%
18	Claude Sonnet 4	Anthropic	52.8%
19	DeepSeek-v3-0324	DeepSeek	49.2%
20	DeepSeek-V3	DeepSeek	49.2%
21	GPT-4.1 mini	OpenAI	48.3%
22	Qwen2.5-Coder 32B	Alibaba	47.8%
23	Llama-4-Maverick	Meta	43.4%
24	DeepSeek-Coder-V2-Instruct	DeepSeek	43.4%
25	GPT-4o	OpenAI	40.8%
26	Gemma-3-27b	Google	39%
27	Llama-4-Scout	Meta	32.8%
28	Gemma 3 12B IT	Google DeepMind	32%
29	Codestral 22B	Mistral	29.5%
30	Gemma 3 4B IT	Google DeepMind	23%

Source: livecodebench.github.io · Problems released after model training cutoffs to prevent contamination.

§ 07 · Agentic / Tools

Agentic & tools.

Multi-turn tasks using real tools and databases; pass = full resolution.

Tau2-Bench

Simulates real customer service interactions — agents use tools and databases to resolve tasks in retail and airline domains across multi-turn dialogues. Pass rate = task fully resolved.

#	Model	Vendor	pass_rate
01	Claude Opus 4.5	Anthropic	79%
02	GPT-5.2	OpenAI	73%
03	Gemini 3 Pro	Google	69%
04	Claude Sonnet 4.5	Anthropic	63%
05	GPT-5.1	OpenAI	59%
06	Gemini 2.5 Pro	Google	54%
07	Claude 3.7 Sonnet	Anthropic	47%
08	GPT-4o	OpenAI	36%

Source: sierra-research/tau2-bench · Average across 3 seeds per model.

§ 08 · Frontier

Frontier difficulty.

Designed to remain unsaturated for years. Even the leaders score low.

Humanity's Last Exam (HLE)

3,000 extremely hard questions across math, science, law, and humanities — contributed by domain experts worldwide. Designed to remain unsaturated for years. No tools allowed in this variant.

#	Model	Vendor	accuracy
01	Gemini 3 Pro	Google	38.3%
02	GPT-5	OpenAI	25.3%
03	Grok 4	xAI	24.5%
04	Gemini 2.5 Pro	Google	21.6%
05	GPT-5-mini	OpenAI	19.4%
06	Claude Opus 4.6	Anthropic	19%
07	Claude 4.5 Sonnet	Anthropic	13.7%
08	Claude Sonnet 4.6	Anthropic	13.2%
09	Gemini 2.5 Flash	Google	12.1%
10	DeepSeek R1	DeepSeek	8.5%
11	o1	OpenAI	8%
12	GPT-4.1 mini	OpenAI	4.6%
13	GPT-4o	OpenAI	2.7%

Source: agi.safe.ai · Leaderboard updated from live source.

§ 09 · Browse

By capability.

A shortcut into deeper leaderboards and per-task pages. All links resolve to live registry pages.

Multi-step, frontier-difficulty, GPQA and HLE.

Reasoning →

AIME 2025, olympiad-style short answer.

Math →

LiveCodeBench, pass@1 on held-out contest problems.

Code generation →

MMLU and MMLU-Pro — breadth across 57 subjects.

Knowledge →

SWE-bench, Tau2-Bench, tool-use under real constraints.

Agentic →

The full index of text-in, text-out tasks.

All LLM datasets →

§ 10 · Deep dives

By benchmark family.

Editorial pages with current rankings, eval methodology, and what the score actually means.

LiveCodeBench, SWE-bench Verified, HumanEval+, MBPP — pass@1 across the coding leaderboards.

Coding benchmarks →

The two saturating Python micro-benchmarks — what they still tell you and what they don’t.

HumanEval & MBPP →

AIME, MATH, Omni-MATH — frontier models on olympiad-style problems.

Math benchmarks →

Grade-school math word problems — the canonical reasoning benchmark.

GSM8K →

GPQA, HLE, ARC-AGI — what frontier-difficulty actually means.

Reasoning benchmarks →

Llama, Qwen, DeepSeek, Mistral — the open frontier vs the closed.

Open-weight models →

§ 11 · Related

Keep reading.

Adjacent sections of the registry.

SWE-bench, Terminal-Bench, tool-use and the trust problem in agent evals.

Agentic →

The pass@1 era and what comes after.

Code generation →

Long-form guide comparing code models in production.

Guide · Code models →