DB-backed registryMMLU familySaturation risk

MMLU and MMLU-Pro

Broad-knowledge benchmark results from the CodeSOTA database. MMLU is the original 57-subject standard; MMLU-Pro is the harder 10-choice successor used to separate current frontier models.

92.9%

MMLU SOTA

o3

91.0%

MMLU-Pro SOTA

Gemini 3.1 Pro

20

MMLU Rows

deduped models

20

MMLU-Pro Rows

deduped models

Apr 2026

Registry Update

latest source access

Benchmark Records

MMLU

Massive Multitask Language Understanding

trust B

Broad multi-task language-understanding benchmark with 57 subjects spanning STEM, humanities, social sciences, and professional knowledge. Original 4-choice MCQ format; now saturated enough that top-frontier deltas should be read as a cluster rather than a strict ranking.

accuracyhigher is better14,042 items2021

MMLU-Pro

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

trust B

Harder successor to MMLU with roughly 12k 10-choice questions, stronger distractors, and more reasoning-heavy items. Use it as the preferred MMLU-family benchmark for current frontier LLMs.

accuracyhigher is better12,032 items2024

MMLU Leaderboard

Best available accuracy row per model from the registry. MMLU is largely saturated, so top ranks should be read as a cluster.

Original paper
#ModelAccuracySourceDate
1o3OpenAI · api92.9%openai-simple-evalsApr 2025
2GPT-5.2OpenAI · api92.4%codesota-shadow-mmluFeb 2026
3Claude Opus 4.5Anthropic · api91.8%codesota-shadow-mmluJan 2026
4o1OpenAI · api91.8%openai-simple-evalsDec 2024
5Claude Opus 4.5Anthropic · Undisclosed · api91.6%anthropic-model-cardNov 2025
6Gemini 3 ProGoogle · Undisclosed · api91.4%codesota-shadow-mmluJan 2026
7Claude Opus 4.6Anthropic · api91.2%codesota-shadow-mmluMar 2026
8DeepSeek R1DeepSeek · 671B MoE · open-source90.8%arxivJan 2025
9GPT-4.5 PreviewOpenAI · api90.8%openai-simple-evalsFeb 2025
10GPT-5OpenAI · api90.8%codesota-shadow-mmluSep 2025
11o1-previewOpenAI · Undisclosed · api90.8%openai-simple-evalsSep 2024
12Claude Sonnet 4.5Anthropic · api90.4%codesota-shadow-mmluDec 2025
13GPT-4.1OpenAI · api90.2%openai-simple-evalsApr 2025
14Claude Sonnet 4Anthropic · api90.1%anthropic-model-cardMar 2026
15o4-miniOpenAI · api90%openai-simple-evalsApr 2025
16Gemini 2.5 ProGoogle · api89.8%google-technical-reportJun 2025
17Gemini 3 FlashGoogle · Undisclosed · api89.6%codesota-shadow-mmluJan 2026
18Llama-4-MaverickMeta · 400B total / 17B active (128 experts) · open-source89.4%meta-blogMar 2026
19Claude Opus 4Anthropic · Undisclosed · api88.8%anthropic-announcementApr 2026
20Qwen 3 72BAlibaba · 72B · open-source88.7%codesota-shadow-mmluNov 2025

MMLU-Pro Leaderboard

The preferred MMLU-family successor for frontier models: 10 choices, harder distractors, and more reasoning-heavy items.

Dataset
#ModelAccuracySourceDate
1Gemini 3.1 ProGoogle · api91.0%pricepertokenApr 2026
2Gemini 3 ProGoogle · Undisclosed · api89.8%pricepertokenApr 2026
3Claude Opus 4.5Anthropic · Undisclosed · api89.5%pricepertokenApr 2026
4Gemini 3 FlashGoogle · Undisclosed · api89%pricepertokenApr 2026
5Qwen3.6 PlusAlibaba Cloud88.5%llm-statsApr 2026
6Claude Opus 4.1Anthropic88%pricepertokenApr 2026
7MiniMax M2.1MiniMax · api88%pricepertokenApr 2026
8Qwen3.5-397B-A17BAlibaba Cloud87.8%llm-statsApr 2026
9Claude Sonnet 4.5Anthropic · Undisclosed · api87.5%pricepertokenApr 2026
10GPT-5.2OpenAI · Undisclosed · api87.4%pricepertokenApr 2026
11GPT-5OpenAI · api87.1%pricepertokenApr 2026
12Kimi K2.5Moonshot AI · Undisclosed · api87.1%llm-statsApr 2026
13GPT-5.1OpenAI · api87%pricepertokenApr 2026
14Grok 4xAI · api86.6%pricepertokenApr 2026
15DeepSeek V3.2DeepSeek · api86.2%pricepertokenApr 2026
16Claude 3.7 SonnetAnthropic85.1%anthropic-announcementApr 2026
17DeepSeek-R1-0528DeepSeek · open-source85%llm-statsApr 2026
18GLM-4.5Zhipu AI84.6%llm-statsApr 2026
19Kimi K2-Thinking-0905Moonshot AI · open-source84.6%llm-statsApr 2026
20GPT-4oOpenAI · Undisclosed · api72.6%artificial-analysisApr 2026

Why This Page Changed

MMLU Subject Coverage

STEM18 subjects

Physics, Chemistry, Math, CS, Engineering, Biology

Humanities13 subjects

History, Philosophy, Law, Literature

Social Sciences12 subjects

Psychology, Economics, Political Science, Sociology

Professional14 subjects

Medicine, law, accounting, clinical knowledge

Use The Registry, Not The Old Static Snapshot

This page now reads from benchmark_results, datasets, and model metadata. The static March snapshot is no longer the source of truth.