MMLU and MMLU-Pro
Broad-knowledge benchmark results from the CodeSOTA database. MMLU is the original 57-subject standard; MMLU-Pro is the harder 10-choice successor used to separate current frontier models.
92.9%
MMLU SOTA
o3
91.0%
MMLU-Pro SOTA
Gemini 3.1 Pro
20
MMLU Rows
deduped models
20
MMLU-Pro Rows
deduped models
Apr 2026
Registry Update
latest source access
Benchmark Records
MMLU
Massive Multitask Language Understanding
Broad multi-task language-understanding benchmark with 57 subjects spanning STEM, humanities, social sciences, and professional knowledge. Original 4-choice MCQ format; now saturated enough that top-frontier deltas should be read as a cluster rather than a strict ranking.
MMLU-Pro
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Harder successor to MMLU with roughly 12k 10-choice questions, stronger distractors, and more reasoning-heavy items. Use it as the preferred MMLU-family benchmark for current frontier LLMs.
MMLU Leaderboard
Best available accuracy row per model from the registry. MMLU is largely saturated, so top ranks should be read as a cluster.
| # | Model | Accuracy | Source | Date |
|---|---|---|---|---|
| 1 | o3OpenAI · api | 92.9% | openai-simple-evals | Apr 2025 |
| 2 | GPT-5.2OpenAI · api | 92.4% | codesota-shadow-mmlu | Feb 2026 |
| 3 | Claude Opus 4.5Anthropic · api | 91.8% | codesota-shadow-mmlu | Jan 2026 |
| 4 | o1OpenAI · api | 91.8% | openai-simple-evals | Dec 2024 |
| 5 | Claude Opus 4.5Anthropic · Undisclosed · api | 91.6% | anthropic-model-card | Nov 2025 |
| 6 | Gemini 3 ProGoogle · Undisclosed · api | 91.4% | codesota-shadow-mmlu | Jan 2026 |
| 7 | Claude Opus 4.6Anthropic · api | 91.2% | codesota-shadow-mmlu | Mar 2026 |
| 8 | DeepSeek R1DeepSeek · 671B MoE · open-source | 90.8% | arxiv | Jan 2025 |
| 9 | GPT-4.5 PreviewOpenAI · api | 90.8% | openai-simple-evals | Feb 2025 |
| 10 | GPT-5OpenAI · api | 90.8% | codesota-shadow-mmlu | Sep 2025 |
| 11 | o1-previewOpenAI · Undisclosed · api | 90.8% | openai-simple-evals | Sep 2024 |
| 12 | Claude Sonnet 4.5Anthropic · api | 90.4% | codesota-shadow-mmlu | Dec 2025 |
| 13 | GPT-4.1OpenAI · api | 90.2% | openai-simple-evals | Apr 2025 |
| 14 | Claude Sonnet 4Anthropic · api | 90.1% | anthropic-model-card | Mar 2026 |
| 15 | o4-miniOpenAI · api | 90% | openai-simple-evals | Apr 2025 |
| 16 | Gemini 2.5 ProGoogle · api | 89.8% | google-technical-report | Jun 2025 |
| 17 | Gemini 3 FlashGoogle · Undisclosed · api | 89.6% | codesota-shadow-mmlu | Jan 2026 |
| 18 | Llama-4-MaverickMeta · 400B total / 17B active (128 experts) · open-source | 89.4% | meta-blog | Mar 2026 |
| 19 | Claude Opus 4Anthropic · Undisclosed · api | 88.8% | anthropic-announcement | Apr 2026 |
| 20 | Qwen 3 72BAlibaba · 72B · open-source | 88.7% | codesota-shadow-mmlu | Nov 2025 |
MMLU-Pro Leaderboard
The preferred MMLU-family successor for frontier models: 10 choices, harder distractors, and more reasoning-heavy items.
| # | Model | Accuracy | Source | Date |
|---|---|---|---|---|
| 1 | Gemini 3.1 ProGoogle · api | 91.0% | pricepertoken | Apr 2026 |
| 2 | Gemini 3 ProGoogle · Undisclosed · api | 89.8% | pricepertoken | Apr 2026 |
| 3 | Claude Opus 4.5Anthropic · Undisclosed · api | 89.5% | pricepertoken | Apr 2026 |
| 4 | Gemini 3 FlashGoogle · Undisclosed · api | 89% | pricepertoken | Apr 2026 |
| 5 | Qwen3.6 PlusAlibaba Cloud | 88.5% | llm-stats | Apr 2026 |
| 6 | Claude Opus 4.1Anthropic | 88% | pricepertoken | Apr 2026 |
| 7 | MiniMax M2.1MiniMax · api | 88% | pricepertoken | Apr 2026 |
| 8 | Qwen3.5-397B-A17BAlibaba Cloud | 87.8% | llm-stats | Apr 2026 |
| 9 | Claude Sonnet 4.5Anthropic · Undisclosed · api | 87.5% | pricepertoken | Apr 2026 |
| 10 | GPT-5.2OpenAI · Undisclosed · api | 87.4% | pricepertoken | Apr 2026 |
| 11 | GPT-5OpenAI · api | 87.1% | pricepertoken | Apr 2026 |
| 12 | Kimi K2.5Moonshot AI · Undisclosed · api | 87.1% | llm-stats | Apr 2026 |
| 13 | GPT-5.1OpenAI · api | 87% | pricepertoken | Apr 2026 |
| 14 | Grok 4xAI · api | 86.6% | pricepertoken | Apr 2026 |
| 15 | DeepSeek V3.2DeepSeek · api | 86.2% | pricepertoken | Apr 2026 |
| 16 | Claude 3.7 SonnetAnthropic | 85.1% | anthropic-announcement | Apr 2026 |
| 17 | DeepSeek-R1-0528DeepSeek · open-source | 85% | llm-stats | Apr 2026 |
| 18 | GLM-4.5Zhipu AI | 84.6% | llm-stats | Apr 2026 |
| 19 | Kimi K2-Thinking-0905Moonshot AI · open-source | 84.6% | llm-stats | Apr 2026 |
| 20 | GPT-4oOpenAI · Undisclosed · api | 72.6% | artificial-analysis | Apr 2026 |
Why This Page Changed
MMLU
14kActive, saturated
Original 4-choice benchmark across 57 subjects. Still useful as a broad-knowledge receipt, but frontier models now cluster tightly.
MMLU-Pro
12kPreferred successor
10-choice successor with harder distractors and more reasoning-heavy items. Better current signal for frontier broad knowledge.
HLE / GPQA
harderNext frontier
For top-model separation, read MMLU-family scores alongside GPQA Diamond and Humanity’s Last Exam.
MMLU Subject Coverage
Physics, Chemistry, Math, CS, Engineering, Biology
History, Philosophy, Law, Literature
Psychology, Economics, Political Science, Sociology
Medicine, law, accounting, clinical knowledge
Use The Registry, Not The Old Static Snapshot
This page now reads from benchmark_results, datasets, and model metadata. The static March snapshot is no longer the source of truth.