Graduate-level knowledge (GPQA Diamond), broad multi-subject reasoning (MMLU-Pro), and extreme frontier difficulty (HLE). These benchmarks separate world-knowledge from genuine scientific reasoning.
198 expert-authored graduate-level questions in biology, chemistry, and physics. Designed to be impossible to Google. PhD specialists score ~65% on their own field. Expert (non-specialist) human baseline: 34%.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | Gemini 3 Pro | 91.9% | Apr 2026 | |
| 2 | Claude Opus 4.6 | Anthropic | 91.3% | Apr 2026 |
| 3 | Kimi K2.6 | 90.5% | Apr 2026 | |
| 4 | Gemini 3 Flash | 90.4% | Apr 2026 | |
| 5 | DeepSeek-V4-Pro Max | DeepSeek | 90.1% | Apr 2026 |
| 6 | Claude Sonnet 4.6 | Anthropic | 89.9% | Apr 2026 |
| 7 | GPT-5 | OpenAI | 89% | Apr 2026 |
| 8 | Qwen3.5-397B-A17B | Alibaba | 88.4% | Feb 2026 |
| 9 | DeepSeek-V4-Flash Max | DeepSeek | 88.1% | Apr 2026 |
| 10 | Grok 4 | xAI | 88% | Apr 2026 |
| 11 | Qwen3.6-27B | 87.8% | Apr 2026 | |
| 12 | Kimi-K2.5 | Moonshot.AI | 87.6% | Feb 2026 |
| 13 | Qwen3.5-122B-A10B | Alibaba | 86.6% | Feb 2026 |
| 14 | Gemini 2.5 Pro | 86.4% | Jul 2025 | |
| 15 | GLM-5.1 | 86.2% | Feb 2026 | |
| 16 | Qwen3.6-35B-A3B | 86% | Apr 2026 | |
| 17 | GLM-5 | Zhipu AI | 86% | Feb 2026 |
| 18 | GLM-4.7 | Zhipu AI | 85.7% | Aug 2025 |
| 19 | DeepSeek-V3.2-Speciale | DeepSeek | 85.7% | Dec 2025 |
| 20 | Qwen3.5-27B | Alibaba | 85.5% | Feb 2026 |
| 21 | MiniMax-M2.5 | MiniMaxAI | 85.2% | Feb 2026 |
| 22 | Step-3.5-Flash PaCoRe | 85% | Feb 2026 | |
| 23 | Gemma 4 31B | 84.3% | Apr 2026 | |
| 24 | Qwen3.5-35B-A3B | Alibaba | 84.2% | Feb 2026 |
| 25 | Gemini 2.5 Pro | 84% | Mar 2026 | |
| 26 | Qwen3.5-Omni-Plus | 83.9% | Apr 2026 | |
| 27 | Step-3.5-Flash | 83.5% | Feb 2026 | |
| 28 | Gemini 2.5 Flash | 82.8% | Apr 2026 | |
| 29 | o3 | OpenAI | 82.8% | Mar 2026 |
| 30 | Gemini 2.5 Flash | 82.8% | Jul 2025 | |
| 31 | DeepSeek-V3.2 | DeepSeek | 82.4% | Dec 2025 |
| 32 | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | 79.23% | Dec 2025 | |
| 33 | GLM-4.5 | Zhipu AI | 79.1% | Aug 2025 |
| 34 | o4-mini | OpenAI | 77.6% | Mar 2026 |
| 35 | Qwen3-VL-235B-A22B-Thinking | Qwen | 77.1% | Nov 2025 |
| 36 | Claude Opus 4 | Anthropic | 76.7% | Mar 2026 |
| 37 | o1 | OpenAI | 75.7% | Mar 2026 |
| 38 | GLM-4.5-Air | Zhipu AI | 75% | Aug 2025 |
| 39 | Claude Opus 4.5 | Anthropic | 74.9% | Mar 2026 |
| 40 | o3-mini | OpenAI | 74.9% | Mar 2026 |
| 41 | Qwen3-Coder-Next | Qwen | 74.49% | Feb 2026 |
| 42 | Qwen3-VL-235B-A22B-Instruct | Qwen | 74.3% | Nov 2025 |
| 43 | o1-preview | OpenAI | 73.3% | Mar 2026 |
| 44 | Qwen3-Omni-Flash-Thinking | 73.1% | Sep 2025 | |
| 45 | NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 73% | Dec 2025 | |
| 46 | DeepSeek R1 | DeepSeek | 71.5% | Mar 2026 |
| 47 | Qwen3-235B-A22B | Alibaba | 71.1% | Apr 2026 |
| 48 | Qwen3-235B-A22B | Alibaba | 71.1% | May 2025 |
| 49 | ZAYA1-8B | Z.ai | 71% | May 2026 |
| 50 | Claude Sonnet 4 | Anthropic | 70% | Mar 2026 |
| 51 | Llama 4 Maverick | Meta | 69.8% | Mar 2026 |
| 52 | GPT-4.5 Preview | OpenAI | 69.5% | Mar 2026 |
| 53 | MiMo-V2.5-Pro | 66.7% | Apr 2026 | |
| 54 | GPT-4.1 mini | OpenAI | 66.4% | Apr 2026 |
| 55 | GPT-4.1 | OpenAI | 66.3% | Mar 2026 |
| 56 | Trinity Large Preview | Arcee AI | 63.32% | Feb 2026 |
| 57 | o1-mini | OpenAI | 60% | Mar 2026 |
| 58 | Claude 3.5 Sonnet | Anthropic | 59.4% | Mar 2026 |
| 59 | Grok 2 | xAI | 56% | Mar 2026 |
| 60 | MiniMax-Text-01 | MiniMax | 54.4% | Jan 2025 |
| 61 | Llama 3 (405B, Instruct) | Meta | 51.1% | Jul 2024 |
| 62 | Llama 3.1 405B | Meta | 50.7% | Mar 2026 |
| 63 | Claude 3 Opus | Anthropic | 50.4% | Mar 2026 |
| 64 | GPT-4o | OpenAI | 49.9% | Mar 2026 |
| 65 | Qwen2.5-Plus | 49.7% | Dec 2024 | |
| 66 | GPT-4 Turbo | OpenAI | 49.3% | Mar 2026 |
| 67 | Qwen2.5-VL-72B | 49% | Feb 2025 | |
| 68 | Qwen2.5-72B-Instruct | Alibaba | 49% | Mar 2026 |
| 69 | Gemini 1.5 Pro | 46.2% | Mar 2026 | |
| 70 | Gemma 3 (27B, IT) | 42.4% | Mar 2025 | |
| 71 | Step-3.5-Flash Base | 41.7% | Feb 2026 | |
| 72 | Llama 3.1 70B | Meta | 41.7% | Mar 2026 |
| 73 | GPT-4o mini | OpenAI | 40.2% | Mar 2026 |
| 74 | Qwen3-VL-8B-Instruct | Qwen | 34.7% | Nov 2025 |
Source: arXiv:2311.12022 · 198-question Diamond set.
Harder version of MMLU — 10-choice MCQ with distractors across 57 subjects (12,000 questions). Reduces surface pattern-matching vs. the original 4-choice format. Still useful for broad capability comparison.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | Claude 3.7 Sonnet | Anthropic | 85.1% | Feb 2025 |
| 2 | Gemini 2.5 Pro | 83.7% | Mar 2025 | |
| 3 | o3-mini (high) | OpenAI | 79.3% | Feb 2025 |
| 4 | Claude 3.5 Sonnet | Anthropic | 76.1% | Jun 2024 |
| 5 | GPT-4o | OpenAI | 72.6% | May 2024 |
| 6 | Gemini 1.5 Pro | 69% | May 2024 | |
| 7 | Claude 3 Opus | Anthropic | 68.5% | Mar 2024 |
| 8 | GPT-4 Turbo | OpenAI | 63.7% | Nov 2023 |
| 9 | Llama 3 70B | Meta | 56.2% | Apr 2024 |
| 10 | DeepSeek V2 Chat | DeepSeek | 54.8% | May 2024 |
Source: TIGER-AI-Lab/MMLU-Pro · 5-shot chain-of-thought.
3,000 expert-contributed questions spanning math, science, law, and humanities — designed to remain unsaturated for years. No tools allowed. Even the best model scores below 40%.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | Kimi K2.6 | 54% | Apr 2026 | |
| 2 | MiMo-V2.5-Pro | 48% | Apr 2026 | |
| 3 | Gemini 3.1 Pro | 46.44% | May 2026 | |
| 4 | GPT-5.4 Pro | OpenAI | 44.32% | May 2026 |
| 5 | Muse Spark | Meta | 40.56% | May 2026 |
| 6 | Gemini 3 Pro | 38.3% | ||
| 7 | DeepSeek-V4-Pro Max | DeepSeek | 37.7% | Apr 2026 |
| 8 | Gemini 3 Pro Preview | 37.52% | May 2026 | |
| 9 | GPT-5.4 | OpenAI | 36.24% | May 2026 |
| 10 | Claude Opus 4.7 | Anthropic | 36.2% | May 2026 |
| 11 | DeepSeek-V4-Flash Max | DeepSeek | 34.8% | Apr 2026 |
| 12 | Claude Opus 4.6 | Anthropic | 34.44% | May 2026 |
| 13 | GPT-5 Pro | OpenAI | 31.64% | May 2026 |
| 14 | GLM-5.1 | 31% | Feb 2026 | |
| 15 | DeepSeek-V3.2-Speciale | DeepSeek | 30.6% | Dec 2025 |
| 16 | GLM-5 | Zhipu AI | 30.5% | Feb 2026 |
| 17 | Kimi-K2.5 | Moonshot.AI | 30.1% | Feb 2026 |
| 18 | Qwen3.5-397B-A17B | Alibaba | 28.7% | Feb 2026 |
| 19 | Step-3.5-Flash PaCoRe | 27.9% | Feb 2026 | |
| 20 | GPT-5.2 | OpenAI | 27.8% | May 2026 |
| 21 | Gemma 4 31B | 26.5% | Apr 2026 | |
| 22 | GPT-5 | OpenAI | 25.32% | May 2026 |
| 23 | GPT-5 | OpenAI | 25.3% | |
| 24 | Claude Opus 4.5 | Anthropic | 25.2% | May 2026 |
| 25 | DeepSeek-V3.2 | DeepSeek | 25.1% | Dec 2025 |
| 26 | GLM-4.7 | Zhipu AI | 24.8% | Aug 2025 |
| 27 | Grok 4 | xAI | 24.5% | |
| 28 | Kimi K2.5 | Moonshot AI | 24.37% | May 2026 |
| 29 | Qwen3.6-27B | 24% | Apr 2026 | |
| 30 | GPT-5.1 | OpenAI | 23.68% | May 2026 |
| 31 | Step-3.5-Flash | 23.1% | Feb 2026 | |
| 32 | Gemini 2.5 Pro | 21.64% | May 2026 | |
| 33 | Gemini 2.5 Pro | 21.6% | ||
| 34 | Gemini 2.5 Pro | 21.6% | Jul 2025 | |
| 35 | Qwen3.6-35B-A3B | 21.4% | Apr 2026 | |
| 36 | o3 | OpenAI | 20.32% | May 2026 |
| 37 | GPT-5 mini | OpenAI | 19.44% | May 2026 |
| 38 | GPT-5 mini | OpenAI | 19.4% | |
| 39 | MiniMax-M2.5 | MiniMaxAI | 19.4% | Feb 2026 |
| 40 | Claude Opus 4.6 | Anthropic | 19% | Apr 2026 |
| 41 | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | 18.26% | Dec 2025 | |
| 42 | o4-mini | OpenAI | 18.08% | May 2026 |
| 43 | GLM-4.5 | Zhipu AI | 14.4% | Aug 2025 |
| 44 | Claude Sonnet 4.5 | Anthropic | 13.72% | May 2026 |
| 45 | Claude 4.5 Sonnet | Anthropic | 13.7% | |
| 46 | Claude Sonnet 4.6 | Anthropic | 13.2% | Apr 2026 |
| 47 | Gemini 2.5 Flash | 12.1% | ||
| 48 | Gemini 2.5 Flash | 12.08% | May 2026 | |
| 49 | Claude Opus 4.1 | Anthropic | 11.52% | May 2026 |
| 50 | Gemini 2.5 Flash | 11% | Jul 2025 | |
| 51 | Claude Opus 4 | Anthropic | 10.72% | May 2026 |
| 52 | GLM-4.5-Air | Zhipu AI | 10.6% | Aug 2025 |
| 53 | NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 10.6% | Dec 2025 | |
| 54 | Gemini 3.1 Flash-Lite | 8.64% | May 2026 | |
| 55 | DeepSeek R1 | DeepSeek | 8.5% | |
| 56 | GLM-4.5 | Zhipu AI | 8.32% | May 2026 |
| 57 | o1 Pro | OpenAI | 8.12% | May 2026 |
| 58 | GLM-4.5-Air | Zhipu AI | 8.12% | May 2026 |
| 59 | Claude 3.7 Sonnet | Anthropic | 8.04% | May 2026 |
| 60 | o1 | OpenAI | 8% | |
| 61 | o1 | OpenAI | 7.96% | May 2026 |
| 62 | Claude Sonnet 4 | Anthropic | 7.76% | May 2026 |
| 63 | Gemini 2.0 Flash Thinking | 6.56% | May 2026 | |
| 64 | Llama 4 Maverick | Meta | 5.68% | May 2026 |
| 65 | GPT-4.5 Preview | OpenAI | 5.44% | May 2026 |
| 66 | GPT-4.1 | OpenAI | 5.4% | May 2026 |
| 67 | GPT-4.1 mini | OpenAI | 4.6% | Apr 2026 |
| 68 | Gemini 1.5 Pro | 4.6% | May 2026 | |
| 69 | Mistral-Medium-3 | Mistral | 4.52% | May 2026 |
| 70 | Nova Pro | Amazon | 4.4% | May 2026 |
| 71 | Claude 3.5 Sonnet | Anthropic | 4.08% | May 2026 |
| 72 | Nova Lite | Amazon | 3.64% | May 2026 |
| 73 | GPT-4o | OpenAI | 2.72% | May 2026 |
| 74 | GPT-4o | OpenAI | 2.7% |
Source: agi.safe.ai · No-tools variant.
198 graduate-level science questions designed to stump non-expert PhD holders. Created by domain experts who also provided misleading distractors. It measures depth of scientific understanding, not pattern matching.
The original MMLU with 4 choices can be solved with ~25% base rate, and models often get answers right by eliminating obviously wrong options. 10-choice format with engineered distractors forces genuine understanding.
HLE covers a much broader domain (math, science, law, humanities, linguistics) and is far harder — even frontier models score below 40%. GPQA focuses specifically on biology/chemistry/physics and frontier models now exceed 80%.