Graduate-level knowledge (GPQA Diamond), broad multi-subject reasoning (MMLU-Pro), and extreme frontier difficulty (HLE). These benchmarks separate world-knowledge from genuine scientific reasoning.
198 expert-authored graduate-level questions in biology, chemistry, and physics. Designed to be impossible to Google. PhD specialists score ~65% on their own field. Expert (non-specialist) human baseline: 34%.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | Gemini 3 Pro | 91.9% | Apr 2026 | |
| 2 | Claude Opus 4.6 | Anthropic | 91.3% | Apr 2026 |
| 3 | Gemini 3 Flash | 90.4% | Apr 2026 | |
| 4 | Claude Sonnet 4.6 | Anthropic | 89.9% | Apr 2026 |
| 5 | GPT-5 | OpenAI | 89% | Apr 2026 |
| 6 | Grok 4 | xAI | 88% | Apr 2026 |
| 7 | Gemini 2.5 Pro | 84% | Mar 2026 | |
| 8 | o3 | OpenAI | 82.8% | Mar 2026 |
| 9 | Gemini 2.5 Flash | 82.8% | Apr 2026 | |
| 10 | o4-mini | OpenAI | 77.6% | Mar 2026 |
| 11 | Claude Opus 4 | Anthropic | 76.7% | Mar 2026 |
| 12 | o1 | OpenAI | 75.7% | Mar 2026 |
| 13 | Claude Opus 4.5 | Anthropic | 74.9% | Mar 2026 |
| 14 | o3-mini | OpenAI | 74.9% | Mar 2026 |
| 15 | o1-preview | OpenAI | 73.3% | Mar 2026 |
| 16 | DeepSeek R1 | DeepSeek | 71.5% | Mar 2026 |
| 17 | Qwen3-235B-A22B | Alibaba | 71.1% | Apr 2026 |
| 18 | Claude Sonnet 4 | Anthropic | 70% | Mar 2026 |
| 19 | Llama-4-Maverick | Meta | 69.8% | Mar 2026 |
| 20 | GPT-4.5 Preview | OpenAI | 69.5% | Mar 2026 |
| 21 | GPT-4.1 mini | OpenAI | 66.4% | Apr 2026 |
| 22 | GPT-4.1 | OpenAI | 66.3% | Mar 2026 |
| 23 | o1-mini | OpenAI | 60% | Mar 2026 |
| 24 | Claude 3.5 Sonnet | Anthropic | 59.4% | Mar 2026 |
| 25 | Grok 2 | xAI | 56% | Mar 2026 |
| 26 | Llama 3.1 405B | Meta | 50.7% | Mar 2026 |
| 27 | Claude 3 Opus | Anthropic | 50.4% | Mar 2026 |
| 28 | GPT-4o | OpenAI | 49.9% | Mar 2026 |
| 29 | GPT-4 Turbo | OpenAI | 49.3% | Mar 2026 |
| 30 | Qwen2.5-72B-Instruct | Alibaba | 49% | Mar 2026 |
| 31 | Gemini 1.5 Pro | 46.2% | Mar 2026 | |
| 32 | Llama 3.1 70B | Meta | 41.7% | Mar 2026 |
| 33 | GPT-4o mini | OpenAI | 40.2% | Mar 2026 |
Source: arXiv:2311.12022 · 198-question Diamond set.
Harder version of MMLU — 10-choice MCQ with distractors across 57 subjects (12,000 questions). Reduces surface pattern-matching vs. the original 4-choice format. Still useful for broad capability comparison.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | Gemini 3.1 Pro | 90.99% | Apr 2026 | |
| 2 | Gemini 3 Pro | 89.8% | Apr 2026 | |
| 3 | Claude Opus 4.5 | Anthropic | 89.5% | Apr 2026 |
| 4 | Gemini 3 Flash | 89% | Apr 2026 | |
| 5 | Qwen3.6 Plus | Alibaba Cloud | 88.5% | Apr 2026 |
| 6 | Claude Opus 4.1 | Anthropic | 88% | Apr 2026 |
| 7 | MiniMax M2.1 | MiniMax | 88% | Apr 2026 |
| 8 | Qwen3.5-397B-A17B | Alibaba Cloud | 87.8% | Apr 2026 |
| 9 | Claude Sonnet 4.5 | Anthropic | 87.5% | Apr 2026 |
| 10 | GPT-5.2 | OpenAI | 87.4% | Apr 2026 |
| 11 | Kimi K2.5 | Moonshot AI | 87.1% | Apr 2026 |
| 12 | GPT-5 | OpenAI | 87.1% | Apr 2026 |
| 13 | GPT-5.1 | OpenAI | 87% | Apr 2026 |
| 14 | Grok 4 | xAI | 86.6% | Apr 2026 |
| 15 | DeepSeek V3.2 | DeepSeek | 86.2% | Apr 2026 |
| 16 | Claude 3.7 Sonnet | Anthropic | 85.1% | Apr 2026 |
| 17 | DeepSeek-R1-0528 | DeepSeek | 85% | Apr 2026 |
| 18 | Kimi K2-Thinking-0905 | Moonshot AI | 84.6% | Apr 2026 |
| 19 | GLM-4.5 | Zhipu AI | 84.6% | Apr 2026 |
| 20 | GPT-4o | OpenAI | 72.6% | Apr 2026 |
Source: TIGER-AI-Lab/MMLU-Pro · 5-shot chain-of-thought.
3,000 expert-contributed questions spanning math, science, law, and humanities — designed to remain unsaturated for years. No tools allowed. Even the best model scores below 40%.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | Gemini 3 Pro | 38.3% | ||
| 2 | GPT-5 | OpenAI | 25.3% | |
| 3 | Grok 4 | xAI | 24.5% | |
| 4 | Gemini 2.5 Pro | 21.6% | ||
| 5 | GPT-5-mini | OpenAI | 19.4% | |
| 6 | Claude Opus 4.6 | Anthropic | 19% | Apr 2026 |
| 7 | Claude 4.5 Sonnet | Anthropic | 13.7% | |
| 8 | Claude Sonnet 4.6 | Anthropic | 13.2% | Apr 2026 |
| 9 | Gemini 2.5 Flash | 12.1% | ||
| 10 | DeepSeek R1 | DeepSeek | 8.5% | |
| 11 | o1 | OpenAI | 8% | |
| 12 | GPT-4.1 mini | OpenAI | 4.6% | Apr 2026 |
| 13 | GPT-4o | OpenAI | 2.7% |
Source: agi.safe.ai · No-tools variant.
198 graduate-level science questions designed to stump non-expert PhD holders. Created by domain experts who also provided misleading distractors. It measures depth of scientific understanding, not pattern matching.
The original MMLU with 4 choices can be solved with ~25% base rate, and models often get answers right by eliminating obviously wrong options. 10-choice format with engineered distractors forces genuine understanding.
HLE covers a much broader domain (math, science, law, humanities, linguistics) and is far harder — even frontier models score below 40%. GPQA focuses specifically on biology/chemistry/physics and frontier models now exceed 80%.