Harder version of MMLU with 10-choice multiple-choice questions across 57 subjects and 12,000 questions. Reduces sensitivity to prompt format and increases reasoning difficulty.
20 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | accuracy |
|---|---|---|---|---|---|
| 01 | Gemini 3.1 ProAPI | Apr 2026 | pricepertoken | 90.99 | |
| 02 | Gemini 3 ProAPI | Apr 2026 | pricepertoken | 89.80 | |
| 03 | Claude Opus 4.5API | Anthropic | Apr 2026 | pricepertoken | 89.50 |
| 04 | Gemini 3 FlashAPI | Apr 2026 | pricepertoken | 89 | |
| 05 | Qwen3.6 Plus | Alibaba Cloud | Apr 2026 | llm-stats | 88.50 |
| 06 | Claude Opus 4.1 | Anthropic | Apr 2026 | pricepertoken | 88 |
| 07 | MiniMax M2.1API | MiniMax | Apr 2026 | pricepertoken | 88 |
| 08 | Qwen3.5-397B-A17B | Alibaba Cloud | Apr 2026 | llm-stats | 87.80 |
| 09 | Claude Sonnet 4.5API | Anthropic | Apr 2026 | pricepertoken | 87.50 |
| 10 | GPT-5.2API | OpenAI | Apr 2026 | pricepertoken | 87.40 |
| 11 | Kimi K2.5API | Moonshot AI | Apr 2026 | llm-stats | 87.10 |
| 12 | GPT-5API | OpenAI | Apr 2026 | pricepertoken | 87.10 |
| 13 | GPT-5.1API | OpenAI | Apr 2026 | pricepertoken | 87 |
| 14 | Grok 4API | xAI | Apr 2026 | pricepertoken | 86.60 |
| 15 | DeepSeek V3.2API | DeepSeek | Apr 2026 | pricepertoken | 86.20 |
| 16 | Claude 3.7 Sonnet | Anthropic | Apr 2026 | anthropic-announcement | 85.10 |
| 17 | DeepSeek-R1-0528OSS | DeepSeek | Apr 2026 | llm-stats | 85 |
| 18 | Kimi K2-Thinking-0905OSS | Moonshot AI | Apr 2026 | llm-stats | 84.60 |
| 19 | GLM-4.5 | Zhipu AI | Apr 2026 | llm-stats | 84.60 |
| 20 | GPT-4oAPI | OpenAI | Apr 2026 | artificial-analysis | 72.60 |
Illustrative items from this benchmark, shown in the exact format the model sees. Sourced from primary distribution — see citation at the bottom of the section.
The symmetric group $S_n$ has $n!$ elements, hence it is not true that $S_{10}$ has 10 elements. Find the characteristic of the ring 2Z.
Let A be the set of all ordered pairs of integers (m, n) such that 7m + 12n = 22. What is the greatest negative number in the set B = {m + n : (m, n) ∈ A}?
Where do most short-period comets come from and how do we know?
Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.
Higher scores win. Each subsequent entry improved upon the previous best.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.