8,500 grade-school word problems requiring 2-8 arithmetic steps, from OpenAI (2021). Largely saturated at the frontier — useful for evaluating smaller models.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | ERNIE 5.0 | Baidu | 99.7% | Apr 2026 |
| 2 | GPT-5 | OpenAI | 99.2% | Apr 2026 |
| 3 | Gemini 2.5 Pro | 99% | Mar 2026 | |
| 4 | o4-mini | OpenAI | 99% | Mar 2026 |
| 5 | o3 | OpenAI | 99% | Mar 2026 |
| 6 | Claude 4 | Anthropic | 98.9% | Apr 2026 |
| 7 | Llama-4-Maverick | Meta | 98.7% | Mar 2026 |
| 8 | Claude Opus 4.5 | Anthropic | 98.6% | Mar 2026 |
| 9 | Llama 4 Behemoth 2T | Meta | 98.5% | Apr 2026 |
| 10 | GPT-4.5 | OpenAI | 98.2% | Apr 2026 |
| 11 | Claude Opus 4 | Anthropic | 98% | Mar 2026 |
| 12 | o1 | OpenAI | 97.8% | Apr 2026 |
| 13 | o1-preview | OpenAI | 97.8% | Dec 2025 |
| 14 | o1 | OpenAI | 97.8% | Apr 2026 |
| 15 | Claude Sonnet 4 | Anthropic | 97.8% | Mar 2026 |
| 16 | DeepSeek R1 | DeepSeek | 97.3% | Mar 2026 |
| 17 | Claude 3.5 Sonnet | Anthropic | 96.4% | Dec 2025 |
| 18 | Qwen2.5-72B-Instruct | Alibaba | 95.8% | Mar 2026 |
| 19 | DeepSeek-V3 | DeepSeek | 95.8% | Mar 2026 |
| 20 | Claude 3.5 Sonnet | Anthropic | 95% | Apr 2026 |
| 21 | Claude 3 Opus | Anthropic | 95% | Apr 2026 |
| 22 | Gemini Ultra | Google DeepMind | 94.4% | Apr 2026 |
| 23 | Llama 3 70B | Meta | 93% | Dec 2025 |
| 24 | GPT-4 | OpenAI | 92% | Apr 2026 |
| 25 | GPT-4 | OpenAI | 92% | Apr 2026 |
| 26 | GPT-4o | OpenAI | 92% | Dec 2025 |
| 27 | Gemini 1.5 Pro | 91.7% | Dec 2025 | |
| 28 | Claude 3 Haiku | Anthropic | 88.9% | Apr 2026 |
| 29 | Mixtral-8x22b | Mistral | 88% | Apr 2026 |
| 30 | PaLM 540B (Self-Consistency) | 74% | Apr 2026 | |
| 31 | PaLM 540B (CoT) | 58% | Apr 2026 | |
| 32 | GPT-3 (base) | OpenAI | 8% | Apr 2026 |
Source: openai/grade-school-math · Chain-of-thought, maj@1.
500 representative problems from the MATH dataset covering algebra, geometry, number theory, and pre-calculus at difficulty 1-5. Hendrycks et al. 2021. Reasoning models have recently surpassed 90%.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | o4-mini (high) | OpenAI | 98.2% | Mar 2026 |
| 2 | o3 (high) | OpenAI | 98.1% | Mar 2026 |
| 3 | o3-mini | OpenAI | 97.9% | Mar 2026 |
| 4 | o3 | OpenAI | 97.8% | Mar 2026 |
| 5 | o4-mini | OpenAI | 97.5% | Mar 2026 |
| 6 | DeepSeek R1 | DeepSeek | 97.3% | Mar 2026 |
| 7 | Gemini 2.5 Pro | 97.3% | Mar 2026 | |
| 8 | o1 | OpenAI | 96.4% | Mar 2026 |
| 9 | Kimi k1.5 | Moonshot AI | 96.2% | Mar 2026 |
| 10 | Claude 3.7 Sonnet | Anthropic | 96.2% | Mar 2026 |
| 11 | DeepSeek-R1-Zero | DeepSeek | 95.9% | Mar 2026 |
| 12 | DeepSeek-R1-Distill-Llama-70B | DeepSeek | 94.5% | Mar 2026 |
| 13 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek | 94.3% | Mar 2026 |
| 14 | DeepSeek-v3-0324 | DeepSeek | 94% | Mar 2026 |
| 15 | Claude Opus 4.5 | Anthropic | 90.7% | Mar 2026 |
| 16 | QwQ-32B | Alibaba/Qwen | 90.6% | Mar 2026 |
| 17 | DeepSeek-V3 | DeepSeek | 90.2% | Mar 2026 |
| 18 | o1-mini | OpenAI | 90% | Mar 2026 |
| 19 | Llama-4-Maverick | Meta | 89.4% | Mar 2026 |
| 20 | Claude Opus 4 | Anthropic | 89.2% | Mar 2026 |
| 21 | Claude Sonnet 4 | Anthropic | 88.9% | Mar 2026 |
| 22 | GPT-4.5 Preview | OpenAI | 87.1% | Mar 2026 |
| 23 | o1-preview | OpenAI | 85.5% | Mar 2026 |
| 24 | Qwen2.5-72B-Instruct | Alibaba | 83.1% | Mar 2026 |
| 25 | GPT-4.1 | OpenAI | 82.1% | Mar 2026 |
| 26 | GPT-4o | OpenAI | 76.6% | Mar 2026 |
| 27 | Grok 2 | xAI | 76.1% | Mar 2026 |
| 28 | Llama 3.1 405B | Meta | 73.8% | Mar 2026 |
| 29 | GPT-4 Turbo | OpenAI | 73.4% | Mar 2026 |
| 30 | Claude 3.5 Sonnet | Anthropic | 71.1% | Mar 2026 |
| 31 | GPT-4o mini | OpenAI | 70.2% | Mar 2026 |
| 32 | Llama 3.1 70B | Meta | 68% | Mar 2026 |
| 33 | Gemini 1.5 Pro | 67.7% | Mar 2026 | |
| 34 | Claude 3 Opus | Anthropic | 60.1% | Mar 2026 |
Source: hendrycks/math · MATH-500 representative subset.
15 integer-answer problems from the American Invitational Mathematics Examination 2024. Human competitors average ~3/15; top students score 10+. This is the sharpest differentiator among frontier models — only reasoning models breach 70%.
| # | Model | Provider | % Correct | Date |
|---|---|---|---|---|
| ★ | o3 | OpenAI | 96.7% | Mar 2026 |
| 2 | o4-mini | OpenAI | 93.4% | Mar 2026 |
| 3 | Gemini 2.5 Pro | 92% | Mar 2026 | |
| 4 | o1-preview | OpenAI | 83.3% | Dec 2025 |
| 5 | Claude 3.7 Sonnet | Anthropic | 80% | Mar 2026 |
| 6 | DeepSeek R1 | DeepSeek | 79.8% | Mar 2026 |
| 7 | Claude 3.5 Opus | Anthropic | 16% | Dec 2025 |
| 8 | GPT-4o | OpenAI | 13.4% | Dec 2025 |
AIME 2024 I & II combined. Human AMC/AIME competitor baseline: ~20-30%.
Models like o3 and DeepSeek-R1 use extended chain-of-thought with self-verification before committing to an answer. This internal search allows them to explore multiple solution paths and backtrack from errors — critical for multi-step proofs.
For frontier models, no — it's saturated above 94%. GSM8K is still useful for comparing smaller models (7B-13B) where performance still varies significantly between 60-90%.
AIME 2024 remains meaningful at the frontier since it requires novel combinatorial and algebraic insight that can't be memorized. HLE (Humanity's Last Exam) includes math problems even harder than AIME, with frontier models scoring below 40%.