8,500 grade-school math word problems requiring 2-8 arithmetic steps. Chain-of-thought prompting unlocks near-perfect accuracy in frontier models. Accuracy measured on the 1,319-problem test set.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | ERNIE 5.0 | Baidu | 99.7% | Apr 2026 |
| 2 | MiMo-V2.5-Pro | 99.6% | Apr 2026 | |
| 3 | GPT-5 | OpenAI | 99.2% | Apr 2026 |
| 4 | o3 | OpenAI | 99% | Mar 2026 |
| 5 | Gemini 2.5 Pro | 99% | Mar 2026 | |
| 6 | o4-mini | OpenAI | 99% | Mar 2026 |
| 7 | Claude 4 | Anthropic | 98.9% | Apr 2026 |
| 8 | Llama 4 Maverick | Meta | 98.7% | Mar 2026 |
| 9 | Claude Opus 4.5 | Anthropic | 98.6% | Mar 2026 |
| 10 | Llama 4 Behemoth 2T | Meta | 98.5% | Apr 2026 |
| 11 | GPT-4.5 | OpenAI | 98.2% | Apr 2026 |
| 12 | Claude Opus 4 | Anthropic | 98% | Mar 2026 |
| 13 | o1-preview | OpenAI | 97.8% | Dec 2025 |
| 14 | o1 | OpenAI | 97.8% | Apr 2026 |
| 15 | Claude Sonnet 4 | Anthropic | 97.8% | Mar 2026 |
| 16 | o1 | OpenAI | 97.8% | Apr 2026 |
| 17 | DeepSeek R1 | DeepSeek | 97.3% | Mar 2026 |
| 18 | Llama 3 (405B, Instruct) | Meta | 96.8% | Jul 2024 |
| 19 | Claude 3.5 Sonnet | Anthropic | 96.4% | Dec 2025 |
| 20 | Qwen2.5-Plus | 96% | Dec 2024 | |
| 21 | DeepSeek-V3 | DeepSeek | 95.8% | Mar 2026 |
| 22 | Qwen2.5-72B-Instruct | Alibaba | 95.8% | Mar 2026 |
| 23 | Qwen2.5-VL-72B | 95.3% | Feb 2025 | |
| 24 | Claude 3 Opus | Anthropic | 95% | Apr 2026 |
| 25 | Claude 3.5 Sonnet | Anthropic | 95% | Apr 2026 |
| 26 | MiniMax-Text-01 | MiniMax | 94.8% | Jan 2025 |
| 27 | MiniCPM-o 4.5-Instruct | 94.5% | Apr 2026 | |
| 28 | Gemini Ultra | Google DeepMind | 94.4% | Apr 2026 |
| 29 | Qwen3-235B-A22B | Alibaba | 94.39% | May 2025 |
| 30 | Llama 3 70B | Meta | 93% | Dec 2025 |
| 31 | GPT-4 | OpenAI | 92% | Apr 2026 |
| 32 | GPT-4 | OpenAI | 92% | Apr 2026 |
| 33 | GPT-4o | OpenAI | 92% | Dec 2025 |
| 34 | Gemini 1.5 Pro | 91.7% | Dec 2025 | |
| 35 | Claude 3 Haiku | Anthropic | 88.9% | Apr 2026 |
| 36 | Step-3.5-Flash Base | 88.2% | Feb 2026 | |
| 37 | Mixtral-8x22b | Mistral | 88% | Apr 2026 |
| 38 | HRM-Text-1B | 84.7% | May 2026 | |
| 39 | Apertus-70B-Instruct | 77.6% | Sep 2025 | |
| 40 | PaLM 540B (Self-Consistency) | 74% | Apr 2026 | |
| 41 | LLaMA-65B | 69.7% | Feb 2023 | |
| 42 | Chameleon 34B | 61.4% | May 2024 | |
| 43 | BitNet b1.58 2B4T | 58.38% | Apr 2025 | |
| 44 | PaLM 540B (CoT) | 58% | Apr 2026 | |
| 45 | Llama 2 70B (5-shot) | 56.8% | Jul 2023 | |
| 46 | Code Llama - Python 34B | 34.42% | Aug 2023 | |
| 47 | SmoLM2 (1.7B) | 31.1% | Feb 2025 | |
| 48 | GPT-3 (base) | OpenAI | 8% | Apr 2026 |
Source: openai/grade-school-math · Chain-of-thought, maj@1.
12,500 competition problems at difficulty 1-5 (AMC/AIME level), covering algebra, counting, geometry, number theory, and pre-calculus. The harder MATH-500 subset (500 representative problems) is the standard evaluation split.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | o4-mini (high) | OpenAI | 98.2% | Mar 2026 |
| 2 | o3 (high) | OpenAI | 98.1% | Mar 2026 |
| 3 | o3-mini | OpenAI | 97.9% | Mar 2026 |
| 4 | o3 | OpenAI | 97.8% | Mar 2026 |
| 5 | o4-mini | OpenAI | 97.5% | Mar 2026 |
| 6 | Gemini 2.5 Pro | 97.3% | Mar 2026 | |
| 7 | DeepSeek R1 | DeepSeek | 97.3% | Mar 2026 |
| 8 | o1 | OpenAI | 96.4% | Mar 2026 |
| 9 | Claude 3.7 Sonnet | Anthropic | 96.2% | Mar 2026 |
| 10 | Kimi k1.5 | Moonshot AI | 96.2% | Mar 2026 |
| 11 | DeepSeek-R1-Zero | DeepSeek | 95.9% | Mar 2026 |
| 12 | DeepSeek-R1-Distill-Llama-70B | DeepSeek | 94.5% | Mar 2026 |
| 13 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek | 94.3% | Mar 2026 |
| 14 | DeepSeek-v3-0324 | DeepSeek | 94% | Mar 2026 |
| 15 | Claude Opus 4.5 | Anthropic | 90.7% | Mar 2026 |
| 16 | QwQ-32B | Alibaba/Qwen | 90.6% | Mar 2026 |
| 17 | DeepSeek-V3 | DeepSeek | 90.2% | Mar 2026 |
| 18 | o1-mini | OpenAI | 90% | Mar 2026 |
| 19 | Llama 4 Maverick | Meta | 89.4% | Mar 2026 |
| 20 | Claude Opus 4 | Anthropic | 89.2% | Mar 2026 |
| 21 | Claude Sonnet 4 | Anthropic | 88.9% | Mar 2026 |
| 22 | GPT-4.5 Preview | OpenAI | 87.1% | Mar 2026 |
| 23 | o1-preview | OpenAI | 85.5% | Mar 2026 |
| 24 | Qwen2.5-Plus | 84.7% | Dec 2024 | |
| 25 | Qwen2.5-72B-Instruct | Alibaba | 83.1% | Mar 2026 |
| 26 | Qwen2.5-VL-72B | 83% | Feb 2025 | |
| 27 | GPT-4.1 | OpenAI | 82.1% | Mar 2026 |
| 28 | MiniMax-Text-01 | MiniMax | 77.4% | Jan 2025 |
| 29 | GPT-4o | OpenAI | 76.6% | Mar 2026 |
| 30 | Grok 2 | xAI | 76.1% | Mar 2026 |
| 31 | Llama 3 (405B, Instruct) | Meta | 73.8% | Jul 2024 |
| 32 | Llama 3.1 405B | Meta | 73.8% | Mar 2026 |
| 33 | GPT-4 Turbo | OpenAI | 73.4% | Mar 2026 |
| 34 | Qwen3-235B-A22B | Alibaba | 71.84% | May 2025 |
| 35 | Claude 3.5 Sonnet | Anthropic | 71.1% | Mar 2026 |
| 36 | GPT-4o mini | OpenAI | 70.2% | Mar 2026 |
| 37 | Llama 3.1 70B | Meta | 68% | Mar 2026 |
| 38 | Gemini 1.5 Pro | 67.7% | Mar 2026 | |
| 39 | Step-3.5-Flash Base | 66.8% | Feb 2026 | |
| 40 | Claude 3 Opus | Anthropic | 60.1% | Mar 2026 |
| 41 | HRM-Text-1B | 56.5% | May 2026 | |
| 42 | Aria | 50.8% | Oct 2024 | |
| 43 | Apertus-70B-Instruct | 30.8% | Sep 2025 | |
| 44 | Chameleon 34B | 22.5% | May 2024 | |
| 45 | LLaMA-65B | 20.5% | Feb 2023 | |
| 46 | SmoLM2 (1.7B) | 11.6% | Feb 2025 |
Source: hendrycks/math · MATH-500 subset, chain-of-thought.
Grade School Math 8K — 8,500 word problems requiring 2-8 step arithmetic reasoning, created by OpenAI in 2021. Chain-of-thought prompting revealed a step-change in model capability. Now saturated at the frontier.
MATH problems require domain knowledge (e.g., modular arithmetic, geometric proofs) not just arithmetic. Difficulty 5 problems (AIME-level) stump most people. The benchmark was designed to take years to saturate — reasoning models like o3 are now above 95%.
Reasoning models (o3, DeepSeek-R1) use extended chain-of-thought with self-verification before committing to an answer. This search-like inference process is especially effective on math, where step-by-step verification catches errors.