8,500 grade-school math word problems requiring 2-8 arithmetic steps. Chain-of-thought prompting unlocks near-perfect accuracy in frontier models. Accuracy measured on the 1,319-problem test set.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | ERNIE 5.0 | Baidu | 99.7% | Apr 2026 |
| 2 | GPT-5 | OpenAI | 99.2% | Apr 2026 |
| 3 | Gemini 2.5 Pro | 99% | Mar 2026 | |
| 4 | o4-mini | OpenAI | 99% | Mar 2026 |
| 5 | o3 | OpenAI | 99% | Mar 2026 |
| 6 | Claude 4 | Anthropic | 98.9% | Apr 2026 |
| 7 | Llama-4-Maverick | Meta | 98.7% | Mar 2026 |
| 8 | Claude Opus 4.5 | Anthropic | 98.6% | Mar 2026 |
| 9 | Llama 4 Behemoth 2T | Meta | 98.5% | Apr 2026 |
| 10 | GPT-4.5 | OpenAI | 98.2% | Apr 2026 |
| 11 | Claude Opus 4 | Anthropic | 98% | Mar 2026 |
| 12 | o1 | OpenAI | 97.8% | Apr 2026 |
| 13 | o1-preview | OpenAI | 97.8% | Dec 2025 |
| 14 | o1 | OpenAI | 97.8% | Apr 2026 |
| 15 | Claude Sonnet 4 | Anthropic | 97.8% | Mar 2026 |
| 16 | DeepSeek R1 | DeepSeek | 97.3% | Mar 2026 |
| 17 | Claude 3.5 Sonnet | Anthropic | 96.4% | Dec 2025 |
| 18 | Qwen2.5-72B-Instruct | Alibaba | 95.8% | Mar 2026 |
| 19 | DeepSeek-V3 | DeepSeek | 95.8% | Mar 2026 |
| 20 | Claude 3.5 Sonnet | Anthropic | 95% | Apr 2026 |
| 21 | Claude 3 Opus | Anthropic | 95% | Apr 2026 |
| 22 | Gemini Ultra | Google DeepMind | 94.4% | Apr 2026 |
| 23 | Llama 3 70B | Meta | 93% | Dec 2025 |
| 24 | GPT-4 | OpenAI | 92% | Apr 2026 |
| 25 | GPT-4 | OpenAI | 92% | Apr 2026 |
| 26 | GPT-4o | OpenAI | 92% | Dec 2025 |
| 27 | Gemini 1.5 Pro | 91.7% | Dec 2025 | |
| 28 | Claude 3 Haiku | Anthropic | 88.9% | Apr 2026 |
| 29 | Mixtral-8x22b | Mistral | 88% | Apr 2026 |
| 30 | PaLM 540B (Self-Consistency) | 74% | Apr 2026 | |
| 31 | PaLM 540B (CoT) | 58% | Apr 2026 | |
| 32 | GPT-3 (base) | OpenAI | 8% | Apr 2026 |
Source: openai/grade-school-math · Chain-of-thought, maj@1.
12,500 competition problems at difficulty 1-5 (AMC/AIME level), covering algebra, counting, geometry, number theory, and pre-calculus. The harder MATH-500 subset (500 representative problems) is the standard evaluation split.
| # | Model | Provider | Accuracy | Date |
|---|---|---|---|---|
| ★ | o4-mini (high) | OpenAI | 98.2% | Mar 2026 |
| 2 | o3 (high) | OpenAI | 98.1% | Mar 2026 |
| 3 | o3-mini | OpenAI | 97.9% | Mar 2026 |
| 4 | o3 | OpenAI | 97.8% | Mar 2026 |
| 5 | o4-mini | OpenAI | 97.5% | Mar 2026 |
| 6 | DeepSeek R1 | DeepSeek | 97.3% | Mar 2026 |
| 7 | Gemini 2.5 Pro | 97.3% | Mar 2026 | |
| 8 | o1 | OpenAI | 96.4% | Mar 2026 |
| 9 | Kimi k1.5 | Moonshot AI | 96.2% | Mar 2026 |
| 10 | Claude 3.7 Sonnet | Anthropic | 96.2% | Mar 2026 |
| 11 | DeepSeek-R1-Zero | DeepSeek | 95.9% | Mar 2026 |
| 12 | DeepSeek-R1-Distill-Llama-70B | DeepSeek | 94.5% | Mar 2026 |
| 13 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek | 94.3% | Mar 2026 |
| 14 | DeepSeek-v3-0324 | DeepSeek | 94% | Mar 2026 |
| 15 | Claude Opus 4.5 | Anthropic | 90.7% | Mar 2026 |
| 16 | QwQ-32B | Alibaba/Qwen | 90.6% | Mar 2026 |
| 17 | DeepSeek-V3 | DeepSeek | 90.2% | Mar 2026 |
| 18 | o1-mini | OpenAI | 90% | Mar 2026 |
| 19 | Llama-4-Maverick | Meta | 89.4% | Mar 2026 |
| 20 | Claude Opus 4 | Anthropic | 89.2% | Mar 2026 |
| 21 | Claude Sonnet 4 | Anthropic | 88.9% | Mar 2026 |
| 22 | GPT-4.5 Preview | OpenAI | 87.1% | Mar 2026 |
| 23 | o1-preview | OpenAI | 85.5% | Mar 2026 |
| 24 | Qwen2.5-72B-Instruct | Alibaba | 83.1% | Mar 2026 |
| 25 | GPT-4.1 | OpenAI | 82.1% | Mar 2026 |
| 26 | GPT-4o | OpenAI | 76.6% | Mar 2026 |
| 27 | Grok 2 | xAI | 76.1% | Mar 2026 |
| 28 | Llama 3.1 405B | Meta | 73.8% | Mar 2026 |
| 29 | GPT-4 Turbo | OpenAI | 73.4% | Mar 2026 |
| 30 | Claude 3.5 Sonnet | Anthropic | 71.1% | Mar 2026 |
| 31 | GPT-4o mini | OpenAI | 70.2% | Mar 2026 |
| 32 | Llama 3.1 70B | Meta | 68% | Mar 2026 |
| 33 | Gemini 1.5 Pro | 67.7% | Mar 2026 | |
| 34 | Claude 3 Opus | Anthropic | 60.1% | Mar 2026 |
Source: hendrycks/math · MATH-500 subset, chain-of-thought.
Grade School Math 8K — 8,500 word problems requiring 2-8 step arithmetic reasoning, created by OpenAI in 2021. Chain-of-thought prompting revealed a step-change in model capability. Now saturated at the frontier.
MATH problems require domain knowledge (e.g., modular arithmetic, geometric proofs) not just arithmetic. Difficulty 5 problems (AIME-level) stump most people. The benchmark was designed to take years to saturate — reasoning models like o3 are now above 95%.
Reasoning models (o3, DeepSeek-R1) use extended chain-of-thought with self-verification before committing to an answer. This search-like inference process is especially effective on math, where step-by-step verification catches errors.