Mathematical Reasoning2021en
Mathematics Aptitude Test of Heuristics
12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.
Current State of the Art
o4-mini (high)
OpenAI
98.2
accuracy
MATH — accuracy
34 results · 1 SOTA advances · higher is better
All results
SOTA frontier
accuracy Progress Over Time
Showing 3 breakthroughs from Dec 2024 to Mar 2026
Key Milestones
Dec 2024
DeepSeek-V3
MATH-500. Non-reasoning base model. From DeepSeek-V3 technical report (Dec 2024).
90.2
Mar 2026
o4-mini (high)Current SOTA
MATH-500, zero-shot CoT, pass@1. High reasoning effort.
98.2
+0.9%
Total Improvement
8.9%
Time Span
1y 4m
Breakthroughs
3
Current SOTA
98.2
Top Models Performance Comparison
Top 10 models ranked by accuracy
Best Score
98.2
Top Model
o4-mini (high)
Models Compared
10
Score Range
2.0
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | o4-mini (high)API OpenAI | 98.2 | Mar 2026 | |
| 2 | o3 (high)API OpenAI | 98.1 | Mar 2026 | |
| 3 | o3-miniAPI OpenAI | 97.9 | Mar 2026 | |
| 4 | o3API OpenAI | 97.8 | Mar 2026 | |
| 5 | o4-miniAPI OpenAI | 97.5 | Mar 2026 | |
| 6 | DeepSeek-R1Open Source DeepSeek | 97.3 | Mar 2026 | |
| 7 | Gemini 2.5 ProAPI Google | 97.3 | Mar 2026 | |
| 8 | o1API OpenAI | 96.4 | Mar 2026 | |
| 9 | Kimi k1.5API Moonshot AI | 96.2 | Mar 2026 | |
| 10 | Claude 3.7 SonnetAPI Anthropic | 96.2 | Mar 2026 | |
| 11 | DeepSeek-R1-ZeroOpen Source DeepSeek | 95.9 | Mar 2026 | |
| 12 | DeepSeek-R1-Distill-Llama-70BOpen Source DeepSeek | 94.5 | Mar 2026 | |
| 13 | DeepSeek-R1-Distill-Qwen-32BOpen Source DeepSeek | 94.3 | Mar 2026 | |
| 14 | DeepSeek-v3-0324Open Source DeepSeek | 94 | Mar 2026 | |
| 15 | Claude Opus 4.5API Anthropic | 90.7 | Mar 2026 | |
| 16 | QwQ-32BOpen Source Alibaba/Qwen | 90.6 | Mar 2026 | |
| 17 | DeepSeek-V3Open Source DeepSeek | 90.2 | Mar 2026 | |
| 18 | o1-miniAPI OpenAI | 90 | Mar 2026 | |
| 19 | Llama-4-MaverickOpen Source Meta | 89.4 | Mar 2026 | |
| 20 | Claude Opus 4API Anthropic | 89.2 | Mar 2026 | |
| 21 | Claude Sonnet 4API Anthropic | 88.9 | Mar 2026 | |
| 22 | GPT-4.5 PreviewAPI OpenAI | 87.1 | Mar 2026 | |
| 23 | o1-preview OpenAI | 85.5 | Mar 2026 | |
| 24 | Qwen2.5-72B-InstructOpen Source Alibaba | 83.1 | Mar 2026 | |
| 25 | GPT-4.1API OpenAI | 82.1 | Mar 2026 | |
| 26 | GPT-4oAPI OpenAI | 76.6 | Mar 2026 | |
| 27 | Grok 2API xAI | 76.1 | Mar 2026 | |
| 28 | Llama 3.1 405BOpen Source Meta | 73.8 | Mar 2026 | |
| 29 | GPT-4 TurboAPI OpenAI | 73.4 | Mar 2026 | |
| 30 | Claude 3.5 SonnetAPI Anthropic | 71.1 | Mar 2026 | |
| 31 | GPT-4o mini OpenAI | 70.2 | Mar 2026 | |
| 32 | Llama 3.1 70BOpen Source Meta | 68 | Mar 2026 | |
| 33 | Gemini 1.5 ProAPI Google | 67.7 | Mar 2026 | |
| 34 | Claude 3 OpusAPI Anthropic | 60.1 | Mar 2026 |