Mathematical Reasoning2021en
Mathematics Aptitude Test of Heuristics
12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.
Current State of the Art
o3-mini
OpenAI
97.9
accuracy
accuracy Progress Over Time
Showing 2 breakthroughs from Jan 2025 to Mar 2026
Key Milestones
Total Improvement
0.6%
Time Span
1y 2m
Breakthroughs
2
Current SOTA
97.9
Top Models Performance Comparison
Top 10 models ranked by accuracy
Best Score
97.9
Top Model
o3-mini
Models Compared
10
Score Range
12.4
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | o3-miniAPI OpenAI | 97.9 | Mar 2026 | |
| 2 | o3API OpenAI | 97.8 | Mar 2026 | |
| 3 | o4-miniAPI OpenAI | 97.5 | Mar 2026 | |
| 4 | DeepSeek-R1Open Source DeepSeek | 97.3 | Mar 2026 | |
| 5 | o1API OpenAI | 96.4 | Mar 2026 | |
| 6 | Claude 3.7 SonnetAPI Anthropic | 96.2 | Mar 2026 | |
| 7 | DeepSeek V3Open Source DeepSeek | 90.2 | Mar 2026 | |
| 8 | o1-miniAPI OpenAI | 90 | Mar 2026 | |
| 9 | GPT-4.5 PreviewAPI OpenAI | 87.1 | Mar 2026 | |
| 10 | o1-preview OpenAI | 85.5 | Mar 2026 | |
| 11 | GPT-4.1API OpenAI | 82.1 | Mar 2026 | |
| 12 | GPT-4oAPI OpenAI | 76.6 | Mar 2026 | |
| 13 | Grok 2API xAI | 76.1 | Mar 2026 | |
| 14 | Llama 3.1 405BOpen Source Meta | 73.8 | Mar 2026 | |
| 15 | GPT-4 TurboAPI OpenAI | 73.4 | Mar 2026 | |
| 16 | Claude 3.5 SonnetAPI Anthropic | 71.1 | Mar 2026 | |
| 17 | GPT-4o Mini OpenAI | 70.2 | Mar 2026 | |
| 18 | Llama 3.1 70BOpen Source Meta | 68 | Mar 2026 | |
| 19 | Gemini 1.5 ProAPI Google | 67.7 | Mar 2026 | |
| 20 | Claude 3 OpusAPI Anthropic | 60.1 | Mar 2026 |