Mathematical Reasoning2021en

Mathematics Aptitude Test of Heuristics

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Metrics:accuracy
Paper / WebsiteDownload
Current State of the Art

o4-mini (high)

OpenAI

98.2

accuracy

MATH — accuracy

34 results · 1 SOTA advances · higher is better

All results
SOTA frontier
606570758085909510020262027accuracyo4-mini (high)

accuracy Progress Over Time

Showing 3 breakthroughs from Dec 2024 to Mar 2026

89.491.894.296.699.0Dec 2024Jul 2025Mar 2026accuracyDate

Key Milestones

Dec 2024
DeepSeek-V3

MATH-500. Non-reasoning base model. From DeepSeek-V3 technical report (Dec 2024).

90.2
Jan 2025
DeepSeek-R1

MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025).

97.3
+7.9%
Mar 2026
o4-mini (high)Current SOTA

MATH-500, zero-shot CoT, pass@1. High reasoning effort.

98.2
+0.9%
Total Improvement
8.9%
Time Span
1y 4m
Breakthroughs
3
Current SOTA
98.2

Top Models Performance Comparison

Top 10 models ranked by accuracy

accuracy1o4-mini (high)98.2100.0%2o3 (high)98.199.9%3o3-mini97.999.7%4o397.899.6%5o4-mini97.599.3%6DeepSeek-R197.399.1%7Gemini 2.5 Pro97.399.1%8o196.498.2%9Kimi k1.596.298.0%10Claude 3.7 Sonnet96.298.0%0%25%50%75%100%% of best
Best Score
98.2
Top Model
o4-mini (high)
Models Compared
10
Score Range
2.0

accuracyPrimary

#ModelScorePaper / CodeDate
1
o4-mini (high)API
OpenAI
98.2Mar 2026
2
o3 (high)API
OpenAI
98.1Mar 2026
3
o3-miniAPI
OpenAI
97.9Mar 2026
4
o3API
OpenAI
97.8Mar 2026
5
o4-miniAPI
OpenAI
97.5Mar 2026
6
DeepSeek-R1Open Source
DeepSeek
97.3Mar 2026
7
Gemini 2.5 ProAPI
Google
97.3Mar 2026
8
o1API
OpenAI
96.4Mar 2026
9
Kimi k1.5API
Moonshot AI
96.2Mar 2026
10
Claude 3.7 SonnetAPI
Anthropic
96.2Mar 2026
11
DeepSeek-R1-ZeroOpen Source
DeepSeek
95.9Mar 2026
12
DeepSeek-R1-Distill-Llama-70BOpen Source
DeepSeek
94.5Mar 2026
13
DeepSeek-R1-Distill-Qwen-32BOpen Source
DeepSeek
94.3Mar 2026
14
DeepSeek-v3-0324Open Source
DeepSeek
94Mar 2026
15
Claude Opus 4.5API
Anthropic
90.7Mar 2026
16
QwQ-32BOpen Source
Alibaba/Qwen
90.6Mar 2026
17
DeepSeek-V3Open Source
DeepSeek
90.2Mar 2026
18
o1-miniAPI
OpenAI
90Mar 2026
19
Llama-4-MaverickOpen Source
Meta
89.4Mar 2026
20
Claude Opus 4API
Anthropic
89.2Mar 2026
21
Claude Sonnet 4API
Anthropic
88.9Mar 2026
22
GPT-4.5 PreviewAPI
OpenAI
87.1Mar 2026
23
o1-preview
OpenAI
85.5Mar 2026
24
Qwen2.5-72B-InstructOpen Source
Alibaba
83.1Mar 2026
25
GPT-4.1API
OpenAI
82.1Mar 2026
26
GPT-4oAPI
OpenAI
76.6Mar 2026
27
Grok 2API
xAI
76.1Mar 2026
28
Llama 3.1 405BOpen Source
Meta
73.8Mar 2026
29
GPT-4 TurboAPI
OpenAI
73.4Mar 2026
30
Claude 3.5 SonnetAPI
Anthropic
71.1Mar 2026
31
GPT-4o mini
OpenAI
70.2Mar 2026
32
Llama 3.1 70BOpen Source
Meta
68Mar 2026
33
Gemini 1.5 ProAPI
Google
67.7Mar 2026
34
Claude 3 OpusAPI
Anthropic
60.1Mar 2026

Other Mathematical Reasoning Datasets

MATH Benchmark - Mathematical Reasoning | CodeSOTA