Mathematical Reasoning2021en

Mathematics Aptitude Test of Heuristics

12,500 competition mathematics problems from AMC, AIME, and other sources. Harder than GSM8K.

Metrics:accuracy
Paper / WebsiteDownload
Current State of the Art

o1-preview

OpenAI

94.8

accuracy

Top Models Performance Comparison

Top 5 models ranked by accuracy

accuracy1o1-preview94.8100.0%2DeepSeek V390.295.1%3GPT-4o76.680.8%4Claude 3.5 Sonnet71.175.0%5Gemini 1.5 Pro67.771.4%0%25%50%75%100%% of best
Best Score
94.8
Top Model
o1-preview
Models Compared
5
Score Range
27.1

accuracyPrimary

#ModelScorePaper / CodeDate
1
o1-preview
OpenAI
94.8Dec 2025
2
DeepSeek V3Open Source
DeepSeek
90.2Dec 2025
3
GPT-4oAPI
OpenAI
76.6Dec 2025
4
Claude 3.5 SonnetAPI
Anthropic
71.1Dec 2025
5
Gemini 1.5 ProAPI
Google
67.7Dec 2025

Other Mathematical Reasoning Datasets

MATH Benchmark - Mathematical Reasoning | CodeSOTA