Mathematical Reasoning

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

4
Datasets
37
Results
accuracy
Canonical metric
Canonical Benchmark

MATH

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on MATH.

RankModelaccuracyYearSource
1
o4-mini (high)
98.22026paper
2
o3 (high)
98.12026paper
3
o3-mini
97.92026paper
4
o3
97.82026paper
5
o4-mini
97.52026paper
6
DeepSeek-R1
97.32026paper
7
Gemini 2.5 Pro
97.32026paper
8
o1
96.42026paper
9
Claude 3.7 Sonnet
96.22026paper
10
Kimi k1.5
96.22026paper

All datasets

4 datasets tracked for this task.

Related tasks

Other tasks in Reasoning.