Codesota · Tasks · Mathematical ReasoningHome/Tasks/Reasoning/Mathematical Reasoning

Mathematical Reasoning.

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

Datasets

127

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

MATH

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on MATH.

#	Model	accuracy	Year	Source
★	o4-mini (high)	98.2	2026	paper ↗
2	o3 (high)	98.1	2026	paper ↗
3	o3-mini	97.9	2026	paper ↗
4	o3	97.8	2026	paper ↗
5	o4-mini	97.5	2026	paper ↗
6	Gemini 2.5 Pro	97.3	2026	paper ↗
7	DeepSeek R1	97.3	2026	paper ↗
8	o1	96.4	2026	paper ↗
9	Kimi k1.5	96.2	2026	paper ↗
10	Claude 3.7 Sonnet	96.2	2026	paper ↗

What were you looking for on Mathematical Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

4 datasets tracked for this task.

MATH

CANONICAL

46 results · accuracy

Top: o4-mini (high) — 98.2

GSM8K

48 results · accuracy

Top: ERNIE 5.0 — 99.7

AIME 2025

22 results · accuracy

Top: Step-3.5-Flash PaCoRe — 99.9

AIME 2024

11 results · accuracy

Top: o3 — 96.7

§ 05 · Related tasks

Other tasks in Reasoning.

Arithmetic Reasoning Commonsense Reasoning Logical Reasoning Multi-step Reasoning

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Mathematical Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.