Reasoning

Mathematical Reasoning

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

4 datasets127 resultsView full task mapping →

Mathematical reasoning encompasses everything from grade-school word problems to competition-level olympiad questions and formal theorem proving. The MATH benchmark remains the key discriminator, with frontier models reaching 90%+ through chain-of-thought and reinforcement learning on reasoning traces.

History

2018

Neural theorem provers first achieve non-trivial results on Metamath

2021

MATH benchmark released — 12.5K competition-level problems across 7 categories

2021

GSM8K released as a grade-school math benchmark (now largely saturated)

2022

Minerva (PaLM 540B) scores 50.3% on MATH through math-specific pretraining

2022

Chain-of-thought prompting enables step-by-step mathematical reasoning in LLMs

2023

AlphaGeometry solves olympiad-level geometry problems without human training data

2024

AlphaProof (DeepMind) solves 4 of 6 IMO 2024 problems using Lean formal proofs

2024

DeepSeek-Math-7B reaches 51.7% on MATH through GRPO reinforcement learning

2025

OpenAI o3 scores 96.7% on MATH; Claude 3.5 with extended thinking achieves 92%

2025

FrontierMath benchmark introduced — unsolved research-level problems where all models score <2%

How Mathematical Reasoning Works

Problem Understanding

Parse the mathematical problem, identifying given information, constraints, and what needs to be proven or computed.

Strategy Selection

Choose an approach — algebraic manipulation, geometric construction, induction, contradiction, or computational search.

Step-by-Step Solution

Execute the chosen strategy with rigorous intermediate steps, each following from the previous by a valid mathematical operation.

Verification

Check the solution by substitution, dimensional analysis, or alternative solution paths. Advanced systems use formal proof verification (Lean, Isabelle).

Answer Formatting

Extract the final numerical, symbolic, or proof-based answer in the required format.

Current Landscape

Mathematical reasoning is the crown jewel of LLM evaluation in 2025. Standard benchmarks (GSM8K, MATH) are nearly saturated by frontier models using chain-of-thought and reinforcement learning. The field has split into informal reasoning (getting the right answer with plausible steps) and formal reasoning (producing machine-verifiable proofs). DeepMind's AlphaProof demonstrated the formal approach can solve olympiad problems, while FrontierMath showed that research-level math remains far beyond current capabilities.

Key Challenges

Competition-level problems require creative insight and non-obvious strategy selection, not just execution

Formal proof generation requires translating intuitive reasoning into machine-verifiable steps — a fundamentally different skill

Error propagation — a single wrong step in a 15-step proof invalidates everything downstream

FrontierMath results show models still cannot tackle research-level mathematics

Evaluation is hard — numerical answers can be checked automatically, but proof correctness requires formal verification

Quick Recommendations

Production math assistance

OpenAI o3 / Claude 3.5 with extended thinking

90%+ on MATH benchmark, excellent step-by-step reasoning

Cost-efficient math

DeepSeek-Math-7B / Qwen2.5-Math-72B

Best open-source math reasoning per parameter

Formal theorem proving

AlphaProof / LeanDojo

Only approaches that produce machine-verified proofs

Education

GPT-4o with code interpreter

Can show work, visualize problems, and verify numerical answers via computation

What's Next

Two converging frontiers: (1) scaling formal proof generation to work on increasingly complex theorems, with LLMs guiding proof search in Lean/Isabelle, and (2) extending informal reasoning to research-level problems through better search, longer reasoning chains, and mathematical tool use. Expect math reasoning to remain the primary benchmark for measuring general intelligence progress.

Benchmarks & SOTA

GSM8K

Grade School Math 8K

202148 results

8,500 grade-school math word problems requiring 2-8 steps of basic arithmetic. The gold standard for multi-step mathematical reasoning in LLMs; scored by exact match on the final numeric answer.

State of the Art

ERNIE 5.0

Baidu

99.7

accuracy

MATH

Mathematics Aptitude Test of Heuristics

202146 results

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

State of the Art

o4-mini (high)

OpenAI

98.2

accuracy

AIME 2025

American Invitational Mathematics Examination 2025

202522 results

Olympiad-style short-answer math benchmark used by reasoning-model releases. Small test set, so score swings should be read with caution.

State of the Art

Step-3.5-Flash PaCoRe

99.9

accuracy

AIME 2024

American Invitational Mathematics Examination 2024

202411 results

30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.

State of the Art

OpenAI

96.7

accuracy

Related Tasks

Arithmetic Reasoning

Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.

Logical Reasoning

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

Multi-step Reasoning

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

Commonsense Reasoning

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Mathematical Reasoning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Reasoning