Mathematical Reasoning
Solving math word problems (GSM8K, MATH, Minerva).
Mathematical Reasoning is a key task in reasoning. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
MATH
Mathematics Aptitude Test of Heuristics
12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.
State of the Art
o3-mini
OpenAI
97.9
accuracy
GSM8K
Grade School Math 8K
8,500 grade school math word problems requiring multi-step reasoning. The most popular math reasoning benchmark.
State of the Art
o1-preview
OpenAI
97.8
accuracy
AIME 2024
American Invitational Mathematics Examination 2024
30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.
State of the Art
o1-preview
OpenAI
83.3
accuracy
Related Tasks
Commonsense Reasoning
Reasoning about everyday situations (CommonsenseQA, HellaSwag).
Logical Reasoning
Solving logic puzzles and deductive problems.
Multi-step Reasoning
Complex reasoning requiring multiple inference steps (HotpotQA).
Arithmetic Reasoning
Performing arithmetic calculations and solving equations.