Mathematical Reasoning
Solving math word problems (GSM8K, MATH, Minerva).
Mathematical Reasoning is a key task in reasoning. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
GSM8K
Grade School Math 8K
8,500 grade school math word problems requiring multi-step reasoning. The most popular math reasoning benchmark.
State of the Art
o1-preview
OpenAI
97.8
accuracy
MATH
Mathematics Aptitude Test of Heuristics
12,500 competition mathematics problems from AMC, AIME, and other sources. Harder than GSM8K.
State of the Art
o1-preview
OpenAI
94.8
accuracy
AIME 2024
American Invitational Mathematics Examination 2024
30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.
State of the Art
o1-preview
OpenAI
83.3
accuracy
Related Tasks
Commonsense Reasoning
Reasoning about everyday situations (CommonsenseQA, HellaSwag).
Logical Reasoning
Solving logic puzzles and deductive problems.
Multi-step Reasoning
Complex reasoning requiring multiple inference steps (HotpotQA).
Arithmetic Reasoning
Performing arithmetic calculations and solving equations.