GSM8K Benchmark
The gold standard for evaluating multi-step mathematical reasoning in Large Language Models. Featuring 8,500 high-quality grade school math word problems that require 2-8 steps of basic arithmetic to solve.
The Reasoning Frontier
GSM8K (Grade School Math 8K) was designed to address a critical gap in AI evaluation: the ability to perform multi-step logical deduction. While previous benchmarks focused on single-turn facts or simple pattern matching, GSM8K requires models to maintain a coherent "chain of thought" across multiple arithmetic operations.
Why it matters:
- 01.Reveals failures in even massive models despite the simplicity of the math (middle school level).
- 02.Standardized the use of Chain-of-Thought (CoT) prompting as a primary evaluation method.
- 03.Drives innovation in verifiers—models that check the work of other models step-by-step.
"Natalia sold 48 clips in April and 60 clips in May. If she sold half as many clips in June as she did in April and May combined, how many clips did she sell in total across the three months?"
# Reasoning Chain (CoT)
1. April + May = 48 + 60 = 108 clips.
2. June = 108 / 2 = 54 clips.
3. Total = 108 + 54 = 162 clips.
#### 162
Evolution of Performance
Tracking the rapid saturation of GSM8K from 2021 to 2026.
GPT-3 (base)
Dataset launch; raw prompting baseline.
PaLM 540B (CoT)
Chain-of-Thought prompting introduced.
PaLM (Self-Consistency)
Majority voting over reasoning paths.
GPT-4
Massive scale and RLHF integration.
o1 (OpenAI)
Internal RL for deliberate reasoning.
GPT-5
Process-based verification RL.
ERNIE 5.0
Benchmark saturation.
Official Leaderboard
Top performing models on the GSM8K test set (Exact Match accuracy).
| Rank | Model | Vendor | Date | Accuracy | Notes |
|---|---|---|---|---|---|
| #01 | Claude 4 | Anthropic | 2025-05 | 98.9% | Constitutional AI refinement |
| #02 | Llama 4 Behemoth 2T | Meta | 2025-04 | 98.5% | MoE Architecture |
| #03 | GPT-4.5 | OpenAI | 2025-03 | 98.2% | Iterative refinement |
| #04 | o1 (OpenAI) | OpenAI | 2024-09 | 97.8% | Test-time compute scaling |
| #05 | Claude 3.5 Sonnet | Anthropic | 2024-07 | 95% | Optimized CoT decoding |
| #06 | Claude 3 Opus | Anthropic | 2024-03 | 95% | Top-tier reasoning |
| #07 | Gemini Ultra | 2024-02 | 94.4% | Multimodal foundation | |
| #08 | GPT-4 (Original) | OpenAI | 2023-03 | 92% | RLHF integration |
| #09 | Claude 3 Haiku | Anthropic | 2024-03 | 88.9% | Efficient reasoning |
| #10 | Mixtral 8x22B | Mistral AI | 2024-04 | 88% | Open-weight SOTA |
Exact Match Scoring
Unlike benchmarks that use fuzzy matching, GSM8K requires the final numeric answer to be exact. This forces models to not only reason correctly but also execute arithmetic with 100% precision.
CoT Saturation
While Chain-of-Thought (CoT) provided a 50%+ boost in 2022, modern frontier models are saturating the benchmark. The focus has shifted to "Self-Consistency" and "Verifier" training to squeeze out the final 2%.
The Meta-Reasoning Gap
New variants like MR-GSM8K show that models which score 95% on GSM8K often fail to identify errors in *other* solutions, suggesting they may still rely on pattern matching over true logic.
Dataset Variants & Extensions
Original GSM8K
8,500 problems
Human-authored, 2-8 steps per problem.
TinyGSM
12.3M problems
Synthetic data for training small models.
MR-GSM8K
N/A
Meta-reasoning: finding errors in solutions.
GSM8K-Platinum
8,281 problems
Cleaned version removing ambiguities.
GSM8K-V
N/A
Visual math problems for VLMs.
Verifier Heatmap Analysis
Modern SOTA models use "Process-based Reward Models" (PRMs) to score every single token in a reasoning chain. This heatmap visualizes how a verifier identifies the exact moment a model makes a logical slip.
Key Research
Code & Implementation
Official dataset and baseline code.
Industry-standard evaluation framework.
Meta-reasoning benchmark extension.
Related Benchmarks
| Benchmark | Focus | Key Difference |
|---|---|---|
| MATH | High School Competition Math | Significantly harder; requires calculus, geometry, and number theory. |
| SVAMP | Robustness to Phrasing | Tests if models are fooled by irrelevant info or word order changes. |
| GSM-Symbolic | Symbolic Generalization | Replaces numbers with variables to test if models truly understand the logic. |
Ready to benchmark your model?
Access the official GSM8K dataset on Hugging Face or GitHub to evaluate your model's mathematical reasoning capabilities.