ReasoningarXiv 2021OpenAI

GSM8K Benchmark

The gold standard for evaluating multi-step mathematical reasoning in Large Language Models. Featuring 8,500 high-quality grade school math word problems that require 2-8 steps of basic arithmetic to solve.

SOTA Accuracy
99.7%
Dataset Size
8.5K
Complexity
2-8 Steps
Metric
Exact Match

The Reasoning Frontier

GSM8K (Grade School Math 8K) was designed to address a critical gap in AI evaluation: the ability to perform multi-step logical deduction. While previous benchmarks focused on single-turn facts or simple pattern matching, GSM8K requires models to maintain a coherent "chain of thought" across multiple arithmetic operations.

Why it matters:

  • 01.Reveals failures in even massive models despite the simplicity of the math (middle school level).
  • 02.Standardized the use of Chain-of-Thought (CoT) prompting as a primary evaluation method.
  • 03.Drives innovation in verifiers—models that check the work of other models step-by-step.
SAMPLE_PROBLEM_042DIFFICULTY: MEDIUM

"Natalia sold 48 clips in April and 60 clips in May. If she sold half as many clips in June as she did in April and May combined, how many clips did she sell in total across the three months?"

# Reasoning Chain (CoT)

1. April + May = 48 + 60 = 108 clips.

2. June = 108 / 2 = 54 clips.

3. Total = 108 + 54 = 162 clips.

#### 162

Evolution of Performance

Tracking the rapid saturation of GSM8K from 2021 to 2026.

Accuracy (%)
8%
Nov 2021
GPT-3 (base)

GPT-3 (base)

Dataset launch; raw prompting baseline.

58%
Jan 2022
PaLM 540B (CoT)

PaLM 540B (CoT)

Chain-of-Thought prompting introduced.

74%
Jan 2022
PaLM (Self-Consistency)

PaLM (Self-Consistency)

Majority voting over reasoning paths.

92%
Mar 2023
GPT-4

GPT-4

Massive scale and RLHF integration.

97.8%
Sep 2024
o1 (OpenAI)

o1 (OpenAI)

Internal RL for deliberate reasoning.

99.2%
Aug 2025
GPT-5

GPT-5

Process-based verification RL.

99.7%
Mar 2026
ERNIE 5.0

ERNIE 5.0

Benchmark saturation.

Official Leaderboard

Top performing models on the GSM8K test set (Exact Match accuracy).

RankModelVendorDateAccuracyNotes
#01
Claude 4
Anthropic2025-05
98.9%
Constitutional AI refinement
#02
Llama 4 Behemoth 2T
Meta2025-04
98.5%
MoE Architecture
#03
GPT-4.5
OpenAI2025-03
98.2%
Iterative refinement
#04
o1 (OpenAI)
OpenAI2024-09
97.8%
Test-time compute scaling
#05
Claude 3.5 Sonnet
Anthropic2024-07
95%
Optimized CoT decoding
#06
Claude 3 Opus
Anthropic2024-03
95%
Top-tier reasoning
#07
Gemini Ultra
Google2024-02
94.4%
Multimodal foundation
#08
GPT-4 (Original)
OpenAI2023-03
92%
RLHF integration
#09
Claude 3 Haiku
Anthropic2024-03
88.9%
Efficient reasoning
#10
Mixtral 8x22B
Mistral AI2024-04
88%
Open-weight SOTA

Exact Match Scoring

Unlike benchmarks that use fuzzy matching, GSM8K requires the final numeric answer to be exact. This forces models to not only reason correctly but also execute arithmetic with 100% precision.

CoT Saturation

While Chain-of-Thought (CoT) provided a 50%+ boost in 2022, modern frontier models are saturating the benchmark. The focus has shifted to "Self-Consistency" and "Verifier" training to squeeze out the final 2%.

The Meta-Reasoning Gap

New variants like MR-GSM8K show that models which score 95% on GSM8K often fail to identify errors in *other* solutions, suggesting they may still rely on pattern matching over true logic.

Dataset Variants & Extensions

Standard

Original GSM8K

8,500 problems

Human-authored, 2-8 steps per problem.

Research

TinyGSM

12.3M problems

Synthetic data for training small models.

Diagnostic

MR-GSM8K

N/A

Meta-reasoning: finding errors in solutions.

Refined

GSM8K-Platinum

8,281 problems

Cleaned version removing ambiguities.

Multimodal

GSM8K-V

N/A

Visual math problems for VLMs.

Verifier Heatmap Analysis

Modern SOTA models use "Process-based Reward Models" (PRMs) to score every single token in a reasoning chain. This heatmap visualizes how a verifier identifies the exact moment a model makes a logical slip.

Start of SolutionError Detected (Step 4)Final Answer

Key Research

Code & Implementation

openai/grade-school-math
1.4k

Official dataset and baseline code.

EleutherAI/lm-evaluation-harness
11.6k

Industry-standard evaluation framework.

JIA-Lab-research/MR-GSM8K
51

Meta-reasoning benchmark extension.

Related Benchmarks

BenchmarkFocusKey Difference
MATHHigh School Competition MathSignificantly harder; requires calculus, geometry, and number theory.
SVAMPRobustness to PhrasingTests if models are fooled by irrelevant info or word order changes.
GSM-SymbolicSymbolic GeneralizationReplaces numbers with variables to test if models truly understand the logic.

Ready to benchmark your model?

Access the official GSM8K dataset on Hugging Face or GitHub to evaluate your model's mathematical reasoning capabilities.