Who leads the GSM8K benchmark?

ERNIE 5.0 currently leads GSM8K with a score of 99.70 on accuracy.

What is the state-of-the-art score on GSM8K?

The state-of-the-art result on GSM8K is 99.70 (accuracy), achieved by ERNIE 5.0 as of 2026.

How many models are tracked on GSM8K?

Codesota tracks 45 models on GSM8K.

When was the GSM8K leaderboard last updated?

The GSM8K leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2021.

GSM8K Leaderboard: Mathematical Reasoning SOTA

Name: Grade School Math 8K Benchmark Results
Creator: Codesota
Published: 2021-01-01
License: https://creativecommons.org/licenses/by/4.0/

The Reasoning Frontier

GSM8K (Grade School Math 8K) was designed to address a critical gap in AI evaluation: the ability to perform multi-step logical deduction. While previous benchmarks focused on single-turn facts or simple pattern matching, GSM8K requires models to maintain a coherent "chain of thought" across multiple arithmetic operations.

Why it matters:

01.Reveals failures in even massive models despite the simplicity of the math (middle school level).
02.Standardized the use of Chain-of-Thought (CoT) prompting as a primary evaluation method.
03.Drives innovation in verifiers—models that check the work of other models step-by-step.

SAMPLE_PROBLEM_042DIFFICULTY: MEDIUM

"Natalia sold 48 clips in April and 60 clips in May. If she sold half as many clips in June as she did in April and May combined, how many clips did she sell in total across the three months?"

# Reasoning Chain (CoT)

1. April + May = 48 + 60 = 108 clips.

2. June = 108 / 2 = 54 clips.

3. Total = 108 + 54 = 162 clips.

#### 162

Evolution of Performance

Tracking the rapid saturation of GSM8K from 2021 to 2026.

Accuracy (%)

Nov 2021

GPT-3 (base)

Dataset launch; raw prompting baseline.

58%

Jan 2022

PaLM 540B (CoT)

Chain-of-Thought prompting introduced.

74%

Jan 2022

PaLM (Self-Consistency)

Majority voting over reasoning paths.

92%

Mar 2023

GPT-4

Massive scale and RLHF integration.

97.8%

Sep 2024

o1 (OpenAI)

Internal RL for deliberate reasoning.

99.2%

Aug 2025

GPT-5

Process-based verification RL.

99.7%

Mar 2026

ERNIE 5.0

Benchmark saturation.

Official Leaderboard

Top performing models on the GSM8K test set (Exact Match accuracy).

Rank	Model	Vendor	Date	Accuracy	Notes
#01	Claude 4	Anthropic	2025-05	98.9%	Constitutional AI refinement
#02	Llama 4 Behemoth 2T	Meta	2025-04	98.5%	MoE Architecture
#03	GPT-4.5	OpenAI	2025-03	98.2%	Iterative refinement
#04	o1 (OpenAI)	OpenAI	2024-09	97.8%	Test-time compute scaling
#05	Claude 3.5 Sonnet	Anthropic	2024-07	95%	Optimized CoT decoding
#06	Claude 3 Opus	Anthropic	2024-03	95%	Top-tier reasoning
#07	Gemini Ultra	Google	2024-02	94.4%	Multimodal foundation
#08	GPT-4 (Original)	OpenAI	2023-03	92%	RLHF integration
#09	Claude 3 Haiku	Anthropic	2024-03	88.9%	Efficient reasoning
#10	Mixtral 8x22B	Mistral AI	2024-04	88%	Open-weight SOTA

Exact Match Scoring

Unlike benchmarks that use fuzzy matching, GSM8K requires the final numeric answer to be exact. This forces models to not only reason correctly but also execute arithmetic with 100% precision.

CoT Saturation

While Chain-of-Thought (CoT) provided a 50%+ boost in 2022, modern frontier models are saturating the benchmark. The focus has shifted to "Self-Consistency" and "Verifier" training to squeeze out the final 2%.

The Meta-Reasoning Gap

New variants like MR-GSM8K show that models which score 95% on GSM8K often fail to identify errors in *other* solutions, suggesting they may still rely on pattern matching over true logic.

Dataset Variants & Extensions

Standard

Original GSM8K

8,500 problems

Human-authored, 2-8 steps per problem.

Research

TinyGSM

12.3M problems

Synthetic data for training small models.

Diagnostic

MR-GSM8K

N/A

Meta-reasoning: finding errors in solutions.

Refined

GSM8K-Platinum

8,281 problems

Cleaned version removing ambiguities.

Multimodal

GSM8K-V

N/A

Visual math problems for VLMs.

Verifier Heatmap Analysis

Modern SOTA models use "Process-based Reward Models" (PRMs) to score every single token in a reasoning chain. This heatmap visualizes how a verifier identifies the exact moment a model makes a logical slip.

Start of SolutionError Detected (Step 4)Final Answer

Key Research

arXiv 2021

Training Verifiers to Solve Math Word Problems

Cobbe et al.

ICLR 2025

MR-GSM8K: A Meta-Reasoning Revolution

Zeng et al.

ICLR 2025

GSM-Symbolic: Understanding Limitations

Mirzadeh et al.

Code & Implementation

openai/grade-school-math

1.4k

Official dataset and baseline code.

EleutherAI/lm-evaluation-harness

11.6k

Industry-standard evaluation framework.

JIA-Lab-research/MR-GSM8K

Meta-reasoning benchmark extension.

Related Benchmarks

Benchmark	Focus	Key Difference
MATH	High School Competition Math	Significantly harder; requires calculus, geometry, and number theory.
SVAMP	Robustness to Phrasing	Tests if models are fooled by irrelevant info or word order changes.
GSM-Symbolic	Symbolic Generalization	Replaces numbers with variables to test if models truly understand the logic.

Ready to benchmark your model?

Access the official GSM8K dataset on Hugging Face or GitHub to evaluate your model's mathematical reasoning capabilities.

Get Dataset Explore Reasoning