GSM8K & MATH.

Math reasoning performance from grade-school word problems (GSM8K) to competition-level proofs (MATH). Reasoning models like o3 and DeepSeek-R1 now dominate both benchmarks.

GSM8K ↓MATH

§ 01 · GSM8K

Grade-school arithmetic, multi-step.

8,500 grade-school math word problems requiring 2-8 arithmetic steps. Chain-of-thought prompting unlocks near-perfect accuracy in frontier models. Accuracy measured on the 1,319-problem test set.

#	Model	Provider	Accuracy	Date
★	ERNIE 5.0	Baidu	99.7%	Apr 2026
2	GPT-5	OpenAI	99.2%	Apr 2026
3	Gemini 2.5 Pro	Google	99%	Mar 2026
4	o4-mini	OpenAI	99%	Mar 2026
5	o3	OpenAI	99%	Mar 2026
6	Claude 4	Anthropic	98.9%	Apr 2026
7	Llama-4-Maverick	Meta	98.7%	Mar 2026
8	Claude Opus 4.5	Anthropic	98.6%	Mar 2026
9	Llama 4 Behemoth 2T	Meta	98.5%	Apr 2026
10	GPT-4.5	OpenAI	98.2%	Apr 2026
11	Claude Opus 4	Anthropic	98%	Mar 2026
12	o1	OpenAI	97.8%	Apr 2026
13	o1-preview	OpenAI	97.8%	Dec 2025
14	o1	OpenAI	97.8%	Apr 2026
15	Claude Sonnet 4	Anthropic	97.8%	Mar 2026
16	DeepSeek R1	DeepSeek	97.3%	Mar 2026
17	Claude 3.5 Sonnet	Anthropic	96.4%	Dec 2025
18	Qwen2.5-72B-Instruct	Alibaba	95.8%	Mar 2026
19	DeepSeek-V3	DeepSeek	95.8%	Mar 2026
20	Claude 3.5 Sonnet	Anthropic	95%	Apr 2026
21	Claude 3 Opus	Anthropic	95%	Apr 2026
22	Gemini Ultra	Google DeepMind	94.4%	Apr 2026
23	Llama 3 70B	Meta	93%	Dec 2025
24	GPT-4	OpenAI	92%	Apr 2026
25	GPT-4	OpenAI	92%	Apr 2026
26	GPT-4o	OpenAI	92%	Dec 2025
27	Gemini 1.5 Pro	Google	91.7%	Dec 2025
28	Claude 3 Haiku	Anthropic	88.9%	Apr 2026
29	Mixtral-8x22b	Mistral	88%	Apr 2026
30	PaLM 540B (Self-Consistency)	Google	74%	Apr 2026
31	PaLM 540B (CoT)	Google	58%	Apr 2026
32	GPT-3 (base)	OpenAI	8%	Apr 2026

Source: openai/grade-school-math · Chain-of-thought, maj@1.

§ 02 · MATH

Competition-level, AMC to AIME.

12,500 competition problems at difficulty 1-5 (AMC/AIME level), covering algebra, counting, geometry, number theory, and pre-calculus. The harder MATH-500 subset (500 representative problems) is the standard evaluation split.

#	Model	Provider	Accuracy	Date
★	o4-mini (high)	OpenAI	98.2%	Mar 2026
2	o3 (high)	OpenAI	98.1%	Mar 2026
3	o3-mini	OpenAI	97.9%	Mar 2026
4	o3	OpenAI	97.8%	Mar 2026
5	o4-mini	OpenAI	97.5%	Mar 2026
6	DeepSeek R1	DeepSeek	97.3%	Mar 2026
7	Gemini 2.5 Pro	Google	97.3%	Mar 2026
8	o1	OpenAI	96.4%	Mar 2026
9	Kimi k1.5	Moonshot AI	96.2%	Mar 2026
10	Claude 3.7 Sonnet	Anthropic	96.2%	Mar 2026
11	DeepSeek-R1-Zero	DeepSeek	95.9%	Mar 2026
12	DeepSeek-R1-Distill-Llama-70B	DeepSeek	94.5%	Mar 2026
13	DeepSeek-R1-Distill-Qwen-32B	DeepSeek	94.3%	Mar 2026
14	DeepSeek-v3-0324	DeepSeek	94%	Mar 2026
15	Claude Opus 4.5	Anthropic	90.7%	Mar 2026
16	QwQ-32B	Alibaba/Qwen	90.6%	Mar 2026
17	DeepSeek-V3	DeepSeek	90.2%	Mar 2026
18	o1-mini	OpenAI	90%	Mar 2026
19	Llama-4-Maverick	Meta	89.4%	Mar 2026
20	Claude Opus 4	Anthropic	89.2%	Mar 2026
21	Claude Sonnet 4	Anthropic	88.9%	Mar 2026
22	GPT-4.5 Preview	OpenAI	87.1%	Mar 2026
23	o1-preview	OpenAI	85.5%	Mar 2026
24	Qwen2.5-72B-Instruct	Alibaba	83.1%	Mar 2026
25	GPT-4.1	OpenAI	82.1%	Mar 2026
26	GPT-4o	OpenAI	76.6%	Mar 2026
27	Grok 2	xAI	76.1%	Mar 2026
28	Llama 3.1 405B	Meta	73.8%	Mar 2026
29	GPT-4 Turbo	OpenAI	73.4%	Mar 2026
30	Claude 3.5 Sonnet	Anthropic	71.1%	Mar 2026
31	GPT-4o mini	OpenAI	70.2%	Mar 2026
32	Llama 3.1 70B	Meta	68%	Mar 2026
33	Gemini 1.5 Pro	Google	67.7%	Mar 2026
34	Claude 3 Opus	Anthropic	60.1%	Mar 2026

Source: hendrycks/math · MATH-500 subset, chain-of-thought.

§ 03 · Methodology

Frequently asked.

What is GSM8K?+

Grade School Math 8K — 8,500 word problems requiring 2-8 step arithmetic reasoning, created by OpenAI in 2021. Chain-of-thought prompting revealed a step-change in model capability. Now saturated at the frontier.

What makes MATH harder than GSM8K?+

MATH problems require domain knowledge (e.g., modular arithmetic, geometric proofs) not just arithmetic. Difficulty 5 problems (AIME-level) stump most people. The benchmark was designed to take years to saturate — reasoning models like o3 are now above 95%.

Why do reasoning models dominate math benchmarks?+

Reasoning models (o3, DeepSeek-R1) use extended chain-of-thought with self-verification before committing to an answer. This search-like inference process is especially effective on math, where step-by-step verification catches errors.

§ 04 · Related

Continue reading.

Math · all

All Math Benchmarks

GSM8K, MATH, AIME 2024, AMC 2023

Reasoning

Reasoning Benchmarks

GPQA Diamond, MMLU-Pro, HLE

Index

All LLM Benchmarks

Full leaderboard overview