Who leads the MATH benchmark?

o4-mini (high) currently leads MATH with a score of 98.2 on accuracy.

What is the state-of-the-art score on MATH?

The state-of-the-art result on MATH is 98.2 (accuracy), achieved by o4-mini (high) as of 2026.

How many models are tracked on MATH?

Codesota tracks 50 models on MATH.

When was the MATH leaderboard last updated?

The MATH leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2023.

Codesota · Benchmark · MATHHome/Leaderboards/Language & Knowledge/Mathematical Reasoning/MATH

Unknown

MATH.

Name: MATH Benchmark Results
Creator: Unknown
Published: 2023-01-01
License: https://creativecommons.org/licenses/by/4.0/

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Paper ↗Leaderboard ↓Lineage

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

accuracy

Accuracy is the reported evaluation metric for MATH. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Links	Edit
01	o4-mini (high) MATH-500, zero-shot CoT, pass@1. High reasoning effort.	paper	98.2	2026	Source ↗	Edit result
02	o3 (high) MATH-500, zero-shot CoT, pass@1. High reasoning effort.	unverified	98.1	2026	Source ↗	Edit result
03	o3-mini MATH-500, zero-shot CoT, pass@1. High reasoning effort.	paper	97.9	2026	Source ↗	Edit result
04	o3 MATH-500, zero-shot CoT, pass@1. Default reasoning effort.	unverified	97.8	2026	Source ↗	Edit result
05	o4-mini MATH-500, zero-shot CoT, pass@1. Default reasoning effort.	unverified	97.5	2026	Source ↗	Edit result
06	Gemini 2.5 Pro MATH-500, pass@1. Gemini 2.5 Pro (Mar 2025).	paper	97.3	2026	Source ↗	Edit result
07	DeepSeek R1 MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025).	paper	97.3	2026	Source ↗	Edit result
08	DeepSeek-R1 MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025).	paper	97.3	2026	Source ↗	Edit result
09	o1 MATH-500, zero-shot CoT, pass@1.	unverified	96.4	2026	Source ↗	Edit result
10	Claude 3.7 Sonnet MATH-500 with extended thinking enabled.	unverified	96.2	2026	Source ↗	Edit result
11	Kimi k1.5 MATH-500, long-CoT variant. From official Kimi k1.5 paper (Jan 2025).	paper	96.2	2026	Source ↗	Edit result
12	DeepSeek-R1-Zero MATH-500, pass@1. DeepSeek-R1-Zero (pure RL, no SFT). From R1 paper (Jan 2025).	paper	95.9	2026	Source ↗	Edit result
13	DeepSeek-R1-Distill-Llama-70B MATH-500, pass@1. Distilled from DeepSeek-R1 into Llama-3.1-70B. From R1 paper (Jan 2025).	paper	94.5	2026	Source ↗	Edit result
14	DeepSeek-R1-Distill-Qwen-32B MATH-500, pass@1. Distilled from DeepSeek-R1 into Qwen-2.5-32B. From R1 paper (Jan 2025).	paper	94.3	2026	Source ↗	Edit result
15	DeepSeek-v3-0324 MATH-500. DeepSeek-V3-0324 updated model (Mar 2025). Non-reasoning base model.	unverified	94	2026	Source ↗	Edit result
16	Claude Opus 4.5 4-shot. Source: Claude Opus 4.5 model card, Anthropic (2025).	verified	90.7	2026	Source ↗	Edit result
17	QwQ-32B MATH-500, pass@1. QwQ-32B reasoning model by Alibaba/Qwen (Mar 2025).	unverified	90.6	2026	Source ↗	Edit result
18	DeepSeek-V3 MATH-500. Non-reasoning base model. From DeepSeek-V3 technical report (Dec 2024).	paper	90.2	2026	Source ↗	Edit result
19	o1-mini MATH-500, zero-shot CoT, pass@1.	paper	90	2026	Source ↗	Edit result
20	Llama-4-Maverick 4-shot. Source: Meta Llama 4 model card (April 2025).	verified	89.4	2026	Source ↗	Edit result
21	Claude Opus 4 4-shot. Source: Claude Opus 4 model card, Anthropic (2025).	verified	89.2	2026	Source ↗	Edit result
22	Claude Sonnet 4 4-shot. Source: Claude Sonnet 4 model card, Anthropic (2025).	verified	88.9	2026	Source ↗	Edit result
23	GPT-4.5 Preview Full MATH test set, zero-shot CoT.	paper	87.1	2026	Source ↗	Edit result
24	o1-preview MATH-500, zero-shot CoT, pass@1.	paper	85.5	2026	Source ↗	Edit result
25	Qwen2.5-Plus	unverified	84.7	2024	Paper ↗Code ↗	Edit result
26	Qwen2.5-72B-Instruct Qwen2.5-72B-Instruct. Table 6 in Qwen2.5 Technical Report.	verified	83.1	2026	Source ↗	Edit result
27	Qwen2.5-VL-72B	unverified	83	2025	Paper ↗Code ↗	Edit result
28	GPT-4.1 Full MATH test set, zero-shot CoT.	paper	82.1	2026	Source ↗	Edit result
29	MiniMax-Text-01	unverified	77.4	2025	Paper ↗Code ↗	Edit result
30	gpt-4o Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13.	paper	76.6	2026	Source ↗	Edit result
31	Grok 2 Full MATH test set.	paper	76.1	2026	Source ↗	Edit result
32	Llama 3 (405B, Instruct)	unverified	73.8	2024	Paper ↗Code ↗	Edit result
33	Llama 3.1 405B Full MATH test set.	paper	73.8	2026	Source ↗	Edit result
34	GPT-4 Turbo Full MATH test set, zero-shot CoT.	paper	73.4	2026	Source ↗	Edit result
35	Qwen3-235B-A22B	unverified	71.84	2025	Paper ↗Code ↗	Edit result
36	claude-35-sonnet Full MATH test set. Original Claude 3.5 Sonnet (June 2024).	paper	71.1	2026	Source ↗	Edit result
37	Claude 3.5 Sonnet Full MATH test set. Original Claude 3.5 Sonnet (June 2024).	unverified	71.1	2026	Source ↗	Edit result
38	gpt-4o-mini Full MATH test set, zero-shot CoT.	paper	70.2	2026	Source ↗	Edit result
39	GPT-4o mini Full MATH test set, zero-shot CoT.	unverified	70.2	2026	Source ↗	Edit result
40	Llama 3.1 70B Full MATH test set.	unverified	68	2026	Source ↗	Edit result
41	gemini-15-pro From Google's official evaluation.	paper	67.7	2026	Source ↗	Edit result
42	Gemini 1.5 Pro From Google's official evaluation.	unverified	67.7	2026	Source ↗	Edit result
43	Step-3.5-Flash Base	unverified	66.8	2026	Paper ↗Code ↗	Edit result
44	Claude 3 Opus Full MATH test set.	unverified	60.1	2026	Source ↗	Edit result
45	HRM-Text-1B	unverified	56.5	2026	Paper ↗Code ↗	Edit result
46	Aria	unverified	50.8	2024	Paper ↗Code ↗	Edit result
47	Apertus-70B-Instruct	unverified	30.8	2025	Paper ↗Code ↗	Edit result
48	Chameleon 34B	unverified	22.5	2024	Paper ↗Code ↗	Edit result
49	LLaMA-65B	unverified	20.5	2023	Paper ↗Code ↗	Edit result
50	SmoLM2 (1.7B)	unverified	11.6	2025	Paper ↗Code ↗	Edit result

Lineage

MATH in context.

See full mathematical reasoning benchmarks lineage →

Predecessors (1)

saturated2021-10

GSM8K

GSM8K addressed grade-school arithmetic; MATH jumped directly to AMC/AIME/Olympiad competition problems — a 5-difficulty-level span above GSM8K. Released one month apart, they together defined the 2021–2023 math evaluation landscape.

This benchmark (1)

saturating2021-11

MATH

Successors (2)

active2024-03

AIME 2024

AIME is not an AI benchmark by origin — it is the human competition that feeds into USAMO/IMO. It became an AI frontier benchmark when o1 started scoring competitively. Because it is updated annually with fresh problems, it provides contamination control that MATH (fixed dataset) cannot.

active2024-10

OmniMATH

OmniMATH aggregates olympiad-level problems from IMO and national competitions, providing harder instances than MATH's Level 5 problems. A parallel ceiling-raising branch alongside AIME.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Mathematical Reasoning