Codesota · Benchmark · MATHHome/Leaderboards/Language & Knowledge/Mathematical Reasoning/MATH
Unknown

MATH.

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for MATH. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01o4-mini (high)
MATH-500, zero-shot CoT, pass@1. High reasoning effort.
paper98.22026Source ↗Looks wrong?
02o3 (high)
MATH-500, zero-shot CoT, pass@1. High reasoning effort.
unverified98.12026Source ↗Looks wrong?
03o3-mini
MATH-500, zero-shot CoT, pass@1. High reasoning effort.
paper97.92026Source ↗Looks wrong?
04o3
MATH-500, zero-shot CoT, pass@1. Default reasoning effort.
unverified97.82026Source ↗Looks wrong?
05o4-mini
MATH-500, zero-shot CoT, pass@1. Default reasoning effort.
unverified97.52026Source ↗Looks wrong?
06Gemini 2.5 Pro
MATH-500, pass@1. Gemini 2.5 Pro (Mar 2025).
paper97.32026Source ↗Looks wrong?
07DeepSeek R1
MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025).
paper97.32026Source ↗Looks wrong?
08DeepSeek-R1
MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025).
paper97.32026Source ↗Looks wrong?
09o1
MATH-500, zero-shot CoT, pass@1.
paper96.42026Source ↗Looks wrong?
10Kimi k1.5
MATH-500, long-CoT variant. From official Kimi k1.5 paper (Jan 2025).
paper96.22026Source ↗Looks wrong?
11Claude 3.7 Sonnet
MATH-500 with extended thinking enabled.
paper96.22026Source ↗Looks wrong?
12DeepSeek-R1-Zero
MATH-500, pass@1. DeepSeek-R1-Zero (pure RL, no SFT). From R1 paper (Jan 2025).
paper95.92026Source ↗Looks wrong?
13DeepSeek-R1-Distill-Llama-70B
MATH-500, pass@1. Distilled from DeepSeek-R1 into Llama-3.1-70B. From R1 paper (Jan 2025).
paper94.52026Source ↗Looks wrong?
14DeepSeek-R1-Distill-Qwen-32B
MATH-500, pass@1. Distilled from DeepSeek-R1 into Qwen-2.5-32B. From R1 paper (Jan 2025).
paper94.32026Source ↗Looks wrong?
15DeepSeek-v3-0324
MATH-500. DeepSeek-V3-0324 updated model (Mar 2025). Non-reasoning base model.
unverified942026Source ↗Looks wrong?
16Claude Opus 4.5
4-shot. Source: Claude Opus 4.5 model card, Anthropic (2025).
verified90.72026Source ↗Looks wrong?
17QwQ-32B
MATH-500, pass@1. QwQ-32B reasoning model by Alibaba/Qwen (Mar 2025).
unverified90.62026Source ↗Looks wrong?
18DeepSeek-V3
MATH-500. Non-reasoning base model. From DeepSeek-V3 technical report (Dec 2024).
paper90.22026Source ↗Looks wrong?
19o1-mini
MATH-500, zero-shot CoT, pass@1.
paper902026Source ↗Looks wrong?
20Llama 4 Maverick
4-shot. Source: Meta Llama 4 model card (April 2025).
verified89.42026Source ↗Looks wrong?
21Claude Opus 4
4-shot. Source: Claude Opus 4 model card, Anthropic (2025).
verified89.22026Source ↗Looks wrong?
22Claude Sonnet 4
4-shot. Source: Claude Sonnet 4 model card, Anthropic (2025).
verified88.92026Source ↗Looks wrong?
23GPT-4.5 Preview
Full MATH test set, zero-shot CoT.
paper87.12026Source ↗Looks wrong?
24o1-preview
MATH-500, zero-shot CoT, pass@1.
paper85.52026Source ↗Looks wrong?
25Qwen2.5-Plusunverified84.72024Paper ↗Code ↗Looks wrong?
26Qwen2.5-72B-Instruct
Qwen2.5-72B-Instruct. Table 6 in Qwen2.5 Technical Report.
verified83.12026Source ↗Looks wrong?
27Qwen2.5-VL-72Bunverified832025Paper ↗Code ↗Looks wrong?
28GPT-4.1
Full MATH test set, zero-shot CoT.
paper82.12026Source ↗Looks wrong?
29MiniMax-Text-01unverified77.42025Paper ↗Code ↗Looks wrong?
30gpt-4o
Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13.
paper76.62026Source ↗Looks wrong?
31Grok 2
Full MATH test set.
paper76.12026Source ↗Looks wrong?
32Llama 3 (405B, Instruct)unverified73.82024Paper ↗Code ↗Looks wrong?
33Llama 3.1 405B
Full MATH test set.
paper73.82026Source ↗Looks wrong?
34GPT-4 Turbo
Full MATH test set, zero-shot CoT.
paper73.42026Source ↗Looks wrong?
35Qwen3-235B-A22Bunverified71.842025Paper ↗Code ↗Looks wrong?
36claude-35-sonnet
Full MATH test set. Original Claude 3.5 Sonnet (June 2024).
paper71.12026Source ↗Looks wrong?
37Claude 3.5 Sonnet
Full MATH test set. Original Claude 3.5 Sonnet (June 2024).
unverified71.12026Source ↗Looks wrong?
38gpt-4o-mini
Full MATH test set, zero-shot CoT.
paper70.22026Source ↗Looks wrong?
39GPT-4o mini
Full MATH test set, zero-shot CoT.
unverified70.22026Source ↗Looks wrong?
40Llama 3.1 70B
Full MATH test set.
unverified682026Source ↗Looks wrong?
41gemini-15-pro
From Google's official evaluation.
paper67.72026Source ↗Looks wrong?
42Gemini 1.5 Pro
From Google's official evaluation.
unverified67.72026Source ↗Looks wrong?
43Step-3.5-Flash Baseunverified66.82026Paper ↗Code ↗Looks wrong?
44Claude 3 Opus
Full MATH test set.
unverified60.12026Source ↗Looks wrong?
45HRM-Text-1Bunverified56.52026Paper ↗Code ↗Looks wrong?
46Ariaunverified50.82024Paper ↗Code ↗Looks wrong?
47Apertus-70B-Instructunverified30.82025Paper ↗Code ↗Looks wrong?
48Chameleon 34Bunverified22.52024Paper ↗Code ↗Looks wrong?
49LLaMA-65Bunverified20.52023Paper ↗Code ↗Looks wrong?
50SmoLM2 (1.7B)unverified11.62025Paper ↗Code ↗Looks wrong?
Lineage

MATH in context.

See full mathematical reasoning benchmarks lineage →
This benchmark (1)
saturating2021-11
MATH
Successors (2)
§ 04 · Submit a result

Add to the leaderboard.

← Back to Mathematical Reasoning
MATH Leaderboard | CodeSOTA | CodeSOTA