Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Benchmark · MATHHome/Leaderboards/Language & Knowledge/Mathematical Reasoning/MATH
Unknown

MATH.

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for MATH. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01o4-mini (high)
MATH-500, zero-shot CoT, pass@1. High reasoning effort.
paper98.22026Source ↗Edit result
02o3 (high)
MATH-500, zero-shot CoT, pass@1. High reasoning effort.
unverified98.12026Source ↗Edit result
03o3-mini
MATH-500, zero-shot CoT, pass@1. High reasoning effort.
paper97.92026Source ↗Edit result
04o3
MATH-500, zero-shot CoT, pass@1. Default reasoning effort.
unverified97.82026Source ↗Edit result
05o4-mini
MATH-500, zero-shot CoT, pass@1. Default reasoning effort.
unverified97.52026Source ↗Edit result
06Gemini 2.5 Pro
MATH-500, pass@1. Gemini 2.5 Pro (Mar 2025).
paper97.32026Source ↗Edit result
07DeepSeek R1
MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025).
paper97.32026Source ↗Edit result
08DeepSeek-R1
MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025).
paper97.32026Source ↗Edit result
09o1
MATH-500, zero-shot CoT, pass@1.
unverified96.42026Source ↗Edit result
10Claude 3.7 Sonnet
MATH-500 with extended thinking enabled.
unverified96.22026Source ↗Edit result
11Kimi k1.5
MATH-500, long-CoT variant. From official Kimi k1.5 paper (Jan 2025).
paper96.22026Source ↗Edit result
12DeepSeek-R1-Zero
MATH-500, pass@1. DeepSeek-R1-Zero (pure RL, no SFT). From R1 paper (Jan 2025).
paper95.92026Source ↗Edit result
13DeepSeek-R1-Distill-Llama-70B
MATH-500, pass@1. Distilled from DeepSeek-R1 into Llama-3.1-70B. From R1 paper (Jan 2025).
paper94.52026Source ↗Edit result
14DeepSeek-R1-Distill-Qwen-32B
MATH-500, pass@1. Distilled from DeepSeek-R1 into Qwen-2.5-32B. From R1 paper (Jan 2025).
paper94.32026Source ↗Edit result
15DeepSeek-v3-0324
MATH-500. DeepSeek-V3-0324 updated model (Mar 2025). Non-reasoning base model.
unverified942026Source ↗Edit result
16Claude Opus 4.5
4-shot. Source: Claude Opus 4.5 model card, Anthropic (2025).
verified90.72026Source ↗Edit result
17QwQ-32B
MATH-500, pass@1. QwQ-32B reasoning model by Alibaba/Qwen (Mar 2025).
unverified90.62026Source ↗Edit result
18DeepSeek-V3
MATH-500. Non-reasoning base model. From DeepSeek-V3 technical report (Dec 2024).
paper90.22026Source ↗Edit result
19o1-mini
MATH-500, zero-shot CoT, pass@1.
paper902026Source ↗Edit result
20Llama-4-Maverick
4-shot. Source: Meta Llama 4 model card (April 2025).
verified89.42026Source ↗Edit result
21Claude Opus 4
4-shot. Source: Claude Opus 4 model card, Anthropic (2025).
verified89.22026Source ↗Edit result
22Claude Sonnet 4
4-shot. Source: Claude Sonnet 4 model card, Anthropic (2025).
verified88.92026Source ↗Edit result
23GPT-4.5 Preview
Full MATH test set, zero-shot CoT.
paper87.12026Source ↗Edit result
24o1-preview
MATH-500, zero-shot CoT, pass@1.
paper85.52026Source ↗Edit result
25Qwen2.5-Plusunverified84.72024Paper ↗Code ↗Edit result
26Qwen2.5-72B-Instruct
Qwen2.5-72B-Instruct. Table 6 in Qwen2.5 Technical Report.
verified83.12026Source ↗Edit result
27Qwen2.5-VL-72Bunverified832025Paper ↗Code ↗Edit result
28GPT-4.1
Full MATH test set, zero-shot CoT.
paper82.12026Source ↗Edit result
29MiniMax-Text-01unverified77.42025Paper ↗Code ↗Edit result
30gpt-4o
Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13.
paper76.62026Source ↗Edit result
31Grok 2
Full MATH test set.
paper76.12026Source ↗Edit result
32Llama 3 (405B, Instruct)unverified73.82024Paper ↗Code ↗Edit result
33Llama 3.1 405B
Full MATH test set.
paper73.82026Source ↗Edit result
34GPT-4 Turbo
Full MATH test set, zero-shot CoT.
paper73.42026Source ↗Edit result
35Qwen3-235B-A22Bunverified71.842025Paper ↗Code ↗Edit result
36claude-35-sonnet
Full MATH test set. Original Claude 3.5 Sonnet (June 2024).
paper71.12026Source ↗Edit result
37Claude 3.5 Sonnet
Full MATH test set. Original Claude 3.5 Sonnet (June 2024).
unverified71.12026Source ↗Edit result
38gpt-4o-mini
Full MATH test set, zero-shot CoT.
paper70.22026Source ↗Edit result
39GPT-4o mini
Full MATH test set, zero-shot CoT.
unverified70.22026Source ↗Edit result
40Llama 3.1 70B
Full MATH test set.
unverified682026Source ↗Edit result
41gemini-15-pro
From Google's official evaluation.
paper67.72026Source ↗Edit result
42Gemini 1.5 Pro
From Google's official evaluation.
unverified67.72026Source ↗Edit result
43Step-3.5-Flash Baseunverified66.82026Paper ↗Code ↗Edit result
44Claude 3 Opus
Full MATH test set.
unverified60.12026Source ↗Edit result
45HRM-Text-1Bunverified56.52026Paper ↗Code ↗Edit result
46Ariaunverified50.82024Paper ↗Code ↗Edit result
47Apertus-70B-Instructunverified30.82025Paper ↗Code ↗Edit result
48Chameleon 34Bunverified22.52024Paper ↗Code ↗Edit result
49LLaMA-65Bunverified20.52023Paper ↗Code ↗Edit result
50SmoLM2 (1.7B)unverified11.62025Paper ↗Code ↗Edit result
Lineage

MATH in context.

See full mathematical reasoning benchmarks lineage →
This benchmark (1)
saturating2021-11
MATH
Successors (2)
§ 04 · Submit a result

Add to the leaderboard.

← Back to Mathematical Reasoning