Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Benchmark · AIME 2024Home/Leaderboards/Language & Knowledge/Mathematical Reasoning/AIME 2024
Unknown

AIME 2024.

30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for AIME 2024. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01o3
Average over AIME 2024 I+II. Pass@1 consensus. Source: OpenAI o3 system card (Dec 2024).
verified96.72026Source ↗Edit result
02o4-mini
Average over AIME 2024 I+II. Source: OpenAI o4-mini system card (April 2025).
verified93.42026Source ↗Edit result
03Gemini 2.5 Pro
Average over AIME 2024 I+II. Source: Gemini 2.5 Pro technical report (April 2025).
verified922026Source ↗Edit result
04GLM-4.5-Airunverified89.42025Paper ↗Code ↗Source ↗Edit result
05Qwen3-Coder-Nextunverified89.012026Paper ↗Code ↗Edit result
06Qwen3-235B-A22Bunverified85.72025Paper ↗Code ↗Edit result
07o1-preview
American Invitational Mathematics Examination. Elite competition math.
paper83.32025Source ↗Edit result
08Claude 3.7 Sonnet
Average AIME 2024 I+II. Source: Claude 3.7 Sonnet model card, Anthropic (Feb 2025).
verified802026Source ↗Edit result
09DeepSeek R1
Average AIME 2024 I+II (consensus @ 64 samples). Source: DeepSeek-R1 paper, arxiv:2501.12948 (Jan 2025).
verified79.82026Source ↗Edit result
10Claude 3.5 Opusunverified162025Source ↗Edit result
11claude-35-opuspaper162025Source ↗Edit result
12GPT-4o
Significant gap between o1 and GPT-4o on competition math.
unverified13.42025Source ↗Edit result
Lineage

AIME 2024 in context.

See full mathematical reasoning benchmarks lineage →
This benchmark (1)
active2024-03
AIME 2024
Successors (1)
active2024-11
FrontierMath
AIME problems are finite and increasingly contaminated as training sets grow. FrontierMath sources unpublished research-frontier problems — contamination by design impossible. The step change from competition math to research math.
§ 04 · Submit a result

Add to the leaderboard.

← Back to Mathematical Reasoning