Who leads the AIME 2024 benchmark?

o3 currently leads AIME 2024 with a score of 96.7 on accuracy.

What is the state-of-the-art score on AIME 2024?

The state-of-the-art result on AIME 2024 is 96.7 (accuracy), achieved by o3 as of 2026.

How many models are tracked on AIME 2024?

Codesota tracks 12 models on AIME 2024.

When was the AIME 2024 leaderboard last updated?

The AIME 2024 leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2025.

Codesota · Benchmark · AIME 2024Home/Leaderboards/Language & Knowledge/Mathematical Reasoning/AIME 2024

Unknown

AIME 2024.

Name: AIME 2024 Benchmark Results
Creator: Unknown
Published: 2025-01-01
License: https://creativecommons.org/licenses/by/4.0/

30 challenging math problems from the 2024 AIME competition. Tests advanced mathematical reasoning.

Paper ↗Leaderboard ↓Lineage

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

accuracy

Accuracy is the reported evaluation metric for AIME 2024. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	o3 Average over AIME 2024 I+II. Pass@1 consensus. Source: OpenAI o3 system card (Dec 2024).	verified	96.7	2026	Source ↗	Looks wrong?
02	o4-mini Average over AIME 2024 I+II. Source: OpenAI o4-mini system card (April 2025).	verified	93.4	2026	Source ↗	Looks wrong?
03	Gemini 2.5 Pro Average over AIME 2024 I+II. Source: Gemini 2.5 Pro technical report (April 2025).	verified	92	2026	Source ↗	Looks wrong?
04	GLM-4.5-Air	unverified	89.4	2025	Paper ↗Code ↗Source ↗	Looks wrong?
05	Qwen3-Coder-Next	unverified	89.01	2026	Paper ↗Code ↗	Looks wrong?
06	Qwen3-235B-A22B	unverified	85.7	2025	Paper ↗Code ↗	Looks wrong?
07	o1-preview American Invitational Mathematics Examination. Elite competition math.	paper	83.3	2025	Source ↗	Looks wrong?
08	Claude 3.7 Sonnet Average AIME 2024 I+II. Source: Claude 3.7 Sonnet model card, Anthropic (Feb 2025).	verified	80	2026	Source ↗	Looks wrong?
09	DeepSeek R1 Average AIME 2024 I+II (consensus @ 64 samples). Source: DeepSeek-R1 paper, arxiv:2501.12948 (Jan 2025).	verified	79.8	2026	Source ↗	Looks wrong?
10	Claude 3.5 Opus	unverified	16	2025	Source ↗	Looks wrong?
11	claude-35-opus	paper	16	2025	Source ↗	Looks wrong?
12	GPT-4o Significant gap between o1 and GPT-4o on competition math.	unverified	13.4	2025	Source ↗	Looks wrong?

Lineage

AIME 2024 in context.

See full mathematical reasoning benchmarks lineage →

Predecessors (1)

saturating2021-11

MATH

AIME is not an AI benchmark by origin — it is the human competition that feeds into USAMO/IMO. It became an AI frontier benchmark when o1 started scoring competitively. Because it is updated annually with fresh problems, it provides contamination control that MATH (fixed dataset) cannot.

This benchmark (1)

active2024-03

AIME 2024

Successors (1)

active2024-11

FrontierMath

AIME problems are finite and increasingly contaminated as training sets grow. FrontierMath sources unpublished research-frontier problems — contamination by design impossible. The step change from competition math to research math.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Mathematical Reasoning