Codesota · Reasoning · Mathematical Reasoning · AIME 2025Tasks/Reasoning/Mathematical Reasoning
Mathematical Reasoning · benchmark dataset · 2025 · EN

American Invitational Mathematics Examination 2025.

Olympiad-style short-answer math benchmark used by reasoning-model releases. Small test set, so score swings should be read with caution.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

22 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
22 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Step-3.5-Flash PaCoReFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code99.90
02Step-3.5-FlashFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code97.30
03Kimi-K2.5OpenMoonshot.AIFeb 2026Kimi K2.5: Visual Agentic Intelligence · code96.10
04DeepSeek-V3.2-SpecialeOpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…96
05SU-01May 2026Achieving Gold-Medal-Level Olympiad Reasoning via Simple… · code94.60
06Intern-S1-ProShanghai AI LabMar 2026Intern-S1-Pro: Scientific Multimodal Foundation Model at…93.10
07DeepSeek-V3.2OpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…93.10
08o4-miniOpenAIMar 2026openai-system-card92.70
09Qwen3-VL-235B-A22B-ThinkingQwenNov 2025Qwen3-VL Technical Report · code89.70
10NVIDIA-Nemotron-3-Nano-30B-A3B-BF16Dec 2025Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybr… · code89.10
11Gemini 2.5 ProJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…88
12Gemini 2.5 ProAPIGoogleMar 2026google-technical-report86.70
13o3OpenAIMar 2026openai-system-card86.70
14Qwen3-Coder-NextQwenFeb 2026Qwen3-Coder-Next Technical Report · code83.07
15Qwen3-235B-A22BOpenAlibabaMay 2025Qwen3 Technical Report · code81.50
16Claude Opus 4.5APIAnthropicMar 2026anthropic-model-card80
17Qwen3-VL-235B-A22B-InstructQwenNov 2025Qwen3-VL Technical Report · code74.70
18Qwen3-Omni-Flash-ThinkingSep 2025Qwen3-Omni Technical Report · code74
19Gemini 2.5 FlashJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…72
20DeepSeek R1OpenDeepSeekMar 2026arxiv72
21Qwen3-VL-8B-InstructQwenNov 2025Qwen3-VL Technical Report · code45.90
22Trinity Large PreviewArcee AIFeb 2026Arcee Trinity Large Technical Report · code24.36
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

6 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. May 14, 2025Qwen3-235B-A22BAlibaba81.50
  2. Jul 7, 2025Gemini 2.5 Pro88
  3. Nov 26, 2025Qwen3-VL-235B-A22B-ThinkingQwen89.70
  4. Dec 2, 2025DeepSeek-V3.2-SpecialeDeepSeek96
  5. Feb 2, 2026Kimi-K2.5Moonshot.AI96.10
  6. Feb 11, 2026Step-3.5-Flash PaCoRe99.90
Fig 3 · SOTA-setting models only. 6 entries span May 2025 Feb 2026.
§ 04 · Literature

12 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies