| 01 | o4-mini (high) MATH-500, zero-shot CoT, pass@1. High reasoning effort. | paper | 98.2 | 2026 | Source ↗ | Edit result |
| 02 | o3 (high) MATH-500, zero-shot CoT, pass@1. High reasoning effort. | unverified | 98.1 | 2026 | Source ↗ | Edit result |
| 03 | o3-mini MATH-500, zero-shot CoT, pass@1. High reasoning effort. | paper | 97.9 | 2026 | Source ↗ | Edit result |
| 04 | o3 MATH-500, zero-shot CoT, pass@1. Default reasoning effort. | unverified | 97.8 | 2026 | Source ↗ | Edit result |
| 05 | o4-mini MATH-500, zero-shot CoT, pass@1. Default reasoning effort. | unverified | 97.5 | 2026 | Source ↗ | Edit result |
| 06 | Gemini 2.5 Pro MATH-500, pass@1. Gemini 2.5 Pro (Mar 2025). | paper | 97.3 | 2026 | Source ↗ | Edit result |
| 07 | DeepSeek R1 MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025). | paper | 97.3 | 2026 | Source ↗ | Edit result |
| 08 | DeepSeek-R1 MATH-500, pass@1. From official DeepSeek-R1 paper (Jan 2025). | paper | 97.3 | 2026 | Source ↗ | Edit result |
| 09 | o1 MATH-500, zero-shot CoT, pass@1. | unverified | 96.4 | 2026 | Source ↗ | Edit result |
| 10 | Claude 3.7 Sonnet MATH-500 with extended thinking enabled. | unverified | 96.2 | 2026 | Source ↗ | Edit result |
| 11 | Kimi k1.5 MATH-500, long-CoT variant. From official Kimi k1.5 paper (Jan 2025). | paper | 96.2 | 2026 | Source ↗ | Edit result |
| 12 | DeepSeek-R1-Zero MATH-500, pass@1. DeepSeek-R1-Zero (pure RL, no SFT). From R1 paper (Jan 2025). | paper | 95.9 | 2026 | Source ↗ | Edit result |
| 13 | DeepSeek-R1-Distill-Llama-70B MATH-500, pass@1. Distilled from DeepSeek-R1 into Llama-3.1-70B. From R1 paper (Jan 2025). | paper | 94.5 | 2026 | Source ↗ | Edit result |
| 14 | DeepSeek-R1-Distill-Qwen-32B MATH-500, pass@1. Distilled from DeepSeek-R1 into Qwen-2.5-32B. From R1 paper (Jan 2025). | paper | 94.3 | 2026 | Source ↗ | Edit result |
| 15 | DeepSeek-v3-0324 MATH-500. DeepSeek-V3-0324 updated model (Mar 2025). Non-reasoning base model. | unverified | 94 | 2026 | Source ↗ | Edit result |
| 16 | Claude Opus 4.5 4-shot. Source: Claude Opus 4.5 model card, Anthropic (2025). | verified | 90.7 | 2026 | Source ↗ | Edit result |
| 17 | QwQ-32B MATH-500, pass@1. QwQ-32B reasoning model by Alibaba/Qwen (Mar 2025). | unverified | 90.6 | 2026 | Source ↗ | Edit result |
| 18 | DeepSeek-V3 MATH-500. Non-reasoning base model. From DeepSeek-V3 technical report (Dec 2024). | paper | 90.2 | 2026 | Source ↗ | Edit result |
| 19 | o1-mini MATH-500, zero-shot CoT, pass@1. | paper | 90 | 2026 | Source ↗ | Edit result |
| 20 | Llama-4-Maverick 4-shot. Source: Meta Llama 4 model card (April 2025). | verified | 89.4 | 2026 | Source ↗ | Edit result |
| 21 | Claude Opus 4 4-shot. Source: Claude Opus 4 model card, Anthropic (2025). | verified | 89.2 | 2026 | Source ↗ | Edit result |
| 22 | Claude Sonnet 4 4-shot. Source: Claude Sonnet 4 model card, Anthropic (2025). | verified | 88.9 | 2026 | Source ↗ | Edit result |
| 23 | GPT-4.5 Preview Full MATH test set, zero-shot CoT. | paper | 87.1 | 2026 | Source ↗ | Edit result |
| 24 | o1-preview MATH-500, zero-shot CoT, pass@1. | paper | 85.5 | 2026 | Source ↗ | Edit result |
| 25 | Qwen2.5-Plus | unverified | 84.7 | 2024 | Paper ↗Code ↗ | Edit result |
| 26 | Qwen2.5-72B-Instruct Qwen2.5-72B-Instruct. Table 6 in Qwen2.5 Technical Report. | verified | 83.1 | 2026 | Source ↗ | Edit result |
| 27 | Qwen2.5-VL-72B | unverified | 83 | 2025 | Paper ↗Code ↗ | Edit result |
| 28 | GPT-4.1 Full MATH test set, zero-shot CoT. | paper | 82.1 | 2026 | Source ↗ | Edit result |
| 29 | MiniMax-Text-01 | unverified | 77.4 | 2025 | Paper ↗Code ↗ | Edit result |
| 30 | gpt-4o Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13. | paper | 76.6 | 2026 | Source ↗ | Edit result |
| 31 | Grok 2 Full MATH test set. | paper | 76.1 | 2026 | Source ↗ | Edit result |
| 32 | Llama 3 (405B, Instruct) | unverified | 73.8 | 2024 | Paper ↗Code ↗ | Edit result |
| 33 | Llama 3.1 405B Full MATH test set. | paper | 73.8 | 2026 | Source ↗ | Edit result |
| 34 | GPT-4 Turbo Full MATH test set, zero-shot CoT. | paper | 73.4 | 2026 | Source ↗ | Edit result |
| 35 | Qwen3-235B-A22B | unverified | 71.84 | 2025 | Paper ↗Code ↗ | Edit result |
| 36 | claude-35-sonnet Full MATH test set. Original Claude 3.5 Sonnet (June 2024). | paper | 71.1 | 2026 | Source ↗ | Edit result |
| 37 | Claude 3.5 Sonnet Full MATH test set. Original Claude 3.5 Sonnet (June 2024). | unverified | 71.1 | 2026 | Source ↗ | Edit result |
| 38 | gpt-4o-mini Full MATH test set, zero-shot CoT. | paper | 70.2 | 2026 | Source ↗ | Edit result |
| 39 | GPT-4o mini Full MATH test set, zero-shot CoT. | unverified | 70.2 | 2026 | Source ↗ | Edit result |
| 40 | Llama 3.1 70B Full MATH test set. | unverified | 68 | 2026 | Source ↗ | Edit result |
| 41 | gemini-15-pro From Google's official evaluation. | paper | 67.7 | 2026 | Source ↗ | Edit result |
| 42 | Gemini 1.5 Pro From Google's official evaluation. | unverified | 67.7 | 2026 | Source ↗ | Edit result |
| 43 | Step-3.5-Flash Base | unverified | 66.8 | 2026 | Paper ↗Code ↗ | Edit result |
| 44 | Claude 3 Opus Full MATH test set. | unverified | 60.1 | 2026 | Source ↗ | Edit result |
| 45 | HRM-Text-1B | unverified | 56.5 | 2026 | Paper ↗Code ↗ | Edit result |
| 46 | Aria | unverified | 50.8 | 2024 | Paper ↗Code ↗ | Edit result |
| 47 | Apertus-70B-Instruct | unverified | 30.8 | 2025 | Paper ↗Code ↗ | Edit result |
| 48 | Chameleon 34B | unverified | 22.5 | 2024 | Paper ↗Code ↗ | Edit result |
| 49 | LLaMA-65B | unverified | 20.5 | 2023 | Paper ↗Code ↗ | Edit result |
| 50 | SmoLM2 (1.7B) | unverified | 11.6 | 2025 | Paper ↗Code ↗ | Edit result |