MATH
Unknown
12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.
Benchmark Stats
SOTA History
Coming Soonaccuracy
accuracy
Higher is better
| Rank | Model | Code | Score | Paper / Source |
|---|---|---|---|---|
| 1 | o3-mini MATH-500, zero-shot CoT, pass@1. High reasoning effort. | - | 97.9 | openai-simple-evals |
| 2 | o3 MATH-500, zero-shot CoT, pass@1. Default reasoning effort. | - | 97.8 | openai-simple-evals |
| 3 | o4-mini MATH-500, zero-shot CoT, pass@1. Default reasoning effort. | - | 97.5 | openai-simple-evals |
| 4 | deepseek-r1 MATH-500, from official DeepSeek-R1 paper. On par with OpenAI o1. | - | 97.3 | deepseek-paper |
| 5 | o1 MATH-500, zero-shot CoT, pass@1. | - | 96.4 | openai-simple-evals |
| 6 | claude-37-sonnet MATH-500 with extended thinking enabled. | - | 96.2 | anthropic-blog |
| 7 | deepseek-v3 Non-reasoning base model. | - | 90.2 | deepseek-blog |
| 8 | o1-mini MATH-500, zero-shot CoT, pass@1. | - | 90 | openai-simple-evals |
| 9 | gpt-45-preview Full MATH test set, zero-shot CoT. | - | 87.1 | openai-simple-evals |
| 10 | o1-preview MATH-500, zero-shot CoT, pass@1. | - | 85.5 | openai-simple-evals |
| 11 | gpt-41 Full MATH test set, zero-shot CoT. | - | 82.1 | openai-simple-evals |
| 12 | gpt-4o Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13. | - | 76.6 | openai-simple-evals |
| 13 | grok-2 Full MATH test set. | - | 76.1 | openai-simple-evals |
| 14 | llama-31-405b Full MATH test set. | - | 73.8 | openai-simple-evals |
| 15 | gpt-4-turbo Full MATH test set, zero-shot CoT. | - | 73.4 | openai-simple-evals |
| 16 | claude-35-sonnet Full MATH test set. Original Claude 3.5 Sonnet (June 2024). | - | 71.1 | openai-simple-evals |
| 17 | gpt-4o-mini Full MATH test set, zero-shot CoT. | - | 70.2 | openai-simple-evals |
| 18 | llama-31-70b Full MATH test set. | - | 68 | openai-simple-evals |
| 19 | gemini-15-pro From Google's official evaluation. | - | 67.7 | google-blog |
| 20 | claude-3-opus Full MATH test set. | - | 60.1 | openai-simple-evals |