MATH

Unknown

12,500 competition mathematics problems (5,000 test) from AMC, AIME, and other sources covering algebra, geometry, number theory, and more. Harder than GSM8K. Modern evaluations typically use the MATH-500 representative subset.

Benchmark Stats

Models20
Papers20
Metrics1

SOTA History

Coming Soon
Visual timeline of state-of-the-art progression over time will appear here.

accuracy

accuracy

Higher is better

RankModelCodeScorePaper / Source
1o3-mini

MATH-500, zero-shot CoT, pass@1. High reasoning effort.

-97.9openai-simple-evals
2o3

MATH-500, zero-shot CoT, pass@1. Default reasoning effort.

-97.8openai-simple-evals
3o4-mini

MATH-500, zero-shot CoT, pass@1. Default reasoning effort.

-97.5openai-simple-evals
4deepseek-r1

MATH-500, from official DeepSeek-R1 paper. On par with OpenAI o1.

-97.3deepseek-paper
5o1

MATH-500, zero-shot CoT, pass@1.

-96.4openai-simple-evals
6claude-37-sonnet

MATH-500 with extended thinking enabled.

-96.2anthropic-blog
7deepseek-v3

Non-reasoning base model.

-90.2deepseek-blog
8o1-mini

MATH-500, zero-shot CoT, pass@1.

-90openai-simple-evals
9gpt-45-preview

Full MATH test set, zero-shot CoT.

-87.1openai-simple-evals
10o1-preview

MATH-500, zero-shot CoT, pass@1.

-85.5openai-simple-evals
11gpt-41

Full MATH test set, zero-shot CoT.

-82.1openai-simple-evals
12gpt-4o

Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13.

-76.6openai-simple-evals
13grok-2

Full MATH test set.

-76.1openai-simple-evals
14llama-31-405b

Full MATH test set.

-73.8openai-simple-evals
15gpt-4-turbo

Full MATH test set, zero-shot CoT.

-73.4openai-simple-evals
16claude-35-sonnet

Full MATH test set. Original Claude 3.5 Sonnet (June 2024).

-71.1openai-simple-evals
17gpt-4o-mini

Full MATH test set, zero-shot CoT.

-70.2openai-simple-evals
18llama-31-70b

Full MATH test set.

-68openai-simple-evals
19gemini-15-pro

From Google's official evaluation.

-67.7google-blog
20claude-3-opus

Full MATH test set.

-60.1openai-simple-evals