math
Unknown
OCR benchmark
accuracy
Higher is better
| Rank | Model | Score | Source |
|---|---|---|---|
| 1 | o3-mini MATH-500, zero-shot CoT, pass@1. High reasoning effort. | 97.9 | openai-simple-evals |
| 2 | o3 MATH-500, zero-shot CoT, pass@1. Default reasoning effort. | 97.8 | openai-simple-evals |
| 3 | o4-mini MATH-500, zero-shot CoT, pass@1. Default reasoning effort. | 97.5 | openai-simple-evals |
| 4 | deepseek-r1 MATH-500, from official DeepSeek-R1 paper. On par with OpenAI o1. | 97.3 | deepseek-paper |
| 5 | o1 MATH-500, zero-shot CoT, pass@1. | 96.4 | openai-simple-evals |
| 6 | claude-37-sonnet MATH-500 with extended thinking enabled. | 96.2 | anthropic-blog |
| 7 | deepseek-v3 Non-reasoning base model. | 90.2 | deepseek-blog |
| 8 | o1-mini MATH-500, zero-shot CoT, pass@1. | 90 | openai-simple-evals |
| 9 | gpt-45-preview Full MATH test set, zero-shot CoT. | 87.1 | openai-simple-evals |
| 10 | o1-preview MATH-500, zero-shot CoT, pass@1. | 85.5 | openai-simple-evals |
| 11 | gpt-41 Full MATH test set, zero-shot CoT. | 82.1 | openai-simple-evals |
| 12 | gpt-4o Full MATH test set, zero-shot CoT. gpt-4o-2024-05-13. | 76.6 | openai-simple-evals |
| 13 | grok-2 Full MATH test set. | 76.1 | openai-simple-evals |
| 14 | llama-31-405b Full MATH test set. | 73.8 | openai-simple-evals |
| 15 | gpt-4-turbo Full MATH test set, zero-shot CoT. | 73.4 | openai-simple-evals |
| 16 | claude-35-sonnet Full MATH test set. Original Claude 3.5 Sonnet (June 2024). | 71.1 | openai-simple-evals |
| 17 | gpt-4o-mini Full MATH test set, zero-shot CoT. | 70.2 | openai-simple-evals |
| 18 | llama-31-70b Full MATH test set. | 68 | openai-simple-evals |
| 19 | gemini-15-pro From Google's official evaluation. | 67.7 | google-blog |
| 20 | claude-3-opus Full MATH test set. | 60.1 | openai-simple-evals |