Multi-step Reasoning2024en
Graduate-Level Google-Proof Q&A
448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.
Metrics:accuracy
Paper / WebsiteCurrent State of the Art
o3
OpenAI
82.8
accuracy
Top Models Performance Comparison
Top 10 models ranked by accuracy
Best Score
82.8
Top Model
o3
Models Compared
10
Score Range
26.8
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | o3API OpenAI | 82.8 | Mar 2026 | |
| 2 | o4-miniAPI OpenAI | 77.6 | Mar 2026 | |
| 3 | o1API OpenAI | 75.7 | Mar 2026 | |
| 4 | o3-miniAPI OpenAI | 74.9 | Mar 2026 | |
| 5 | o1-preview OpenAI | 73.3 | Mar 2026 | |
| 6 | GPT-4.5 PreviewAPI OpenAI | 69.5 | Mar 2026 | |
| 7 | GPT-4.1API OpenAI | 66.3 | Mar 2026 | |
| 8 | o1-miniAPI OpenAI | 60 | Mar 2026 | |
| 9 | Claude 3.5 SonnetAPI Anthropic | 59.4 | Mar 2026 | |
| 10 | Grok 2API xAI | 56 | Mar 2026 | |
| 11 | Llama 3.1 405BOpen Source Meta | 50.7 | Mar 2026 | |
| 12 | Claude 3 OpusAPI Anthropic | 50.4 | Mar 2026 | |
| 13 | GPT-4oAPI OpenAI | 49.9 | Mar 2026 | |
| 14 | GPT-4 TurboAPI OpenAI | 49.3 | Mar 2026 | |
| 15 | Gemini 1.5 ProAPI Google | 46.2 | Mar 2026 | |
| 16 | Llama 3.1 70BOpen Source Meta | 41.7 | Mar 2026 | |
| 17 | GPT-4o Mini OpenAI | 40.2 | Mar 2026 |