Multi-step Reasoning2024en

Graduate-Level Google-Proof Q&A

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

Metrics:accuracy
Paper / Website
Current State of the Art

o3

OpenAI

82.8

accuracy

Top Models Performance Comparison

Top 10 models ranked by accuracy

accuracy1o382.8100.0%2o4-mini77.693.7%3o175.791.4%4o3-mini74.990.5%5o1-preview73.388.5%6GPT-4.5 Preview69.583.9%7GPT-4.166.380.1%8o1-mini60.072.5%9Claude 3.5 Sonnet59.471.7%10Grok 256.067.6%0%25%50%75%100%% of best
Best Score
82.8
Top Model
o3
Models Compared
10
Score Range
26.8

accuracyPrimary

#ModelScorePaper / CodeDate
1
o3API
OpenAI
82.8Mar 2026
2
o4-miniAPI
OpenAI
77.6Mar 2026
3
o1API
OpenAI
75.7Mar 2026
4
o3-miniAPI
OpenAI
74.9Mar 2026
5
o1-preview
OpenAI
73.3Mar 2026
6
GPT-4.5 PreviewAPI
OpenAI
69.5Mar 2026
7
GPT-4.1API
OpenAI
66.3Mar 2026
8
o1-miniAPI
OpenAI
60Mar 2026
9
Claude 3.5 SonnetAPI
Anthropic
59.4Mar 2026
10
Grok 2API
xAI
56Mar 2026
11
Llama 3.1 405BOpen Source
Meta
50.7Mar 2026
12
Claude 3 OpusAPI
Anthropic
50.4Mar 2026
13
GPT-4oAPI
OpenAI
49.9Mar 2026
14
GPT-4 TurboAPI
OpenAI
49.3Mar 2026
15
Gemini 1.5 ProAPI
Google
46.2Mar 2026
16
Llama 3.1 70BOpen Source
Meta
41.7Mar 2026
17
GPT-4o Mini
OpenAI
40.2Mar 2026

Other Multi-step Reasoning Datasets