Multi-step Reasoning2024en

Graduate-Level Google-Proof Q&A

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

Metrics:accuracy
Paper / Website
Current State of the Art

o1-preview

OpenAI

78

accuracy

Top Models Performance Comparison

Top 4 models ranked by accuracy

accuracy1o1-preview78.0100.0%2Claude 3.5 Sonnet59.476.2%3GPT-4o53.668.7%4Gemini 1.5 Pro46.259.2%0%25%50%75%100%% of best
Best Score
78.0
Top Model
o1-preview
Models Compared
4
Score Range
31.8

accuracyPrimary

#ModelScorePaper / CodeDate
1
o1-preview
OpenAI
78Dec 2025
2
Claude 3.5 SonnetAPI
Anthropic
59.4Dec 2025
3
GPT-4oAPI
OpenAI
53.6Dec 2025
4
Gemini 1.5 ProAPI
Google
46.2Dec 2025

Other Multi-step Reasoning Datasets