Multi-step Reasoning2024en
Graduate-Level Google-Proof Q&A
448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.
Metrics:accuracy
Paper / WebsiteCurrent State of the Art
o1-preview
OpenAI
78
accuracy
Top Models Performance Comparison
Top 4 models ranked by accuracy
Best Score
78.0
Top Model
o1-preview
Models Compared
4
Score Range
31.8
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | o1-preview OpenAI | 78 | Dec 2025 | |
| 2 | Claude 3.5 SonnetAPI Anthropic | 59.4 | Dec 2025 | |
| 3 | GPT-4oAPI OpenAI | 53.6 | Dec 2025 | |
| 4 | Gemini 1.5 ProAPI Google | 46.2 | Dec 2025 |