Multi-step Reasoning2018en
HotpotQA
113K question-answer pairs requiring reasoning over multiple Wikipedia documents.
Current State of the Art
GPT-4o
OpenAI
71.3
f1
f1Primary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | GPT-4oAPI OpenAI | 71.3 | Dec 2025 | |
| 2 | Claude 3.5 SonnetAPI Anthropic | 68.5 | Dec 2025 |