Reasoning
Multi-step Reasoning
Complex reasoning requiring multiple inference steps (HotpotQA).
3 datasets8 results
Multi-step Reasoning is a key task in reasoning. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
GPQA
Graduate-Level Google-Proof Q&A
20244 results
448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.
State of the Art
o1-preview
OpenAI
78
accuracy
HotpotQA
HotpotQA
20182 results
113K question-answer pairs requiring reasoning over multiple Wikipedia documents.
State of the Art
GPT-4o
OpenAI
71.3
f1
StrategyQA
StrategyQA
20212 results
2,780 yes/no questions requiring implicit multi-step reasoning to answer.
State of the Art
GPT-4o
OpenAI
82.1
accuracy