Multi-step Reasoning

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

4
Datasets
21
Results
accuracy
Canonical metric
Canonical Benchmark

GPQA

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on GPQA.

RankModelaccuracyYearSource
1
o3
82.82026paper
2
o4-mini
77.62026paper
3
o1
75.72026paper
4
o3-mini
74.92026paper
5
o1-preview
73.32026paper
6
gpt-45-preview
69.52026paper
7
gpt-41
66.32026paper
8
o1-mini
60.02026paper
9
claude-35-sonnet
59.42026paper
10
grok-2
56.02026paper

All datasets

4 datasets tracked for this task.

Related tasks

Other tasks in Reasoning.