Multi-step Reasoning

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

5
Datasets
21
Results
accuracy
Canonical metric
Canonical Benchmark

GPQA

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on GPQA.

RankModelaccuracyYearSource
1
o3
82.82026paper
2
o4-mini
77.62026paper
3
o1
75.72026paper
4
o3-mini
74.92026paper
5
o1-preview
73.32026paper
6
gpt-45-preview
69.52026paper
7
gpt-41
66.32026paper
8
o1-mini
60.02026paper
9
claude-35-sonnet
59.42026paper
10
grok-2
56.02026paper

What were you looking for on Multi-step Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

5 datasets tracked for this task.

Related tasks

Other tasks in Reasoning.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Multi-step Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.