Codesota · Tasks · Multi-step ReasoningHome/Tasks/Reasoning/Multi-step Reasoning

Multi-step Reasoning.

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

4
Datasets
53
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

GPQA Diamond

Graduate-level science QA benchmark designed to be difficult for non-experts and resistant to simple web lookup. GPQA Diamond is the common frontier reporting split.

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on GPQA Diamond.

#ModelaccuracyYearSource
Gemini 3 Pro91.92026paper ↗
2Claude Opus 4.691.32026paper ↗
3Gemini 3 Flash90.42026paper ↗
4Claude Sonnet 4.689.92026paper ↗
5GPT-589.02026paper ↗
6Grok 488.02026paper ↗
7Gemini 2.5 Pro84.02026paper ↗
8o382.82026paper ↗
9Gemini 2.5 Flash82.82026paper ↗
10o4-mini77.62026paper ↗

What were you looking for on Multi-step Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

4 datasets tracked for this task.

GPQA Diamond
CANONICAL
33 results · accuracy
Top: Gemini 3 Pro 91.9
HLE
13 results · accuracy
Top: Gemini 3 Pro 38.3
BIG-Bench Hard
5 results · accuracy
Top: Claude 3.5 Sonnet 93.1
StrategyQA
2 results · accuracy
Top: GPT-4o 82.1
§ 05 · Related tasks

Other tasks in Reasoning.

Arithmetic ReasoningCommonsense ReasoningLogical ReasoningMathematical Reasoning
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Multi-step Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.