Reasoning

Multi-step Reasoning

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

4 datasets161 resultsView full task mapping →

Multi-step reasoning tests a model's ability to chain together 5+ inference steps across diverse knowledge domains — integrating arithmetic, logic, commonsense, and retrieval. Benchmarks like MMLU-Pro, BIG-Bench Hard, and MuSR specifically target this capability, where even frontier models show systematic degradation with increasing step count.

History

2018

HotpotQA released — multi-hop question answering requiring reasoning over multiple documents

2020

StrategyQA introduced — questions requiring implicit multi-step decomposition

2022

BIG-Bench (200+ tasks) includes many multi-step reasoning challenges

2022

Least-to-most prompting decomposes complex problems into sequential sub-problems

2023

Tree-of-Thought (Yao et al.) enables systematic exploration of reasoning paths

2023

MuSR benchmark specifically tests multi-step soft reasoning

2024

MMLU-Pro extends MMLU with harder, multi-step versions of standard questions

2024

OpenAI o1 introduces extended thinking for sustained multi-step reasoning

2024

Claude 3.5 achieves strong multi-step performance through systematic chain-of-thought

2025

O3 and DeepSeek-R1 push multi-step reasoning to new highs via RL on reasoning traces

How Multi-step Reasoning Works

Problem Decomposition

The complex question is broken into a dependency graph of simpler sub-questions, each answerable in one step.

Sub-Problem Solving

Each sub-question is solved independently, potentially using different reasoning types (arithmetic, retrieval, logic).

State Tracking

Intermediate results are maintained in a working memory, tracking which sub-problems are solved and their answers.

Synthesis

Answers to sub-problems are composed — sometimes requiring additional reasoning about how intermediate results combine.

Consistency Verification

The final answer is checked against known constraints and intermediate results for coherence.

Current Landscape

Multi-step reasoning is the meta-capability that separates frontier models from the rest in 2025. Extended thinking models (o1/o3, Claude 3.5, DeepSeek-R1) have made significant progress by learning to sustain reasoning over dozens of steps through reinforcement learning. However, reliability degrades noticeably beyond 8-10 steps, and problems requiring creative decomposition remain challenging. Tool-augmented approaches that offload sub-problems to specialized solvers are increasingly popular in production.

Key Challenges

Compounding errors — each reasoning step has a failure probability, and errors cascade across a 10-step chain

Working memory limits — models lose track of intermediate results in long reasoning chains, even with scratchpads

Decomposition quality — incorrect problem decomposition dooms the entire reasoning chain regardless of execution quality

Reasoning type switching — real-world problems require fluidly mixing arithmetic, logic, retrieval, and commonsense within one chain

Evaluation complexity — partial credit for mostly-correct reasoning chains is hard to automate

Quick Recommendations

Complex question answering

OpenAI o3 / Claude 3.5 extended thinking

Best sustained multi-step reasoning with explicit intermediate steps

Tool-augmented reasoning

GPT-4o + function calling

Can offload sub-problems to calculators, search, and databases mid-chain

Research baseline

Llama 3.1 70B + tree-of-thought

Open-source baseline with systematic reasoning path exploration

Verified multi-step

LATS (Language Agent Tree Search)

Combines LLM reasoning with Monte Carlo tree search for more reliable multi-step chains

What's Next

The frontier is reliable 20+ step reasoning with automatic error detection and correction. Expect advances in: (1) learned decomposition strategies, (2) verification modules that catch errors mid-chain, (3) hybrid systems that combine LLM reasoning with symbolic execution for guaranteed intermediate correctness.

Benchmarks & SOTA

GPQA Diamond

Graduate-Level Google-Proof Q&A Diamond

202374 results

Graduate-level science QA benchmark designed to be difficult for non-experts and resistant to simple web lookup. GPQA Diamond is the common frontier reporting split.

State of the Art

Gemini 3 Pro

Google

91.9

accuracy

HLE

Humanity's Last Exam

202574 results

3,000 expert-level questions designed to be the hardest public benchmark. Questions sourced from domain experts across mathematics, sciences, humanities, and more. Frontier difficulty — most models score below 10%.

State of the Art

Kimi K2.6

accuracy

BIG-Bench Hard

BIG-Bench Hard (BBH)

202211 results

BIG-Bench Hard is a curated subset of 23 challenging tasks from BIG-Bench that require multi-step reasoning, where chain-of-thought prompting significantly helps performance. Tasks include algorithmic reasoning, logical deduction, causal judgment, and more. By 2024–2025, frontier models were approaching saturation (>90%) on BBH, prompting the creation of the harder BBEH variant.

State of the Art

Claude 3.5 Sonnet

Anthropic

93.1

accuracy

StrategyQA

20212 results

2,780 yes/no questions requiring implicit multi-step reasoning to answer.

State of the Art

GPT-4o

OpenAI

82.1

accuracy

Related Tasks

Arithmetic Reasoning

Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.

Logical Reasoning

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

Mathematical Reasoning

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

Commonsense Reasoning

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Multi-step Reasoning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Reasoning