Reasoning

Logical Reasoning

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

4 datasets12 resultsView full task mapping →

Logical reasoning measures a model's ability to perform deductive, inductive, and abductive inference over structured premises. While frontier LLMs handle simple syllogisms well, they still struggle with multi-hop deduction, negation, and formal logic puzzles that require systematic search.

History

2014

bAbI tasks (Weston et al.) test basic logical reasoning in neural networks

2019

ReClor benchmark released — reading comprehension requiring logical reasoning from standardized tests

2020

LogiQA extracted from Chinese civil service exams for logical reasoning evaluation

2020

RuleTaker (Clark et al.) shows transformers can learn forward-chaining over explicit rules

2021

ProofWriter extends RuleTaker with proof generation for up to 5 reasoning hops

2022

FOLIO benchmark tests first-order logic reasoning with natural language premises

2023

GPT-4 achieves ~80% on LogiQA but drops significantly on problems requiring 4+ reasoning hops

2024

Logic-LM framework combines LLMs with symbolic solvers for verified logical reasoning

2024

DeepSeek-R1 shows significant improvement on logical reasoning through reinforcement learning

2025

OpenAI o3 and Claude 3.5 with extended thinking push multi-hop reasoning to new highs

How Logical Reasoning Works

1Premise ExtractionThe model identifies stated…2Logical FormalizationImplicit logical structure …3Inference Chain Const…The model builds a chain of…4Consistency CheckingEach step is checked for co…5Conclusion ExtractionThe final conclusion is sta…Logical Reasoning Pipeline
1

Premise Extraction

The model identifies stated facts, rules, and constraints from the natural language input.

2

Logical Formalization

Implicit logical structure is mapped — identifying conditionals, quantifiers, negations, and conjunctions.

3

Inference Chain Construction

The model builds a chain of deductive steps, applying modus ponens, modus tollens, or other inference rules to derive new conclusions.

4

Consistency Checking

Each step is checked for contradictions with established premises and previously derived facts.

5

Conclusion Extraction

The final conclusion is stated, ideally with the reasoning trace that justifies it.

Current Landscape

Logical reasoning in 2025 is a tale of two regimes: simple syllogistic and 1-2 hop reasoning is reliably handled by frontier LLMs, while complex multi-hop deduction, especially involving negation and quantifiers, remains challenging. The most promising approaches combine LLMs (for natural language understanding and heuristic search) with symbolic solvers (for soundness guarantees). Extended thinking / chain-of-thought models like o3 and DeepSeek-R1 have significantly improved through reinforcement learning on reasoning tasks.

Key Challenges

Hop degradation — accuracy drops 10-20% per additional reasoning hop beyond 3

Negation blindness — models frequently mishandle double negation, contraposition, and 'none/not all' quantifiers

Distractor sensitivity — irrelevant premises cause models to hallucinate invalid reasoning paths

Formal logic gap — LLMs approximate logical reasoning probabilistically rather than performing sound deduction

Proof faithfulness — generated reasoning chains often contain subtle logical errors even when reaching the correct answer

Quick Recommendations

Production logical QA

OpenAI o3 / Claude 3.5 extended thinking

Best performance on multi-hop deduction with explicit reasoning traces

Verified reasoning

Logic-LM (LLM + Z3/Prover9)

Combines LLM natural language understanding with sound symbolic verification

Research baseline

Llama 3.1 70B + chain-of-thought

Strong open-source baseline for studying logical reasoning capabilities and failures

Formal proofs

Lean4 + LLM copilot

For applications requiring machine-verifiable logical correctness

What's Next

The frontier is neurosymbolic integration — LLMs that can dynamically invoke symbolic reasoners and verify their own deductive steps. Expect to see more tool-augmented reasoning pipelines and training approaches that reward logically sound intermediate steps, not just correct final answers.

Benchmarks & SOTA

Related Tasks

Arithmetic Reasoning

Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.

Mathematical Reasoning

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

Multi-step Reasoning

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

Commonsense Reasoning

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

Something wrong or missing?

Help keep Logical Reasoning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000