Logical Reasoning

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

4
Datasets
4
Results
accuracy
Canonical metric
Canonical Benchmark

LogiQA

8,678 logical reasoning questions from National Civil Servants Examinations of China.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on LogiQA.

RankModelaccuracyYearSource
1
gpt-4o
56.32025paper
2
claude-35-sonnet
53.82025paper

All datasets

4 datasets tracked for this task.

Related tasks

Other tasks in Reasoning.