Logical Reasoning

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

4
Datasets
4
Results
accuracy
Canonical metric
Canonical Benchmark

LogiQA

8,678 logical reasoning questions from National Civil Servants Examinations of China.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on LogiQA.

RankModelaccuracyYearSource
1
gpt-4o
56.32025paper
2
claude-35-sonnet
53.82025paper

What were you looking for on Logical Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

4 datasets tracked for this task.

Related tasks

Other tasks in Reasoning.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Logical Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.