Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Logical ReasoningHome/Tasks/Reasoning/Logical Reasoning

Logical Reasoning.

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

4
Datasets
12
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

LogiQA

8,678 logical reasoning questions from National Civil Servants Examinations of China.

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on LogiQA.

#ModelaccuracyYearSource
GPT-4o56.32025paper ↗
2Claude 3.5 Sonnet53.82025paper ↗

What were you looking for on Logical Reasoning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

4 datasets tracked for this task.

LogiQA
CANONICAL
2 results · accuracy
Top: GPT-4o 56.3
ARC-AGI-1
5 results · accuracy
Top: o3 87.5
ARC-AGI-2
3 results · accuracy
Top: Gemini 2.5 Pro 5.00
ReClor
2 results · accuracy
Top: GPT-4o 72.4
§ 05 · Related tasks

Other tasks in Reasoning.

Arithmetic ReasoningCommonsense ReasoningMathematical ReasoningMulti-step Reasoning
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Logical Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.