Commonsense Reasoning
Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.
Commonsense reasoning evaluates whether models understand everyday physical, social, and temporal knowledge that humans take for granted. Benchmarks like HellaSwag, WinoGrande, and PIQA are nearly saturated by frontier LLMs, pushing the field toward harder compositional and embodied commonsense tasks.
History
Winograd Schema Challenge proposed as an alternative to the Turing test
SWAG benchmark released for grounded commonsense inference; BERT achieves 86%
HellaSwag published — adversarially filtered to be hard for BERT-era models
WinoGrande scales up Winograd schemas to 44K examples
GPT-3 achieves 79.3% on HellaSwag via few-shot prompting
PIQA (Physical Intuition QA) tests understanding of everyday physical interactions
GPT-4 scores 95.3% on HellaSwag, effectively saturating it
Llama 3.1 405B matches GPT-4 on commonsense benchmarks
ARC-AGI challenge highlights gaps in abstract commonsense pattern reasoning
Frontier models exceed 95% on most commonsense benchmarks; focus shifts to embodied and causal reasoning
How Commonsense Reasoning Works
Scenario Presentation
The model receives a partially described situation — a sentence to complete, a pronoun to resolve, or a physical scenario to evaluate.
Knowledge Retrieval
The model draws on implicit world knowledge encoded in its parameters — physics, social norms, temporal sequences, object affordances.
Candidate Evaluation
Multiple possible continuations or answers are scored based on their plausibility given the context.
Inference
The model selects the most plausible answer by combining linguistic cues with world knowledge, ruling out physically or socially impossible options.
Current Landscape
Standard commonsense reasoning benchmarks are effectively solved by 2025 frontier models. The field is bifurcating: applied work uses commonsense as a component of larger reasoning chains, while research pushes toward harder tasks — abstract reasoning (ARC-AGI), causal reasoning, and embodied commonsense in robotic and simulation settings. The key insight is that scale alone solved the easy cases, but genuine understanding of physical and causal mechanisms remains elusive.
Key Challenges
Benchmark saturation — HellaSwag, WinoGrande, and PIQA are all above 95% for frontier models
Surface pattern exploitation — models may use statistical shortcuts rather than genuine understanding
Physical reasoning gap — models still struggle with multi-step physical cause-and-effect chains
Cultural bias — commonsense is culturally dependent, and benchmarks reflect Western/English-speaking norms
Embodied grounding — textual models lack the sensorimotor experience that grounds human commonsense
Quick Recommendations
General commonsense QA
GPT-4o / Claude 3.5 Sonnet
95%+ on all standard benchmarks, reliable for production use
Open-source deployment
Llama 3.1 70B
Competitive with proprietary models on commonsense tasks at much lower cost
Research on harder commonsense
ARC-AGI evaluation suite
Tests abstract pattern reasoning that current models still fail at
Physical reasoning research
Multimodal models (GPT-4V, Gemini) + simulation
Vision grounding improves physical commonsense over text-only approaches
What's Next
The frontier is moving toward embodied commonsense — testing whether models can predict the consequences of physical actions in novel scenarios. Expect new benchmarks combining vision, language, and physics simulation, and a growing focus on causal reasoning that cannot be solved by pattern matching alone.
Benchmarks & SOTA
MMLU
Massive Multitask Language Understanding
15,908 multiple choice questions across 57 subjects from elementary to professional level.
State of the Art
o3
OpenAI
92.9
accuracy
ARC-Challenge
AI2 Reasoning Challenge
7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.
State of the Art
o3
OpenAI
98.1
accuracy
HellaSwag
HellaSwag
70K sentence completion problems testing commonsense natural language inference.
State of the Art
GPT-4o
OpenAI
95.3
accuracy
CommonsenseQA
CommonsenseQA
12,247 multiple choice questions requiring commonsense reasoning about everyday concepts.
State of the Art
GPT-4o
OpenAI
85.4
accuracy
WinoGrande
WinoGrande
44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.
State of the Art
GPT-4o
OpenAI
87.5
accuracy
Related Tasks
Arithmetic Reasoning
Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.
Logical Reasoning
Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.
Mathematical Reasoning
Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.
Multi-step Reasoning
Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.
Something wrong or missing?
Help keep Commonsense Reasoning benchmarks accurate. Report outdated results, missing benchmarks, or errors.