Reasoning

Commonsense Reasoning

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

5 datasets45 resultsView full task mapping →

Commonsense reasoning evaluates whether models understand everyday physical, social, and temporal knowledge that humans take for granted. Benchmarks like HellaSwag, WinoGrande, and PIQA are nearly saturated by frontier LLMs, pushing the field toward harder compositional and embodied commonsense tasks.

History

2011

Winograd Schema Challenge proposed as an alternative to the Turing test

2018

SWAG benchmark released for grounded commonsense inference; BERT achieves 86%

2019

HellaSwag published — adversarially filtered to be hard for BERT-era models

2019

WinoGrande scales up Winograd schemas to 44K examples

2020

GPT-3 achieves 79.3% on HellaSwag via few-shot prompting

2021

PIQA (Physical Intuition QA) tests understanding of everyday physical interactions

2023

GPT-4 scores 95.3% on HellaSwag, effectively saturating it

2024

Llama 3.1 405B matches GPT-4 on commonsense benchmarks

2024

ARC-AGI challenge highlights gaps in abstract commonsense pattern reasoning

2025

Frontier models exceed 95% on most commonsense benchmarks; focus shifts to embodied and causal reasoning

How Commonsense Reasoning Works

1Scenario PresentationThe model receives a partia…2Knowledge RetrievalThe model draws on implicit…3Candidate EvaluationMultiple possible continuat…4InferenceThe model selects the most …Commonsense Reasoning Pipeline
1

Scenario Presentation

The model receives a partially described situation — a sentence to complete, a pronoun to resolve, or a physical scenario to evaluate.

2

Knowledge Retrieval

The model draws on implicit world knowledge encoded in its parameters — physics, social norms, temporal sequences, object affordances.

3

Candidate Evaluation

Multiple possible continuations or answers are scored based on their plausibility given the context.

4

Inference

The model selects the most plausible answer by combining linguistic cues with world knowledge, ruling out physically or socially impossible options.

Current Landscape

Standard commonsense reasoning benchmarks are effectively solved by 2025 frontier models. The field is bifurcating: applied work uses commonsense as a component of larger reasoning chains, while research pushes toward harder tasks — abstract reasoning (ARC-AGI), causal reasoning, and embodied commonsense in robotic and simulation settings. The key insight is that scale alone solved the easy cases, but genuine understanding of physical and causal mechanisms remains elusive.

Key Challenges

Benchmark saturation — HellaSwag, WinoGrande, and PIQA are all above 95% for frontier models

Surface pattern exploitation — models may use statistical shortcuts rather than genuine understanding

Physical reasoning gap — models still struggle with multi-step physical cause-and-effect chains

Cultural bias — commonsense is culturally dependent, and benchmarks reflect Western/English-speaking norms

Embodied grounding — textual models lack the sensorimotor experience that grounds human commonsense

Quick Recommendations

General commonsense QA

GPT-4o / Claude 3.5 Sonnet

95%+ on all standard benchmarks, reliable for production use

Open-source deployment

Llama 3.1 70B

Competitive with proprietary models on commonsense tasks at much lower cost

Research on harder commonsense

ARC-AGI evaluation suite

Tests abstract pattern reasoning that current models still fail at

Physical reasoning research

Multimodal models (GPT-4V, Gemini) + simulation

Vision grounding improves physical commonsense over text-only approaches

What's Next

The frontier is moving toward embodied commonsense — testing whether models can predict the consequences of physical actions in novel scenarios. Expect new benchmarks combining vision, language, and physics simulation, and a growing focus on causal reasoning that cannot be solved by pattern matching alone.

Benchmarks & SOTA

Related Tasks

Arithmetic Reasoning

Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.

Logical Reasoning

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

Mathematical Reasoning

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

Multi-step Reasoning

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

Something wrong or missing?

Help keep Commonsense Reasoning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000