Commonsense Reasoning
Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.
MMLU
15,908 multiple choice questions across 57 subjects from elementary to professional level.
Top 10
Leading models on MMLU.
What were you looking for on Commonsense Reasoning?
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
All datasets
6 datasets tracked for this task.
Related tasks
Other tasks in Reasoning.
Didn't find what you came for?
Still looking for something on Commonsense Reasoning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.