Commonsense Reasoning

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

5
Datasets
33
Results
accuracy
Canonical metric
Canonical Benchmark

MMLU

15,908 multiple choice questions across 57 subjects from elementary to professional level.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on MMLU.

RankModelaccuracyYearSource
1
o3
92.92026paper
2
o1
91.82026paper
3
gpt-45-preview
90.82026paper
4
o1-preview
90.82026paper
5
gpt-41
90.22026paper
6
o4-mini
90.02026paper
7
llama-31-405b
88.62026paper
8
deepseek-v3
88.52026paper
9
claude-35-sonnet
88.32026paper
10
grok-2
87.52026paper

All datasets

5 datasets tracked for this task.

Related tasks

Other tasks in Reasoning.