Codesota · Tasks · Commonsense ReasoningHome/Tasks/Reasoning/Commonsense Reasoning

Commonsense Reasoning.

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

Datasets

182

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

MMLU

Broad multi-task language-understanding benchmark with 57 subjects spanning STEM, humanities, social sciences, and professional knowledge. Original 4-choice MCQ format; now saturated enough that top-frontier deltas should be read as a cluster rather than a strict ranking.

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on MMLU.

#	Model	accuracy	Year	Source
★	o3✓	92.9	2026	paper ↗
2	GPT-5.2	92.4	2026	paper ↗
3	o1✓	91.8	2026	paper ↗
4	Claude Opus 4.5	91.8	2026	paper ↗
5	Claude Opus 4.5✓	91.6	2026	paper ↗
6	Gemini 3 Pro	91.4	2026	paper ↗
7	Claude Opus 4.6	91.2	2026	paper ↗
8	GPT-4.5 Preview✓	90.8	2026	paper ↗
9	GPT-5	90.8	2026	paper ↗
10	o1-preview✓	90.8	2026	paper ↗