Commonsense Reasoning
Reasoning about everyday situations (CommonsenseQA, HellaSwag).
Commonsense Reasoning is a key task in reasoning. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
MMLU
Massive Multitask Language Understanding
15,908 multiple choice questions across 57 subjects from elementary to professional level.
State of the Art
o1-preview
OpenAI
92.3
accuracy
ARC-Challenge
AI2 Reasoning Challenge
7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.
State of the Art
Claude 3.5 Sonnet
Anthropic
96.7
accuracy
HellaSwag
HellaSwag
70K sentence completion problems testing commonsense natural language inference.
State of the Art
GPT-4o
OpenAI
95.3
accuracy
CommonsenseQA
CommonsenseQA
12,247 multiple choice questions requiring commonsense reasoning about everyday concepts.
State of the Art
GPT-4o
OpenAI
85.4
accuracy
WinoGrande
WinoGrande
44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.
State of the Art
GPT-4o
OpenAI
87.5
accuracy