Reasoning

Commonsense Reasoning

Reasoning about everyday situations (CommonsenseQA, HellaSwag).

5 datasets20 results

Commonsense Reasoning is a key task in reasoning. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

MMLU

Massive Multitask Language Understanding

20216 results

15,908 multiple choice questions across 57 subjects from elementary to professional level.

State of the Art

o1-preview

OpenAI

92.3

accuracy

ARC-Challenge

AI2 Reasoning Challenge

20184 results

7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.

State of the Art

Claude 3.5 Sonnet

Anthropic

96.7

accuracy

HellaSwag

20194 results

70K sentence completion problems testing commonsense natural language inference.

State of the Art

GPT-4o

OpenAI

95.3

accuracy

CommonsenseQA

20193 results

12,247 multiple choice questions requiring commonsense reasoning about everyday concepts.

State of the Art

GPT-4o

OpenAI

85.4

accuracy

WinoGrande

20193 results

44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.

State of the Art

GPT-4o

OpenAI

87.5

accuracy

Related Tasks

Mathematical Reasoning

Solving math word problems (GSM8K, MATH, Minerva).

Logical Reasoning

Solving logic puzzles and deductive problems.

Multi-step Reasoning

Complex reasoning requiring multiple inference steps (HotpotQA).

Arithmetic Reasoning

Performing arithmetic calculations and solving equations.

Back to Reasoning

Commonsense Reasoning Benchmarks - Reasoning - CodeSOTA | CodeSOTA