Commonsense Reasoning2021en

Massive Multitask Language Understanding

15,908 multiple choice questions across 57 subjects from elementary to professional level.

Metrics:accuracy
Paper / WebsiteDownload
Current State of the Art

o1-preview

OpenAI

92.3

accuracy

Top Models Performance Comparison

Top 6 models ranked by accuracy

accuracy1o1-preview92.3100.0%2GPT-4o88.796.1%3Claude 3.5 Sonnet88.796.1%4DeepSeek V388.595.9%5Gemini 1.5 Pro85.993.1%6Llama 3 70B82.088.8%0%25%50%75%100%% of best
Best Score
92.3
Top Model
o1-preview
Models Compared
6
Score Range
10.3

accuracyPrimary

#ModelScorePaper / CodeDate
1
o1-preview
OpenAI
92.3Dec 2025
2
GPT-4oAPI
OpenAI
88.7Dec 2025
3
Claude 3.5 SonnetAPI
Anthropic
88.7Dec 2025
4
DeepSeek V3Open Source
DeepSeek
88.5Dec 2025
5
Gemini 1.5 ProAPI
Google
85.9Dec 2025
6
Llama 3 70BOpen Source
Meta
82Dec 2025

Other Commonsense Reasoning Datasets

MMLU Benchmark - Commonsense Reasoning | CodeSOTA