Commonsense Reasoning2018en

AI2 Reasoning Challenge

7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.

Metrics:accuracy
Paper / WebsiteDownload
Current State of the Art

Claude 3.5 Sonnet

Anthropic

96.7

accuracy

Top Models Performance Comparison

Top 4 models ranked by accuracy

accuracy1Claude 3.5 Sonnet96.7100.0%2GPT-4o96.499.7%3Gemini 1.5 Pro94.898.0%4Llama 3 70B93.096.2%0%25%50%75%100%% of best
Best Score
96.7
Top Model
Claude 3.5 Sonnet
Models Compared
4
Score Range
3.7

accuracyPrimary

#ModelScorePaper / CodeDate
1
Claude 3.5 SonnetAPI
Anthropic
96.7Dec 2025
2
GPT-4oAPI
OpenAI
96.4Dec 2025
3
Gemini 1.5 ProAPI
Google
94.8Dec 2025
4
Llama 3 70BOpen Source
Meta
93Dec 2025

Other Commonsense Reasoning Datasets