Commonsense Reasoning2018en

AI2 Reasoning Challenge

7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.

Metrics:accuracy
Paper / WebsiteDownload
Current State of the Art

o3

OpenAI

98.1

accuracy

ARC-Challenge — accuracy

10 results · 2 SOTA advances · higher is better

All results
SOTA frontier
9293949596979899202520262027accuracyClaude 3.5 Sonneto3

accuracy Progress Over Time

Showing 2 breakthroughs from Jan 2025 to Mar 2026

97.097.397.697.998.2Jan 2025Mar 2026accuracyDate

Key Milestones

Jan 2025
DeepSeek-R1

0-shot. Source: DeepSeek-R1 paper Table 3, arxiv:2501.12948 (Jan 2025).

97.1
Mar 2026
o3Current SOTA

0-shot. Source: OpenAI simple-evals (2025).

98.1
+1.0%
Total Improvement
1.0%
Time Span
1y 3m
Breakthroughs
2
Current SOTA
98.1

Top Models Performance Comparison

Top 10 models ranked by accuracy

accuracy1o398.1100.0%2Gemini 2.5 Pro97.899.7%3Llama-4-Maverick97.499.3%4o4-mini97.399.2%5DeepSeek-R197.199.0%6Llama 3.1 405B96.998.8%7Claude 3.5 Sonnet96.798.6%8GPT-4o96.498.3%9Gemini 1.5 Pro94.896.6%10Llama 3 70B93.094.8%0%25%50%75%100%% of best
Best Score
98.1
Top Model
o3
Models Compared
10
Score Range
5.1

accuracyPrimary

#ModelScorePaper / CodeDate
1
o3API
OpenAI
98.1Mar 2026
2
Gemini 2.5 ProAPI
Google
97.8Mar 2026
3
Llama-4-MaverickOpen Source
Meta
97.4Mar 2026
4
o4-miniAPI
OpenAI
97.3Mar 2026
5
DeepSeek-R1Open Source
DeepSeek
97.1Mar 2026
6
Llama 3.1 405BOpen Source
Meta
96.9Mar 2026
7
Claude 3.5 SonnetAPI
Anthropic
96.7Dec 2025
8
GPT-4oAPI
OpenAI
96.4Dec 2025
9
Gemini 1.5 ProAPI
Google
94.8Dec 2025
10
Llama 3 70BOpen Source
Meta
93Dec 2025

Other Commonsense Reasoning Datasets