Commonsense Reasoning2018en
AI2 Reasoning Challenge
7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.
Current State of the Art
o3
OpenAI
98.1
accuracy
ARC-Challenge — accuracy
10 results · 2 SOTA advances · higher is better
All results
SOTA frontier
accuracy Progress Over Time
Showing 2 breakthroughs from Jan 2025 to Mar 2026
Key Milestones
Total Improvement
1.0%
Time Span
1y 3m
Breakthroughs
2
Current SOTA
98.1
Top Models Performance Comparison
Top 10 models ranked by accuracy
Best Score
98.1
Top Model
o3
Models Compared
10
Score Range
5.1
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | o3API OpenAI | 98.1 | Mar 2026 | |
| 2 | Gemini 2.5 ProAPI Google | 97.8 | Mar 2026 | |
| 3 | Llama-4-MaverickOpen Source Meta | 97.4 | Mar 2026 | |
| 4 | o4-miniAPI OpenAI | 97.3 | Mar 2026 | |
| 5 | DeepSeek-R1Open Source DeepSeek | 97.1 | Mar 2026 | |
| 6 | Llama 3.1 405BOpen Source Meta | 96.9 | Mar 2026 | |
| 7 | Claude 3.5 SonnetAPI Anthropic | 96.7 | Dec 2025 | |
| 8 | GPT-4oAPI OpenAI | 96.4 | Dec 2025 | |
| 9 | Gemini 1.5 ProAPI Google | 94.8 | Dec 2025 | |
| 10 | Llama 3 70BOpen Source Meta | 93 | Dec 2025 |