Commonsense Reasoning2018en
AI2 Reasoning Challenge
7,787 science questions requiring reasoning. Challenge set contains harder questions that retrieval fails on.
Current State of the Art
Claude 3.5 Sonnet
Anthropic
96.7
accuracy
Top Models Performance Comparison
Top 4 models ranked by accuracy
Best Score
96.7
Top Model
Claude 3.5 Sonnet
Models Compared
4
Score Range
3.7
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Claude 3.5 SonnetAPI Anthropic | 96.7 | Dec 2025 | |
| 2 | GPT-4oAPI OpenAI | 96.4 | Dec 2025 | |
| 3 | Gemini 1.5 ProAPI Google | 94.8 | Dec 2025 | |
| 4 | Llama 3 70BOpen Source Meta | 93 | Dec 2025 |