Commonsense Reasoning2019en
HellaSwag
70K sentence completion problems testing commonsense natural language inference.
Current State of the Art
GPT-4o
OpenAI
95.3
accuracy
HellaSwag — accuracy
5 results · 1 SOTA advances · higher is better
All results
SOTA frontier
Top Models Performance Comparison
Top 5 models ranked by accuracy
Best Score
95.3
Top Model
GPT-4o
Models Compared
5
Score Range
7.3
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | GPT-4oAPI OpenAI | 95.3 | Dec 2025 | |
| 2 | Gemini 1.5 ProAPI Google | 92.5 | Dec 2025 | |
| 3 | Claude 3.5 SonnetAPI Anthropic | 89 | Dec 2025 | |
| 4 | Llama 3.1 405BOpen Source Meta | 89 | Mar 2026 | |
| 5 | Llama 3 70BOpen Source Meta | 88 | Dec 2025 |