Commonsense Reasoning2019en

HellaSwag

70K sentence completion problems testing commonsense natural language inference.

Metrics:accuracy
Paper / WebsiteDownload
Current State of the Art

GPT-4o

OpenAI

95.3

accuracy

Top Models Performance Comparison

Top 4 models ranked by accuracy

accuracy1GPT-4o95.3100.0%2Gemini 1.5 Pro92.597.1%3Claude 3.5 Sonnet89.093.4%4Llama 3 70B88.092.3%0%25%50%75%100%% of best
Best Score
95.3
Top Model
GPT-4o
Models Compared
4
Score Range
7.3

accuracyPrimary

#ModelScorePaper / CodeDate
1
GPT-4oAPI
OpenAI
95.3Dec 2025
2
Gemini 1.5 ProAPI
Google
92.5Dec 2025
3
Claude 3.5 SonnetAPI
Anthropic
89Dec 2025
4
Llama 3 70BOpen Source
Meta
88Dec 2025

Other Commonsense Reasoning Datasets