Commonsense Reasoning2019en

HellaSwag

70K sentence completion problems testing commonsense natural language inference.

Metrics:accuracy
Paper / WebsiteDownload
Current State of the Art

GPT-4o

OpenAI

95.3

accuracy

HellaSwag — accuracy

5 results · 1 SOTA advances · higher is better

All results
SOTA frontier
8788899091929394959697202520262027accuracyGPT-4o

Top Models Performance Comparison

Top 5 models ranked by accuracy

accuracy1GPT-4o95.3100.0%2Gemini 1.5 Pro92.597.1%3Claude 3.5 Sonnet89.093.4%4Llama 3.1 405B89.093.4%5Llama 3 70B88.092.3%0%25%50%75%100%% of best
Best Score
95.3
Top Model
GPT-4o
Models Compared
5
Score Range
7.3

accuracyPrimary

#ModelScorePaper / CodeDate
1
GPT-4oAPI
OpenAI
95.3Dec 2025
2
Gemini 1.5 ProAPI
Google
92.5Dec 2025
3
Claude 3.5 SonnetAPI
Anthropic
89Dec 2025
4
Llama 3.1 405BOpen Source
Meta
89Mar 2026
5
Llama 3 70BOpen Source
Meta
88Dec 2025

Other Commonsense Reasoning Datasets

HellaSwag Benchmark - Commonsense Reasoning | CodeSOTA