Codesota · Benchmark · HellaSwagHome/Leaderboards/HellaSwag
Unknown

HellaSwag.

70K sentence completion problems testing commonsense natural language inference.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for HellaSwag. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01gpt-4o
Commonsense NLI. Models now exceed human performance (95.6%).
paper95.32025Source ↗Looks wrong?
02Gemini 1.5 Prounverified92.52025Source ↗Looks wrong?
03gemini-15-propaper92.52025Source ↗Looks wrong?
04Step-3.5-Flash Baseunverified90.22026Paper ↗Code ↗Looks wrong?
05Trinity Large Base (5-shot)unverified90.112026Paper ↗Code ↗Looks wrong?
06Llama 3.1 405B
Llama 3.1 405B Instruct. Official Meta model card evaluation.
verified892026Source ↗Looks wrong?
07claude-35-sonnetpaper892025Source ↗Looks wrong?
08Claude 3.5 Sonnetunverified892025Source ↗Looks wrong?
09Llama 3 70Bunverified882025Source ↗Looks wrong?
10llama-3-70bpaper882025Source ↗Looks wrong?
11LLaMA-65Bunverified84.22023Paper ↗Code ↗Looks wrong?
12Chameleon 34Bunverified82.72024Paper ↗Code ↗Looks wrong?
13BLT-Entropy 8Bunverified80.62024Paper ↗Code ↗Looks wrong?
14Apertus-70B-Instructunverified78.12025Paper ↗Code ↗Looks wrong?
15Heliumunverified76.32024Paper ↗Code ↗Looks wrong?
16SmoLM2 (1.7B)unverified68.72025Paper ↗Code ↗Looks wrong?
17BitNet b1.58 2B4Tunverified68.442025Paper ↗Code ↗Looks wrong?
18Apertus-70Bunverified642025Paper ↗Code ↗Looks wrong?
19HRM-Text-1Bunverified63.42026Paper ↗Code ↗Looks wrong?
20OLMo-2-7B-1124 (olmOCR-peS2o)unverified62.62025Paper ↗Code ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards