Commonsense Reasoning2019en

WinoGrande

44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.

Metrics:accuracy
Paper / WebsiteDownload
Current State of the Art

GPT-4o

OpenAI

87.5

accuracy

Top Models Performance Comparison

Top 3 models ranked by accuracy

accuracy1GPT-4o87.5100.0%2Claude 3.5 Sonnet85.497.6%3Llama 3 70B85.397.5%0%25%50%75%100%% of best
Best Score
87.5
Top Model
GPT-4o
Models Compared
3
Score Range
2.2

accuracyPrimary

#ModelScorePaper / CodeDate
1
GPT-4oAPI
OpenAI
87.5Dec 2025
2
Claude 3.5 SonnetAPI
Anthropic
85.4Dec 2025
3
Llama 3 70BOpen Source
Meta
85.3Dec 2025

Other Commonsense Reasoning Datasets