Who leads the WinoGrande benchmark?

GPT-4o currently leads WinoGrande with a score of 87.50 on accuracy.

What is the state-of-the-art score on WinoGrande?

The state-of-the-art result on WinoGrande is 87.50 (accuracy), achieved by GPT-4o as of 2026.

How many models are tracked on WinoGrande?

Codesota tracks 13 models on WinoGrande.

When was the WinoGrande leaderboard last updated?

The WinoGrande leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2023.

Codesota · Reasoning · Commonsense Reasoning · WinoGrandeTasks/Reasoning/Commonsense Reasoning

Commonsense Reasoning · benchmark dataset · 2019 · EN

WinoGrande.

Name: WinoGrande Benchmark Results
Creator: Codesota
Published: 2023-01-01
License: https://creativecommons.org/licenses/by/4.0/

44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

13 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: accuracy · higher is better

accuracy· primary

13 rows

#	Model	Org	Submitted	Paper / code	accuracy
01	GPT-4oAPI	OpenAI	Dec 2025	openai-blog	87.50
02	Claude 3.5 SonnetAPI	Anthropic	Dec 2025	anthropic-blog	85.40
03	Llama 3 70BOpen	Meta	Dec 2025	meta-blog	85.30
04	Trinity Large Base (5-shot)	—	Feb 2026	Arcee Trinity Large Technical Report · code	80.82
05	Step-3.5-Flash Base	—	Feb 2026	Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code	79.10
06	Chameleon 34B	—	May 2024	Chameleon: Mixed-Modal Early-Fusion Foundation Models · code	78.50
07	LLaMA-65B	—	Feb 2023	LLaMA: Open and Efficient Foundation Language Models · code	77
08	Apertus-70B	—	Sep 2025	Apertus: Democratizing Open and Compliant LLMs for Globa… · code	73.30
09	HRM-Text-1B	—	May 2026	pwc-dump · code	72.40
10	BitNet b1.58 2B4T	—	Apr 2025	BitNet b1.58 2B4T Technical Report · code	71.90
11	Helium	—	Sep 2024	Moshi: a speech-text foundation model for real-time dial… · code	70
12	SmoLM2 (1.7B)	—	Feb 2025	SmolLM2: When Smol Goes Big -- Data-Centric Training of … · code	59.40
13	OLMo-2-7B-1124 (olmOCR-peS2o)	—	Feb 2025	olmOCR: Unlocking Trillions of Tokens in PDFs with Visio… · code	58

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

3 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy

Feb 27, 2023LLaMA-65B77
May 16, 2024Chameleon 34B78.50
Dec 17, 2025GPT-4oOpenAI87.50

Fig 3 · SOTA-setting models only. 3 entries span Feb 2023 → Dec 2025.

§ 04 · Literature

9 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

Arcee Trinity Large Technical Report
Feb 2026·Trinity Large Base (5-shot)
arXiv ↗Code
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Feb 2026·Step-3.5-Flash Base
arXiv ↗Code
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Sep 2025·Apertus-70B
arXiv ↗Code
BitNet b1.58 2B4T Technical Report
Apr 2025·BitNet b1.58 2B4T
arXiv ↗Code
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Feb 2025·OLMo-2-7B-1124 (olmOCR-peS2o)
arXiv ↗Code
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Feb 2025·SmoLM2 (1.7B)
arXiv ↗Code
Moshi: a speech-text foundation model for real-time dialogue
Sep 2024·Helium
arXiv ↗Code
Chameleon: Mixed-Modal Early-Fusion Foundation Models
May 2024·Chameleon 34B
arXiv ↗Code
LLaMA: Open and Efficient Foundation Language Models
Feb 2023·LLaMA-65B
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

WinoGrande.

Best published scores.

3 stepsof state of the art.

9 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

3 steps
of state of the art.

9 papers
tied to this benchmark.

Have a score that beats
this table?