44K Winograd-style problems requiring commonsense reasoning to resolve pronoun references.
Accuracy is the reported evaluation metric for WinoGrande. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Edit |
|---|---|---|---|---|---|---|
| 01 | gpt-4o | paper | 87.5 | 2025 | Source ↗ | Edit result |
| 02 | Claude 3.5 Sonnet | unverified | 85.4 | 2025 | Source ↗ | Edit result |
| 03 | claude-35-sonnet | paper | 85.4 | 2025 | Source ↗ | Edit result |
| 04 | llama-3-70b | paper | 85.3 | 2025 | Source ↗ | Edit result |
| 05 | Llama 3 70B | unverified | 85.3 | 2025 | Source ↗ | Edit result |
| 06 | Trinity Large Base (5-shot) | unverified | 80.82 | 2026 | Paper ↗Code ↗ | Edit result |
| 07 | Step-3.5-Flash Base | unverified | 79.1 | 2026 | Paper ↗Code ↗ | Edit result |
| 08 | Chameleon 34B | unverified | 78.5 | 2024 | Paper ↗Code ↗ | Edit result |
| 09 | LLaMA-65B | unverified | 77 | 2023 | Paper ↗Code ↗ | Edit result |
| 10 | Apertus-70B | unverified | 73.3 | 2025 | Paper ↗Code ↗ | Edit result |
| 11 | HRM-Text-1B | unverified | 72.4 | 2026 | Paper ↗Code ↗ | Edit result |
| 12 | BitNet b1.58 2B4T | unverified | 71.9 | 2025 | Paper ↗Code ↗ | Edit result |
| 13 | Helium | unverified | 70 | 2024 | Paper ↗Code ↗ | Edit result |
| 14 | SmoLM2 (1.7B) | unverified | 59.4 | 2025 | Paper ↗Code ↗ | Edit result |
| 15 | OLMo-2-7B-1124 (olmOCR-peS2o) | unverified | 58 | 2025 | Paper ↗Code ↗ | Edit result |