12,247 multiple choice questions requiring commonsense reasoning about everyday concepts.
Accuracy is the reported evaluation metric for CommonsenseQA. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | gpt-4o | paper | 85.4 | 2025 | Source ↗ | Looks wrong? |
| 02 | Claude 3.5 Sonnet | unverified | 83.2 | 2025 | Source ↗ | Looks wrong? |
| 03 | claude-35-sonnet | paper | 83.2 | 2025 | Source ↗ | Looks wrong? |
| 04 | llama-3-70b | paper | 80.9 | 2025 | Source ↗ | Looks wrong? |
| 05 | Llama 3 70B | unverified | 80.9 | 2025 | Source ↗ | Looks wrong? |
| 06 | BitNet b1.58 2B4T | unverified | 71.58 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 07 | SmoLM2 (1.7B) | unverified | 43.6 | 2025 | Paper ↗Code ↗ | Looks wrong? |