570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral.
Accuracy is the reported evaluation metric for SNLI. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GPT-4o | verified | 92.6 | 2023 | Paper ↗Source ↗ | Looks wrong? |
| 02 | DeBERTa-v3-large | verified | 92.2 | 2021 | Paper ↗Source ↗ | Looks wrong? |
| 03 | Gemini Ultra | verified | 91.9 | 2023 | Paper ↗ | Looks wrong? |
| 04 | Claude 3.5 Sonnet | verified | 91.8 | 2024 | Paper ↗ | Looks wrong? |
| 05 | Llama 3.1 405B | verified | 91.2 | 2024 | Paper ↗ | Looks wrong? |
| 06 | Qwen2 72B | verified | 90.1 | 2024 | Paper ↗ | Looks wrong? |
| 07 | Llama 3 70B | verified | 89.7 | 2024 | Paper ↗ | Looks wrong? |
| 08 | Mistral 7B | verified | 85.6 | 2023 | Paper ↗ | Looks wrong? |