Short-form factuality benchmark with single-answer fact-seeking questions designed to expose hallucination and calibration failures.
Accuracy is the reported evaluation metric for SimpleQA. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Edit |
|---|---|---|---|---|---|---|
| 01 | Gemini 2.5 Pro | unverified | 54 | 2025 | Paper ↗ | Edit result |
| 02 | Step-3.5-Flash Base | unverified | 31.6 | 2026 | Paper ↗Code ↗ | Edit result |
| 03 | Gemini 2.5 Flash | unverified | 26.9 | 2025 | Paper ↗ | Edit result |
| 04 | GLM-4.5 | unverified | 26.4 | 2025 | Paper ↗Code ↗ | Edit result |
| 05 | Trinity Large Preview | unverified | 23.92 | 2026 | Paper ↗Code ↗ | Edit result |
| 06 | MiniMax-Text-01 | unverified | 23.7 | 2025 | Paper ↗Code ↗ | Edit result |
| 07 | Gemma 3 (27B, IT) | unverified | 10 | 2025 | Paper ↗Code ↗ | Edit result |