Open-domain QA benchmark built from real Google search queries with answers annotated from Wikipedia pages.
Accuracy is the reported evaluation metric for Natural Questions. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | LLaMA-65B | unverified | 39.9 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 02 | Llama 2 70B (5-shot) | unverified | 33 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 03 | OLMo-2-7B-1124 (olmOCR-peS2o) | unverified | 29.1 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 04 | Helium | unverified | 23.3 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 05 | SmoLM2 (1.7B) | unverified | 8.70 | 2025 | Paper ↗Code ↗ | Looks wrong? |