Open-domain QA benchmark built from real Google search queries with answers annotated from Wikipedia pages.
5 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | accuracy |
|---|---|---|---|---|---|
| 01 | LLaMA-65B | — | Feb 2023 | LLaMA: Open and Efficient Foundation Language Models · code | 39.90 |
| 02 | Llama 2 70B (5-shot) | — | Jul 2023 | Llama 2: Open Foundation and Fine-Tuned Chat Models · code | 33 |
| 03 | OLMo-2-7B-1124 (olmOCR-peS2o) | — | Feb 2025 | olmOCR: Unlocking Trillions of Tokens in PDFs with Visio… · code | 29.10 |
| 04 | Helium | — | Sep 2024 | Moshi: a speech-text foundation model for real-time dial… · code | 23.30 |
| 05 | SmoLM2 (1.7B) | — | Feb 2025 | SmolLM2: When Smol Goes Big -- Data-Centric Training of … · code | 8.70 |
Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.