Large-scale QA benchmark with trivia questions and independently gathered evidence documents.
4 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | accuracy |
|---|---|---|---|---|---|
| 01 | Llama 2 70B (5-shot) | — | Jul 2023 | Llama 2: Open Foundation and Fine-Tuned Chat Models · code | 85 |
| 02 | LLaMA-65B | — | Feb 2023 | LLaMA: Open and Efficient Foundation Language Models · code | 73 |
| 03 | SmoLM2 (1.7B) | — | Feb 2025 | SmolLM2: When Smol Goes Big -- Data-Centric Training of … · code | 36.70 |
| 04 | BitNet b1.58 2B4T | — | Apr 2025 | BitNet b1.58 2B4T Technical Report · code | 33.57 |
Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.