The dataset is named SimpleQA-Verified, and its task is language modeling. It is a reliable factuality benchmark to measure parametric knowledge, with an original size of 4,326 samples, which is reduced to 1,000 samples after various processing stages. The dataset includes problems with varying string lengths, topics (e.g., Sports, Politics, Art, History, Geography, Music, Other), and answer types (e.g., Person, Other). Some problems may require reasoning or be multi-step.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.