GPQA Diamond is a dataset for language modeling [1]. It consists of 448 expert-validated multiple-choice questions in STEM fields. It is designed to be a challenging benchmark for advanced AI reasoning and drives progress in scalable oversight and structured problem-solving [2].
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.