Codesota · Reasoning · Multi-step Reasoning · GPQA DiamondTasks/Reasoning/Multi-step Reasoning
Multi-step Reasoning · benchmark dataset · 2023 · EN

Graduate-Level Google-Proof Q&A Diamond.

Graduate-level science QA benchmark designed to be difficult for non-experts and resistant to simple web lookup. GPQA Diamond is the common frontier reporting split.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

74 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
74 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Gemini 3 ProAPIGoogleApr 2026google-blog91.90
02Claude Opus 4.6AnthropicApr 2026anthropic-opus-4-6-announcement91.30
03Kimi K2.6Apr 2026pwc-dump90.50
04Gemini 3 FlashAPIGoogleApr 2026google-blog90.40
05DeepSeek-V4-Pro MaxDeepSeekApr 2026pwc-dump · code90.10
06Claude Sonnet 4.6APIAnthropicApr 2026anthropic-sonnet-4-6-system-card89.90
07GPT-5OpenAIApr 2026openai-gpt-5-launch89
08Qwen3.5-397B-A17BOpenAlibabaFeb 2026pwc-dump · code88.40
09DeepSeek-V4-Flash MaxDeepSeekApr 2026pwc-dump · code88.10
10Grok 4APIxAIApr 2026xai-grok-4-announcement88
11Qwen3.6-27BApr 2026pwc-dump · code87.80
12Kimi-K2.5OpenMoonshot.AIFeb 2026Kimi K2.5: Visual Agentic Intelligence · code87.60
13Qwen3.5-122B-A10BOpenAlibabaFeb 2026pwc-dump · code86.60
14Gemini 2.5 ProJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…86.40
15GLM-5.1Feb 2026GLM-5: from Vibe Coding to Agentic Engineering · code86.20
16Qwen3.6-35B-A3BApr 2026pwc-dump · code86
17GLM-5OpenZhipu AIFeb 2026GLM-5: from Vibe Coding to Agentic Engineering · code86
18GLM-4.7OpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code85.70
19DeepSeek-V3.2-SpecialeOpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…85.70
20Qwen3.5-27BOpenAlibabaFeb 2026pwc-dump · code85.50
21MiniMax-M2.5OpenMiniMaxAIFeb 2026pwc-dump · code85.20
22Step-3.5-Flash PaCoReFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code85
23Gemma 4 31BGoogleApr 2026pwc-dump84.30
24Qwen3.5-35B-A3BOpenAlibabaFeb 2026pwc-dump · code84.20
25Gemini 2.5 ProAPIGoogleMar 2026google-technical-report84
26Qwen3.5-Omni-PlusApr 2026Qwen3.5-Omni Technical Report83.90
27Step-3.5-FlashFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code83.50
28Gemini 2.5 FlashGoogleApr 2026google-model-card82.80
29o3OpenAIMar 2026openai-simple-evals82.80
30Gemini 2.5 FlashJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…82.80
31DeepSeek-V3.2OpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…82.40
32NVIDIA-Nemotron-3-Super-120B-A12B-BF16Dec 2025NVIDIA Nemotron 3: Efficient and Open Intelligence79.23
33GLM-4.5OpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code79.10
34o4-miniOpenAIMar 2026openai-simple-evals77.60
35Qwen3-VL-235B-A22B-ThinkingQwenNov 2025Qwen3-VL Technical Report · code77.10
36Claude Opus 4AnthropicMar 2026anthropic-model-card76.70
37o1APIOpenAIMar 2026openai-simple-evals75.70
38GLM-4.5-AirOpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code75
39Claude Opus 4.5APIAnthropicMar 2026anthropic-model-card74.90
40o3-miniAPIOpenAIMar 2026openai-simple-evals74.90
41Qwen3-Coder-NextQwenFeb 2026Qwen3-Coder-Next Technical Report · code74.49
42Qwen3-VL-235B-A22B-InstructQwenNov 2025Qwen3-VL Technical Report · code74.30
43o1-previewAPIOpenAIMar 2026openai-simple-evals73.30
44Qwen3-Omni-Flash-ThinkingSep 2025Qwen3-Omni Technical Report · code73.10
45NVIDIA-Nemotron-3-Nano-30B-A3B-BF16Dec 2025Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybr… · code73
46DeepSeek R1OpenDeepSeekMar 2026arxiv71.50
47Qwen3-235B-A22BOpenAlibabaApr 2026qwen-model-card71.10
48Qwen3-235B-A22BOpenAlibabaMay 2025Qwen3 Technical Report · code71.10
49ZAYA1-8BZ.aiMay 2026ZAYA1-8B Technical Report71
50Claude Sonnet 4AnthropicMar 2026anthropic-model-card70
51Llama 4 MaverickOpenMetaMar 2026meta-blog69.80
52GPT-4.5 PreviewAPIOpenAIMar 2026openai-simple-evals69.50
53MiMo-V2.5-ProApr 2026pwc-dump66.70
54GPT-4.1 miniAPIOpenAIApr 2026pricepertoken-leaderboard66.40
55GPT-4.1OpenAIMar 2026openai-simple-evals66.30
56Trinity Large PreviewArcee AIFeb 2026Arcee Trinity Large Technical Report · code63.32
57o1-miniAPIOpenAIMar 2026openai-simple-evals60
58Claude 3.5 SonnetAPIAnthropicMar 2026openai-simple-evals59.40
59Grok 2APIxAIMar 2026openai-simple-evals56
60MiniMax-Text-01MiniMaxJan 2025MiniMax-01: Scaling Foundation Models with Lightning Att… · code54.40
61Llama 3 (405B, Instruct)MetaJul 2024The Llama 3 Herd of Models · code51.10
62Llama 3.1 405BOpenMetaMar 2026openai-simple-evals50.70
63Claude 3 OpusAPIAnthropicMar 2026openai-simple-evals50.40
64GPT-4oAPIOpenAIMar 2026openai-simple-evals49.90
65Qwen2.5-PlusDec 2024Qwen2.5 Technical Report · code49.70
66GPT-4 TurboAPIOpenAIMar 2026openai-simple-evals49.30
67Qwen2.5-VL-72BFeb 2025Qwen2.5-VL Technical Report · code49
68Qwen2.5-72B-InstructOpenAlibabaMar 2026qwen25-tech-report49
69Gemini 1.5 ProAPIGoogleMar 2026openai-simple-evals46.20
70Gemma 3 (27B, IT)Mar 2025Gemma 3 Technical Report · code42.40
71Step-3.5-Flash BaseFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code41.70
72Llama 3.1 70BOpenMetaMar 2026openai-simple-evals41.70
73GPT-4o miniOpenAIMar 2026openai-simple-evals40.20
74Qwen3-VL-8B-InstructQwenNov 2025Qwen3-VL Technical Report · code34.70
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

7 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Jul 31, 2024Llama 3 (405B, Instruct)Meta51.10
  2. Jan 14, 2025MiniMax-Text-01MiniMax54.40
  3. May 14, 2025Qwen3-235B-A22BAlibaba71.10
  4. Jul 7, 2025Gemini 2.5 Pro86.40
  5. Feb 2, 2026Kimi-K2.5Moonshot.AI87.60
  6. Feb 16, 2026Qwen3.5-397B-A17BAlibaba88.40
  7. Apr 12, 2026Gemini 3 ProGoogle91.90
Fig 3 · SOTA-setting models only. 7 entries span Jul 2024 Apr 2026.
§ 04 · Literature

20 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies