Codesota · Benchmark · GPQAHome/Leaderboards/GPQA
Unknown

GPQA.

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for GPQA. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Gemini 3 Prounverified91.92026Source ↗Looks wrong?
02Claude Opus 4.6unverified91.32026Source ↗Looks wrong?
03Kimi K2.6unverified90.52026Paper ↗Looks wrong?
04Gemini 3 Flashunverified90.42026Source ↗Looks wrong?
05DeepSeek-V4-Pro Maxunverified90.12026Paper ↗Code ↗Looks wrong?
06Claude Sonnet 4.6unverified89.92026Source ↗Looks wrong?
07GPT-5unverified892026Source ↗Looks wrong?
08Qwen3.5-397B-A17Bunverified88.42026Paper ↗Code ↗Looks wrong?
09DeepSeek-V4-Flash Maxunverified88.12026Paper ↗Code ↗Looks wrong?
10Grok 4unverified882026Source ↗Looks wrong?
11Qwen3.6-27Bunverified87.82026Paper ↗Code ↗Looks wrong?
12Kimi-K2.5unverified87.62026Paper ↗Code ↗Looks wrong?
13Qwen3.5-122B-A10Bunverified86.62026Paper ↗Code ↗Source ↗Looks wrong?
14Gemini 2.5 Prounverified86.42025Paper ↗Looks wrong?
15GLM-5.1unverified86.22026Paper ↗Code ↗Looks wrong?
16Qwen3.6-35B-A3Bunverified862026Paper ↗Code ↗Looks wrong?
17GLM-5unverified862026Paper ↗Code ↗Source ↗Looks wrong?
18GLM-4.7unverified85.72025Paper ↗Code ↗Source ↗Looks wrong?
19DeepSeek-V3.2-Specialeunverified85.72025Paper ↗Source ↗Looks wrong?
20Qwen3.5-27Bunverified85.52026Paper ↗Code ↗Source ↗Looks wrong?
21MiniMax-M2.5unverified85.22026Paper ↗Code ↗Looks wrong?
22Step-3.5-Flash PaCoReunverified852026Paper ↗Code ↗Looks wrong?
23Gemma 4 31Bunverified84.32026Paper ↗Looks wrong?
24Qwen3.5-35B-A3Bunverified84.22026Paper ↗Code ↗Source ↗Looks wrong?
25Qwen3.5-Omni-Plusunverified83.92026Paper ↗Looks wrong?
26Step-3.5-Flashunverified83.52026Paper ↗Code ↗Looks wrong?
27o3paper82.82026Source ↗Looks wrong?
28Gemini 2.5 Flashunverified82.82026Source ↗Looks wrong?
29DeepSeek-V3.2unverified82.42025Paper ↗Source ↗Looks wrong?
30NVIDIA-Nemotron-3-Super-120B-A12B-BF16unverified79.232025Paper ↗Source ↗Looks wrong?
31GLM-4.5unverified79.12025Paper ↗Code ↗Looks wrong?
32o4-minipaper77.62026Source ↗Looks wrong?
33Qwen3-VL-235B-A22B-Thinkingunverified77.12025Paper ↗Code ↗Looks wrong?
34Claude Opus 4
GPQA Diamond, 0-shot CoT. Source: Claude Opus 4 model card, Anthropic (2025).
verified76.72026Source ↗Looks wrong?
35o1paper75.72026Source ↗Looks wrong?
36GLM-4.5-Airunverified752025Paper ↗Code ↗Source ↗Looks wrong?
37o3-mini
Zero-shot CoT, pass@1. Default reasoning effort.
unverified74.92026Source ↗Looks wrong?
38Claude Opus 4.5
GPQA Diamond, 0-shot CoT. Source: Claude Opus 4.5 model card, Anthropic (2025).
verified74.92026Source ↗Looks wrong?
39Qwen3-Coder-Nextunverified74.492026Paper ↗Code ↗Looks wrong?
40Qwen3-VL-235B-A22B-Instructunverified74.32025Paper ↗Code ↗Looks wrong?
41o1-previewpaper73.32026Source ↗Looks wrong?
42Qwen3-Omni-Flash-Thinkingunverified73.12025Paper ↗Code ↗Looks wrong?
43NVIDIA-Nemotron-3-Nano-30B-A3B-BF16unverified732025Paper ↗Code ↗Source ↗Looks wrong?
44DeepSeek R1
GPQA Diamond, 0-shot CoT. Source: DeepSeek-R1 paper Table 3, arxiv:2501.12948 (Jan 2025).
verified71.52026Source ↗Looks wrong?
45Qwen3-235B-A22Bunverified71.12025Paper ↗Code ↗Looks wrong?
46ZAYA1-8Bunverified712026Paper ↗Source ↗Looks wrong?
47Claude Sonnet 4
GPQA Diamond, 0-shot CoT. Source: Claude Sonnet 4 model card, Anthropic (2025).
verified702026Source ↗Looks wrong?
48Llama 4 Maverick
GPQA Diamond, 0-shot CoT. Source: Meta Llama 4 blog post (April 2025).
verified69.82026Source ↗Looks wrong?
49gpt-45-previewpaper69.52026Source ↗Looks wrong?
50GPT-4.5 Preview
Zero-shot CoT.
unverified69.52026Source ↗Looks wrong?
51MiMo-V2.5-Prounverified66.72026Paper ↗Looks wrong?
52GPT-4.1 miniunverified66.42026Source ↗Looks wrong?
53gpt-41paper66.32026Source ↗Looks wrong?
54GPT-4.1
Zero-shot CoT.
unverified66.32026Source ↗Looks wrong?
55Trinity Large Previewunverified63.322026Paper ↗Code ↗Looks wrong?
56o1-mini
Zero-shot CoT, pass@1.
unverified602026Source ↗Looks wrong?
57claude-35-sonnetpaper59.42026Source ↗Looks wrong?
58Claude 3.5 Sonnet
Third-party reported.
unverified59.42026Source ↗Looks wrong?
59grok-2paper562026Source ↗Looks wrong?
60Grok 2
Third-party reported.
unverified562026Source ↗Looks wrong?
61MiniMax-Text-01unverified54.42025Paper ↗Code ↗Looks wrong?
62Llama 3 (405B, Instruct)unverified51.12024Paper ↗Code ↗Looks wrong?
63llama-31-405bpaper50.72026Source ↗Looks wrong?
64Llama 3.1 405B
Third-party reported.
unverified50.72026Source ↗Looks wrong?
65Claude 3 Opus
Third-party reported.
unverified50.42026Source ↗Looks wrong?
66claude-3-opuspaper50.42026Source ↗Looks wrong?
67GPT-4o
Zero-shot CoT. gpt-4o-2024-05-13.
unverified49.92026Source ↗Looks wrong?
68Qwen2.5-Plusunverified49.72024Paper ↗Code ↗Looks wrong?
69GPT-4 Turbo
Zero-shot CoT.
unverified49.32026Source ↗Looks wrong?
70gpt-4-turbopaper49.32026Source ↗Looks wrong?
71Qwen2.5-72B-Instruct
Qwen2.5-72B-Instruct. GPQA Diamond. Table 6 in Qwen2.5 Technical Report.
verified492026Source ↗Looks wrong?
72Qwen2.5-VL-72Bunverified492025Paper ↗Code ↗Looks wrong?
73Gemini 1.5 Pro
From Google blog.
unverified46.22026Source ↗Looks wrong?
74gemini-15-propaper46.22026Source ↗Looks wrong?
75Gemma 3 (27B, IT)unverified42.42025Paper ↗Code ↗Looks wrong?
76llama-31-70bpaper41.72026Source ↗Looks wrong?
77Step-3.5-Flash Baseunverified41.72026Paper ↗Code ↗Looks wrong?
78Llama 3.1 70B
Third-party reported.
unverified41.72026Source ↗Looks wrong?
79GPT-4o mini
Zero-shot CoT.
unverified40.22026Source ↗Looks wrong?
80gpt-4o-minipaper40.22026Source ↗Looks wrong?
81Qwen3-VL-8B-Instructunverified34.72025Paper ↗Code ↗Looks wrong?
Lineage

GPQA in context.

See full reasoning benchmarks lineage →
This benchmark (1)
active2023-11
GPQA
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards