Codesota · Reasoning · Multi-step Reasoning · GPQATasks/Reasoning/Multi-step Reasoning
Multi-step Reasoning · benchmark dataset · 2024 · EN

Graduate-Level Google-Proof Q&A.

448 expert-level questions in biology, physics, and chemistry. Designed to be unsearchable.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

33 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
33 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Gemini 3 ProAPIGoogleApr 2026google-blog91.90
02Claude Opus 4.6APIAnthropicApr 2026anthropic-opus-4-6-announcement91.30
03Gemini 3 FlashAPIGoogleApr 2026google-blog90.40
04Claude Sonnet 4.6APIAnthropicApr 2026anthropic-sonnet-4-6-system-card89.90
05GPT-5APIOpenAIApr 2026openai-gpt-5-launch89
06Grok 4APIxAIApr 2026xai-grok-4-announcement88
07Gemini 2.5 ProAPIGoogleMar 2026google-technical-report84
08o3APIOpenAIMar 2026openai-simple-evals82.80
09Gemini 2.5 FlashGoogleApr 2026google-model-card82.80
10o4-miniAPIOpenAIMar 2026openai-simple-evals77.60
11Claude Opus 4APIAnthropicMar 2026anthropic-model-card76.70
12o1APIOpenAIMar 2026openai-simple-evals75.70
13Claude Opus 4.5APIAnthropicMar 2026anthropic-model-card74.90
14o3-miniAPIOpenAIMar 2026openai-simple-evals74.90
15o1-previewAPIOpenAIMar 2026openai-simple-evals73.30
16DeepSeek R1OSSDeepSeekMar 2026arxiv71.50
17Qwen3-235B-A22BAlibabaApr 2026qwen-model-card71.10
18Claude Sonnet 4APIAnthropicMar 2026anthropic-model-card70
19Llama-4-MaverickOSSMetaMar 2026meta-blog69.80
20GPT-4.5 PreviewAPIOpenAIMar 2026openai-simple-evals69.50
21GPT-4.1 miniAPIOpenAIApr 2026pricepertoken-leaderboard66.40
22GPT-4.1APIOpenAIMar 2026openai-simple-evals66.30
23o1-miniAPIOpenAIMar 2026openai-simple-evals60
24Claude 3.5 SonnetAPIAnthropicMar 2026openai-simple-evals59.40
25Grok 2APIxAIMar 2026openai-simple-evals56
26Llama 3.1 405BOSSMetaMar 2026openai-simple-evals50.70
27Claude 3 OpusAPIAnthropicMar 2026openai-simple-evals50.40
28GPT-4oAPIOpenAIMar 2026openai-simple-evals49.90
29GPT-4 TurboAPIOpenAIMar 2026openai-simple-evals49.30
30Qwen2.5-72B-InstructOSSAlibabaMar 2026qwen25-tech-report49
31Gemini 1.5 ProAPIGoogleMar 2026openai-simple-evals46.20
32Llama 3.1 70BOSSMetaMar 2026openai-simple-evals41.70
33GPT-4o miniOpenAIMar 2026openai-simple-evals40.20
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

3 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Mar 3, 2026o3OpenAI82.80
  2. Mar 27, 2026Gemini 2.5 ProGoogle84
  3. Apr 12, 2026Gemini 3 ProGoogle91.90
Fig 3 · SOTA-setting models only. 3 entries span Mar 2026 Apr 2026.
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
GPQA — Multi-step Reasoning benchmark · Codesota | CodeSOTA