Codesota · LLM · Reasoning BenchmarksLLM/Reasoning
Reasoning · updated April 2026

LLM Reasoning Benchmarks.

Graduate-level knowledge (GPQA Diamond), broad multi-subject reasoning (MMLU-Pro), and extreme frontier difficulty (HLE). These benchmarks separate world-knowledge from genuine scientific reasoning.

GPQA Diamond MMLU-ProHLE
§ 01 · GPQA Diamond

Graduate science, Google-proof.

198 expert-authored graduate-level questions in biology, chemistry, and physics. Designed to be impossible to Google. PhD specialists score ~65% on their own field. Expert (non-specialist) human baseline: 34%.

#ModelProviderAccuracyDate
Gemini 3 ProGoogle91.9%Apr 2026
2Claude Opus 4.6Anthropic91.3%Apr 2026
3Kimi K2.690.5%Apr 2026
4Gemini 3 FlashGoogle90.4%Apr 2026
5DeepSeek-V4-Pro MaxDeepSeek90.1%Apr 2026
6Claude Sonnet 4.6Anthropic89.9%Apr 2026
7GPT-5OpenAI89%Apr 2026
8Qwen3.5-397B-A17BAlibaba88.4%Feb 2026
9DeepSeek-V4-Flash MaxDeepSeek88.1%Apr 2026
10Grok 4xAI88%Apr 2026
11Qwen3.6-27B87.8%Apr 2026
12Kimi-K2.5Moonshot.AI87.6%Feb 2026
13Qwen3.5-122B-A10BAlibaba86.6%Feb 2026
14Gemini 2.5 Pro86.4%Jul 2025
15GLM-5.186.2%Feb 2026
16Qwen3.6-35B-A3B86%Apr 2026
17GLM-5Zhipu AI86%Feb 2026
18GLM-4.7Zhipu AI85.7%Aug 2025
19DeepSeek-V3.2-SpecialeDeepSeek85.7%Dec 2025
20Qwen3.5-27BAlibaba85.5%Feb 2026
21MiniMax-M2.5MiniMaxAI85.2%Feb 2026
22Step-3.5-Flash PaCoRe85%Feb 2026
23Gemma 4 31BGoogle84.3%Apr 2026
24Qwen3.5-35B-A3BAlibaba84.2%Feb 2026
25Gemini 2.5 ProGoogle84%Mar 2026
26Qwen3.5-Omni-Plus83.9%Apr 2026
27Step-3.5-Flash83.5%Feb 2026
28Gemini 2.5 FlashGoogle82.8%Apr 2026
29o3OpenAI82.8%Mar 2026
30Gemini 2.5 Flash82.8%Jul 2025
31DeepSeek-V3.2DeepSeek82.4%Dec 2025
32NVIDIA-Nemotron-3-Super-120B-A12B-BF1679.23%Dec 2025
33GLM-4.5Zhipu AI79.1%Aug 2025
34o4-miniOpenAI77.6%Mar 2026
35Qwen3-VL-235B-A22B-ThinkingQwen77.1%Nov 2025
36Claude Opus 4Anthropic76.7%Mar 2026
37o1OpenAI75.7%Mar 2026
38GLM-4.5-AirZhipu AI75%Aug 2025
39Claude Opus 4.5Anthropic74.9%Mar 2026
40o3-miniOpenAI74.9%Mar 2026
41Qwen3-Coder-NextQwen74.49%Feb 2026
42Qwen3-VL-235B-A22B-InstructQwen74.3%Nov 2025
43o1-previewOpenAI73.3%Mar 2026
44Qwen3-Omni-Flash-Thinking73.1%Sep 2025
45NVIDIA-Nemotron-3-Nano-30B-A3B-BF1673%Dec 2025
46DeepSeek R1DeepSeek71.5%Mar 2026
47Qwen3-235B-A22BAlibaba71.1%Apr 2026
48Qwen3-235B-A22BAlibaba71.1%May 2025
49ZAYA1-8BZ.ai71%May 2026
50Claude Sonnet 4Anthropic70%Mar 2026
51Llama 4 MaverickMeta69.8%Mar 2026
52GPT-4.5 PreviewOpenAI69.5%Mar 2026
53MiMo-V2.5-Pro66.7%Apr 2026
54GPT-4.1 miniOpenAI66.4%Apr 2026
55GPT-4.1OpenAI66.3%Mar 2026
56Trinity Large PreviewArcee AI63.32%Feb 2026
57o1-miniOpenAI60%Mar 2026
58Claude 3.5 SonnetAnthropic59.4%Mar 2026
59Grok 2xAI56%Mar 2026
60MiniMax-Text-01MiniMax54.4%Jan 2025
61Llama 3 (405B, Instruct)Meta51.1%Jul 2024
62Llama 3.1 405BMeta50.7%Mar 2026
63Claude 3 OpusAnthropic50.4%Mar 2026
64GPT-4oOpenAI49.9%Mar 2026
65Qwen2.5-Plus49.7%Dec 2024
66GPT-4 TurboOpenAI49.3%Mar 2026
67Qwen2.5-VL-72B49%Feb 2025
68Qwen2.5-72B-InstructAlibaba49%Mar 2026
69Gemini 1.5 ProGoogle46.2%Mar 2026
70Gemma 3 (27B, IT)42.4%Mar 2025
71Step-3.5-Flash Base41.7%Feb 2026
72Llama 3.1 70BMeta41.7%Mar 2026
73GPT-4o miniOpenAI40.2%Mar 2026
74Qwen3-VL-8B-InstructQwen34.7%Nov 2025

Source: arXiv:2311.12022 · 198-question Diamond set.

§ 02 · MMLU-Pro

10-choice MCQ, 57 subjects.

Harder version of MMLU — 10-choice MCQ with distractors across 57 subjects (12,000 questions). Reduces surface pattern-matching vs. the original 4-choice format. Still useful for broad capability comparison.

#ModelProviderAccuracyDate
Claude 3.7 SonnetAnthropic85.1%Feb 2025
2Gemini 2.5 ProGoogle83.7%Mar 2025
3o3-mini (high)OpenAI79.3%Feb 2025
4Claude 3.5 SonnetAnthropic76.1%Jun 2024
5GPT-4oOpenAI72.6%May 2024
6Gemini 1.5 ProGoogle69%May 2024
7Claude 3 OpusAnthropic68.5%Mar 2024
8GPT-4 TurboOpenAI63.7%Nov 2023
9Llama 3 70BMeta56.2%Apr 2024
10DeepSeek V2 ChatDeepSeek54.8%May 2024

Source: TIGER-AI-Lab/MMLU-Pro · 5-shot chain-of-thought.

§ 03 · HLE

Humanity's Last Exam — built to last.

3,000 expert-contributed questions spanning math, science, law, and humanities — designed to remain unsaturated for years. No tools allowed. Even the best model scores below 40%.

#ModelProviderAccuracyDate
Kimi K2.654%Apr 2026
2MiMo-V2.5-Pro48%Apr 2026
3Gemini 3.1 ProGoogle46.44%May 2026
4GPT-5.4 ProOpenAI44.32%May 2026
5Muse SparkMeta40.56%May 2026
6Gemini 3 ProGoogle38.3%
7DeepSeek-V4-Pro MaxDeepSeek37.7%Apr 2026
8Gemini 3 Pro PreviewGoogle37.52%May 2026
9GPT-5.4OpenAI36.24%May 2026
10Claude Opus 4.7Anthropic36.2%May 2026
11DeepSeek-V4-Flash MaxDeepSeek34.8%Apr 2026
12Claude Opus 4.6Anthropic34.44%May 2026
13GPT-5 ProOpenAI31.64%May 2026
14GLM-5.131%Feb 2026
15DeepSeek-V3.2-SpecialeDeepSeek30.6%Dec 2025
16GLM-5Zhipu AI30.5%Feb 2026
17Kimi-K2.5Moonshot.AI30.1%Feb 2026
18Qwen3.5-397B-A17BAlibaba28.7%Feb 2026
19Step-3.5-Flash PaCoRe27.9%Feb 2026
20GPT-5.2OpenAI27.8%May 2026
21Gemma 4 31BGoogle26.5%Apr 2026
22GPT-5OpenAI25.32%May 2026
23GPT-5OpenAI25.3%
24Claude Opus 4.5Anthropic25.2%May 2026
25DeepSeek-V3.2DeepSeek25.1%Dec 2025
26GLM-4.7Zhipu AI24.8%Aug 2025
27Grok 4xAI24.5%
28Kimi K2.5Moonshot AI24.37%May 2026
29Qwen3.6-27B24%Apr 2026
30GPT-5.1OpenAI23.68%May 2026
31Step-3.5-Flash23.1%Feb 2026
32Gemini 2.5 ProGoogle21.64%May 2026
33Gemini 2.5 ProGoogle21.6%
34Gemini 2.5 Pro21.6%Jul 2025
35Qwen3.6-35B-A3B21.4%Apr 2026
36o3OpenAI20.32%May 2026
37GPT-5 miniOpenAI19.44%May 2026
38GPT-5 miniOpenAI19.4%
39MiniMax-M2.5MiniMaxAI19.4%Feb 2026
40Claude Opus 4.6Anthropic19%Apr 2026
41NVIDIA-Nemotron-3-Super-120B-A12B-BF1618.26%Dec 2025
42o4-miniOpenAI18.08%May 2026
43GLM-4.5Zhipu AI14.4%Aug 2025
44Claude Sonnet 4.5Anthropic13.72%May 2026
45Claude 4.5 SonnetAnthropic13.7%
46Claude Sonnet 4.6Anthropic13.2%Apr 2026
47Gemini 2.5 FlashGoogle12.1%
48Gemini 2.5 FlashGoogle12.08%May 2026
49Claude Opus 4.1Anthropic11.52%May 2026
50Gemini 2.5 Flash11%Jul 2025
51Claude Opus 4Anthropic10.72%May 2026
52GLM-4.5-AirZhipu AI10.6%Aug 2025
53NVIDIA-Nemotron-3-Nano-30B-A3B-BF1610.6%Dec 2025
54Gemini 3.1 Flash-LiteGoogle8.64%May 2026
55DeepSeek R1DeepSeek8.5%
56GLM-4.5Zhipu AI8.32%May 2026
57o1 ProOpenAI8.12%May 2026
58GLM-4.5-AirZhipu AI8.12%May 2026
59Claude 3.7 SonnetAnthropic8.04%May 2026
60o1OpenAI8%
61o1OpenAI7.96%May 2026
62Claude Sonnet 4Anthropic7.76%May 2026
63Gemini 2.0 Flash ThinkingGoogle6.56%May 2026
64Llama 4 MaverickMeta5.68%May 2026
65GPT-4.5 PreviewOpenAI5.44%May 2026
66GPT-4.1OpenAI5.4%May 2026
67GPT-4.1 miniOpenAI4.6%Apr 2026
68Gemini 1.5 ProGoogle4.6%May 2026
69Mistral-Medium-3Mistral4.52%May 2026
70Nova ProAmazon4.4%May 2026
71Claude 3.5 SonnetAnthropic4.08%May 2026
72Nova LiteAmazon3.64%May 2026
73GPT-4oOpenAI2.72%May 2026
74GPT-4oOpenAI2.7%

Source: agi.safe.ai · No-tools variant.

§ 04 · Methodology

Frequently asked.

What is GPQA Diamond?+

198 graduate-level science questions designed to stump non-expert PhD holders. Created by domain experts who also provided misleading distractors. It measures depth of scientific understanding, not pattern matching.

Why does MMLU-Pro use 10 choices instead of 4?+

The original MMLU with 4 choices can be solved with ~25% base rate, and models often get answers right by eliminating obviously wrong options. 10-choice format with engineered distractors forces genuine understanding.

What makes HLE (Humanity's Last Exam) different from GPQA?+

HLE covers a much broader domain (math, science, law, humanities, linguistics) and is far harder — even frontier models score below 40%. GPQA focuses specifically on biology/chemistry/physics and frontier models now exceed 80%.

§ 05 · Related

Continue reading.

Math
Math Benchmarks
GSM8K, MATH-500, AIME 2024
Coding
Coding Benchmarks
LiveCodeBench, SWE-bench Verified
Index
All LLM Benchmarks
Full leaderboard overview