Codesota · LLM · Reasoning BenchmarksLLM/Reasoning
Reasoning · updated April 2026

LLM Reasoning Benchmarks.

Graduate-level knowledge (GPQA Diamond), broad multi-subject reasoning (MMLU-Pro), and extreme frontier difficulty (HLE). These benchmarks separate world-knowledge from genuine scientific reasoning.

GPQA Diamond MMLU-ProHLE
§ 01 · GPQA Diamond

Graduate science, Google-proof.

198 expert-authored graduate-level questions in biology, chemistry, and physics. Designed to be impossible to Google. PhD specialists score ~65% on their own field. Expert (non-specialist) human baseline: 34%.

#ModelProviderAccuracyDate
Gemini 3 ProGoogle91.9%Apr 2026
2Claude Opus 4.6Anthropic91.3%Apr 2026
3Gemini 3 FlashGoogle90.4%Apr 2026
4Claude Sonnet 4.6Anthropic89.9%Apr 2026
5GPT-5OpenAI89%Apr 2026
6Grok 4xAI88%Apr 2026
7Gemini 2.5 ProGoogle84%Mar 2026
8o3OpenAI82.8%Mar 2026
9Gemini 2.5 FlashGoogle82.8%Apr 2026
10o4-miniOpenAI77.6%Mar 2026
11Claude Opus 4Anthropic76.7%Mar 2026
12o1OpenAI75.7%Mar 2026
13Claude Opus 4.5Anthropic74.9%Mar 2026
14o3-miniOpenAI74.9%Mar 2026
15o1-previewOpenAI73.3%Mar 2026
16DeepSeek R1DeepSeek71.5%Mar 2026
17Qwen3-235B-A22BAlibaba71.1%Apr 2026
18Claude Sonnet 4Anthropic70%Mar 2026
19Llama-4-MaverickMeta69.8%Mar 2026
20GPT-4.5 PreviewOpenAI69.5%Mar 2026
21GPT-4.1 miniOpenAI66.4%Apr 2026
22GPT-4.1OpenAI66.3%Mar 2026
23o1-miniOpenAI60%Mar 2026
24Claude 3.5 SonnetAnthropic59.4%Mar 2026
25Grok 2xAI56%Mar 2026
26Llama 3.1 405BMeta50.7%Mar 2026
27Claude 3 OpusAnthropic50.4%Mar 2026
28GPT-4oOpenAI49.9%Mar 2026
29GPT-4 TurboOpenAI49.3%Mar 2026
30Qwen2.5-72B-InstructAlibaba49%Mar 2026
31Gemini 1.5 ProGoogle46.2%Mar 2026
32Llama 3.1 70BMeta41.7%Mar 2026
33GPT-4o miniOpenAI40.2%Mar 2026

Source: arXiv:2311.12022 · 198-question Diamond set.

§ 02 · MMLU-Pro

10-choice MCQ, 57 subjects.

Harder version of MMLU — 10-choice MCQ with distractors across 57 subjects (12,000 questions). Reduces surface pattern-matching vs. the original 4-choice format. Still useful for broad capability comparison.

#ModelProviderAccuracyDate
Gemini 3.1 ProGoogle90.99%Apr 2026
2Gemini 3 ProGoogle89.8%Apr 2026
3Claude Opus 4.5Anthropic89.5%Apr 2026
4Gemini 3 FlashGoogle89%Apr 2026
5Qwen3.6 PlusAlibaba Cloud88.5%Apr 2026
6Claude Opus 4.1Anthropic88%Apr 2026
7MiniMax M2.1MiniMax88%Apr 2026
8Qwen3.5-397B-A17BAlibaba Cloud87.8%Apr 2026
9Claude Sonnet 4.5Anthropic87.5%Apr 2026
10GPT-5.2OpenAI87.4%Apr 2026
11Kimi K2.5Moonshot AI87.1%Apr 2026
12GPT-5OpenAI87.1%Apr 2026
13GPT-5.1OpenAI87%Apr 2026
14Grok 4xAI86.6%Apr 2026
15DeepSeek V3.2DeepSeek86.2%Apr 2026
16Claude 3.7 SonnetAnthropic85.1%Apr 2026
17DeepSeek-R1-0528DeepSeek85%Apr 2026
18Kimi K2-Thinking-0905Moonshot AI84.6%Apr 2026
19GLM-4.5Zhipu AI84.6%Apr 2026
20GPT-4oOpenAI72.6%Apr 2026

Source: TIGER-AI-Lab/MMLU-Pro · 5-shot chain-of-thought.

§ 03 · HLE

Humanity's Last Exam — built to last.

3,000 expert-contributed questions spanning math, science, law, and humanities — designed to remain unsaturated for years. No tools allowed. Even the best model scores below 40%.

#ModelProviderAccuracyDate
Gemini 3 ProGoogle38.3%
2GPT-5OpenAI25.3%
3Grok 4xAI24.5%
4Gemini 2.5 ProGoogle21.6%
5GPT-5-miniOpenAI19.4%
6Claude Opus 4.6Anthropic19%Apr 2026
7Claude 4.5 SonnetAnthropic13.7%
8Claude Sonnet 4.6Anthropic13.2%Apr 2026
9Gemini 2.5 FlashGoogle12.1%
10DeepSeek R1DeepSeek8.5%
11o1OpenAI8%
12GPT-4.1 miniOpenAI4.6%Apr 2026
13GPT-4oOpenAI2.7%

Source: agi.safe.ai · No-tools variant.

§ 04 · Methodology

Frequently asked.

What is GPQA Diamond?+

198 graduate-level science questions designed to stump non-expert PhD holders. Created by domain experts who also provided misleading distractors. It measures depth of scientific understanding, not pattern matching.

Why does MMLU-Pro use 10 choices instead of 4?+

The original MMLU with 4 choices can be solved with ~25% base rate, and models often get answers right by eliminating obviously wrong options. 10-choice format with engineered distractors forces genuine understanding.

What makes HLE (Humanity's Last Exam) different from GPQA?+

HLE covers a much broader domain (math, science, law, humanities, linguistics) and is far harder — even frontier models score below 40%. GPQA focuses specifically on biology/chemistry/physics and frontier models now exceed 80%.

§ 05 · Related

Continue reading.

Math
Math Benchmarks
GSM8K, MATH-500, AIME 2024
Coding
Coding Benchmarks
LiveCodeBench, SWE-bench Verified
Index
All LLM Benchmarks
Full leaderboard overview