Multi-step Reasoning2025en

Humanity's Last Exam

3,000 expert-level questions designed to be the hardest public benchmark. Questions sourced from domain experts across mathematics, sciences, humanities, and more. Frontier difficulty — most models score below 10%.

Samples:3,000
Metrics:accuracy
Paper / Website
Current State of the Art

Gemini 3 Pro

Google

38.3

accuracy

HLE — accuracy

13 results · 1 SOTA advances · higher is better

All results
SOTA frontier
051015202530354020262027accuracyGemini 3 Pro

Top Models Performance Comparison

Top 10 models ranked by accuracy

accuracy1Gemini 3 Pro38.3100.0%2GPT-525.366.1%3Grok 424.564.0%4Gemini 2.5 Pro21.656.4%5GPT-5-mini19.450.7%6Claude Opus 4.619.049.6%7Claude 4.5 Sonnet13.735.8%8Claude Sonnet 4.613.234.5%9Gemini 2.5 Flash12.131.6%10DeepSeek-R18.522.2%0%25%50%75%100%% of best
Best Score
38.3
Top Model
Gemini 3 Pro
Models Compared
10
Score Range
29.8

accuracyPrimary

#ModelScorePaper / CodeDate
1
Gemini 3 Pro
Google
38.3
-
2
GPT-5API
OpenAI
25.3
-
3
Grok 4API
xAI
24.5
-
4
Gemini 2.5 Pro
Google
21.6
-
5
GPT-5-mini
OpenAI
19.4
-
6
Claude Opus 4.6API
Anthropic
19Apr 2026
7
Claude 4.5 Sonnet
Anthropic
13.7
-
8
Claude Sonnet 4.6API
Anthropic
13.2Apr 2026
9
Gemini 2.5 Flash
Google
12.1
-
10
DeepSeek-R1Open Source
DeepSeek
8.5
-
11
o1API
OpenAI
8
-
12
GPT-4.1 miniAPI
OpenAI
4.6Apr 2026
13
GPT-4oAPI
OpenAI
2.7
-

Other Multi-step Reasoning Datasets