Multi-step Reasoning2025en
Humanity's Last Exam
3,000 expert-level questions designed to be the hardest public benchmark. Questions sourced from domain experts across mathematics, sciences, humanities, and more. Frontier difficulty — most models score below 10%.
Current State of the Art
Gemini 3 Pro
38.3
accuracy
HLE — accuracy
13 results · 1 SOTA advances · higher is better
All results
SOTA frontier
Top Models Performance Comparison
Top 10 models ranked by accuracy
Best Score
38.3
Top Model
Gemini 3 Pro
Models Compared
10
Score Range
29.8
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Gemini 3 Pro Google | 38.3 | - | |
| 2 | GPT-5API OpenAI | 25.3 | - | |
| 3 | Grok 4API xAI | 24.5 | - | |
| 4 | Gemini 2.5 Pro Google | 21.6 | - | |
| 5 | GPT-5-mini OpenAI | 19.4 | - | |
| 6 | Claude Opus 4.6API Anthropic | 19 | Apr 2026 | |
| 7 | Claude 4.5 Sonnet Anthropic | 13.7 | - | |
| 8 | Claude Sonnet 4.6API Anthropic | 13.2 | Apr 2026 | |
| 9 | Gemini 2.5 Flash Google | 12.1 | - | |
| 10 | DeepSeek-R1Open Source DeepSeek | 8.5 | - | |
| 11 | o1API OpenAI | 8 | - | |
| 12 | GPT-4.1 miniAPI OpenAI | 4.6 | Apr 2026 | |
| 13 | GPT-4oAPI OpenAI | 2.7 | - |