Codesota · Benchmark · Humanity's Last ExamBrowse/Reasoning/Lineages
Reasoning lineage · contamination-resistant · Jan 2025

Humanity's Last Exam.

3,000 questions hand-written by domain experts, deliberately chosen to be unsolvable by current frontier models — and still mostly are. Where reasoning evaluation moved once MMLU broke 90%, then MMLU-Pro broke 90%, then GPQA Diamond approached 85%. The benchmark designed to last a few years rather than a few months.

Lineage status · Active· still discriminative — top model under 35%

Released by the Center for AI Safety + Scale AI. Deliberately heterogeneous: math, physics, chemistry, biology, history, classical languages, law. Many questions require multimodal grounding; text-only is the standard reported number.

Official site Read the paperHow benchmarks evolve
§ 01 · Lineage

Reasoning eval, made hard again.

Each of these benchmarks held the frontier for one to two years before saturation set in. HLE's explicit design goal was to produce a benchmark that cannot be solved by any current model — buying a multi-year window of discrimination between systems.

MMLU
Sep 2020
Saturated

57-subject multiple choice. Frontier models broke 90% by 2024 — discriminative power gone.

MMLU-Pro
Jun 2024
Saturating

12k harder MCQ + 10 distractors, reasoning-heavy. Top models now ~91% — saturation curve repeating.

GPQA Diamond
Nov 2023
Saturating

198 PhD-written questions where domain experts spend 30+ minutes. Frontier closes on ~85%.

Humanity's Last Exam
Jan 2025
Active

3,000 expert-written questions explicitly designed to remain unsaturated. Top model still under 35%.

◆ this page
Fig 2 · Reasoning evaluation, attention-path. Text-only accuracy on each benchmark's standard split. Curated coding lineage at /lineage/coding for comparison.
§ 02 · SOTA

9.1% → 54.0%, two tracks.

15 months of frontier reasoning models against a benchmark designed to outlast them. Closed/API leads by ~10 points; open-weight reasoning models (DeepSeek R1, Qwen3-Max-Thinking, GLM-5) closed fast in 2026.


API · latest
Apr 2026 · Kimi K2.6 · 54.0%
Open · latest
Dec 2025 · DeepSeek-V3.2-Speciale · 30.6%
Frontier gap
23.4pp
Headroom
~65pp to ceiling
Closed / APIOpen weight
0%10%20%30%40%50%202520269.120.321.624.827.631.033.154.08.624.830.6Kimi K2.6DeepSeek-V3.2-Speciale
Fig 3 · HLE accuracy by record-setting model, split by license. Y-axis capped at 50% — top frontier is still 34.6%, with ~65pp of headroom left.
§ 03 · Leaderboard

Best published scores.

Accuracy on HLE's public text-only split. Multimodal sub-scores tend to track lower; vendors that report multimodal separately are noted in the source link. Shaded row marks SOTA.


Metric
accuracy · higher is better
Rows
25
Source
live · benchmark_results
#ModelVendorTypeSubmittedSourceaccuracy %
01Kimi K2.6APIApr 2026source54.0
02MiMo-V2.5-ProAPIApr 2026source48.0
03Gemini 3.1 ProGoogleAPIsource46.4
04GPT-5.4 ProOpenAIAPIsource44.3
05Muse SparkMetaAPIsource40.6
06Gemini 3 ProGoogleAPIlastexam.ai38.3
07DeepSeek-V4-Pro MaxDeepSeekAPIApr 2026source37.7
08Gemini 3 Pro PreviewGoogleAPIsource37.5
09GPT-5.4OpenAIAPIsource36.2
10Claude Opus 4.7AnthropicAPIsource36.2
11DeepSeek-V4-Flash MaxDeepSeekAPIApr 2026source34.8
12Claude Opus 4.6AnthropicAPIsource34.4
13GPT-5 ProOpenAIAPIsource31.6
14GLM-5.1APIFeb 2026GLM-5: from Vibe Coding to Agentic E…31.0
15DeepSeek-V3.2-SpecialeDeepSeekOSSDec 2025DeepSeek-V3.2: Pushing the Frontier …30.6
16GLM-5Zhipu AIOSSFeb 2026GLM-5: from Vibe Coding to Agentic E…30.5
17Kimi-K2.5Moonshot.AIOSSFeb 2026Kimi K2.5: Visual Agentic Intelligen…30.1
18Qwen3.5-397B-A17BAlibabaOSSFeb 2026source28.7
19Step-3.5-Flash PaCoReAPIFeb 2026Step 3.5 Flash: Open Frontier-Level …27.9
20GPT-5.2OpenAIAPIsource27.8
21Gemma 4 31BGoogleAPIApr 2026source26.5
22GPT-5OpenAIAPIsource25.3
23GPT-5OpenAIAPIlastexam.ai25.3
24Claude Opus 4.5AnthropicAPIsource25.2
25DeepSeek-V3.2DeepSeekOSSDec 2025DeepSeek-V3.2: Pushing the Frontier …25.1
Fig 4 · Vendor-reported HLE accuracy. Open-weight reasoning models (DeepSeek R1, Qwen3-Max-Thinking, GLM-5) closed the gap fast — frontier gap is now 23.4 points.
§ 04 · Open vs closed

The gap is 23.4 points.

Reasoning chains matter more on HLE than on any other benchmark on Codesota. Models without extended thinking modes (or with limited budgets) drop sharply. The open-weight catch-up is real but slower than on coding benchmarks.

Open-weight avg
29.0%
5 models · top: DeepSeek-V3.2-Speciale · 30.6%
API/closed avg
35.5%
20 models · top: Kimi K2.6 · 54.0%
Frontier gap
23.4pp
Kimi K2.6 − DeepSeek-V3.2-Speciale
§ 05 · Methodology

Why HLE is built to last.

Expert-authored

Every question was written by a domain expert (PhD or working professional) under instructions to make it adversarial: solvable from a textbook reference, not from common training data.

Heterogeneous by design

Math, physics, chemistry, biology, classical languages, law, history. No single capability covers more than a slim fraction of the test — narrow models can't shortcut their way up.

Auto-graded against keys

Each item has a short reference answer; an LLM-as-judge (GPT-4o) checks for semantic match to the gold key. Low judge variance reported in the launch paper.

Held-out test set

A private split exists alongside the public 3,000 questions to detect overfitting. Frontier vendors are encouraged to report on both; large gaps signal contamination.

§ 06 · Resources

Papers and code.

Key papers
Repositories

How benchmarks evolve GPQA Diamond (predecessor)MMLU-ProLLM head-to-head