Codesota · Benchmark · Humanity's Last ExamBrowse/Reasoning/Lineages

Reasoning lineage · contamination-resistant · Jan 2025

Humanity's Last Exam.

3,000 questions hand-written by domain experts, deliberately chosen to be unsolvable by current frontier models — and still mostly are. Where reasoning evaluation moved once MMLU broke 90%, then MMLU-Pro broke 90%, then GPQA Diamond approached 85%. The benchmark designed to last a few years rather than a few months.

Lineage status · Active· still discriminative — top model under 35%

Released by the Center for AI Safety + Scale AI. Deliberately heterogeneous: math, physics, chemistry, biology, history, classical languages, law. Many questions require multimodal grounding; text-only is the standard reported number.

Official site ↗Read the paper How benchmarks evolve →

§ 01 · Lineage

Reasoning eval, made hard again.

Each of these benchmarks held the frontier for one to two years before saturation set in. HLE's explicit design goal was to produce a benchmark that cannot be solved by any current model — buying a multi-year window of discrimination between systems.

MMLU

Sep 2020

Saturated

57-subject multiple choice. Frontier models broke 90% by 2024 — discriminative power gone.

MMLU-Pro

Jun 2024

Saturating

12k harder MCQ + 10 distractors, reasoning-heavy. Top models now ~91% — saturation curve repeating.

GPQA Diamond

Nov 2023

Saturating

198 PhD-written questions where domain experts spend 30+ minutes. Frontier closes on ~85%.

Humanity's Last Exam

Jan 2025

Active

3,000 expert-written questions explicitly designed to remain unsaturated. Top model still under 35%.

◆ this page

Fig 2 · Reasoning evaluation, attention-path. Text-only accuracy on each benchmark's standard split. Curated coding lineage at /lineage/coding for comparison.

§ 02 · SOTA

9.1% → 54.0%, two tracks.

15 months of frontier reasoning models against a benchmark designed to outlast them. Closed/API leads by ~10 points; open-weight reasoning models (DeepSeek R1, Qwen3-Max-Thinking, GLM-5) closed fast in 2026.

API · latest: Apr 2026 · Kimi K2.6 · 54.0%
Open · latest: Dec 2025 · DeepSeek-V3.2-Speciale · 30.6%
Frontier gap: 23.4pp
Headroom: ~65pp to ceiling

Closed / APIOpen weight

Fig 3 · HLE accuracy by record-setting model, split by license. Y-axis capped at 50% — top frontier is still 34.6%, with ~65pp of headroom left.

§ 03 · Leaderboard

Best published scores.

Accuracy on HLE's public text-only split. Multimodal sub-scores tend to track lower; vendors that report multimodal separately are noted in the source link. Shaded row marks SOTA.

Metric: accuracy · higher is better
Rows: 25
Source: live · benchmark_results

#	Model	Vendor	Type	Submitted	Source	accuracy %
01	Kimi K2.6	—	API	Apr 2026	source	54.0
02	MiMo-V2.5-Pro	—	API	Apr 2026	source	48.0
03	Gemini 3.1 Pro	Google	API	—	source	46.4
04	GPT-5.4 Pro	OpenAI	API	—	source	44.3
05	Muse Spark	Meta	API	—	source	40.6
06	Gemini 3 Pro	Google	API	—	lastexam.ai	38.3
07	DeepSeek-V4-Pro Max	DeepSeek	API	Apr 2026	source	37.7
08	Gemini 3 Pro Preview	Google	API	—	source	37.5
09	GPT-5.4	OpenAI	API	—	source	36.2
10	Claude Opus 4.7	Anthropic	API	—	source	36.2
11	DeepSeek-V4-Flash Max	DeepSeek	API	Apr 2026	source	34.8
12	Claude Opus 4.6	Anthropic	API	—	source	34.4
13	GPT-5 Pro	OpenAI	API	—	source	31.6
14	GLM-5.1	—	API	Feb 2026	GLM-5: from Vibe Coding to Agentic E…	31.0
15	DeepSeek-V3.2-Speciale	DeepSeek	OSS	Dec 2025	DeepSeek-V3.2: Pushing the Frontier …	30.6
16	GLM-5	Zhipu AI	OSS	Feb 2026	GLM-5: from Vibe Coding to Agentic E…	30.5
17	Kimi-K2.5	Moonshot.AI	OSS	Feb 2026	Kimi K2.5: Visual Agentic Intelligen…	30.1
18	Qwen3.5-397B-A17B	Alibaba	OSS	Feb 2026	source	28.7
19	Step-3.5-Flash PaCoRe	—	API	Feb 2026	Step 3.5 Flash: Open Frontier-Level …	27.9
20	GPT-5.2	OpenAI	API	—	source	27.8
21	Gemma 4 31B	Google	API	Apr 2026	source	26.5
22	GPT-5	OpenAI	API	—	source	25.3
23	GPT-5	OpenAI	API	—	lastexam.ai	25.3
24	Claude Opus 4.5	Anthropic	API	—	source	25.2
25	DeepSeek-V3.2	DeepSeek	OSS	Dec 2025	DeepSeek-V3.2: Pushing the Frontier …	25.1

Fig 4 · Vendor-reported HLE accuracy. Open-weight reasoning models (DeepSeek R1, Qwen3-Max-Thinking, GLM-5) closed the gap fast — frontier gap is now 23.4 points.

§ 04 · Open vs closed

The gap is 23.4 points.

Reasoning chains matter more on HLE than on any other benchmark on Codesota. Models without extended thinking modes (or with limited budgets) drop sharply. The open-weight catch-up is real but slower than on coding benchmarks.

Open-weight avg

29.0%

5 models · top: DeepSeek-V3.2-Speciale · 30.6%

API/closed avg

35.5%

20 models · top: Kimi K2.6 · 54.0%

Frontier gap

23.4pp

Kimi K2.6 − DeepSeek-V3.2-Speciale

§ 05 · Methodology

Why HLE is built to last.

Expert-authored

Every question was written by a domain expert (PhD or working professional) under instructions to make it adversarial: solvable from a textbook reference, not from common training data.

Heterogeneous by design

Math, physics, chemistry, biology, classical languages, law, history. No single capability covers more than a slim fraction of the test — narrow models can't shortcut their way up.

Auto-graded against keys

Each item has a short reference answer; an LLM-as-judge (GPT-4o) checks for semantic match to the gold key. Low judge variance reported in the launch paper.

Held-out test set

A private split exists alongside the public 3,000 questions to detect overfitting. Frontier vendors are encouraged to report on both; large gaps signal contamination.

§ 06 · Resources

Papers and code.

Key papers

Humanity's Last Exam — launch paper

Phan et al. (CAIS · Scale AI) · arXiv 2501.14249

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, Hou, Stickland, Petty, Pang, Dirani, Michael, Bowman · COLM 2024

MMLU-Pro: A More Robust and Challenging MMLU

Wang et al. · NeurIPS 2024

Measuring Massive Multitask Language Understanding (MMLU)

Hendrycks et al. · ICLR 2021

Repositories

centerforaisafety/hle · 1.4k★

Official HLE evaluation harness + question loader

lastexam.ai · —★

Submission portal · live leaderboard · auto-grader

huggingface.co/datasets/cais/hle · —★

Public 3,000-question split on Hugging Face

How benchmarks evolve →GPQA Diamond (predecessor)MMLU-Pro LLM head-to-head