3,000 questions hand-written by domain experts, deliberately chosen to be unsolvable by current frontier models — and still mostly are. Where reasoning evaluation moved once MMLU broke 90%, then MMLU-Pro broke 90%, then GPQA Diamond approached 85%. The benchmark designed to last a few years rather than a few months.
Released by the Center for AI Safety + Scale AI. Deliberately heterogeneous: math, physics, chemistry, biology, history, classical languages, law. Many questions require multimodal grounding; text-only is the standard reported number.
Each of these benchmarks held the frontier for one to two years before saturation set in. HLE's explicit design goal was to produce a benchmark that cannot be solved by any current model — buying a multi-year window of discrimination between systems.
57-subject multiple choice. Frontier models broke 90% by 2024 — discriminative power gone.
12k harder MCQ + 10 distractors, reasoning-heavy. Top models now ~91% — saturation curve repeating.
198 PhD-written questions where domain experts spend 30+ minutes. Frontier closes on ~85%.
3,000 expert-written questions explicitly designed to remain unsaturated. Top model still under 35%.
15 months of frontier reasoning models against a benchmark designed to outlast them. Closed/API leads by ~10 points; open-weight reasoning models (DeepSeek R1, Qwen3-Max-Thinking, GLM-5) closed fast in 2026.
Accuracy on HLE's public text-only split. Multimodal sub-scores tend to track lower; vendors that report multimodal separately are noted in the source link. Shaded row marks SOTA.
Reasoning chains matter more on HLE than on any other benchmark on Codesota. Models without extended thinking modes (or with limited budgets) drop sharply. The open-weight catch-up is real but slower than on coding benchmarks.
Every question was written by a domain expert (PhD or working professional) under instructions to make it adversarial: solvable from a textbook reference, not from common training data.
Math, physics, chemistry, biology, classical languages, law, history. No single capability covers more than a slim fraction of the test — narrow models can't shortcut their way up.
Each item has a short reference answer; an LLM-as-judge (GPT-4o) checks for semantic match to the gold key. Low judge variance reported in the launch paper.
A private split exists alongside the public 3,000 questions to detect overfitting. Frontier vendors are encouraged to report on both; large gaps signal contamination.