Codesota · Reasoning · Multi-step Reasoning · HLETasks/Reasoning/Multi-step Reasoning
Multi-step Reasoning · benchmark dataset · 2025 · EN

Humanity's Last Exam.

3,000 expert-level questions designed to be the hardest public benchmark. Questions sourced from domain experts across mathematics, sciences, humanities, and more. Frontier difficulty — most models score below 10%.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

74 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
74 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Kimi K2.6Apr 2026pwc-dump54
02MiMo-V2.5-ProApr 2026pwc-dump48
03Gemini 3.1 ProGoogleMay 2026scale-hle-official46.44
04GPT-5.4 ProOpenAIMay 2026scale-hle-official44.32
05Muse SparkMetaMay 2026scale-hle-official40.56
06Gemini 3 ProAPIGoogle38.30
07DeepSeek-V4-Pro MaxDeepSeekApr 2026pwc-dump · code37.70
08Gemini 3 Pro PreviewGoogleMay 2026scale-hle-official37.52
09GPT-5.4APIOpenAIMay 2026scale-hle-official36.24
10Claude Opus 4.7AnthropicMay 2026scale-hle-official36.20
11DeepSeek-V4-Flash MaxDeepSeekApr 2026pwc-dump · code34.80
12Claude Opus 4.6AnthropicMay 2026scale-hle-official34.44
13GPT-5 ProOpenAIMay 2026scale-hle-official31.64
14GLM-5.1Feb 2026GLM-5: from Vibe Coding to Agentic Engineering · code31
15DeepSeek-V3.2-SpecialeOpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…30.60
16GLM-5OpenZhipu AIFeb 2026GLM-5: from Vibe Coding to Agentic Engineering · code30.50
17Kimi-K2.5OpenMoonshot.AIFeb 2026Kimi K2.5: Visual Agentic Intelligence · code30.10
18Qwen3.5-397B-A17BOpenAlibabaFeb 2026pwc-dump · code28.70
19Step-3.5-Flash PaCoReFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code27.90
20GPT-5.2OpenAIMay 2026scale-hle-official27.80
21Gemma 4 31BGoogleApr 2026pwc-dump26.50
22GPT-5OpenAIMay 2026scale-hle-official25.32
23GPT-5OpenAI25.30
24Claude Opus 4.5AnthropicMay 2026scale-hle-official25.20
25DeepSeek-V3.2OpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…25.10
26GLM-4.7OpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code24.80
27Grok 4APIxAI24.50
28Kimi K2.5OpenMoonshot AIMay 2026scale-hle-official24.37
29Qwen3.6-27BApr 2026pwc-dump · code24
30GPT-5.1OpenAIMay 2026scale-hle-official23.68
31Step-3.5-FlashFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code23.10
32Gemini 2.5 ProGoogleMay 2026scale-hle-official21.64
33Gemini 2.5 ProGoogle21.60
34Gemini 2.5 ProJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…21.60
35Qwen3.6-35B-A3BApr 2026pwc-dump · code21.40
36o3OpenAIMay 2026scale-hle-official20.32
37GPT-5 miniOpenAIMay 2026scale-hle-official19.44
38GPT-5 miniOpenAI19.40
39MiniMax-M2.5OpenMiniMaxAIFeb 2026pwc-dump · code19.40
40Claude Opus 4.6AnthropicApr 2026scale-ai-leaderboard19
41NVIDIA-Nemotron-3-Super-120B-A12B-BF16Dec 2025NVIDIA Nemotron 3: Efficient and Open Intelligence18.26
42o4-miniOpenAIMay 2026scale-hle-official18.08
43GLM-4.5OpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code14.40
44Claude Sonnet 4.5AnthropicMay 2026scale-hle-official13.72
45Claude 4.5 SonnetAnthropic13.70
46Claude Sonnet 4.6APIAnthropicApr 2026pricepertoken-leaderboard13.20
47Gemini 2.5 FlashGoogle12.10
48Gemini 2.5 FlashGoogleMay 2026scale-hle-official12.08
49Claude Opus 4.1AnthropicMay 2026scale-hle-official11.52
50Gemini 2.5 FlashJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…11
51Claude Opus 4AnthropicMay 2026scale-hle-official10.72
52GLM-4.5-AirOpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code10.60
53NVIDIA-Nemotron-3-Nano-30B-A3B-BF16Dec 2025Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybr… · code10.60
54Gemini 3.1 Flash-LiteGoogleMay 2026scale-hle-official8.64
55DeepSeek R1OpenDeepSeek8.50
56GLM-4.5OpenZhipu AIMay 2026scale-hle-official8.32
57o1 ProOpenAIMay 2026scale-hle-official8.12
58GLM-4.5-AirOpenZhipu AIMay 2026scale-hle-official8.12
59Claude 3.7 SonnetAnthropicMay 2026scale-hle-official8.04
60o1APIOpenAI8.00
61o1APIOpenAIMay 2026scale-hle-official7.96
62Claude Sonnet 4AnthropicMay 2026scale-hle-official7.76
63Gemini 2.0 Flash ThinkingGoogleMay 2026scale-hle-official6.56
64Llama 4 MaverickOpenMetaMay 2026scale-hle-official5.68
65GPT-4.5 PreviewAPIOpenAIMay 2026scale-hle-official5.44
66GPT-4.1OpenAIMay 2026scale-hle-official5.40
67GPT-4.1 miniAPIOpenAIApr 2026pricepertoken-leaderboard4.60
68Gemini 1.5 ProAPIGoogleMay 2026scale-hle-official4.60
69Mistral-Medium-3OpenMistralMay 2026scale-hle-official4.52
70Nova ProAmazonMay 2026scale-hle-official4.40
71Claude 3.5 SonnetAnthropicMay 2026scale-hle-official4.08
72Nova LiteAmazonMay 2026scale-hle-official3.64
73GPT-4oAPIOpenAIMay 2026scale-hle-official2.72
74GPT-4oAPIOpenAI2.70
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

5 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Jul 7, 2025Gemini 2.5 Pro21.60
  2. Aug 8, 2025GLM-4.7Zhipu AI24.80
  3. Dec 2, 2025DeepSeek-V3.2-SpecialeDeepSeek30.60
  4. Feb 17, 2026GLM-5.131
  5. Apr 20, 2026Kimi K2.654
Fig 3 · SOTA-setting models only. 5 entries span Jul 2025 Apr 2026.
§ 04 · Literature

8 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies