Codesota · Large Language ModelsThe frontier leaderboard, dated & sourcedSnapshot page updated: 2026-05-26
Tracked registry · 7 benchmarks · 130 models · per-benchmark timestamps below

Large language models,
measured honestly.

Frontier model performance across knowledge, reasoning, math, code, and sustained tool-use — every score dated, every source linked, every benchmark described in its own words. No vendor spin, no collapsed averages.

Shaded rows mark leaders within this tracked snapshot. Descriptions in serif; scores in tabular mono; navigation in sans.

§ 01 · Dashboard

Tracked shortlist, not live SOTA.

One row per current frontier model, one column per benchmark. Cells show the published score; a dash means we have no verified result for that pair. GPT-4o and other older systems remain in benchmark histories, but not in this frontier shortlist.


Frontier rows
35
Benchmarks
7
Results
304
Page updated
May 26, 2026
Tracked shortlist · dated evidence
Copper cells mark benchmark leaders
#ModelProviderHumanity's Last Exam (HLE)GPQA DiamondAIME 2025LiveCodeBench ProLiveCodeBench (classic)Tau2-BenchMMLU (legacy saturated sanity check)Status
01Gemini 3.1 ProGoogle46.44% ✓— no known score— no known score2887 ✓— no known score— no known score— no known score2/7
02Gemini 3 FlashGoogle— no known score90.4% ?— no known score— no known score90.8% ✓— no known score89.6% ?3/7
03GPT-5.4 ProOpenAI44.32% ✓— no known score— no known score— no known score— no known score— no known score— no known score1/7
04DeepSeek V3.5DeepSeek— no known score— no known score— no known score— no known score— no known score— no known score88.2% ?1/7
05Mistral Large 3Mistral— no known score— no known score— no known score— no known score— no known score— no known score87.1% ?1/7
06MiniMax M2.5MiniMax— no known score— no known score— no known score— no known score— no known score— no known score86.5% ?1/7
07Kimi K2.5Moonshot AI— no known score— no known score— no known score— no known score— no known score— no known score86% ?1/7
08Claude Opus 4.6Anthropic34.44% ✓91.3% ?— no known score— no known score— no known score— no known score91.2% ?3/7
09Gemini 3 Pro PreviewGoogle37.52% ✓— no known score— no known score— no known score91.7% ✓— no known score— no known score2/7
10Gemini 3 ProGoogle38.3% ?91.9% ?— no known score2439 ✓— no known score69% ✓91.4% ?5/7
11Kimi-K2.5Moonshot.AI24.37% ✓87.6% ?96.1% ?— no known score85% ?— no known score— no known score4/7
12Qwen3.5-397B-A17BAlibaba28.7% ?88.4% ?— no known score— no known score83.6% ?86.7% ?— no known score4/7
13GPT-5OpenAI25.32% ✓89% ?— no known score2176 ✓85% ?— no known score90.8% ?5/7
14Grok 4xAI24.5% ?88% ?— no known score— no known score79% ?— no known score86.6% ?4/7
15Claude Opus 4.5Anthropic25.2% ✓74.9% ✓80% ✓— no known score— no known score79% ✓91.6% ✓5/7
16GPT-5.2OpenAI27.8% ✓— no known score— no known score— no known score— no known score73% ✓92.4% ?3/7
17DeepSeek-R1-0528DeepSeek— no known score— no known score— no known score— no known score73.3% ✓— no known score— no known score1/7
18GPT-5.4OpenAI36.24% ✓— no known score— no known score— no known score— no known score— no known score— no known score1/7
19Claude Opus 4.7Anthropic36.2% ✓— no known score— no known score— no known score— no known score— no known score— no known score1/7
20o4-miniOpenAI18.08% ✓77.6% ?92.7% ✓2092 ✓72.8% ✓— no known score90% ✓6/7
21Qwen3-235B-A22BAlibaba— no known score71.1% ?81.5% ?1673 ✓70.7% ✓— no known score87.81% ?5/7
22Gemini 2.5 ProGoogle21.64% ✓84% ✓86.7% ✓1769 ✓75.6% ?54% ?89.8% ✓7/7
23GPT-5 ProOpenAI31.64% ✓— no known score— no known score— no known score— no known score— no known score— no known score1/7
24o3OpenAI20.32% ✓82.8% ?86.7% ✓1010 ✓65.3% ✓— no known score92.9% ✓6/7
25Claude Opus 4Anthropic10.72% ✓76.7% ✓— no known score— no known score57.8% ✓— no known score88.8% ?4/7
26Claude Sonnet 4.6Anthropic13.2% ?89.9% ?— no known score— no known score— no known score— no known score— no known score2/7
27Claude Sonnet 4Anthropic7.76% ✓70% ✓— no known score— no known score52.8% ✓— no known score90.1% ?4/7
28DeepSeek R1DeepSeek8.5% ?71.5% ✓72% ✓1161 ✓65.9% ✓— no known score90.8% ✓6/7
29Llama 4 MaverickMeta5.68% ✓69.8% ✓— no known score— no known score43.4% ✓— no known score89.4% ✓4/7
30GPT-5.1OpenAI23.68% ✓— no known score— no known score— no known score— no known score59% ?— no known score2/7
31Claude Sonnet 4.5Anthropic13.72% ✓— no known score— no known score1412 ✓— no known score63% ✓90.4% ?4/7
32Gemini 2.5 FlashGoogle12.08% ✓82.8% ?— no known score1288 ✓63.9% ?— no known score— no known score4/7
33GPT-5 miniOpenAI19.44% ✓— no known score— no known score— no known score— no known score— no known score— no known score1/7
34Claude Opus 4.1Anthropic11.52% ✓— no known score— no known score— no known score— no known score— no known score— no known score1/7
35Gemini 3.1 Flash-LiteGoogle8.64% ✓— no known score— no known score— no known score— no known score— no known score— no known score1/7
Fig 2 · This is a tracked shortlist, not a live-SOTA claim or raw coverage ranking. GPT-4o, GPT-4 Turbo, Claude 3 Opus, Gemini 1.5 Pro and similar historical rows stay in the individual benchmark tables below for lineage, but are intentionally excluded here. Ingestion debt: at least one benchmark timestamp is older than the page update date.
§ 02 · Coverage

7 benchmarks. Five axes.

Knowledge, reasoning, math, code, tools, and frontier difficulty. Each tile names the benchmark, the current SOTA score, and the leading model.

Humanity's Last Exam (HLE)
54%64 models
Kimi K2.6
Frontier Difficulty · verified 2026-05-26
GPQA Diamond
91.9%73 models
Gemini 3 Pro
Knowledge · verified 2026-05-06
AIME 2025
99.9%22 models
Step-3.5-Flash PaCoRe
Math & Reasoning · verified 2026-05-13
LiveCodeBench Pro
2887Elo10 models
Gemini 3.1 Pro
Coding · verified 2026-05-26
LiveCodeBench (classic)
93.5%53 models
DeepSeek-V4-Pro Max
Coding · verified 2026-04-24
Tau2-Bench
89.7%19 models
GLM-5
Agentic & Tools · verified 2026-05-12
MMLU (legacy saturated sanity check)
92.9%63 models
o3
Legacy Sanity Check · verified 2026-05-18
Fig 3 · Tracked leader per benchmark with count of recorded submissions and latest benchmark timestamp. Click any tile to scroll to its full leaderboard below.
§ 03 · Caveat

Read the numbers with care.

Benchmarks are instruments. Like all instruments they can be mis-calibrated, miscounted, or gamed.

A 2026 Berkeley RDI study found that eight major agent benchmarks — including SWE-bench Verified, Terminal-Bench, WebArena, OSWorld, GAIA, and FieldWorkArena — could be exploited to near-perfect scores without solving any tasks.

Failure modes included leaked reference answers, unsanitized eval(), prompt-injectable LLM judges, and scoring functions that skip correctness checks entirely. A 10-line conftest.py was enough to make every SWE-bench test report as passing.

Treat leaderboard position as a signal, not proof of capability — especially on agentic benchmarks where the evaluation environment is itself part of the attack surface. Held-out, contamination-resistant evals like HLE and LiveCodeBench Pro are more resistant, but not immune.

Read the full Berkeley RDI analysis →
§ 04 · Frontier

Frontier difficulty.

Designed to remain unsaturated for years. Even the leaders score low.

Humanity's Last Exam (HLE)

3,000 extremely hard questions across math, science, law, and humanities — contributed by domain experts worldwide. Designed to remain unsaturated for years. No tools allowed in this variant.

#ModelProvideraccuracy
01Kimi K2.6Unknown54%
02MiMo-V2.5-ProUnknown48%
03Gemini 3.1 ProGoogle46.44%
04GPT-5.4 ProOpenAI44.32%
05Muse SparkMeta40.56%
06Gemini 3 ProGoogle38.3%
07DeepSeek-V4-Pro MaxDeepSeek37.7%
08Gemini 3 Pro PreviewGoogle37.52%
09GPT-5.4OpenAI36.24%
10Claude Opus 4.7Anthropic36.2%
11DeepSeek-V4-Flash MaxDeepSeek34.8%
12Claude Opus 4.6Anthropic34.44%
13GPT-5 ProOpenAI31.64%
14GLM-5.1Unknown31%
15DeepSeek-V3.2-SpecialeDeepSeek30.6%
16GLM-5Zhipu AI30.5%
17Qwen3.5-397B-A17BAlibaba28.7%
18Step-3.5-Flash PaCoReUnknown27.9%
19GPT-5.2OpenAI27.8%
20Gemma 4 31BGoogle26.5%
21GPT-5OpenAI25.32%
22Claude Opus 4.5Anthropic25.2%
23DeepSeek-V3.2DeepSeek25.1%
24GLM-4.7Zhipu AI24.8%
25Grok 4xAI24.5%
26Kimi K2.5Moonshot AI24.37%
27Qwen3.6-27BUnknown24%
28GPT-5.1OpenAI23.68%
29Step-3.5-FlashUnknown23.1%
30Gemini 2.5 ProGoogle21.64%
31Gemini 2.5 ProUnknown21.6%
32Qwen3.6-35B-A3BUnknown21.4%
33o3OpenAI20.32%
34GPT-5 miniOpenAI19.44%
35MiniMax-M2.5MiniMaxAI19.4%
36NVIDIA-Nemotron-3-Super-120B-A12B-BF16Unknown18.26%
37o4-miniOpenAI18.08%
38Claude Sonnet 4.5Anthropic13.72%
39Claude Sonnet 4.6Anthropic13.2%
40Gemini 2.5 FlashGoogle12.08%
41Claude Opus 4.1Anthropic11.52%
42Gemini 2.5 FlashUnknown11%
43Claude Opus 4Anthropic10.72%
44GLM-4.5-AirZhipu AI10.6%
45NVIDIA-Nemotron-3-Nano-30B-A3B-BF16Unknown10.6%
46Gemini 3.1 Flash-LiteGoogle8.64%
47DeepSeek R1DeepSeek8.5%
48GLM-4.5Zhipu AI8.32%
49GLM-4.5-AirZhipu AI8.12%
50o1 ProOpenAI8.12%
51Claude 3.7 SonnetAnthropic8.04%
52o1OpenAI7.96%
53Claude Sonnet 4Anthropic7.76%
54Gemini 2.0 Flash ThinkingGoogle6.56%
55Llama 4 MaverickMeta5.68%
56GPT-4.5 PreviewOpenAI5.44%
57GPT-4.1OpenAI5.4%
58Gemini 1.5 ProGoogle4.6%
59GPT-4.1 miniOpenAI4.6%
60Mistral-Medium-3Mistral4.52%
61Nova ProAmazon4.4%
62Claude 3.5 SonnetAnthropic4.08%
63Nova LiteAmazon3.64%
64GPT-4oOpenAI2.72%
Last verified: 2026-05-26 · Source: Scale Labs HLE leaderboard · Official CAIS/Scale leaderboard snapshot. Models are evaluated on all public HLE questions with temperature 0.0 when configurable or stated otherwise.
§ 05 · Knowledge

Knowledge.

Breadth across 57+ subjects, graduate-level and multiple-choice.

198 expert-authored graduate-level questions in biology, chemistry, and physics. PhD-level specialists score ~65% on their own field. Designed to be impossible to Google.

#ModelProvideraccuracy
01Gemini 3 ProGoogle91.9%
02Claude Opus 4.6Anthropic91.3%
03Kimi K2.6Unknown90.5%
04Gemini 3 FlashGoogle90.4%
05DeepSeek-V4-Pro MaxDeepSeek90.1%
06Claude Sonnet 4.6Anthropic89.9%
07GPT-5OpenAI89%
08Qwen3.5-397B-A17BAlibaba88.4%
09DeepSeek-V4-Flash MaxDeepSeek88.1%
10Grok 4xAI88%
11Qwen3.6-27BUnknown87.8%
12Kimi-K2.5Moonshot.AI87.6%
13Qwen3.5-122B-A10BAlibaba86.6%
14Gemini 2.5 ProUnknown86.4%
15GLM-5.1Unknown86.2%
16GLM-5Zhipu AI86%
17Qwen3.6-35B-A3BUnknown86%
18DeepSeek-V3.2-SpecialeDeepSeek85.7%
19GLM-4.7Zhipu AI85.7%
20Qwen3.5-27BAlibaba85.5%
21MiniMax-M2.5MiniMaxAI85.2%
22Step-3.5-Flash PaCoReUnknown85%
23Gemma 4 31BGoogle84.3%
24Qwen3.5-35B-A3BAlibaba84.2%
25Gemini 2.5 ProGoogle84%
26Qwen3.5-Omni-PlusUnknown83.9%
27Step-3.5-FlashUnknown83.5%
28Gemini 2.5 FlashGoogle82.8%
29Gemini 2.5 FlashUnknown82.8%
30o3OpenAI82.8%
31DeepSeek-V3.2DeepSeek82.4%
32NVIDIA-Nemotron-3-Super-120B-A12B-BF16Unknown79.23%
33GLM-4.5Zhipu AI79.1%
34o4-miniOpenAI77.6%
35Qwen3-VL-235B-A22B-ThinkingQwen77.1%
36Claude Opus 4Anthropic76.7%
37o1OpenAI75.7%
38GLM-4.5-AirZhipu AI75%
39o3-miniOpenAI74.9%
40Claude Opus 4.5Anthropic74.9%
41Qwen3-Coder-NextQwen74.49%
42Qwen3-VL-235B-A22B-InstructQwen74.3%
43o1-previewOpenAI73.3%
44Qwen3-Omni-Flash-ThinkingUnknown73.1%
45NVIDIA-Nemotron-3-Nano-30B-A3B-BF16Unknown73%
46DeepSeek R1DeepSeek71.5%
47Qwen3-235B-A22BAlibaba71.1%
48ZAYA1-8BZ.ai71%
49Claude Sonnet 4Anthropic70%
50Llama 4 MaverickMeta69.8%
51GPT-4.5 PreviewOpenAI69.5%
52MiMo-V2.5-ProUnknown66.7%
53GPT-4.1 miniOpenAI66.4%
54GPT-4.1OpenAI66.3%
55Trinity Large PreviewArcee AI63.32%
56o1-miniOpenAI60%
57Claude 3.5 SonnetAnthropic59.4%
58Grok 2xAI56%
59MiniMax-Text-01MiniMax54.4%
60Llama 3 (405B, Instruct)Meta51.1%
61Llama 3.1 405BMeta50.7%
62Claude 3 OpusAnthropic50.4%
63GPT-4oOpenAI49.9%
64Qwen2.5-PlusUnknown49.7%
65GPT-4 TurboOpenAI49.3%
66Qwen2.5-72B-InstructAlibaba49%
67Qwen2.5-VL-72BUnknown49%
68Gemini 1.5 ProGoogle46.2%
69Gemma 3 (27B, IT)Unknown42.4%
70Llama 3.1 70BMeta41.7%
71Step-3.5-Flash BaseUnknown41.7%
72GPT-4o miniOpenAI40.2%
73Qwen3-VL-8B-InstructQwen34.7%
Last verified: 2026-05-06 · Source: arXiv:2311.12022 · Human expert baseline (non-specialist): 34%. PhD specialist: ~65%.
§ 06 · Math / Reasoning

Math & reasoning.

Olympiad-style short answer, released after model training cutoffs.

The 2025 American Invitational Mathematics Examination: 30 olympiad-style short-answer problems drawn after most 2024-era model training cutoffs. A primary frontier-math signal in recent reasoning-model reports.

#ModelProvideraccuracy
01Step-3.5-Flash PaCoReUnknown99.9%
02Step-3.5-FlashUnknown97.3%
03Kimi-K2.5Moonshot.AI96.1%
04DeepSeek-V3.2-SpecialeDeepSeek96%
05SU-01Unknown94.6%
06Intern-S1-ProShanghai AI Lab93.1%
07DeepSeek-V3.2DeepSeek93.1%
08o4-miniOpenAI92.7%
09Qwen3-VL-235B-A22B-ThinkingQwen89.7%
10NVIDIA-Nemotron-3-Nano-30B-A3B-BF16Unknown89.1%
11Gemini 2.5 ProUnknown88%
12o3OpenAI86.7%
13Gemini 2.5 ProGoogle86.7%
14Qwen3-Coder-NextQwen83.07%
15Qwen3-235B-A22BAlibaba81.5%
16Claude Opus 4.5Anthropic80%
17Qwen3-VL-235B-A22B-InstructQwen74.7%
18Qwen3-Omni-Flash-ThinkingUnknown74%
19DeepSeek R1DeepSeek72%
20Gemini 2.5 FlashUnknown72%
21Qwen3-VL-8B-InstructQwen45.9%
22Trinity Large PreviewArcee AI24.36%
Last verified: 2026-05-13 · Source: maa.org/aime · Small test set (30 problems) — a single swing is ~3.3%. Numbers below are pass@1 unless otherwise noted.
§ 07 · Coding

Code.

Contest-style programming. Elo-rated or pass@1 on held-out problems.

The 2026 Elo-rated successor to classic LCB. Built by Olympiad medalists from continuously-updated Codeforces, ICPC and IOI problems. Each LLM is treated as a virtual Codeforces contestant and fit to a Bayesian MAP Elo on the standard Codeforces scale (~800 novice to ~3800 top human).

#ModelProviderElo
01Gemini 3.1 ProGoogle2887
02Gemini 3 ProGoogle2439
03GPT-5OpenAI2176
04o4-miniOpenAI2092
05Gemini 2.5 ProGoogle1769
06Qwen3-235B-A22BAlibaba1673
07Claude Sonnet 4.5Anthropic1412
08Gemini 2.5 FlashGoogle1288
09DeepSeek R1DeepSeek1161
10o3OpenAI1010
Last verified: 2026-05-26 · Source: livecodebenchpro.com · Elo rating comparable to the Codeforces human scale. Top human contestants sit around 3800; the current tracked leader is Gemini 3.1 Pro at 2887. Gemini 3 Pro remains the previous 2439 Elo result.

Classic pass@1 LiveCodeBench — continuously updated with new contest problems from LeetCode, Codeforces, and AtCoder. Largely superseded by LCB Pro for frontier models, but preserved here for historical comparison across older models.

#ModelProviderpass-1
01DeepSeek-V4-Pro MaxDeepSeek93.5%
02Gemini 3 Pro PreviewGoogle91.7%
03DeepSeek-V4-Flash MaxDeepSeek91.6%
04Gemini 3 FlashGoogle90.8%
05Kimi K2.6Unknown89.6%
06DeepSeek-V3.2-SpecialeDeepSeek88.7%
07Kimi-K2.5Moonshot.AI85%
08GPT-5OpenAI85%
09Qwen3.6-27BUnknown83.9%
10Qwen3.5-397B-A17BAlibaba83.6%
11DeepSeek-V3.2DeepSeek83.3%
12NVIDIA-Nemotron-3-Super-120B-A12B-BF16Unknown81.19%
13Qwen3.6-35B-A3BUnknown80.4%
14Gemma 4 31BGoogle80%
15Grok 4xAI79%
16Gemini 2.5 ProGoogle75.6%
17Intern-S1-ProShanghai AI Lab74.3%
18Gemini 2.5 ProUnknown74.2%
19DeepSeek-R1-0528DeepSeek73.3%
20GLM-4.5Zhipu AI72.9%
21o4-miniOpenAI72.8%
22Qwen3-235B-A22BAlibaba70.7%
23GLM-4.5-AirZhipu AI70.7%
24Qwen3-VL-235B-A22B-ThinkingQwen70.1%
25NVIDIA-Nemotron-3-Nano-30B-A3B-BF16Unknown68.3%
26o3-miniOpenAI66.9%
27DeepSeek R1DeepSeek65.9%
28o3OpenAI65.3%
29DeepSeek-R1-Distill-Llama-70BDeepSeek65.2%
30Gemini 2.5 FlashGoogle63.9%
31Kimi k1.5Moonshot AI62.5%
32DeepSeek-R1-Distill-Qwen-32BDeepSeek62.1%
33Gemini 2.5 FlashUnknown59.3%
34Qwen3-Coder-NextQwen58.93%
35Claude Opus 4Anthropic57.8%
36Qwen2.5-72B-InstructUnknown55.5%
37GPT-4.1OpenAI54.4%
38Qwen3-VL-235B-A22B-InstructQwen54.3%
39Claude Sonnet 4Anthropic52.8%
40DeepSeek-v3-0324DeepSeek49.2%
41DeepSeek-V3DeepSeek49.2%
42GPT-4.1 miniOpenAI48.3%
43Qwen2.5-Coder 32BAlibaba47.8%
44Llama 4 MaverickMeta43.4%
45DeepSeek-Coder-V2-InstructDeepSeek43.4%
46GPT-4oOpenAI40.8%
47Qwen3-VL-8B-InstructQwen39.3%
48Gemma-3-27bGoogle39%
49Llama-4-ScoutMeta32.8%
50Gemma 3 12B ITGoogle DeepMind32%
51Gemma 3 (27B, IT)Unknown29.7%
52Codestral 22BMistral29.5%
53Gemma 3 4B ITGoogle DeepMind23%
Last verified: 2026-04-24 · Source: livecodebench.github.io · Problems released after model training cutoffs to prevent contamination.
§ 08 · Agentic / Tools

Agentic & tools.

Multi-turn tasks using real tools and databases; pass = full resolution.

Simulates real customer service interactions — agents use tools and databases to resolve tasks in retail and airline domains across multi-turn dialogues. Pass rate = task fully resolved.

#ModelProvideraccuracy
01GLM-5Zhipu AI89.7%
02Step-3.5-FlashUnknown88.2%
03Qwen3.5-397B-A17BAlibaba86.7%
04Qwen3.5-35B-A3BAlibaba81.2%
05Intern-S1-ProShanghai AI Lab80.9%
06DeepSeek-V3.2DeepSeek80.3%
07Qwen3.5-122B-A10BAlibaba79.5%
08Qwen3.5-27BAlibaba79%
09Claude Opus 4.5Anthropic79%
10Ling-2.6-1TUnknown78.36%
11SenseNova-U1-A3B-MoTSenseTime75.39%
12GPT-5.2OpenAI73%
13Gemini 3 ProGoogle69%
14Claude Sonnet 4.5Anthropic63%
15NVIDIA-Nemotron-3-Super-120B-A12B-BF16Unknown61.15%
16GPT-5.1OpenAI59%
17Gemini 2.5 ProGoogle54%
18Claude 3.7 SonnetAnthropic47%
19GPT-4oOpenAI36%
Last verified: 2026-05-12 · Source: sierra-research/tau2-bench · Average across 3 seeds per model.
§ 09 · Legacy Sanity Check

Legacy sanity check.

Saturated legacy checks retained for continuity, not frontier ranking.

MMLU (legacy saturated sanity check)

The original MMLU: 15,908 four-choice questions across 57 subjects from elementary to professional level. Largely saturated at the frontier — top models cluster above 90%. For a harder variant see MMLU-Pro.

#ModelProvideraccuracy
01o3OpenAI92.9%
02GPT-5.2OpenAI92.4%
03o1OpenAI91.8%
04Claude Opus 4.5Anthropic91.6%
05Gemini 3 ProGoogle91.4%
06Claude Opus 4.6Anthropic91.2%
07o1-previewOpenAI90.8%
08GPT-5OpenAI90.8%
09DeepSeek R1DeepSeek90.8%
10GPT-4.5 PreviewOpenAI90.8%
11Claude Sonnet 4.5Anthropic90.4%
12GPT-4.1OpenAI90.2%
13Claude Sonnet 4Anthropic90.1%
14o4-miniOpenAI90%
15GLM-4.5Zhipu AI90%
16Gemini 2.5 ProGoogle89.8%
17Gemini 3 FlashGoogle89.6%
18Llama 4 MaverickMeta89.4%
19Claude Opus 4Anthropic88.8%
20Qwen 3 72BAlibaba88.7%
21Llama 3.1 405BMeta88.6%
22DeepSeek-V3DeepSeek88.5%
23MiniMax-Text-01MiniMax88.5%
24Claude 3.5 SonnetAnthropic88.3%
25DeepSeek V3.5DeepSeek88.2%
26Qwen3-235B-A22BAlibaba87.81%
27Llama 4 405BMeta87.8%
28Qwen3-Coder-NextQwen87.73%
29Grok 2xAI87.5%
30Llama 3 (405B, Instruct)Meta87.3%
31Trinity Large PreviewArcee AI87.21%
32GPT-4oOpenAI87.2%
33Mistral Large 3Mistral87.1%
34LongCat-Flash-OmniUnknown86.81%
35Claude 3 OpusAnthropic86.8%
36GPT-4 TurboOpenAI86.7%
37Grok 4xAI86.6%
38MiniMax M2.5MiniMax86.5%
39Qwen2.5-72B-InstructAlibaba86.1%
40Kimi K2.5Moonshot AI86%
41o3-miniOpenAI85.9%
42Gemini 1.5 ProGoogle85.9%
43Step-3.5-Flash BaseUnknown85.8%
44o1-miniOpenAI85.2%
45Qwen 3 14BAlibaba84.3%
46Phi-4 14BMicrosoft83.9%
47GPT-4o miniOpenAI82%
48Llama 3.1 70BMeta82%
49Qwen3-Omni-30B-A3B-Base-202507Unknown81.69%
50Qwen3-VL-8B-InstructQwen80.7%
51MiniCPM-o 4.5-InstructUnknown77%
52AriaUnknown73.3%
53Apertus-70B-InstructUnknown69.6%
54Llama 2 70B (5-shot)Unknown68.9%
55Chameleon 34BUnknown65.8%
56Apertus-70BUnknown65.2%
57LLaMA-65BUnknown63.4%
58OLMo-2-7B-1124 (olmOCR-peS2o)Unknown61.1%
59HRM-Text-1BUnknown60.7%
60BLT-Entropy 8BUnknown57.4%
61HeliumUnknown54.3%
62BitNet b1.58 2B4TUnknown53.17%
63MoshiKyutai49.7%
Last verified: 2026-05-18 · Source: hendrycks/test (MMLU) · Saturated benchmark. Small score deltas at the top (90–93%) are within noise; treat rankings as a cluster, not a strict order.
§ 10 · Browse

By capability.

A shortcut into deeper leaderboards and per-task pages. All links resolve to live registry pages.

Capability
Reasoning
Multi-step, frontier-difficulty, GPQA and HLE.
Capability
Math
AIME 2025, olympiad-style short answer.
Capability
Code generation
LiveCodeBench, pass@1 on held-out contest problems.
Capability
Knowledge
MMLU and MMLU-Pro — breadth across 57 subjects.
Capability
Agentic
SWE-bench, Tau2-Bench, tool-use under real constraints.
Capability
All LLM datasets
The full index of text-in, text-out tasks.
§ 11 · Deep dives

By benchmark family.

Editorial pages with current rankings, eval methodology, and what the score actually means.

Deep dive
Coding benchmarks
LiveCodeBench, SWE-bench Verified, HumanEval+, MBPP — pass@1 across the coding leaderboards.
Deep dive
HumanEval & MBPP
The two saturating Python micro-benchmarks — what they still tell you and what they don’t.
Deep dive
Math benchmarks
AIME, MATH, Omni-MATH — frontier models on olympiad-style problems.
Deep dive
GSM8K
Grade-school math word problems — the canonical reasoning benchmark.
Deep dive
Reasoning benchmarks
GPQA, HLE, ARC-AGI — what frontier-difficulty actually means.
Deep dive
Open-weight models
Llama, Qwen, DeepSeek, Mistral — the open frontier vs the closed.
§ 12 · Related

Keep reading.

Adjacent sections of the registry.

Section
Agentic
SWE-bench, Terminal-Bench, tool-use and the trust problem in agent evals.
Section
Code generation
The pass@1 era and what comes after.
Section
Guide · Code models
Long-form guide comparing code models in production.
Section
News
Dated editorial notes when a benchmark moves.
Submit a result Read the methodology
Read next

Three places to go from here.

Condensed view
LLM Power Ranking
One ranking by average percentile across MMLU, GPQA, MATH, AIME and more — plus CodeSOTA-verified scores where we ran our own held-out eval.
Sister hub
Code generation
SWE-bench, HumanEval, LiveCodeBench, Aider Polyglot — every code-generation benchmark and the harness behind it.
Sister hub
Agentic AI
Long-horizon agent benchmarks, OpenRouter adoption data, and which models actually show up in production agents.
Reference
Methodology
How scores are sourced, which sources count, and what we exclude. Required reading before quoting numbers.