Codesota · Large Language ModelsThe frontier leaderboard, dated & sourcedIssue: April 22, 2026
Live registry · 8 benchmarks · 69 models ·

Large language models,
measured honestly.

Frontier model performance across knowledge, reasoning, math, code, and sustained tool-use — every score dated, every source linked, every benchmark described in its own words. No vendor spin, no collapsed averages.

Shaded rows mark current state of the art. Descriptions in serif; scores in tabular mono; navigation in sans.

§ 01 · Dashboard

The frontier, side by side.

One row per model, one column per benchmark. Cells show the published score; a dash means we have no verified result for that pair. Rows sorted by breadth of coverage first, then by average normalized score.


Models
69
Benchmarks
8
Results
159
Last update
April 23, 2026
Frontier models · top by coverage
Copper row marks leader-on-most-benchmarks
#ModelVendorMMLUMMLU-ProGPQA DiamondAIME 2025LiveCodeBench ProLiveCodeBench (classic)Tau2-BenchHumanity's Last Exam (HLE)Cov.
01Gemini 2.5 ProGoogle89.8%84%86.7%176975.6%54%21.6%7/8
02Gemini 3 ProGoogle91.4%89.8%91.9%243969%38.3%6/8
03GPT-5OpenAI90.8%87.1%89%217685%25.3%6/8
04DeepSeek R1DeepSeek90.8%71.5%72%116165.9%8.5%6/8
05GPT-4oOpenAI87.2%72.6%49.9%40.8%36%2.7%6/8
06Claude Opus 4.5Anthropic91.8%89.5%74.9%80%79%5/8
07o4-miniOpenAI90%77.6%92.7%209272.8%5/8
08Grok 4xAI86.6%86.6%88%79%24.5%5/8
09o3OpenAI92.9%82.8%86.7%101065.3%5/8
10Gemini 3 FlashGoogle89.6%89%90.4%90.8%4/8
11Claude Sonnet 4.5Anthropic90.4%87.5%141263%4/8
12Gemini 2.5 FlashGoogle82.8%128863.9%12.1%4/8
Fig 2 · Cells in copper mark the benchmark leader. Em-dash means no result on file; this is not evidence of weakness — just an absence of a verified submission in our registry.
§ 02 · Coverage

8 benchmarks. Five axes.

Knowledge, reasoning, math, code, tools, and frontier difficulty. Each tile names the benchmark, the current SOTA score, and the leading model.

MMLU
92.9%41 models
o3
Knowledge
MMLU-Pro
90.99%20 models
Gemini 3.1 Pro
Knowledge
GPQA Diamond
91.9%33 models
Gemini 3 Pro
Knowledge
AIME 2025
92.7%5 models
o4-mini
Math & Reasoning
LiveCodeBench Pro
2439Elo9 models
Gemini 3 Pro
Coding
LiveCodeBench (classic)
91.7%30 models
Gemini 3 Pro Preview
Coding
Tau2-Bench
79%8 models
Claude Opus 4.5
Agentic & Tools
Humanity's Last Exam (HLE)
38.3%13 models
Gemini 3 Pro
Frontier Difficulty
Fig 3 · Current SOTA per benchmark with count of recorded submissions. Click any tile to scroll to its full leaderboard below.
§ 03 · Caveat

Read the numbers with care.

Benchmarks are instruments. Like all instruments they can be mis-calibrated, miscounted, or gamed.

A 2026 Berkeley RDI study found that eight major agent benchmarks — including SWE-bench Verified, Terminal-Bench, WebArena, OSWorld, GAIA, and FieldWorkArena — could be exploited to near-perfect scores without solving any tasks.

Failure modes included leaked reference answers, unsanitized eval(), prompt-injectable LLM judges, and scoring functions that skip correctness checks entirely. A 10-line conftest.py was enough to make every SWE-bench test report as passing.

Treat leaderboard position as a signal, not proof of capability — especially on agentic benchmarks where the evaluation environment is itself part of the attack surface. Held-out, contamination-resistant evals like HLE and LiveCodeBench Pro are more resistant, but not immune.

Read the full Berkeley RDI analysis →
§ 04 · Knowledge

Knowledge.

Breadth across 57+ subjects, graduate-level and multiple-choice.

The original MMLU: 15,908 four-choice questions across 57 subjects from elementary to professional level. Largely saturated at the frontier — top models cluster above 90%. For a harder variant see MMLU-Pro.

#ModelVendoraccuracy
01o3OpenAI92.9%
02GPT-5.2OpenAI92.4%
03o1OpenAI91.8%
04Claude Opus 4.5Anthropic91.8%
05Claude Opus 4.5Anthropic91.6%
06Gemini 3 ProGoogle91.4%
07Claude Opus 4.6Anthropic91.2%
08o1-previewOpenAI90.8%
09GPT-5OpenAI90.8%
10DeepSeek R1DeepSeek90.8%
11GPT-4.5 PreviewOpenAI90.8%
12Claude Sonnet 4.5Anthropic90.4%
13GPT-4.1OpenAI90.2%
14Claude Sonnet 4Anthropic90.1%
15o4-miniOpenAI90%
16Gemini 2.5 ProGoogle89.8%
17Gemini 3 FlashGoogle89.6%
18Llama-4-MaverickMeta89.4%
19Claude Opus 4Anthropic88.8%
20Qwen 3 72BAlibaba88.7%
21Llama 3.1 405BMeta88.6%
22DeepSeek-V3DeepSeek88.5%
23Claude 3.5 SonnetAnthropic88.3%
24DeepSeek V3.5DeepSeek88.2%
25Llama 4 405BMeta87.8%
26Grok 2xAI87.5%
27GPT-4oOpenAI87.2%
28Mistral Large 3Mistral87.1%
29Claude 3 OpusAnthropic86.8%
30GPT-4 TurboOpenAI86.7%
31Grok 4xAI86.6%
32MiniMax M2.5MiniMax86.5%
33Qwen2.5-72B-InstructAlibaba86.1%
34Kimi K2.5Moonshot AI86%
35o3-miniOpenAI85.9%
36Gemini 1.5 ProGoogle85.9%
37o1-miniOpenAI85.2%
38Qwen 3 14BAlibaba84.3%
39Phi-4 14BMicrosoft83.9%
40Llama 3.1 70BMeta82%
41GPT-4o miniOpenAI82%
Source: hendrycks/test (MMLU) · Saturated benchmark. Small score deltas at the top (90–93%) are within noise; treat rankings as a cluster, not a strict order.

A harder, contamination-resistant successor to MMLU: 12,000 questions with 10 answer choices (vs. 4) and reasoning-focused items pulled from advanced STEM and professional sources. Frontier now crests 90% — the benchmark is starting to saturate 18 months after release.

#ModelVendoraccuracy
01Gemini 3.1 ProGoogle90.99%
02Gemini 3 ProGoogle89.8%
03Claude Opus 4.5Anthropic89.5%
04Gemini 3 FlashGoogle89%
05Qwen3.6 PlusAlibaba Cloud88.5%
06MiniMax M2.1MiniMax88%
07Claude Opus 4.1Anthropic88%
08Qwen3.5-397B-A17BAlibaba Cloud87.8%
09Claude Sonnet 4.5Anthropic87.5%
10GPT-5.2OpenAI87.4%
11GPT-5OpenAI87.1%
12Kimi K2.5Moonshot AI87.1%
13GPT-5.1OpenAI87%
14Grok 4xAI86.6%
15DeepSeek V3.2DeepSeek86.2%
16Claude 3.7 SonnetAnthropic85.1%
17DeepSeek-R1-0528DeepSeek85%
18Kimi K2-Thinking-0905Moonshot AI84.6%
19GLM-4.5Zhipu AI84.6%
20GPT-4oOpenAI72.6%
Source: arXiv:2406.01574 · Scores above ~87% cluster within a few points — treat the top cohort as a band, not a strict ranking. Variant rows (thinking vs. non-thinking) collapsed to one entry per model family.

198 expert-authored graduate-level questions in biology, chemistry, and physics. PhD-level specialists score ~65% on their own field. Designed to be impossible to Google.

#ModelVendoraccuracy
01Gemini 3 ProGoogle91.9%
02Claude Opus 4.6Anthropic91.3%
03Gemini 3 FlashGoogle90.4%
04Claude Sonnet 4.6Anthropic89.9%
05GPT-5OpenAI89%
06Grok 4xAI88%
07Gemini 2.5 ProGoogle84%
08Gemini 2.5 FlashGoogle82.8%
09o3OpenAI82.8%
10o4-miniOpenAI77.6%
11Claude Opus 4Anthropic76.7%
12o1OpenAI75.7%
13Claude Opus 4.5Anthropic74.9%
14o3-miniOpenAI74.9%
15o1-previewOpenAI73.3%
16DeepSeek R1DeepSeek71.5%
17Qwen3-235B-A22BAlibaba71.1%
18Claude Sonnet 4Anthropic70%
19Llama-4-MaverickMeta69.8%
20GPT-4.5 PreviewOpenAI69.5%
21GPT-4.1 miniOpenAI66.4%
22GPT-4.1OpenAI66.3%
23o1-miniOpenAI60%
24Claude 3.5 SonnetAnthropic59.4%
25Grok 2xAI56%
26Llama 3.1 405BMeta50.7%
27Claude 3 OpusAnthropic50.4%
28GPT-4oOpenAI49.9%
29GPT-4 TurboOpenAI49.3%
30Qwen2.5-72B-InstructAlibaba49%
31Gemini 1.5 ProGoogle46.2%
32Llama 3.1 70BMeta41.7%
33GPT-4o miniOpenAI40.2%
Source: arXiv:2311.12022 · Human expert baseline (non-specialist): 34%. PhD specialist: ~65%.
§ 05 · Math / Reasoning

Math & reasoning.

Olympiad-style short answer, released after model training cutoffs.

The 2025 American Invitational Mathematics Examination: 30 olympiad-style short-answer problems drawn after most 2024-era model training cutoffs. A primary frontier-math signal in recent reasoning-model reports.

#ModelVendoraccuracy
01o4-miniOpenAI92.7%
02Gemini 2.5 ProGoogle86.7%
03o3OpenAI86.7%
04Claude Opus 4.5Anthropic80%
05DeepSeek R1DeepSeek72%
Source: maa.org/aime · Small test set (30 problems) — a single swing is ~3.3%. Numbers below are pass@1 unless otherwise noted.
§ 06 · Coding

Code.

Contest-style programming. Elo-rated or pass@1 on held-out problems.

The 2026 Elo-rated successor to classic LCB. Built by Olympiad medalists from continuously-updated Codeforces, ICPC and IOI problems. Each LLM is treated as a virtual Codeforces contestant and fit to a Bayesian MAP Elo on the standard Codeforces scale (~800 novice to ~3800 top human).

#ModelVendorElo
01Gemini 3 ProGoogle2439
02GPT-5OpenAI2176
03o4-miniOpenAI2092
04Gemini 2.5 ProGoogle1769
05Qwen3-235B-A22BAlibaba1673
06Claude Sonnet 4.5Anthropic1412
07Gemini 2.5 FlashGoogle1288
08DeepSeek R1DeepSeek1161
09o3OpenAI1010
Source: livecodebenchpro.com · Elo rating comparable to the Codeforces human scale. Top human contestants sit around 3800; the strongest model on the board is Gemini 3 Pro at 2439.

Classic pass@1 LiveCodeBench — continuously updated with new contest problems from LeetCode, Codeforces, and AtCoder. Largely superseded by LCB Pro for frontier models, but preserved here for historical comparison across older models.

#ModelVendorpass@1
01Gemini 3 Pro Preview91.7%
02Gemini 3 FlashGoogle90.8%
03GPT-5OpenAI85%
04Grok 4xAI79%
05Gemini 2.5 ProGoogle75.6%
06DeepSeek-R1-0528DeepSeek73.3%
07o4-miniOpenAI72.8%
08Qwen3-235B-A22BAlibaba70.7%
09o3-miniOpenAI66.9%
10DeepSeek R1DeepSeek65.9%
11o3OpenAI65.3%
12DeepSeek-R1-Distill-Llama-70BDeepSeek65.2%
13Gemini 2.5 FlashGoogle63.9%
14Kimi k1.5Moonshot AI62.5%
15DeepSeek-R1-Distill-Qwen-32BDeepSeek62.1%
16Claude Opus 4Anthropic57.8%
17GPT-4.1OpenAI54.4%
18Claude Sonnet 4Anthropic52.8%
19DeepSeek-v3-0324DeepSeek49.2%
20DeepSeek-V3DeepSeek49.2%
21GPT-4.1 miniOpenAI48.3%
22Qwen2.5-Coder 32BAlibaba47.8%
23Llama-4-MaverickMeta43.4%
24DeepSeek-Coder-V2-InstructDeepSeek43.4%
25GPT-4oOpenAI40.8%
26Gemma-3-27bGoogle39%
27Llama-4-ScoutMeta32.8%
28Gemma 3 12B ITGoogle DeepMind32%
29Codestral 22BMistral29.5%
30Gemma 3 4B ITGoogle DeepMind23%
Source: livecodebench.github.io · Problems released after model training cutoffs to prevent contamination.
§ 07 · Agentic / Tools

Agentic & tools.

Multi-turn tasks using real tools and databases; pass = full resolution.

Simulates real customer service interactions — agents use tools and databases to resolve tasks in retail and airline domains across multi-turn dialogues. Pass rate = task fully resolved.

#ModelVendorpass_rate
01Claude Opus 4.5Anthropic79%
02GPT-5.2OpenAI73%
03Gemini 3 ProGoogle69%
04Claude Sonnet 4.5Anthropic63%
05GPT-5.1OpenAI59%
06Gemini 2.5 ProGoogle54%
07Claude 3.7 SonnetAnthropic47%
08GPT-4oOpenAI36%
Source: sierra-research/tau2-bench · Average across 3 seeds per model.
§ 08 · Frontier

Frontier difficulty.

Designed to remain unsaturated for years. Even the leaders score low.

Humanity's Last Exam (HLE)

3,000 extremely hard questions across math, science, law, and humanities — contributed by domain experts worldwide. Designed to remain unsaturated for years. No tools allowed in this variant.

#ModelVendoraccuracy
01Gemini 3 ProGoogle38.3%
02GPT-5OpenAI25.3%
03Grok 4xAI24.5%
04Gemini 2.5 ProGoogle21.6%
05GPT-5-miniOpenAI19.4%
06Claude Opus 4.6Anthropic19%
07Claude 4.5 SonnetAnthropic13.7%
08Claude Sonnet 4.6Anthropic13.2%
09Gemini 2.5 FlashGoogle12.1%
10DeepSeek R1DeepSeek8.5%
11o1OpenAI8%
12GPT-4.1 miniOpenAI4.6%
13GPT-4oOpenAI2.7%
Source: agi.safe.ai · Leaderboard updated from live source.
§ 09 · Browse

By capability.

A shortcut into deeper leaderboards and per-task pages. All links resolve to live registry pages.

Capability
Reasoning
Multi-step, frontier-difficulty, GPQA and HLE.
Capability
Math
AIME 2025, olympiad-style short answer.
Capability
Code generation
LiveCodeBench, pass@1 on held-out contest problems.
Capability
Knowledge
MMLU and MMLU-Pro — breadth across 57 subjects.
Capability
Agentic
SWE-bench, Tau2-Bench, tool-use under real constraints.
Capability
All LLM datasets
The full index of text-in, text-out tasks.
§ 10 · Deep dives

By benchmark family.

Editorial pages with current rankings, eval methodology, and what the score actually means.

Deep dive
Coding benchmarks
LiveCodeBench, SWE-bench Verified, HumanEval+, MBPP — pass@1 across the coding leaderboards.
Deep dive
HumanEval & MBPP
The two saturating Python micro-benchmarks — what they still tell you and what they don’t.
Deep dive
Math benchmarks
AIME, MATH, Omni-MATH — frontier models on olympiad-style problems.
Deep dive
GSM8K
Grade-school math word problems — the canonical reasoning benchmark.
Deep dive
Reasoning benchmarks
GPQA, HLE, ARC-AGI — what frontier-difficulty actually means.
Deep dive
Open-weight models
Llama, Qwen, DeepSeek, Mistral — the open frontier vs the closed.
§ 11 · Related

Keep reading.

Adjacent sections of the registry.

Section
Agentic
SWE-bench, Terminal-Bench, tool-use and the trust problem in agent evals.
Section
Code generation
The pass@1 era and what comes after.
Section
Guide · Code models
Long-form guide comparing code models in production.
Section
News
Dated editorial notes when a benchmark moves.
Submit a result Read the methodology
Read next

Three places to go from here.

Sister hub
Code generation
SWE-bench, HumanEval, LiveCodeBench, Aider Polyglot — every code-generation benchmark and the harness behind it.
Sister hub
Agentic AI
Long-horizon agent benchmarks, OpenRouter adoption data, and which models actually show up in production agents.
Reference
Methodology
How scores are sourced, which sources count, and what we exclude. Required reading before quoting numbers.