Codesota · Benchmark OpennessWhich evals are open, and which still resist contaminationUpdated: June 7, 2026
§ 00 · Premise

Every lab quotes the same benchmarks. How many can you train on?

A model release is a wall of percentages. What the table never tells you is how each number was scored — and a fully open test set is, by definition, a trainable one.

So we took the benchmarks the frontier labs actually cite and tagged each by access model and contamination risk. 47 benchmarks; 33 are fully open, only 13 are held-out, private, or live.

§ 01 · The headline

70% of the benchmarks labs use to grade their models are fully open — the questions and the answers are on the internet the models were trained on. The ones that hold a number’s value over time are the 13 that hide the test set: held-out servers, private graders, or live human votes.

A higher score on an open benchmark can mean a better model — or a better-memorized one. The access column is the only way to tell them apart.

§ 02 · Access types

Five ways a benchmark guards its answers.

Openness is a spectrum. At one end the whole test set sits on Hugging Face; at the other the questions never leave the maintainer’s server. Each step up makes the score harder to game and harder to fake.

Copper marks the contamination-resistant tiers — the ones whose numbers age well.

Open

Test items + answers fully public — trainable, leak-prone

33
Gated

Public but access-controlled / canary-protected — still downloadable

1
Held-out

Items public, answers withheld via an eval server or leaderboard

7
Private

Items hidden, scored only by the maintainer

1
Live

Continuously refreshed / human-vote — the test set is a moving target

5
§ 03 · The index

Every benchmark, by category and access.

Grouped by what they measure, sorted within each group from most open to most guarded. The contamination column is our read on how likely the test set has already been seen.

Red = high contamination risk · copper = low. “Sat.” flags a near-solved benchmark.

BenchmarkWhat it measuresAccessContam.Yr
Knowledge & reasoning
MMLUsat.57-subject multiple-choice knowledgeOpenhigh20
MMLU-ProHarder 10-way MMLU with reasoningOpenhigh24
SimpleQAShort-fact accuracy & calibrationOpenmedium24
BIG-Bench Hardsat.23 hard reasoning tasksOpenhigh22
GPQA DiamondGoogle-proof graduate science QAGatedmedium23
Humanity's Last ExamExpert frontier QA across 100+ subjectsHeld-outlow25
LiveBenchMonthly-refreshed contamination-free suiteLivelow24
Abstract reasoning
ARC-AGI-2Few-shot abstraction & generalizationHeld-outlow25
Math
MATHsat.Competition math, 7 subjectsOpenhigh21
GSM8Ksat.Grade-school word problemsOpenhigh21
Omni-MATHOlympiad-level math, 33 subdomainsOpenmedium24
FrontierMathResearch-level mathematicsPrivatelow24
AIME 2025Olympiad-qualifier competition mathLivemedium25
Code
HumanEvalsat.Function synthesis from docstringsOpenhigh21
MBPPsat.Entry-level Python programmingOpenhigh21
SWE-bench VerifiedResolve real GitHub issues with testsOpenmedium24
Aider PolyglotMulti-language edit tasks in a real editorOpenmedium24
LiveCodeBenchTime-windowed competitive programmingLivelow24
Agentic
Terminal-BenchTerminal / sysadmin agent tasksOpenlow25
WebArenaWeb navigation in self-hosted sitesOpenlow23
tau-benchTool-use dialog in retail / airlineOpenlow24
GAIAReal-world assistant tasks w/ toolsHeld-outlow23
BFCLFunction / tool-calling accuracyLivelow24
Chat & preference
Arena-HardHard prompts, LLM-judged vs baselineOpenmedium24
MT-Benchsat.Multi-turn instruction following, LLM-judgeOpenhigh23
AlpacaEval 2.0sat.Length-controlled win rate, LLM-judgeOpenhigh23
LMArenaHuman pairwise preference (Elo)Livelow23
Long context
RULERSynthetic long-context retrieval & tracingOpenlow24
LongBench v2Realistic long-context understandingOpenmedium24
Visual reasoning
MMMUCollege-level multimodal understandingOpenhigh23
MMMU-ProRobust multimodal reasoning (vision-only options)Openmedium24
VLMsAreBiasedVisual evidence vs memorized priorsOpenlow25
MMStarVision-indispensable multimodal QAOpenlow24
MathVistaVisual math reasoningHeld-outmedium23
Document
CharXiv ReasoningScientific chart understandingOpenlow24
OmniDocBenchDiverse PDF parsing (OCR, layout, tables)Openlow24
ChartQAsat.Question answering over chartsOpenhigh22
DocVQAsat.QA over document imagesHeld-outmedium21
Screen / GUI
ScreenSpot-ProGUI grounding in professional softwareOpenlow24
Spatial
ERQAGrounding objects & spatial concepts physicallyOpenlow25
CV-BenchFundamental 2D/3D spatial reasoningOpenmedium24
Video
Video-MMETemporal reasoning, long-context videoOpenmedium24
Video-MMMUKnowledge acquisition from educational videoOpenlow24
Perception TestPerception & reasoning in real-world videoHeld-outlow23
EgoSchemaLong-form egocentric video QAHeld-outlow23
Biomedical
MedXpertQA-MMExpert clinical reasoning (multimodal)Openlow25
VQA-RADsat.QA over radiology imagesOpenhigh18
Access: open = items + answers public; gated = public behind an agreement / canary; held-out = answers withheld via eval server; private = items never released; live = refreshed or human-voted. Contamination = risk the test set leaked into pretraining, not a measured leakage rate. Curated by CodeSOTA — corrections welcome.
§ 04 · Why it matters

An open test set stops measuring the moment it’s famous.

Once a benchmark is public and widely cited, its questions end up in the next pretraining crawl. Scores climb whether or not the underlying capability moved — the number measures memorization as much as skill. That’s why 26% of the canonical suite carries a high contamination risk, and why the held-out and live tiers are the ones worth betting a decision on.

The fix isn’t a new leaderboard — it’s a hold-out set the model has never seen, with a reward you can verify in code. That is exactly what we build.

The canonical suite, by access
Open3370%
Gated12%
Held-out715%
Private12%
Live511%
13 of 47 resist contamination by construction. The rest are useful until they’re cited.
§ 05 · Work with us

Need a number you can actually trust?

If the public benchmarks for your capability are saturated or contaminated, the leaderboard can’t tell your models apart. We build private, contamination-resistant, verifiable-reward evals on a hold-out set — designed to discriminate where the open ones no longer do.