Codesota · Benchmark OpennessWhich evals are open, and which still resist contaminationUpdated: June 7, 2026

§ 00 · Premise

Every lab quotes the same benchmarks. How many can you train on?

A model release is a wall of percentages. What the table never tells you is how each number was scored — and a fully open test set is, by definition, a trainable one.

So we took the benchmarks the frontier labs actually cite and tagged each by access model and contamination risk. 47 benchmarks; 33 are fully open, only 13 are held-out, private, or live.

The index →Access types Why it matters

§ 01 · The headline

70% of the benchmarks labs use to grade their models are fully open — the questions and the answers are on the internet the models were trained on. The ones that hold a number’s value over time are the 13 that hide the test set: held-out servers, private graders, or live human votes.

A higher score on an open benchmark can mean a better model — or a better-memorized one. The access column is the only way to tell them apart.

§ 02 · Access types

Five ways a benchmark guards its answers.

Openness is a spectrum. At one end the whole test set sits on Hugging Face; at the other the questions never leave the maintainer’s server. Each step up makes the score harder to game and harder to fake.

Copper marks the contamination-resistant tiers — the ones whose numbers age well.

Open

Test items + answers fully public — trainable, leak-prone

Gated

Public but access-controlled / canary-protected — still downloadable

Held-out

Items public, answers withheld via an eval server or leaderboard

Private

Items hidden, scored only by the maintainer

Live

Continuously refreshed / human-vote — the test set is a moving target

§ 03 · The index

Every benchmark, by category and access.

Grouped by what they measure, sorted within each group from most open to most guarded. The contamination column is our read on how likely the test set has already been seen.

Red = high contamination risk · copper = low. “Sat.” flags a near-solved benchmark.

Benchmark	What it measures	Access	Contam.	Yr
Knowledge & reasoning
MMLUsat.	57-subject multiple-choice knowledge	Open	high	20
MMLU-Pro	Harder 10-way MMLU with reasoning	Open	high	24
SimpleQA	Short-fact accuracy & calibration	Open	medium	24
BIG-Bench Hardsat.	23 hard reasoning tasks	Open	high	22
GPQA Diamond	Google-proof graduate science QA	Gated	medium	23
Humanity's Last Exam	Expert frontier QA across 100+ subjects	Held-out	low	25
LiveBench	Monthly-refreshed contamination-free suite	Live	low	24
Abstract reasoning
ARC-AGI-2	Few-shot abstraction & generalization	Held-out	low	25
Math
MATHsat.	Competition math, 7 subjects	Open	high	21
GSM8Ksat.	Grade-school word problems	Open	high	21
Omni-MATH	Olympiad-level math, 33 subdomains	Open	medium	24
FrontierMath	Research-level mathematics	Private	low	24
AIME 2025	Olympiad-qualifier competition math	Live	medium	25
Code
HumanEvalsat.	Function synthesis from docstrings	Open	high	21
MBPPsat.	Entry-level Python programming	Open	high	21
SWE-bench Verified	Resolve real GitHub issues with tests	Open	medium	24
Aider Polyglot	Multi-language edit tasks in a real editor	Open	medium	24
LiveCodeBench	Time-windowed competitive programming	Live	low	24
Agentic
Terminal-Bench	Terminal / sysadmin agent tasks	Open	low	25
WebArena	Web navigation in self-hosted sites	Open	low	23
tau-bench	Tool-use dialog in retail / airline	Open	low	24
GAIA	Real-world assistant tasks w/ tools	Held-out	low	23
BFCL	Function / tool-calling accuracy	Live	low	24
Chat & preference
Arena-Hard	Hard prompts, LLM-judged vs baseline	Open	medium	24
MT-Benchsat.	Multi-turn instruction following, LLM-judge	Open	high	23
AlpacaEval 2.0sat.	Length-controlled win rate, LLM-judge	Open	high	23
LMArena	Human pairwise preference (Elo)	Live	low	23
Long context
RULER	Synthetic long-context retrieval & tracing	Open	low	24
LongBench v2	Realistic long-context understanding	Open	medium	24
Visual reasoning
MMMU	College-level multimodal understanding	Open	high	23
MMMU-Pro	Robust multimodal reasoning (vision-only options)	Open	medium	24
VLMsAreBiased	Visual evidence vs memorized priors	Open	low	25
MMStar	Vision-indispensable multimodal QA	Open	low	24
MathVista	Visual math reasoning	Held-out	medium	23
Document
CharXiv Reasoning	Scientific chart understanding	Open	low	24
OmniDocBench	Diverse PDF parsing (OCR, layout, tables)	Open	low	24
ChartQAsat.	Question answering over charts	Open	high	22
DocVQAsat.	QA over document images	Held-out	medium	21
Screen / GUI
ScreenSpot-Pro	GUI grounding in professional software	Open	low	24
Spatial
ERQA	Grounding objects & spatial concepts physically	Open	low	25
CV-Bench	Fundamental 2D/3D spatial reasoning	Open	medium	24
Video
Video-MME	Temporal reasoning, long-context video	Open	medium	24
Video-MMMU	Knowledge acquisition from educational video	Open	low	24
Perception Test	Perception & reasoning in real-world video	Held-out	low	23
EgoSchema	Long-form egocentric video QA	Held-out	low	23
Biomedical
MedXpertQA-MM	Expert clinical reasoning (multimodal)	Open	low	25
VQA-RADsat.	QA over radiology images	Open	high	18

Access: open = items + answers public; gated = public behind an agreement / canary; held-out = answers withheld via eval server; private = items never released; live = refreshed or human-voted. Contamination = risk the test set leaked into pretraining, not a measured leakage rate. Curated by CodeSOTA — corrections welcome.

§ 04 · Why it matters

An open test set stops measuring the moment it’s famous.

Once a benchmark is public and widely cited, its questions end up in the next pretraining crawl. Scores climb whether or not the underlying capability moved — the number measures memorization as much as skill. That’s why 26% of the canonical suite carries a high contamination risk, and why the held-out and live tiers are the ones worth betting a decision on.

The fix isn’t a new leaderboard — it’s a hold-out set the model has never seen, with a reward you can verify in code. That is exactly what we build.

How we evaluate →RL environments, ranked

The canonical suite, by access

Open	33	70%
Gated	1	2%
Held-out	7	15%
Private	1	2%
Live	5	11%

13 of 47 resist contamination by construction. The rest are useful until they’re cited.

§ 05 · Work with us

Need a number you can actually trust?

If the public benchmarks for your capability are saturated or contaminated, the leaderboard can’t tell your models apart. We build private, contamination-resistant, verifiable-reward evals on a hold-out set — designed to discriminate where the open ones no longer do.

How we evaluate →Methodology Email us