Every lab quotes the same benchmarks. How many can you train on?
A model release is a wall of percentages. What the table never tells you is how each number was scored — and a fully open test set is, by definition, a trainable one.
So we took the benchmarks the frontier labs actually cite and tagged each by access model and contamination risk. 47 benchmarks; 33 are fully open, only 13 are held-out, private, or live.
70% of the benchmarks labs use to grade their models are fully open — the questions and the answers are on the internet the models were trained on. The ones that hold a number’s value over time are the 13 that hide the test set: held-out servers, private graders, or live human votes.
A higher score on an open benchmark can mean a better model — or a better-memorized one. The access column is the only way to tell them apart.
Five ways a benchmark guards its answers.
Openness is a spectrum. At one end the whole test set sits on Hugging Face; at the other the questions never leave the maintainer’s server. Each step up makes the score harder to game and harder to fake.
Copper marks the contamination-resistant tiers — the ones whose numbers age well.
Test items + answers fully public — trainable, leak-prone
Public but access-controlled / canary-protected — still downloadable
Items public, answers withheld via an eval server or leaderboard
Items hidden, scored only by the maintainer
Continuously refreshed / human-vote — the test set is a moving target
Every benchmark, by category and access.
Grouped by what they measure, sorted within each group from most open to most guarded. The contamination column is our read on how likely the test set has already been seen.
Red = high contamination risk · copper = low. “Sat.” flags a near-solved benchmark.
| Benchmark | What it measures | Access | Contam. | Yr |
|---|---|---|---|---|
| Knowledge & reasoning | ||||
| MMLUsat. | 57-subject multiple-choice knowledge | Open | high | 20 |
| MMLU-Pro | Harder 10-way MMLU with reasoning | Open | high | 24 |
| SimpleQA | Short-fact accuracy & calibration | Open | medium | 24 |
| BIG-Bench Hardsat. | 23 hard reasoning tasks | Open | high | 22 |
| GPQA Diamond | Google-proof graduate science QA | Gated | medium | 23 |
| Humanity's Last Exam | Expert frontier QA across 100+ subjects | Held-out | low | 25 |
| LiveBench | Monthly-refreshed contamination-free suite | Live | low | 24 |
| Abstract reasoning | ||||
| ARC-AGI-2 | Few-shot abstraction & generalization | Held-out | low | 25 |
| Math | ||||
| MATHsat. | Competition math, 7 subjects | Open | high | 21 |
| GSM8Ksat. | Grade-school word problems | Open | high | 21 |
| Omni-MATH | Olympiad-level math, 33 subdomains | Open | medium | 24 |
| FrontierMath | Research-level mathematics | Private | low | 24 |
| AIME 2025 | Olympiad-qualifier competition math | Live | medium | 25 |
| Code | ||||
| HumanEvalsat. | Function synthesis from docstrings | Open | high | 21 |
| MBPPsat. | Entry-level Python programming | Open | high | 21 |
| SWE-bench Verified | Resolve real GitHub issues with tests | Open | medium | 24 |
| Aider Polyglot | Multi-language edit tasks in a real editor | Open | medium | 24 |
| LiveCodeBench | Time-windowed competitive programming | Live | low | 24 |
| Agentic | ||||
| Terminal-Bench | Terminal / sysadmin agent tasks | Open | low | 25 |
| WebArena | Web navigation in self-hosted sites | Open | low | 23 |
| tau-bench | Tool-use dialog in retail / airline | Open | low | 24 |
| GAIA | Real-world assistant tasks w/ tools | Held-out | low | 23 |
| BFCL | Function / tool-calling accuracy | Live | low | 24 |
| Chat & preference | ||||
| Arena-Hard | Hard prompts, LLM-judged vs baseline | Open | medium | 24 |
| MT-Benchsat. | Multi-turn instruction following, LLM-judge | Open | high | 23 |
| AlpacaEval 2.0sat. | Length-controlled win rate, LLM-judge | Open | high | 23 |
| LMArena | Human pairwise preference (Elo) | Live | low | 23 |
| Long context | ||||
| RULER | Synthetic long-context retrieval & tracing | Open | low | 24 |
| LongBench v2 | Realistic long-context understanding | Open | medium | 24 |
| Visual reasoning | ||||
| MMMU | College-level multimodal understanding | Open | high | 23 |
| MMMU-Pro | Robust multimodal reasoning (vision-only options) | Open | medium | 24 |
| VLMsAreBiased | Visual evidence vs memorized priors | Open | low | 25 |
| MMStar | Vision-indispensable multimodal QA | Open | low | 24 |
| MathVista | Visual math reasoning | Held-out | medium | 23 |
| Document | ||||
| CharXiv Reasoning | Scientific chart understanding | Open | low | 24 |
| OmniDocBench | Diverse PDF parsing (OCR, layout, tables) | Open | low | 24 |
| ChartQAsat. | Question answering over charts | Open | high | 22 |
| DocVQAsat. | QA over document images | Held-out | medium | 21 |
| Screen / GUI | ||||
| ScreenSpot-Pro | GUI grounding in professional software | Open | low | 24 |
| Spatial | ||||
| ERQA | Grounding objects & spatial concepts physically | Open | low | 25 |
| CV-Bench | Fundamental 2D/3D spatial reasoning | Open | medium | 24 |
| Video | ||||
| Video-MME | Temporal reasoning, long-context video | Open | medium | 24 |
| Video-MMMU | Knowledge acquisition from educational video | Open | low | 24 |
| Perception Test | Perception & reasoning in real-world video | Held-out | low | 23 |
| EgoSchema | Long-form egocentric video QA | Held-out | low | 23 |
| Biomedical | ||||
| MedXpertQA-MM | Expert clinical reasoning (multimodal) | Open | low | 25 |
| VQA-RADsat. | QA over radiology images | Open | high | 18 |
An open test set stops measuring the moment it’s famous.
Once a benchmark is public and widely cited, its questions end up in the next pretraining crawl. Scores climb whether or not the underlying capability moved — the number measures memorization as much as skill. That’s why 26% of the canonical suite carries a high contamination risk, and why the held-out and live tiers are the ones worth betting a decision on.
The fix isn’t a new leaderboard — it’s a hold-out set the model has never seen, with a reward you can verify in code. That is exactly what we build.
| Open | 33 | 70% |
| Gated | 1 | 2% |
| Held-out | 7 | 15% |
| Private | 1 | 2% |
| Live | 5 | 11% |
Need a number you can actually trust?
If the public benchmarks for your capability are saturated or contaminated, the leaderboard can’t tell your models apart. We build private, contamination-resistant, verifiable-reward evals on a hold-out set — designed to discriminate where the open ones no longer do.