Anyone can list a thousand RL environments. Which ones still separate models?
An environment is only worth training on while it pulls the best and worst models far apart. Once every frontier model clears it — or none can — it stops teaching anything.
So we score every RL / agent environment with public model results by discriminative power — the spread it produces across models, penalised as the leader hits the ceiling. 12 scored; 9 still discriminate, 3 are saturated or floored.
The sharpest environment right now is Terminal-Bench 2.0 — it spreads models 87% from top to bottom on terminal / SWE-sysadmin agents. The environments still doing real work are OSWorld-Verified, DeepSWE, FrontierSWE.
Counting environments tells you nothing about whether training on them moves a model. This page is which ones do.
Every scored environment, by discriminative power.
Spread = how far apart the best and worst model land. Saturated environments (leader ≥ 90%) and floored ones (no model clears 15%) are flagged — high or low, but no separation left.
Copper rows still discriminate. Faded rows have hit a ceiling or a floor.
| # | Environment | Models | Top | Spread | Status | Discriminative |
|---|---|---|---|---|---|---|
| 01 | Terminal-Bench 2.0accuracy · terminal / SWE-sysadmin agents | 7 | 90% | 87% | saturated | 0.86 |
| 02 | OSWorld-Verifiedsuccess-rate · desktop computer-use | 16 | 84% | 81% | alive | 0.81 |
| 03 | DeepSWEpass-rate · long-horizon agentic coding | 12 | 70% | 65% | alive | 0.65 |
| 04 | FrontierSWEdominance · frontier software engineering | 11 | 83% | 59% | alive | 0.59 |
| 05 | GBA Evaloverall · long-horizon SWE (build a GBA emulator) | 10 | 53% | 53% | alive | 0.53 |
| 06 | SkillsBenchpass-rate (with skills) · agent skills (multi-domain) | 16 | 55% | 43% | alive | 0.43 |
| 07 | Diplomacy Arenaperformance · strategy / negotiation (games) | 15 | 60% | 29% | alive | 0.29 |
| 08 | WebBenchsuccess-rate · browser agents (live web) | 5 | 66% | 22% | alive | 0.22 |
| 09 | Cua-Benchsuccess-rate · GUI computer-use (desktop + mobile) | 6 | 68% | 14% | alive | 0.14 |
| 10 | CompileBenchpass@1 · compile/cross-compile real OSS | 6 | 100% | 27% | saturated | 0.14 |
| 11 | SWE-Bench-Proresolve-rate · software engineering (audited) | 4 | 67% | 11% | alive | 0.11 |
| 12 | COBOLBenchpass@4 · legacy enterprise (COBOL) — floor effect | 5 | 11% | 2% | floored | 0.02 |
Which environments give you the same signal twice?
The public leaderboards share models. So we can ask whether two environments rank those models the same way. High correlation means running both is redundant; low or negative means each catches what the other misses.
Most redundant: GBA Eval ≈ FrontierSWE (r = 0.94, 6 shared models) — pick one.
Most complementary: OSWorld-Verified vs SkillsBench (r = 0.27) — they measure different things.
| Pearson r | 01 | 02 | 03 | 04 | 05 | 06 |
|---|---|---|---|---|---|---|
| 01OSWorld-Verified | — | 0.65 | · | 0.84 | 0.27 | · |
| 02DeepSWE | 0.65 | — | 0.90 | 0.83 | 0.50 | 0.75 |
| 03FrontierSWE | · | 0.90 | — | 0.94 | 0.86 | 0.71 |
| 04GBA Eval | 0.84 | 0.83 | 0.94 | — | 0.36 | 0.68 |
| 05SkillsBench | 0.27 | 0.50 | 0.86 | 0.36 | — | 0.59 |
| 06COBOLBench | · | 0.75 | 0.71 | 0.68 | 0.59 | — |
Your model lags on a capability. Which environment separates it?
Grouped by what each environment trains. For each capability we surface the environment with the most discriminative power left — the one most likely to move a model that’s already strong.
Pick the gap, then jump to the sharpest environment for it.
We also ship verifiable-reward environments.
Indexing the ecosystem is the map. The product is the environment. audio-verify is a working RLVR environment where the reward is objectively verifiable — no learned judge: synth speech → whisper.cpp ASR → structured-field reward.
It demonstrably discriminates: at a fast 320 wpm speaking rate, structured-field recovery collapses from 1.00 to 0.42 while plain WER barely moves — exactly the signal a frontier lab needs and a generic transcript score misses.
| Reward | Clean | Fast 320wpm | Δ |
|---|---|---|---|
| Structured entity recovery | 1.00 | 0.42 | +0.58 |
| Structured WER | 0.03 | 0.42 | — |
| Plain WER | 0.025 | 0.088 | — |
20 more, not yet scorable.
Public environments whose scores aren’t machine-retrievable yet, ones with no public scores, plus datasets, tooling, and infra that aren’t capability leaderboards. Ranked above the moment they publish.
| Environment | Domain | Status |
|---|---|---|
| KellyBench | long-horizon sequential decision (bankroll) | scores pending |
| SheetBench-50 | spreadsheets | scores pending |
| BrowserBench | browser infrastructure stealth | infra · not capability |
| UI-CUBE | enterprise computer-use | scores pending |
| ClawsBench | enterprise workplace (Gmail/Slack/Drive) | scores pending |
| Westworld | web-app simulators | scores pending |
| LegacySWE | legacy enterprise code | scores pending |
| VideoBench | video / animation generation | scores pending |
| FigmaBench | design-to-code | scores pending |
| BinaryAudit | binary reverse-engineering / security | scores pending |
| Unix-CTF | unix / shell competence | scores pending |
| LOL Arena | humor preference alignment | private |
| OpenThoughts | open reasoning dataset | dataset |
| Evalchemy | eval harness | tooling |
| GEPA | agent-optimization algorithm | tooling |
| Shipd | coding-bounty platform | tooling |
| Dojo | computer-use env hub | tooling |
| SimLab | enterprise simulation platform | tooling |
| OpenReward | RL-env hub (330+ envs) | tooling |
| Bad Cards | humor data-collection game | dataset |
An environment's worth is how far apart it puts models. We take the best-minus-worst score across every model with a public result. Wide spread = the environment is still sorting models, so training on it can still move yours.
Once the top model clears 90%, the environment stops separating frontier models — they all pass. When no model clears 15%, it only ranks degrees of failure. We flag both; discriminative power falls toward the ceiling regardless of fame.
This is the analysis CodeSOTA runs on every metric it publishes — including showing WER predicts human TTS preference at only ρ=0.13. An environment nobody can fail, or that everybody fails, is not worth a training run.
Need an environment that still separates models?
If the public environments for your capability are saturated, you can’t tell your models apart and you can’t train past them. We build private, contamination-resistant, verifiable-reward environments and evals on a hold-out set — designed to discriminate where the public ones no longer do.