Codesota · RL EnvironmentsWhich environments still separate modelsUpdated: June 2, 2026
§ 00 · Premise

Anyone can list a thousand RL environments. Which ones still separate models?

An environment is only worth training on while it pulls the best and worst models far apart. Once every frontier model clears it — or none can — it stops teaching anything.

So we score every RL / agent environment with public model results by discriminative power — the spread it produces across models, penalised as the leader hits the ceiling. 12 scored; 9 still discriminate, 3 are saturated or floored.

§ 01 · The headline

The sharpest environment right now is Terminal-Bench 2.0 — it spreads models 87% from top to bottom on terminal / SWE-sysadmin agents. The environments still doing real work are OSWorld-Verified, DeepSWE, FrontierSWE.

Counting environments tells you nothing about whether training on them moves a model. This page is which ones do.

§ 02 · Ranking

Every scored environment, by discriminative power.

Spread = how far apart the best and worst model land. Saturated environments (leader ≥ 90%) and floored ones (no model clears 15%) are flagged — high or low, but no separation left.

Copper rows still discriminate. Faded rows have hit a ceiling or a floor.

#EnvironmentModelsTopSpreadStatusDiscriminative
01Terminal-Bench 2.0accuracy · terminal / SWE-sysadmin agents790%87%saturated0.86
02OSWorld-Verifiedsuccess-rate · desktop computer-use1684%81%alive0.81
03DeepSWEpass-rate · long-horizon agentic coding1270%65%alive0.65
04FrontierSWEdominance · frontier software engineering1183%59%alive0.59
05GBA Evaloverall · long-horizon SWE (build a GBA emulator)1053%53%alive0.53
06SkillsBenchpass-rate (with skills) · agent skills (multi-domain)1655%43%alive0.43
07Diplomacy Arenaperformance · strategy / negotiation (games)1560%29%alive0.29
08WebBenchsuccess-rate · browser agents (live web)566%22%alive0.22
09Cua-Benchsuccess-rate · GUI computer-use (desktop + mobile)668%14%alive0.14
10CompileBenchpass@1 · compile/cross-compile real OSS6100%27%saturated0.14
11SWE-Bench-Proresolve-rate · software engineering (audited)467%11%alive0.11
12COBOLBenchpass@4 · legacy enterprise (COBOL) — floor effect511%2%floored0.02
Discriminative power = spread, penalised as the top model passes 90% of the ceiling. Environments with ≥ 3 public model scores, normalized 0..1. Floored = even the best model stays under 15% (the environment discriminates only by degree of failure).
§ 03 · Redundant vs unique

Which environments give you the same signal twice?

The public leaderboards share models. So we can ask whether two environments rank those models the same way. High correlation means running both is redundant; low or negative means each catches what the other misses.

Most redundant: GBA EvalFrontierSWE (r = 0.94, 6 shared models) — pick one.

Most complementary: OSWorld-Verified vs SkillsBench (r = 0.27) — they measure different things.

Pearson r010203040506
01OSWorld-Verified0.65·0.840.27·
02DeepSWE0.650.900.830.500.75
03FrontierSWE·0.900.940.860.71
04GBA Eval0.840.830.940.360.68
05SkillsBench0.270.500.860.360.59
06COBOLBench·0.750.710.680.59
Pearson correlation of per-model scores over the models two environments share (≥ 4, scaffold normalized to base model, best score kept). Copper = ranks models alike (redundant); pale = independent or inverse signal; · = too few shared models to compare. This is the cross-environment analysis no public RL-env index publishes.
§ 04 · By capability gap

Your model lags on a capability. Which environment separates it?

Grouped by what each environment trains. For each capability we surface the environment with the most discriminative power left — the one most likely to move a model that’s already strong.

Pick the gap, then jump to the sharpest environment for it.

Code & software engineering
7 environments · 4 still discriminate
Sharpest
Terminal-Bench 2.0
DP 0.86 · spread 87%
Computer use (desktop / GUI)
2 environments · 2 still discriminate
Sharpest
OSWorld-Verified
DP 0.81 · spread 81%
Browser & web agents
1 environment · 1 still discriminate
Sharpest
WebBench
DP 0.22 · spread 22%
Games, strategy & decisions
1 environment · 1 still discriminate
Sharpest
Diplomacy Arena
DP 0.29 · spread 29%
§ 05 · We build, not just index

We also ship verifiable-reward environments.

Indexing the ecosystem is the map. The product is the environment. audio-verify is a working RLVR environment where the reward is objectively verifiable — no learned judge: synth speech → whisper.cpp ASR → structured-field reward.

It demonstrably discriminates: at a fast 320 wpm speaking rate, structured-field recovery collapses from 1.00 to 0.42 while plain WER barely moves — exactly the signal a frontier lab needs and a generic transcript score misses.

audio-verify · discrimination
RewardCleanFast 320wpmΔ
Structured entity recovery1.000.42+0.58
Structured WER0.030.42
Plain WER0.0250.088
Real audio → whisper.cpp ASR → field reward. The structured signal separates; plain WER doesn’t.
§ 06 · Also indexed

20 more, not yet scorable.

Public environments whose scores aren’t machine-retrievable yet, ones with no public scores, plus datasets, tooling, and infra that aren’t capability leaderboards. Ranked above the moment they publish.

EnvironmentDomainStatus
KellyBenchlong-horizon sequential decision (bankroll)scores pending
SheetBench-50spreadsheetsscores pending
BrowserBenchbrowser infrastructure stealthinfra · not capability
UI-CUBEenterprise computer-usescores pending
ClawsBenchenterprise workplace (Gmail/Slack/Drive)scores pending
Westworldweb-app simulatorsscores pending
LegacySWElegacy enterprise codescores pending
VideoBenchvideo / animation generationscores pending
FigmaBenchdesign-to-codescores pending
BinaryAuditbinary reverse-engineering / securityscores pending
Unix-CTFunix / shell competencescores pending
LOL Arenahumor preference alignmentprivate
OpenThoughtsopen reasoning datasetdataset
Evalchemyeval harnesstooling
GEPAagent-optimization algorithmtooling
Shipdcoding-bounty platformtooling
Dojocomputer-use env hubtooling
SimLabenterprise simulation platformtooling
OpenRewardRL-env hub (330+ envs)tooling
Bad Cardshumor data-collection gamedataset
§ 01
Spread is the signal

An environment's worth is how far apart it puts models. We take the best-minus-worst score across every model with a public result. Wide spread = the environment is still sorting models, so training on it can still move yours.

§ 02
Ceilings and floors both kill it

Once the top model clears 90%, the environment stops separating frontier models — they all pass. When no model clears 15%, it only ranks degrees of failure. We flag both; discriminative power falls toward the ceiling regardless of fame.

§ 03
The same lens, everywhere

This is the analysis CodeSOTA runs on every metric it publishes — including showing WER predicts human TTS preference at only ρ=0.13. An environment nobody can fail, or that everybody fails, is not worth a training run.

§ 07 · Work with us

Need an environment that still separates models?

If the public environments for your capability are saturated, you can’t tell your models apart and you can’t train past them. We build private, contamination-resistant, verifiable-reward environments and evals on a hold-out set — designed to discriminate where the public ones no longer do.