Codesota · RL EnvironmentsWhich environments still separate modelsUpdated: June 7, 2026

§ 00 · Premise

Anyone can list a thousand RL environments. Which ones still separate models?

An environment is only worth training on while it pulls the best and worst models far apart. Once every frontier model clears it — or none can — it stops teaching anything.

So we score every RL / agent environment with public model results by discriminative power — the spread it produces across models, penalised as the leader hits the ceiling. 12 scored; 9 still discriminate, 3 are saturated or floored.

The ranking →By capability gap What we build

§ 01 · The headline

The sharpest environment right now is Terminal-Bench 2.0 — it spreads models 87% from top to bottom on terminal / SWE-sysadmin agents. The environments still doing real work are OSWorld-Verified, DeepSWE, FrontierSWE.

Counting environments tells you nothing about whether training on them moves a model. This page is which ones do.

§ 02 · Ranking

Every scored environment, by discriminative power.

Spread = how far apart the best and worst model land. Saturated environments (leader ≥ 90%) and floored ones (no model clears 15%) are flagged — high or low, but no separation left.

Copper rows still discriminate. Faded rows have hit a ceiling or a floor.

#	Environment	Models	Top	Spread	Status	Discriminative
01	Terminal-Bench 2.0accuracy · terminal / SWE-sysadmin agents	7	90%	87%	saturated	0.86
02	OSWorld-Verifiedsuccess-rate · desktop computer-use	16	84%	81%	alive	0.81
03	DeepSWEpass-rate · long-horizon agentic coding	12	70%	65%	alive	0.65
04	FrontierSWEdominance · frontier software engineering	11	83%	59%	alive	0.59
05	GBA Evaloverall · long-horizon SWE (build a GBA emulator)	10	53%	53%	alive	0.53
06	SkillsBenchpass-rate (with skills) · agent skills (multi-domain)	16	55%	43%	alive	0.43
07	Diplomacy Arenaperformance · strategy / negotiation (games)	15	60%	29%	alive	0.29
08	WebBenchsuccess-rate · browser agents (live web)	5	66%	22%	alive	0.22
09	Cua-Benchsuccess-rate · GUI computer-use (desktop + mobile)	6	68%	14%	alive	0.14
10	CompileBenchpass@1 · compile/cross-compile real OSS	6	100%	27%	saturated	0.14
11	SWE-Bench-Proresolve-rate · software engineering (audited)	4	67%	11%	alive	0.11
12	COBOLBenchpass@4 · legacy enterprise (COBOL) — floor effect	5	11%	2%	floored	0.02

Discriminative power = spread, penalised as the top model passes 90% of the ceiling. Environments with ≥ 3 public model scores, normalized 0..1. Floored = even the best model stays under 15% (the environment discriminates only by degree of failure).

§ 03 · Redundant vs unique

Which environments give you the same signal twice?

The public leaderboards share models. So we can ask whether two environments rank those models the same way. High correlation means running both is redundant; low or negative means each catches what the other misses.

Most redundant: GBA Eval ≈ FrontierSWE (r = 0.94, 6 shared models) — pick one.

Most complementary: OSWorld-Verified vs SkillsBench (r = 0.27) — they measure different things.

Pearson r	01	02	03	04	05	06
01OSWorld-Verified	—	0.65	·	0.84	0.27	·
02DeepSWE	0.65	—	0.90	0.83	0.50	0.75
03FrontierSWE	·	0.90	—	0.94	0.86	0.71
04GBA Eval	0.84	0.83	0.94	—	0.36	0.68
05SkillsBench	0.27	0.50	0.86	0.36	—	0.59
06COBOLBench	·	0.75	0.71	0.68	0.59	—

Pearson correlation of per-model scores over the models two environments share (≥ 4, scaffold normalized to base model, best score kept). Copper = ranks models alike (redundant); pale = independent or inverse signal; · = too few shared models to compare. This is the cross-environment analysis no public RL-env index publishes.

§ 04 · By capability gap

Your model lags on a capability. Which environment separates it?

Grouped by what each environment trains. For each capability we surface the environment with the most discriminative power left — the one most likely to move a model that’s already strong.

Pick the gap, then jump to the sharpest environment for it.

Long-horizon agents

2 environments · 2 still discriminate

1 environment · 0 still discriminate

Sharpest

Terminal-Bench 2.0

DP 0.86 · spread 87%

Computer use (desktop / GUI)

2 environments · 2 still discriminate

1 environment · 1 still discriminate

Sharpest

WebBench

DP 0.22 · spread 22%

Games, strategy & decisions

1 environment · 1 still discriminate

Sharpest

Diplomacy Arena

DP 0.29 · spread 29%

Generalist & multi-domain

1 environment · 1 still discriminate

Sharpest

SkillsBench

DP 0.43 · spread 43%

Coding & software engineering

4 environments · 2 still discriminate

Sharpest

FrontierSWE

DP 0.59 · spread 59%

§ 05 · Market structure

The supply chain has three layers. None of them measures.

Three layers feed a frontier lab’s RL run — human raters, environment suppliers, and the infrastructure that runs them at scale. Each is racing to ship more: more labeled data, more environments across coding, computer use, long-horizon, finance, model behavior.

What none of them publishes: which of those environments still pulls models apart. That measurement gap is the rest of this page.

Layer 1

Competes on labor throughput

Human data & labeling

Raters, demonstrations, preference labels. Sold by the seat and the hour.

Layer 2

Competes on environment count

Environment supply

Task environments with rewards, fanned out across capabilities. The bulk of the market — and where listings collide.

Layer 3

Competes on integrations

RL infrastructure & hubs

Orchestration, training loops, environment hubs that aggregate everyone else's work.

Our wedge

Layer 0

Measurement & verifiable reward

We don’t add to the pile. We score the pile by discriminative power — which environments still separate models — and build the ones whose reward is verified in code, with no rater in Layer 1 at all.

§ 06 · We build, not just index

We also ship verifiable-reward environments.

Indexing the ecosystem is the map. The product is the environment. audio-verify is a working RLVR environment where the reward is objectively verifiable — no learned judge: synth speech → whisper.cpp ASR → structured-field reward.

It demonstrably discriminates: at a fast 320 wpm speaking rate, structured-field recovery collapses from 1.00 to 0.42 while plain WER barely moves — exactly the signal a frontier lab needs and a generic transcript score misses.

Request the spec →How we evaluate

audio-verify · discrimination

Reward	Clean	Fast 320wpm	Δ
Structured entity recovery	1.00	0.42	+0.58
Structured WER	0.03	0.42	—
Plain WER	0.025	0.088	—

Real audio → whisper.cpp ASR → field reward. The structured signal separates; plain WER doesn’t.

§ 07 · Also indexed

20 more, not yet scorable.

Public environments whose scores aren’t machine-retrievable yet, ones with no public scores, plus datasets, tooling, and infra that aren’t capability leaderboards. Ranked above the moment they publish.

Environment	Domain	Status
KellyBench	long-horizon sequential decision (bankroll)	scores pending
SheetBench-50	spreadsheets	scores pending
BrowserBench	browser infrastructure stealth	infra · not capability
UI-CUBE	enterprise computer-use	scores pending
ClawsBench	enterprise workplace (Gmail/Slack/Drive)	scores pending
Westworld	web-app simulators	scores pending
LegacySWE	legacy enterprise code	scores pending
VideoBench	video / animation generation	scores pending
FigmaBench	design-to-code	scores pending
BinaryAudit	binary reverse-engineering / security	scores pending
Unix-CTF	unix / shell competence	scores pending
LOL Arena	humor preference alignment	private
OpenThoughts	open reasoning dataset	dataset
Evalchemy	eval harness	tooling
GEPA	agent-optimization algorithm	tooling
Shipd	coding-bounty platform	tooling
Dojo	computer-use env hub	tooling
SimLab	enterprise simulation platform	tooling
OpenReward	RL-env hub (330+ envs)	tooling
Bad Cards	humor data-collection game	dataset

§ 01

Spread is the signal

An environment's worth is how far apart it puts models. We take the best-minus-worst score across every model with a public result. Wide spread = the environment is still sorting models, so training on it can still move yours.

§ 02

Ceilings and floors both kill it

Once the top model clears 90%, the environment stops separating frontier models — they all pass. When no model clears 15%, it only ranks degrees of failure. We flag both; discriminative power falls toward the ceiling regardless of fame.

§ 03

The same lens, everywhere

This is the analysis CodeSOTA runs on every metric it publishes — including showing WER predicts human TTS preference at only ρ=0.13. An environment nobody can fail, or that everybody fails, is not worth a training run.

§ 08 · Work with us

Need an environment that still separates models?

If the public environments for your capability are saturated, you can’t tell your models apart and you can’t train past them. We build private, contamination-resistant, verifiable-reward environments and evals on a hold-out set — designed to discriminate where the public ones no longer do.

How we evaluate →Methodology Email us