Codesota · RL Environmentsterminal / SWE-sysadmin agents← All environments
§ Ranked #01 by discriminative power

Terminal-Bench 2.0.

An environment for terminal / SWE-sysadmin agents. Across 7 models with public results it spreads the best and worst 87%but the leader has cleared the ceiling, so it no longer separates the strongest models.

§ Public model scores

Who wins Terminal-Bench 2.0.

Best public result per model entry, normalized 0..1. The spread between the top and bottom rows is what makes this environment worth — or not worth — a training run.

#Modelaccuracy
01vix+Opus-4.790%
02JJAgent87%
03NexAU+GPT-5.585%
04Codex+GPT-5.582%
05Terminus2+GPT-5-Nano8%
06MiniSWE+GPT-OSS-20B3%
07Terminus2+GPT-OSS-20B3%
§ Nearby in the ranking
#EnvironmentSpreadDiscriminative
01Terminal-Bench 2.0terminal / SWE-sysadmin agents87%0.86
02OSWorld-Verifieddesktop computer-use81%0.81
03DeepSWElong-horizon agentic coding65%0.65
§ Work with us

Need one that still separates models?

When the public environment for your capability saturates, you can’t tell your models apart and you can’t train past it. We build private, contamination-resistant, verifiable-reward environments and evals on a hold-out set — designed to discriminate where the public ones no longer do.