Codesota · RL Environmentsfrontier software engineering← All environments
§ Ranked #04 by discriminative power

FrontierSWE.

An environment for frontier software engineering. Across 11 models with public results it spreads the best and worst 59%it still sorts frontier models, so training on it can still move yours.

§ Public model scores

Who wins FrontierSWE.

Best public result per model entry, normalized 0..1. The spread between the top and bottom rows is what makes this environment worth — or not worth — a training run.

#Modeldominance
01Claude-Opus-4.883%
02GPT-5.579%
03Claude-Opus-4.769%
04Claude-Opus-4.661%
05GPT-5.459%
06Gemini-3.1-Pro44%
07Composer-2.544%
08DeepSeek-V4-Pro32%
09Kimi-K2.629%
10Kimi-K2.527%
11Qwen3.6-Plus24%
§ Nearby in the ranking
#EnvironmentSpreadDiscriminative
02OSWorld-Verifieddesktop computer-use81%0.81
03DeepSWElong-horizon agentic coding65%0.65
04FrontierSWEfrontier software engineering59%0.59
05GBA Evallong-horizon SWE (build a GBA emulator)53%0.53
06SkillsBenchagent skills (multi-domain)43%0.43
§ Work with us

Need one that still separates models?

When the public environment for your capability saturates, you can’t tell your models apart and you can’t train past it. We build private, contamination-resistant, verifiable-reward environments and evals on a hold-out set — designed to discriminate where the public ones no longer do.