§ Ranked #06 by discriminative power
SkillsBench.
An environment for agent skills (multi-domain). Across 16 models with public results it spreads the best and worst 43% — it still sorts frontier models, so training on it can still move yours.
§ Public model scores
Who wins SkillsBench.
Best public result per model entry, normalized 0..1. The spread between the top and bottom rows is what makes this environment worth — or not worth — a training run.
| # | Model | pass-rate (with skills) |
|---|---|---|
| 01 | OpenHands+GPT-5.5 | 55% |
| 02 | OpenHands+Opus-4.7 | 51% |
| 03 | GeminiCLI+Gemini-3.1-Pro | 49% |
| 04 | OpenHands+Gemini-3.1-Pro | 49% |
| 05 | Codex+GPT-5.5 | 49% |
| 06 | ClaudeCode+Opus-4.7 | 46% |
| 07 | OpenHands+DeepSeek-V4-Pro | 40% |
| 08 | OpenHands+Kimi-K2.6 | 35% |
| 09 | OpenHands+GLM-5.1 | 35% |
| 10 | OpenHands+DeepSeek-V4-Flash | 32% |
| 11 | OpenHands+GPT-5.4-Mini | 26% |
| 12 | OpenHands+MiniMax-M2.7 | 26% |
| 13 | OpenHands+Sonnet-4.6 | 26% |
| 14 | OpenHands+Qwen3.6-Plus | 21% |
| 15 | OpenHands+Grok-4.3 | 20% |
| 16 | OpenHands+Gemini-3.1-Flash-Lite | 12% |
§ Nearby in the ranking
| # | Environment | Spread | Discriminative |
|---|---|---|---|
| 04 | FrontierSWEfrontier software engineering | 59% | 0.59 |
| 05 | GBA Evallong-horizon SWE (build a GBA emulator) | 53% | 0.53 |
| 06 | SkillsBenchagent skills (multi-domain) | 43% | 0.43 |
| 07 | Diplomacy Arenastrategy / negotiation (games) | 29% | 0.29 |
| 08 | WebBenchbrowser agents (live web) | 22% | 0.22 |
§ Work with us
Need one that still separates models?
When the public environment for your capability saturates, you can’t tell your models apart and you can’t train past it. We build private, contamination-resistant, verifiable-reward environments and evals on a hold-out set — designed to discriminate where the public ones no longer do.