Diplomacy Arena.

An environment for strategy / negotiation (games). Across 15 models with public results it spreads the best and worst 29% — it still sorts frontier models, so training on it can still move yours.

The full ranking →What we build

§ Public model scores

Who wins Diplomacy Arena.

Best public result per model entry, normalized 0..1. The spread between the top and bottom rows is what makes this environment worth — or not worth — a training run.

#	Model	performance
01	Gemini-3-Pro	60%
02	o3	58%
03	Grok-4-Fast	58%
04	Opus-4.5	57%
05	Gemini-2.5-Flash	57%
06	Sonnet-4.5	55%
07	GPT-5-Minimal	54%
08	GLM-4.6	53%
09	Kimi-K2	53%
10	Hermes-4-405b	49%
11	o4-Mini	45%
12	GPT-OSS-120b	41%
13	Llama-4-Maverick	38%
14	GPT-5-Nano	33%
15	AFM-4.5B	31%

§ Nearby in the ranking

#	Environment	Spread	Discriminative
05	GBA Evallong-horizon SWE (build a GBA emulator)	53%	0.53
06	SkillsBenchagent skills (multi-domain)	43%	0.43
07	Diplomacy Arenastrategy / negotiation (games)	29%	0.29
08	WebBenchbrowser agents (live web)	22%	0.22
09	Cua-BenchGUI computer-use (desktop + mobile)	14%	0.14

§ Work with us

Need one that still separates models?

When the public environment for your capability saturates, you can’t tell your models apart and you can’t train past it. We build private, contamination-resistant, verifiable-reward environments and evals on a hold-out set — designed to discriminate where the public ones no longer do.

How we evaluate →All environments Email us