CompileBench.

An environment for compile/cross-compile real OSS. Across 6 models with public results it spreads the best and worst 27% — but the leader has cleared the ceiling, so it no longer separates the strongest models.

The full ranking →What we build

§ Public model scores

Who wins CompileBench.

Best public result per model entry, normalized 0..1. The spread between the top and bottom rows is what makes this environment worth — or not worth — a training run.

#	Model	pass@1
01	Opus-4.1-Thinking	100%
02	Sonnet-4-Thinking	87%
03	Sonnet-4.5-Thinking	87%
04	Sonnet-4	87%
05	Haiku-4.5	80%
06	DeepSeek-V3.1	73%

§ Nearby in the ranking

#	Environment	Spread	Discriminative
08	WebBenchbrowser agents (live web)	22%	0.22
09	Cua-BenchGUI computer-use (desktop + mobile)	14%	0.14
10	CompileBenchcompile/cross-compile real OSS	27%	0.14
11	SWE-Bench-Prosoftware engineering (audited)	11%	0.11
12	COBOLBenchlegacy enterprise (COBOL) — floor effect	2%	0.02

§ Work with us

Need one that still separates models?

When the public environment for your capability saturates, you can’t tell your models apart and you can’t train past it. We build private, contamination-resistant, verifiable-reward environments and evals on a hold-out set — designed to discriminate where the public ones no longer do.

How we evaluate →All environments Email us