A 2026 Berkeley RDI study found that eight major agent benchmarks — including SWE-bench Verified, Terminal-Bench, WebArena, OSWorld, GAIA, and FieldWorkArena — could be exploited to near-perfect scores without solving any tasks.
Failure modes included leaked reference answers, unsanitized eval(), prompt-injectable LLM judges, and scoring functions that skip correctness checks entirely. A 10-line conftest.py was enough to make every SWE-bench test report as passing.
Treat leaderboard position as a signal, not proof of capability — especially on agentic benchmarks where the evaluation environment is itself part of the attack surface. Held-out, contamination-resistant evals like HLE and LiveCodeBench Pro are more resistant, but not immune.
Read the full Berkeley RDI analysis →