Benchmark for autonomous coding/scientific agents reproducing Large Hadron Collider analyses. Public CodeSOTA score is Acc_tau at tau=0.33: the percent of simulation tasks whose relative-L2 error is below 0.33, derived from Table 2 and Eq. 4 of arXiv:2605.13950.
Acc Tau 0 33 is the reported evaluation metric for Collider-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Edit |
|---|---|---|---|---|---|---|
| 01 | Codex CLI (GPT-5.5) | verified | 30 | 2026 | Paper ↗ | Edit result |
| 02 | Claude Code (Opus 4.7) | verified | 20 | 2026 | Paper ↗ | Edit result |
| 03 | Claude Code (Sonnet 4.6) | verified | 10 | 2026 | Paper ↗ | Edit result |
| 04 | Claude Code (Haiku 4.5) | verified | 0.00 | 2026 | Paper ↗ | Edit result |
| 05 | Codex CLI (GPT-5.4-mini) | verified | 0.00 | 2026 | Paper ↗ | Edit result |
| 06 | ForgeCode (DeepSeek-V4) | verified | 0.00 | 2026 | Paper ↗ | Edit result |