Codesota · Papers2026-01-01 · arXiv
Paper

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih
arXiv ↗

Collider-Bench is a benchmark for evaluating autonomous LLM agents on long-horizon, real-world scientific tasks involving the reproduction of Large Hadron Collider (LHC) experimental analyses.

Agents must turn published papers into executable simulation-and-selection pipelines to predict collision event yields, evaluated against quantitative targets.

§ 01 · Benchmark results

6 results reproduced from this paper.

View:
Sorted instantly in-page
Results
6
SOTA rows
1
Models
6
Datasets
1
#ModelVendorBenchmarkValueSOTADateSource
01Codex CLI (GPT-5.5)OpenAICollider-Bench30.00#1source ↗
02Claude Code (Opus 4.7)AnthropicCollider-Bench20.00source ↗
03Claude Code (Sonnet 4.6)AnthropicCollider-Bench10.00source ↗
04Claude Code (Haiku 4.5)AnthropicCollider-Bench0.00source ↗
05Codex CLI (GPT-5.4-mini)OpenAICollider-Bench0.00source ↗
06ForgeCode (DeepSeek-V4)DeepSeekCollider-Bench0.00source ↗
Benchmark trail
§ 02 · Models

6 models from this paper.

evaluates
Claude Code (Haiku 4.5)
Anthropic
evaluates
Claude Code (Opus 4.7)
Anthropic
evaluates
Claude Code (Sonnet 4.6)
Anthropic
evaluates
Codex CLI (GPT-5.4-mini)
OpenAI
evaluates
Codex CLI (GPT-5.5)
OpenAI
evaluates
ForgeCode (DeepSeek-V4)
DeepSeek
§ 03 · Datasets

1 dataset from this paper.

uses · Agentic AI
Collider-Bench
Task agents
Read next

Three places to go from here.

Index
All papers
All tracked papers in the registry, with benchmark result, model, and leaderboard linkage where available.
Replacement
Papers with Code is dead — alternatives
What replaced PWC for each use case: LLMs, OCR, speech, vision, robotics.
Top hub
LLM benchmarks
Every frontier LLM benchmark, scored.