RE-Bench
RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.
RE-Bench
7 challenging open-ended ML research engineering tasks requiring multi-hour autonomous work. Agents compete against human researchers on real tasks like implementing new architectures or optimizing training pipelines. Score is normalized against human performance.
Top 10
Leading models on RE-Bench.
No results yet. Be the first to contribute.
What were you looking for on RE-Bench?
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Agentic AI.
Didn't find what you came for?
Still looking for something on RE-Bench? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.