RE-Bench

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

1
Datasets
0
Results
normalized-score
Canonical metric
Canonical Benchmark

RE-Bench

7 challenging open-ended ML research engineering tasks requiring multi-hour autonomous work. Agents compete against human researchers on real tasks like implementing new architectures or optimizing training pipelines. Score is normalized against human performance.

Primary metric: normalized-score
View full leaderboard

Top 10

Leading models on RE-Bench.

No results yet. Be the first to contribute.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Agentic AI.