Agentic AI

RE-Bench

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

1 datasets5 resultsView full task mapping →

RE-Bench (Research Engineering Benchmark) evaluates AI agents on research engineering tasks — implementing ML experiments, reproducing paper results, and extending research codebases. It tests the intersection of coding ability and scientific understanding, where current agents struggle compared to human research engineers.

History

2023

MLAgentBench tests whether agents can run and improve ML experiments

2024

METR (Model Evaluation and Threat Research) releases early versions of research engineering tasks

2024

RE-Bench formalized with 7 hand-crafted research engineering environments

2024

Best agents achieve ~40% of expert human performance on RE-Bench tasks within comparable time

2025

Extended evaluations (up to 8 hours of agent time) show diminishing returns beyond 2 hours

2025

RE-Bench adopted as a key evaluation for measuring AI research capabilities

How RE-Bench Works

1Task AssignmentThe agent receives a resear…2Codebase UnderstandingThe agent reads existing co…3ImplementationCode is written or modified…4ExperimentationThe agent runs experiments5ScoringResults are evaluated on a …RE-Bench Pipeline
1

Task Assignment

The agent receives a research engineering task: implement an algorithm, reproduce a result, debug an experiment, or extend a research codebase.

2

Codebase Understanding

The agent reads existing code, documentation, and paper excerpts to understand the research context and implementation requirements.

3

Implementation

Code is written or modified to complete the research task — often requiring understanding of ML concepts, not just coding patterns.

4

Experimentation

The agent runs experiments, interprets results, and iterates on the implementation based on empirical observations.

5

Scoring

Results are evaluated on a continuous scale comparing agent output quality to expert human output, not just pass/fail.

Current Landscape

RE-Bench highlights a crucial gap in 2025: agents that excel at well-defined coding tasks (SWE-bench) often struggle with the open-ended nature of research engineering. The benchmark shows agents achieve roughly 40% of expert performance in comparable time, with the gap widening on tasks requiring scientific judgment rather than pure implementation. This makes RE-Bench a key metric for tracking progress toward AI-driven scientific research.

Key Challenges

Requires both coding skill and domain expertise — agents must understand ML concepts to make good implementation decisions

Long-horizon tasks — some RE-Bench tasks take expert humans 2-8 hours, requiring sustained agent effort

Diminishing returns — agents plateau quickly while humans continue improving with more time

Experiment interpretation — agents struggle to diagnose why an experiment failed and what to change

Open-ended evaluation — no single 'correct' solution, requiring nuanced quality assessment

Quick Recommendations

Evaluating research coding capability

RE-Bench (METR)

Most rigorous evaluation of research engineering, used by leading AI labs

Research assistant agents

Claude 3.5 Sonnet / GPT-4o + Jupyter

Strong combination of coding and scientific reasoning for research support

ML experiment automation

OpenHands + domain-specific prompting

Extensible framework for building research engineering agents

What's Next

The frontier is extending RE-Bench to cover full research cycles — from literature review through experiment design, implementation, analysis, and paper writing. Expect integration with lab automation and scientific computing environments, testing whether agents can be genuine research collaborators.

Benchmarks & SOTA

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep RE-Bench benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000