Agentic AI

RE-Bench

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

1 datasets5 resultsView full task mapping →

RE-Bench (Research Engineering Benchmark) evaluates AI agents on research engineering tasks — implementing ML experiments, reproducing paper results, and extending research codebases. It tests the intersection of coding ability and scientific understanding, where current agents struggle compared to human research engineers.

History

2023

MLAgentBench tests whether agents can run and improve ML experiments

2024

METR (Model Evaluation and Threat Research) releases early versions of research engineering tasks

2024

RE-Bench formalized with 7 hand-crafted research engineering environments

2024

Best agents achieve ~40% of expert human performance on RE-Bench tasks within comparable time

2025

Extended evaluations (up to 8 hours of agent time) show diminishing returns beyond 2 hours

2025

RE-Bench adopted as a key evaluation for measuring AI research capabilities

How RE-Bench Works

Task Assignment

The agent receives a research engineering task: implement an algorithm, reproduce a result, debug an experiment, or extend a research codebase.

Codebase Understanding

The agent reads existing code, documentation, and paper excerpts to understand the research context and implementation requirements.

Implementation

Code is written or modified to complete the research task — often requiring understanding of ML concepts, not just coding patterns.

Experimentation

The agent runs experiments, interprets results, and iterates on the implementation based on empirical observations.

Scoring

Results are evaluated on a continuous scale comparing agent output quality to expert human output, not just pass/fail.

Current Landscape

RE-Bench highlights a crucial gap in 2025: agents that excel at well-defined coding tasks (SWE-bench) often struggle with the open-ended nature of research engineering. The benchmark shows agents achieve roughly 40% of expert performance in comparable time, with the gap widening on tasks requiring scientific judgment rather than pure implementation. This makes RE-Bench a key metric for tracking progress toward AI-driven scientific research.

Key Challenges

Requires both coding skill and domain expertise — agents must understand ML concepts to make good implementation decisions

Long-horizon tasks — some RE-Bench tasks take expert humans 2-8 hours, requiring sustained agent effort

Diminishing returns — agents plateau quickly while humans continue improving with more time

Experiment interpretation — agents struggle to diagnose why an experiment failed and what to change

Open-ended evaluation — no single 'correct' solution, requiring nuanced quality assessment

Quick Recommendations

Evaluating research coding capability

RE-Bench (METR)

Most rigorous evaluation of research engineering, used by leading AI labs

Research assistant agents

Claude 3.5 Sonnet / GPT-4o + Jupyter

Strong combination of coding and scientific reasoning for research support

ML experiment automation

OpenHands + domain-specific prompting

Extensible framework for building research engineering agents

What's Next

The frontier is extending RE-Bench to cover full research cycles — from literature review through experiment design, implementation, analysis, and paper writing. Expect integration with lab automation and scientific computing environments, testing whether agents can be genuine research collaborators.

Benchmarks & SOTA

RE-Bench

Research Engineering Benchmark

20245 results

7 challenging open-ended ML research engineering tasks requiring multi-hour autonomous work. Agents compete against human researchers on real tasks like implementing new architectures or optimizing training pipelines. Score is normalized against human performance.

State of the Art

OpenAI

0.380

normalized-score

Related Tasks

Task agents

AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.

Autonomous Coding

Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

Tool Use

Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep RE-Bench benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Agentic AI