Agentic AI

RE-Bench

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

1 datasets5 resultsView full task mapping →

RE-Bench (Research Engineering Benchmark) evaluates AI agents on research engineering tasks — implementing ML experiments, reproducing paper results, and extending research codebases. It tests the intersection of coding ability and scientific understanding, where current agents struggle compared to human research engineers.

History

2023

MLAgentBench tests whether agents can run and improve ML experiments

2024

METR (Model Evaluation and Threat Research) releases early versions of research engineering tasks

2024

RE-Bench formalized with 7 hand-crafted research engineering environments

2024

Best agents achieve ~40% of expert human performance on RE-Bench tasks within comparable time

2025

Extended evaluations (up to 8 hours of agent time) show diminishing returns beyond 2 hours

2025

RE-Bench adopted as a key evaluation for measuring AI research capabilities

How RE-Bench Works

1Task AssignmentThe agent receives a resear…2Codebase UnderstandingThe agent reads existing co…3ImplementationCode is written or modified…4ExperimentationThe agent runs experiments5ScoringResults are evaluated on a …RE-Bench Pipeline
1

Task Assignment

The agent receives a research engineering task: implement an algorithm, reproduce a result, debug an experiment, or extend a research codebase.

2

Codebase Understanding

The agent reads existing code, documentation, and paper excerpts to understand the research context and implementation requirements.

3

Implementation

Code is written or modified to complete the research task — often requiring understanding of ML concepts, not just coding patterns.

4

Experimentation

The agent runs experiments, interprets results, and iterates on the implementation based on empirical observations.

5

Scoring

Results are evaluated on a continuous scale comparing agent output quality to expert human output, not just pass/fail.

Current Landscape

RE-Bench highlights a crucial gap in 2025: agents that excel at well-defined coding tasks (SWE-bench) often struggle with the open-ended nature of research engineering. The benchmark shows agents achieve roughly 40% of expert performance in comparable time, with the gap widening on tasks requiring scientific judgment rather than pure implementation. This makes RE-Bench a key metric for tracking progress toward AI-driven scientific research.

Key Challenges

Requires both coding skill and domain expertise — agents must understand ML concepts to make good implementation decisions

Long-horizon tasks — some RE-Bench tasks take expert humans 2-8 hours, requiring sustained agent effort

Diminishing returns — agents plateau quickly while humans continue improving with more time

Experiment interpretation — agents struggle to diagnose why an experiment failed and what to change

Open-ended evaluation — no single 'correct' solution, requiring nuanced quality assessment

Quick Recommendations

Evaluating research coding capability

RE-Bench (METR)

Most rigorous evaluation of research engineering, used by leading AI labs

Research assistant agents

Claude 3.5 Sonnet / GPT-4o + Jupyter

Strong combination of coding and scientific reasoning for research support

ML experiment automation

OpenHands + domain-specific prompting

Extensible framework for building research engineering agents

What's Next

The frontier is extending RE-Bench to cover full research cycles — from literature review through experiment design, implementation, analysis, and paper writing. Expect integration with lab automation and scientific computing environments, testing whether agents can be genuine research collaborators.

Benchmarks & SOTA

Related Tasks

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

Autonomous Coding

Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?

SWE-bench

SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.

Web & Desktop Agents

Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.

Something wrong or missing?

Help keep RE-Bench benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000