RE-Bench
RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.
RE-Bench (Research Engineering Benchmark) evaluates AI agents on research engineering tasks — implementing ML experiments, reproducing paper results, and extending research codebases. It tests the intersection of coding ability and scientific understanding, where current agents struggle compared to human research engineers.
History
MLAgentBench tests whether agents can run and improve ML experiments
METR (Model Evaluation and Threat Research) releases early versions of research engineering tasks
RE-Bench formalized with 7 hand-crafted research engineering environments
Best agents achieve ~40% of expert human performance on RE-Bench tasks within comparable time
Extended evaluations (up to 8 hours of agent time) show diminishing returns beyond 2 hours
RE-Bench adopted as a key evaluation for measuring AI research capabilities
How RE-Bench Works
Task Assignment
The agent receives a research engineering task: implement an algorithm, reproduce a result, debug an experiment, or extend a research codebase.
Codebase Understanding
The agent reads existing code, documentation, and paper excerpts to understand the research context and implementation requirements.
Implementation
Code is written or modified to complete the research task — often requiring understanding of ML concepts, not just coding patterns.
Experimentation
The agent runs experiments, interprets results, and iterates on the implementation based on empirical observations.
Scoring
Results are evaluated on a continuous scale comparing agent output quality to expert human output, not just pass/fail.
Current Landscape
RE-Bench highlights a crucial gap in 2025: agents that excel at well-defined coding tasks (SWE-bench) often struggle with the open-ended nature of research engineering. The benchmark shows agents achieve roughly 40% of expert performance in comparable time, with the gap widening on tasks requiring scientific judgment rather than pure implementation. This makes RE-Bench a key metric for tracking progress toward AI-driven scientific research.
Key Challenges
Requires both coding skill and domain expertise — agents must understand ML concepts to make good implementation decisions
Long-horizon tasks — some RE-Bench tasks take expert humans 2-8 hours, requiring sustained agent effort
Diminishing returns — agents plateau quickly while humans continue improving with more time
Experiment interpretation — agents struggle to diagnose why an experiment failed and what to change
Open-ended evaluation — no single 'correct' solution, requiring nuanced quality assessment
Quick Recommendations
Evaluating research coding capability
RE-Bench (METR)
Most rigorous evaluation of research engineering, used by leading AI labs
Research assistant agents
Claude 3.5 Sonnet / GPT-4o + Jupyter
Strong combination of coding and scientific reasoning for research support
ML experiment automation
OpenHands + domain-specific prompting
Extensible framework for building research engineering agents
What's Next
The frontier is extending RE-Bench to cover full research cycles — from literature review through experiment design, implementation, analysis, and paper writing. Expect integration with lab automation and scientific computing environments, testing whether agents can be genuine research collaborators.
Benchmarks & SOTA
Related Tasks
Task agents
AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.
Autonomous Coding
Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.
HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.
Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep RE-Bench benchmarks accurate. Report outdated results, missing benchmarks, or errors.