Who leads the RE-Bench benchmark?

o3 currently leads RE-Bench with a score of 0.38 on Normalized Score.

What is the state-of-the-art score on RE-Bench?

The state-of-the-art result on RE-Bench is 0.38 (Normalized Score), achieved by o3 as of 2025.

How many models are tracked on RE-Bench?

Codesota tracks 5 models on RE-Bench.

When was the RE-Bench leaderboard last updated?

The RE-Bench leaderboard on Codesota includes results through 2025, with the earliest tracked result from 2024.

Codesota · Benchmark · RE-BenchHome/Leaderboards/RE-Bench

Unknown

RE-Bench.

Name: RE-Bench Benchmark Results
Creator: Unknown
Published: 2024-01-01
License: https://creativecommons.org/licenses/by/4.0/

7 challenging open-ended ML research engineering tasks requiring multi-hour autonomous work. Agents compete against human researchers on real tasks like implementing new architectures or optimizing training pipelines. Score is normalized against human performance.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Normalized Score

Normalized Score is the reported evaluation metric for RE-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Normalized Scoreverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	o3 OpenAI o3 on RE-Bench. 8-hour limit. METR eval, 2025.	verified	0.38	2025	Paper ↗	Looks wrong?
02	Claude 3.7 Sonnet Claude 3.7 Sonnet on RE-Bench. 8-hour limit extrapolation. METR eval, 2025.	verified	0.29	2025	Paper ↗	Looks wrong?
03	o1 OpenAI o1 on RE-Bench. 2-hour limit. Table 1, arxiv:2411.15114.	verified	0.17	2024	Paper ↗	Looks wrong?
04	Claude 3.5 Sonnet Claude 3.5 Sonnet on RE-Bench. 2-hour limit. Table 1, arxiv:2411.15114.	verified	0.12	2024	Paper ↗	Looks wrong?
05	GPT-4 Turbo (2024) GPT-4 Turbo on RE-Bench. 2-hour limit. Table 1, arxiv:2411.15114.	verified	0.07	2024	Paper ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards