Codesota · Benchmark · RE-BenchHome/Leaderboards/RE-Bench
Unknown

RE-Bench.

7 challenging open-ended ML research engineering tasks requiring multi-hour autonomous work. Agents compete against human researchers on real tasks like implementing new architectures or optimizing training pipelines. Score is normalized against human performance.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Normalized Score

Normalized Score is the reported evaluation metric for RE-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Normalized Scoreverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01o3
OpenAI o3 on RE-Bench. 8-hour limit. METR eval, 2025.
verified0.382025Paper ↗Looks wrong?
02Claude 3.7 Sonnet
Claude 3.7 Sonnet on RE-Bench. 8-hour limit extrapolation. METR eval, 2025.
verified0.292025Paper ↗Looks wrong?
03o1
OpenAI o1 on RE-Bench. 2-hour limit. Table 1, arxiv:2411.15114.
verified0.172024Paper ↗Looks wrong?
04Claude 3.5 Sonnet
Claude 3.5 Sonnet on RE-Bench. 2-hour limit. Table 1, arxiv:2411.15114.
verified0.122024Paper ↗Looks wrong?
05GPT-4 Turbo (2024)
GPT-4 Turbo on RE-Bench. 2-hour limit. Table 1, arxiv:2411.15114.
verified0.072024Paper ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards
RE-Bench Leaderboard | CodeSOTA | CodeSOTA