7 challenging open-ended ML research engineering tasks requiring multi-hour autonomous work. Agents compete against human researchers on real tasks like implementing new architectures or optimizing training pipelines. Score is normalized against human performance.
Normalized Score is the reported evaluation metric for RE-Bench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | o3 | verified | 0.38 | 2025 | Paper ↗ | Looks wrong? |
| 02 | Claude 3.7 Sonnet | verified | 0.29 | 2025 | Paper ↗ | Looks wrong? |
| 03 | o1 | verified | 0.17 | 2024 | Paper ↗ | Looks wrong? |
| 04 | Claude 3.5 Sonnet | verified | 0.12 | 2024 | Paper ↗ | Looks wrong? |
| 05 | GPT-4 Turbo (2024) | verified | 0.07 | 2024 | Paper ↗ | Looks wrong? |