Defects4J: A Database of Real Faults in Java Programs.

Standard program repair benchmark with 835 real bugs from 17 open-source Java projects. Each bug has a fix and triggering test suite. Primary metric is the number of correctly fixed bugs (plausible and correct patches).

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

5 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: correct-patches · higher is better

correct-patches· primary

5 rows

#	Model	Org	Submitted	Paper / code	correct-patches
01	SRepair	SUTD	Apr 2024	SRepair: Utilizing Multiple LLM Agents for Automated Pro…	101
02	Claude Opus 4API	Anthropic	Mar 2026	official-model-card	89
03	GPT-4oAPI	OpenAI	Apr 2024	SRepair: Utilizing Multiple LLM Agents for Automated Pro…	82
04	ChatRepair	Fudan University	Jan 2024	ChatRepair: A Conversational Approach to Automated Progr…	78
05	AlphaRepair	ETH Zurich	Aug 2022	Less Training, More Repairing Please: Revisiting Automat…	23

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

3 steps
of state of the art.

Each row below marks a model that broke the previous record on correct-patches. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · correct-patches

Aug 24, 2022AlphaRepairETH Zurich23
Jan 3, 2024ChatRepairFudan University78
Apr 18, 2024SRepairSUTD101

Fig 3 · SOTA-setting models only. 3 entries span Aug 2022 → Apr 2024.

§ 04 · Literature

3 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

SRepair: Utilizing Multiple LLM Agents for Automated Program Repair
Apr 2024·SRepair, GPT-4o
arXiv ↗
ChatRepair: A Conversational Approach to Automated Program Repair
Jan 2024·ChatRepair
arXiv ↗
Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning
Aug 2022·AlphaRepair
arXiv ↗

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

Defects4J: A Database of Real Faults in Java Programs.

Best published scores.

3 stepsof state of the art.

3 paperstied to this benchmark.

Have a score that beatsthis table?

3 steps
of state of the art.

3 papers
tied to this benchmark.

Have a score that beats
this table?