Computer Code

Program Repair

Automatically fixing bugs in code.

1 datasets5 resultsView full task mapping →

Automated program repair generates patches to fix buggy code, ranging from simple syntax fixes to complex logic repairs. LLMs have dramatically advanced the field, with GPT-4 and Claude 3.5 fixing 50-70% of real-world bugs from curated benchmarks. The combination of fault localization + LLM patching + test validation forms the modern APR pipeline.

History

2009

GenProg uses genetic programming to evolve patches for C programs

2016

Prophet and Angelix use learned program transformations for semantic repair

2019

SequenceR applies seq2seq models to generate patches from buggy code

2021

AlphaRepair combines neural code generation with template-based repair

2022

Codex-based repair achieves 31% on Defects4J, a major improvement

2023

ChatRepair uses conversational LLM prompting with test feedback for iterative repair

2024

SWE-bench shows LLM agents can fix real GitHub issues at 45-50% rate

2024

Claude 3.5 Sonnet and GPT-4 achieve 60-70% on curated repair benchmarks

2025

Autonomous repair agents (SWE-agent, Claude Code) operate in production environments

How Program Repair Works

Bug Report / Failing Test

The repair process starts with a bug report, failing test case, or error stack trace that localizes the symptom.

Fault Localization

Spectrum-based (Ochiai, Tarantula) or LLM-based methods identify the most likely buggy location in the codebase.

Patch Generation

One or more candidate patches are generated — by the LLM based on the bug context, or by applying learned program transformations.

Patch Validation

Candidates are tested against the failing test (should now pass) and existing tests (should not regress).

Patch Ranking

Multiple valid patches are ranked by naturalness, minimality, and semantic correctness to select the best fix.

Current Landscape

Program repair in 2025 has been transformed by LLMs. The traditional generate-and-validate pipeline now uses LLMs as the generator, achieving dramatically higher fix rates than template-based or search-based methods. The key insight is iterative repair: LLMs read test failure messages and refine their patches, mimicking human debugging. This approach, embodied in tools like SWE-agent and Claude Code, fixes real GitHub issues at scale. The remaining gap is between benchmark performance (60-70% on curated bugs) and real-world reliability on diverse, complex codebases.

Key Challenges

Overfitting patches — patches that pass tests but don't actually fix the underlying bug (test-suite overfitting)

Fault localization bottleneck — repair quality is bounded by the accuracy of finding the right location to fix

Complex bugs — multi-location, multi-file bugs requiring coordinated changes remain very difficult

Patch quality — generated patches may fix the bug but introduce code smells or degrade readability

Evaluation reliability — benchmarks like Defects4J have a limited number of bugs, leading to high variance

Quick Recommendations

Production bug repair

Claude 3.5 Sonnet + test-guided repair loop

Best combination of code understanding, fix generation, and iterative debugging

Autonomous repair agent

SWE-agent / OpenHands + Claude/GPT-4

Full pipeline from bug report to validated patch

CI/CD integration

GitHub Copilot autofix / CodeGuru

Automated fix suggestions in pull request reviews

Research benchmarking

Defects4J / BugsInPy / SWE-bench

Standard evaluation suites with reproducible evaluation protocols

What's Next

The frontier is proactive repair — fixing bugs before they reach users. Expect CI/CD-integrated repair agents that automatically diagnose and fix failing builds, combined with formal verification to guarantee patch correctness. Multi-file, multi-step repair requiring architectural understanding remains the next capability hurdle.

Benchmarks & SOTA

Defects4J

Defects4J: A Database of Real Faults in Java Programs

20145 results

Standard program repair benchmark with 835 real bugs from 17 open-source Java projects. Each bug has a fix and triggering test suite. Primary metric is the number of correctly fixed bugs (plausible and correct patches).

State of the Art

SRepair

SUTD

101

correct-patches

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Program Repair benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Code

Program Repair

History

How Program Repair Works

Current Landscape

Key Challenges

Quick Recommendations

What's Next

Benchmarks & SOTA

Defects4J

Related Tasks

Code Generation

Code Translation

Code Summarization

Bug Detection

Something wrong or missing?