Program Repair
Automatically fixing bugs in code.
Automated program repair generates patches to fix buggy code, ranging from simple syntax fixes to complex logic repairs. LLMs have dramatically advanced the field, with GPT-4 and Claude 3.5 fixing 50-70% of real-world bugs from curated benchmarks. The combination of fault localization + LLM patching + test validation forms the modern APR pipeline.
History
GenProg uses genetic programming to evolve patches for C programs
Prophet and Angelix use learned program transformations for semantic repair
SequenceR applies seq2seq models to generate patches from buggy code
AlphaRepair combines neural code generation with template-based repair
Codex-based repair achieves 31% on Defects4J, a major improvement
ChatRepair uses conversational LLM prompting with test feedback for iterative repair
SWE-bench shows LLM agents can fix real GitHub issues at 45-50% rate
Claude 3.5 Sonnet and GPT-4 achieve 60-70% on curated repair benchmarks
Autonomous repair agents (SWE-agent, Claude Code) operate in production environments
How Program Repair Works
Bug Report / Failing Test
The repair process starts with a bug report, failing test case, or error stack trace that localizes the symptom.
Fault Localization
Spectrum-based (Ochiai, Tarantula) or LLM-based methods identify the most likely buggy location in the codebase.
Patch Generation
One or more candidate patches are generated — by the LLM based on the bug context, or by applying learned program transformations.
Patch Validation
Candidates are tested against the failing test (should now pass) and existing tests (should not regress).
Patch Ranking
Multiple valid patches are ranked by naturalness, minimality, and semantic correctness to select the best fix.
Current Landscape
Program repair in 2025 has been transformed by LLMs. The traditional generate-and-validate pipeline now uses LLMs as the generator, achieving dramatically higher fix rates than template-based or search-based methods. The key insight is iterative repair: LLMs read test failure messages and refine their patches, mimicking human debugging. This approach, embodied in tools like SWE-agent and Claude Code, fixes real GitHub issues at scale. The remaining gap is between benchmark performance (60-70% on curated bugs) and real-world reliability on diverse, complex codebases.
Key Challenges
Overfitting patches — patches that pass tests but don't actually fix the underlying bug (test-suite overfitting)
Fault localization bottleneck — repair quality is bounded by the accuracy of finding the right location to fix
Complex bugs — multi-location, multi-file bugs requiring coordinated changes remain very difficult
Patch quality — generated patches may fix the bug but introduce code smells or degrade readability
Evaluation reliability — benchmarks like Defects4J have a limited number of bugs, leading to high variance
Quick Recommendations
Production bug repair
Claude 3.5 Sonnet + test-guided repair loop
Best combination of code understanding, fix generation, and iterative debugging
Autonomous repair agent
SWE-agent / OpenHands + Claude/GPT-4
Full pipeline from bug report to validated patch
CI/CD integration
GitHub Copilot autofix / CodeGuru
Automated fix suggestions in pull request reviews
Research benchmarking
Defects4J / BugsInPy / SWE-bench
Standard evaluation suites with reproducible evaluation protocols
What's Next
The frontier is proactive repair — fixing bugs before they reach users. Expect CI/CD-integrated repair agents that automatically diagnose and fix failing builds, combined with formal verification to guarantee patch correctness. Multi-file, multi-step repair requiring architectural understanding remains the next capability hurdle.
Benchmarks & SOTA
Related Tasks
Something wrong or missing?
Help keep Program Repair benchmarks accurate. Report outdated results, missing benchmarks, or errors.