Bug Detection2019en
Bugs2Fix: Learning to Rewrite Buggy Code
Bug detection and repair benchmark with ~2.4M Java methods mined from GitHub commits labeled as bug fixes. Used widely to evaluate LLM bug detection capabilities. Primary metric is Accuracy (correct bug classification).
Current State of the Art
GPT-4o
OpenAI
78.6
accuracy
accuracy Progress Over Time
Showing 5 breakthroughs from Feb 2020 to Jul 2024
Key Milestones
Jun 2024
DeepSeek-Coder-V2-Instruct
Bug detection accuracy. DeepSeek-Coder-V2 evaluation.
75.3
+10.4%
Jul 2024
GPT-4oCurrent SOTA
Bug detection accuracy. LLM bug detection evaluation study (arxiv:2407.01511).
78.6
+4.4%
Total Improvement
25.8%
Time Span
4y 6m
Breakthroughs
5
Current SOTA
78.6
Top Models Performance Comparison
Top 6 models ranked by accuracy
Best Score
78.6
Top Model
GPT-4o
Models Compared
6
Score Range
16.1
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | GPT-4oAPI OpenAI | 78.6 | Mar 2026 | |
| 2 | Qwen2.5-Coder-32B-InstructOpen Source Alibaba | 76.8 | Sep 2024 | |
| 3 | DeepSeek-Coder-V2-InstructOpen Source DeepSeek | 75.3 | Jun 2024 | |
| 4 | CodeT5+Open Source Salesforce | 68.2 | May 2023 | |
| 5 | UniXcoderOpen Source Microsoft | 66.4 | Mar 2022 | |
| 6 | CodeBERTOpen Source Microsoft | 62.5 | Feb 2020 |
Related Papers5
Qwen2.5-Coder Technical Report
Sep 2024Models: Qwen2.5-Coder-32B-Instruct
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Jun 2024Models: DeepSeek-Coder-V2-Instruct
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
May 2023Models: CodeT5+
UniXcoder: Unified Cross-Modal Pre-Training for Code Representation
Mar 2022Models: UniXcoder
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Feb 2020Models: CodeBERT