Bug Detection2019en

Bugs2Fix: Learning to Rewrite Buggy Code

Bug detection and repair benchmark with ~2.4M Java methods mined from GitHub commits labeled as bug fixes. Used widely to evaluate LLM bug detection capabilities. Primary metric is Accuracy (correct bug classification).

Samples:2,400,000
Metrics:accuracy
Paper / Website
Current State of the Art

GPT-4o

OpenAI

78.6

accuracy

accuracy Progress Over Time

Showing 5 breakthroughs from Feb 2020 to Jul 2024

60.965.770.575.480.2Feb 2020Mar 2021Apr 2022May 2023Jul 2024accuracyDate

Key Milestones

Feb 2020
CodeBERT

Bug detection accuracy on Bugs2Fix test set. CodeBERT paper Table 4.

62.5
Mar 2022
UniXcoder

Bug detection accuracy. UniXcoder paper.

66.4
+6.2%
May 2023
CodeT5+

Bug detection accuracy. CodeT5+ paper (220M encoder-decoder variant).

68.2
+2.7%
Jun 2024
DeepSeek-Coder-V2-Instruct

Bug detection accuracy. DeepSeek-Coder-V2 evaluation.

75.3
+10.4%
Jul 2024
GPT-4oCurrent SOTA

Bug detection accuracy. LLM bug detection evaluation study (arxiv:2407.01511).

78.6
+4.4%
Total Improvement
25.8%
Time Span
4y 6m
Breakthroughs
5
Current SOTA
78.6

Top Models Performance Comparison

Top 6 models ranked by accuracy

accuracy1GPT-4o78.6100.0%2Qwen2.5-Coder-32B-Instruct76.897.7%3DeepSeek-Coder-V2-Instruct75.395.8%4CodeT5+68.286.8%5UniXcoder66.484.5%6CodeBERT62.579.5%0%25%50%75%100%% of best
Best Score
78.6
Top Model
GPT-4o
Models Compared
6
Score Range
16.1

accuracyPrimary

#ModelScorePaper / CodeDate
1
GPT-4oAPI
OpenAI
78.6Mar 2026
2
Qwen2.5-Coder-32B-InstructOpen Source
Alibaba
76.8Sep 2024
3
DeepSeek-Coder-V2-InstructOpen Source
DeepSeek
75.3Jun 2024
4
CodeT5+Open Source
Salesforce
68.2May 2023
5
UniXcoderOpen Source
Microsoft
66.4Mar 2022
6
CodeBERTOpen Source
Microsoft
62.5Feb 2020

Related Papers5

Bugs2Fix Benchmark - Bug Detection | CodeSOTA