Codesota · Benchmark · Bugs2FixHome/Leaderboards/Code & Software Engineering/Bug Detection/Bugs2Fix
Unknown

Bugs2Fix.

Bug detection and repair benchmark with ~2.4M Java methods mined from GitHub commits labeled as bug fixes. Used widely to evaluate LLM bug detection capabilities. Primary metric is Accuracy (correct bug classification).

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Accuracy

Accuracy is the reported evaluation metric for Bugs2Fix. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01GPT-4o
Bug detection accuracy. LLM bug detection evaluation study (arxiv:2407.01511).
verified78.62026Source ↗Looks wrong?
02Qwen2.5-Coder 32B
Bug detection accuracy. Qwen2.5-Coder paper.
verified76.82024Paper ↗Code ↗Looks wrong?
03DeepSeek-Coder-V2-Instruct
Bug detection accuracy. DeepSeek-Coder-V2 evaluation.
verified75.32024Paper ↗Code ↗Looks wrong?
04CodeT5+
Bug detection accuracy. CodeT5+ paper (220M encoder-decoder variant).
verified68.22023Paper ↗Code ↗Looks wrong?
05UniXcoder
Bug detection accuracy. UniXcoder paper.
verified66.42022Paper ↗Code ↗Looks wrong?
06CodeBERT
Bug detection accuracy on Bugs2Fix test set. CodeBERT paper Table 4.
verified62.52020Paper ↗Code ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Bug Detection