Bug detection and repair benchmark with ~2.4M Java methods mined from GitHub commits labeled as bug fixes. Used widely to evaluate LLM bug detection capabilities. Primary metric is Accuracy (correct bug classification).
Accuracy is the reported evaluation metric for Bugs2Fix. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | GPT-4o | verified | 78.6 | 2026 | Source ↗ | Looks wrong? |
| 02 | Qwen2.5-Coder 32B | verified | 76.8 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 03 | DeepSeek-Coder-V2-Instruct | verified | 75.3 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 04 | CodeT5+ | verified | 68.2 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 05 | UniXcoder | verified | 66.4 | 2022 | Paper ↗Code ↗ | Looks wrong? |
| 06 | CodeBERT | verified | 62.5 | 2020 | Paper ↗Code ↗ | Looks wrong? |