Bugs2Fix: Learning to Rewrite Buggy Code.

Bug detection and repair benchmark with ~2.4M Java methods mined from GitHub commits labeled as bug fixes. Used widely to evaluate LLM bug detection capabilities. Primary metric is Accuracy (correct bug classification).

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

6 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: accuracy · higher is better

accuracy· primary

6 rows

#	Model	Org	Submitted	Paper / code	accuracy
01	GPT-4oAPI	OpenAI	Mar 2026	arxiv	78.60
02	Qwen2.5-Coder 32BOSS	Alibaba	Sep 2024	Qwen2.5-Coder Technical Report · code	76.80
03	DeepSeek-Coder-V2-InstructOSS	DeepSeek	Jun 2024	DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source… · code	75.30
04	CodeT5+OSS	Salesforce	May 2023	CodeT5+: Open Code Large Language Models for Code Unders… · code	68.20
05	UniXcoderOSS	Microsoft	Mar 2022	UniXcoder: Unified Cross-Modal Pre-Training for Code Rep… · code	66.40
06	CodeBERTOSS	Microsoft	Feb 2020	CodeBERT: A Pre-Trained Model for Programming and Natura… · code	62.50

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

6 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy

Feb 19, 2020CodeBERTMicrosoft62.50
Mar 7, 2022UniXcoderMicrosoft66.40
May 13, 2023CodeT5+Salesforce68.20
Jun 17, 2024DeepSeek-Coder-V2-InstructDeepSeek75.30
Sep 19, 2024Qwen2.5-Coder 32BAlibaba76.80
Mar 27, 2026GPT-4oOpenAI78.60

Fig 3 · SOTA-setting models only. 6 entries span Feb 2020 → Mar 2026.

§ 04 · Literature

5 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

Qwen2.5-Coder Technical Report
Sep 2024·Qwen2.5-Coder 32B
arXiv ↗Code
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Jun 2024·DeepSeek-Coder-V2-Instruct
arXiv ↗Code
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
May 2023·CodeT5+
arXiv ↗Code
UniXcoder: Unified Cross-Modal Pre-Training for Code Representation
Mar 2022·UniXcoder
arXiv ↗Code
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Feb 2020·CodeBERT
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

Bugs2Fix: Learning to Rewrite Buggy Code.

Best published scores.

6 stepsof state of the art.

5 paperstied to this benchmark.

Have a score that beatsthis table?

6 steps
of state of the art.

5 papers
tied to this benchmark.

Have a score that beats
this table?