Computer Code

Bug Detection

Identifying bugs and vulnerabilities in code.

1 datasets6 resultsView full task mapping →

Bug detection uses ML models to identify potential defects in source code — from simple syntax issues to complex logic errors, security vulnerabilities, and concurrency bugs. LLMs are increasingly competitive with traditional static analysis, especially for detecting semantic bugs that rule-based tools miss.

History

2018

DeepBugs (Pradel & Sen) learns name-based bug patterns from JavaScript code

2019

CuBERT (Google) applies BERT-style pretraining to bug detection tasks

2020

CodeBERT and GraphCodeBERT capture code semantics for defect prediction

2021

GREAT (Google) uses graph neural networks to detect variable misuse bugs

2022

Codex shows LLMs can identify bugs when given code and asked to review

2023

GPT-4 demonstrates strong zero-shot bug detection across languages

2024

Amazon CodeGuru and Snyk integrate ML-based vulnerability detection

2024

Claude 3.5 Sonnet achieves high accuracy on CWE-based vulnerability benchmarks

2025

LLM-based code review tools detect logic bugs that static analysis misses

How Bug Detection Works

Code Representation

Source code is represented as text (for LLMs), AST nodes (for tree-based models), or graph structures (for GNN-based approaches combining control flow and data flow).

Pattern Learning

Models learn bug patterns from labeled datasets of buggy/fixed code pairs, or through pretraining on large code corpora that implicitly captures correct patterns.

Anomaly Detection

The model identifies code that deviates from learned patterns — unusual variable usage, suspicious control flow, potential null dereferences, or API misuse.

Localization

The specific lines or expressions likely containing the bug are highlighted, with an explanation of the suspected issue.

Fix Suggestion

Advanced systems suggest a correction alongside the bug report, reducing developer effort from diagnosis to resolution.

Current Landscape

Bug detection in 2025 operates in two paradigms: (1) traditional static analysis tools (Semgrep, SonarQube) with ML enhancements for reducing false positives, and (2) LLM-based code review that understands code semantics. LLMs excel at detecting logic bugs and suggesting fixes but have higher false positive rates than tuned static analyzers. The practical sweet spot is combining both: static analysis for known vulnerability patterns (CWEs) and LLM review for semantic issues. The market includes standalone tools (Snyk, CodeGuru) and IDE-integrated solutions (Copilot code review).

Key Challenges

False positive rate — too many false alarms cause developers to ignore all warnings (alarm fatigue)

Dataset bias — bug detection datasets over-represent simple bugs; complex logic errors are underrepresented

Context sensitivity — many bugs are only bugs in specific contexts, requiring deep understanding of intended behavior

Ground truth — labeling code as buggy/correct requires expensive expert review; automated labels are noisy

Cross-project generalization — models trained on one codebase often perform poorly on different codebases

Quick Recommendations

General code review

Claude 3.5 Sonnet / GPT-4o with review prompting

Best at detecting semantic bugs through code understanding

Security vulnerability detection

Snyk Code / Semgrep + LLM

Combines rule-based CWE detection with ML-based pattern matching

CI/CD integration

Amazon CodeGuru / SonarQube

Production-ready tools with low false-positive rates

Research baseline

CodeBERT / UniXcoder fine-tuned on Devign/BigVul

Well-studied models with reproducible benchmarks

What's Next

The frontier is autonomous bug detection and repair — agents that continuously scan codebases, detect bugs, generate fixes, and create pull requests. Expect tighter integration with CI/CD pipelines, formal verification for critical code paths, and LLMs that can reason about concurrency and distributed system bugs.

Benchmarks & SOTA

Bugs2Fix

Bugs2Fix: Learning to Rewrite Buggy Code

20196 results

Bug detection and repair benchmark with ~2.4M Java methods mined from GitHub commits labeled as bug fixes. Used widely to evaluate LLM bug detection capabilities. Primary metric is Accuracy (correct bug classification).

State of the Art

GPT-4o

OpenAI

78.6

accuracy

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Bug Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Code

Bug Detection

History

How Bug Detection Works

Current Landscape

Key Challenges

Quick Recommendations

What's Next

Benchmarks & SOTA

Bugs2Fix

Related Tasks

Code Generation

Code Translation

Code Summarization

Code Completion

Something wrong or missing?