SWE-bench

SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.

1
Datasets
0
Results
resolve-rate
Canonical metric
Canonical Benchmark

SWE-bench Verified

500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).

Primary metric: resolve-rate
View full leaderboard

Top 10

Leading models on SWE-bench Verified.

No results yet. Be the first to contribute.

What were you looking for on SWE-bench?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Agentic AI.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on SWE-bench? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.