SWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.
SWE-bench Verified
500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).
Top 10
Leading models on SWE-bench Verified.
No results yet. Be the first to contribute.
What were you looking for on SWE-bench?
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Agentic AI.
Didn't find what you came for?
Still looking for something on SWE-bench? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.