SWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.
SWE-bench Verified
500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).
Top 10
Leading models on SWE-bench Verified.
No results yet. Be the first to contribute.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Agentic AI.