Autonomous Coding

Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?

1
Datasets
0
Results
pct_resolved
Canonical metric
Canonical Benchmark

SWE-bench Verified

Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).

Primary metric: pct_resolved
View full leaderboard

Top 10

Leading models on SWE-bench Verified.

No results yet. Be the first to contribute.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Agentic AI.