Agentic AI

Autonomous Coding

Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.

2 datasets23 resultsView full task mapping →

Autonomous coding agents take a natural language task description and produce working code end-to-end — including planning, implementation, testing, and debugging. Devin (Cognition), Claude Code, and Cursor represent different points on the autonomy spectrum, with SWE-bench measuring real-world software engineering capability.

History

2021

GitHub Copilot launches — first widely adopted AI code completion tool

2021

Codex (OpenAI) demonstrates code generation from natural language on HumanEval

2023

GPT-4 achieves 67% on HumanEval, a major jump from GPT-3.5's 48%

2023

SWE-bench released — tests whether agents can resolve real GitHub issues

2024

Devin (Cognition) announced as first AI software engineer; scores 13.86% on SWE-bench full

2024

SWE-agent (Princeton) achieves 12.5% on SWE-bench with open tools

2024

Cursor, Claude Code, and Windsurf popularize agentic coding IDEs

2024

Claude 3.5 Sonnet reaches 49% on SWE-bench Verified with scaffolding

2025

Claude Code and similar tools handle multi-file, multi-step coding tasks in production

2025

OpenAI Codex agent and Google Jules enter the autonomous coding space

How Autonomous Coding Works

Task Understanding

The agent reads a task description (issue, feature request, bug report) and explores the relevant codebase to understand context.

Planning

A plan is formed — which files to modify, what approach to take, what tests to write — potentially iterating through multiple strategies.

Implementation

Code is written or modified across one or more files, using the model's understanding of the codebase architecture.

Testing & Debugging

The agent runs tests, reads error outputs, and iteratively fixes issues until tests pass.

Validation

Final changes are reviewed against the original task description, and a summary or PR description is generated.

Current Landscape

Autonomous coding in 2025 exists on a spectrum from copilot (inline suggestions) to fully autonomous agents (Devin, Claude Code background tasks). The best agents resolve ~50% of SWE-bench Verified issues — real GitHub bugs from popular repositories. The market is rapidly evolving with Cursor, Claude Code, Windsurf, Cody, and others competing on different autonomy levels. The key differentiator is reliability: developers adopt tools they can trust to produce correct, well-structured code.

Key Challenges

Context window limits — real codebases are far larger than any model's context, requiring intelligent retrieval and exploration

Test oracle problem — agents need to write meaningful tests, not just tests that pass

Long-horizon planning — complex features require coordinating changes across many files over many steps

Environment interaction — setting up dependencies, running builds, and managing development environments

Evaluation gap — SWE-bench measures bug fixes, but real coding includes design decisions, trade-offs, and code quality

Quick Recommendations

Daily development assistant

Claude Code / Cursor with Claude 3.5 Sonnet

Best balance of autonomy and developer control for real production work

Fully autonomous bug fixing

SWE-agent + Claude 3.5 Sonnet

Highest open-source SWE-bench performance with reproducible scaffolding

IDE integration

Cursor / Windsurf

Tightest integration with existing development workflows

Research and benchmarking

OpenHands / SWE-agent

Open-source frameworks for studying and improving autonomous coding agents

What's Next

The frontier is extending autonomous coding from single-issue fixes to multi-day feature development. Key advances needed: better codebase understanding via persistent memory, reliable multi-file refactoring, and autonomous CI/CD interaction. Expect convergence toward agents that pair with developers rather than replace them.

Benchmarks & SOTA

Terminal-Bench 2.0

202620 results

Terminal-agent benchmark for software engineering, machine learning, security, data science, system administration, file operations, and related terminal workflows. Scores measure the agent harness and underlying model as one system.

State of the Art

Codex / GPT-5.5

OpenAI

accuracy

SWE-bench Verified

SWE-bench Verified (Agentic)

20243 results

Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).

State of the Art

Claude Opus 4.5

Anthropic

80.9

pct_resolved

Related Tasks

Task agents

AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

Tool Use

Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.

Bioinformatics Agents

LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpreting biological results.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Autonomous Coding benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Agentic AI