Codesota · Tasks · Autonomous CodingHome/Tasks/Agentic AI/Autonomous Coding

Autonomous Coding.

Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.

Datasets

Results

pct_resolved

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

SWE-bench Verified

Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).

Primary metric: pct_resolved

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on SWE-bench Verified.

#	Model	pct_resolved	Year	Source
★	Claude Opus 4.5	80.9	2026	paper ↗
2	Gemini 3 Pro	78.8	2026	paper ↗
3	GPT-5 Codex	74.9	2026	paper ↗

What were you looking for on Autonomous Coding?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

2 datasets tracked for this task.

SWE-bench Verified

CANONICAL

3 results · pct_resolved

Top: Claude Opus 4.5 — 80.9

Terminal-Bench 2.0

20 results · accuracy

Top: Codex / GPT-5.5 — 82.0

§ 05 · Related tasks

Other tasks in Agentic AI.

Agent Memory Bioinformatics Agents HCAST RE-Bench SWE-bench Task agents Time Horizon Tool Use

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Autonomous Coding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.