Who leads the SWE-bench benchmark?

Claude Sonnet 5 currently leads SWE-bench with a score of 82.10 on resolve-rate.

What is the state-of-the-art score on SWE-bench?

The state-of-the-art result on SWE-bench is 82.10 (resolve-rate), achieved by Claude Sonnet 5 as of 2026.

How many models are tracked on SWE-bench?

Codesota tracks 34 models on SWE-bench across 2 metrics.

When was the SWE-bench leaderboard last updated?

The SWE-bench leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2023.

ICLR 2024Princeton NLPActive Benchmark

SWE-bench

Name: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Benchmark Results
Creator: Codesota
Published: 2023-01-01
License: https://creativecommons.org/licenses/by/4.0/

Can AI agents resolve real-world GitHub issues? The definitive benchmark for evaluating autonomous coding agents on 2,294 software engineering tasks drawn from 12 popular Python repositories.

79.2%

Current SOTA

Verified

2,294

Total Tasks

Full set

500

Verified Tasks

Human-checked

Source Repos

Python

1.96%

First Result

Oct 2023

What is SWE-bench?

SWE-bench is a benchmark that tests whether language models can solve real software engineering problems. Each task is a GitHub issue from a popular open-source Python project, paired with the human-written pull request that fixed it.

To "resolve" a task, an AI agent must produce a code patch that passes the project's test suite — including the specific tests added by the original fix. This means the agent must understand the codebase, locate the bug, write working code, and satisfy existing tests without breaking anything.

Unlike synthetic coding benchmarks (HumanEval, MBPP), SWE-bench uses real bugs from production codebases — Django, scikit-learn, SymPy, matplotlib — making it the closest proxy to actual developer work.

How evaluation works

1Agent receives the issue description and codebase at the pre-fix commit
2Agent explores the codebase, identifies affected files, and writes a patch
3Patch is applied and the full test suite runs (including tests from the fix PR)
4Task is "resolved" only if all relevant tests pass and no regressions occur

Dataset Variants

SWE-bench Full

Superseded

2,294

Original complete benchmark from 12 Python repos.

SWE-bench Verified

Primary

500

Human-validated subset. 68.3% filtered for quality. The standard evaluation.

SWE-bench Lite

Active

300

Smaller subset for cost-effective evaluation and rapid iteration.

SWE-bench Multimodal

New

517

Issues with screenshots, diagrams, and visual elements.

SOTA Progress: 1.96% → 79.2%

SWE-bench Verified resolve rate over time. Current CodeSOTA coverage follows the official all-agent leaderboard.

2023-10

1.96%

Claude 2 + SWE-agent← SWE-bench launch

2024-03

12.5%

SWE-agent + GPT-4← First scaffold

2024-05

13.8%

Devin← First "AI developer"

2024-06

19%

AutoCodeRover← Open-source catches up

2024-08

27%

OpenHands + Claude 3.5

2024-10

36.2%

Amazon Q Developer Agent← Enterprise agents enter

2024-11

38.4%

Agentless + GPT-4o

2024-12

49%

Claude Sonnet 3.5 v2← Breaking 50% barrier

2025-03

55.2%

Claude Opus 4 + Aider

2025-06

62%

GPT-4.5 + Codex← Breaking 60%

2025-09

70.8%

Claude Sonnet 4.5← Breaking 70%

2025-12

78%

Claude Opus 4.5

2025-12

79.2%

Claude Opus 4.5 + live-SWE-agent / Sonar← Current official top

Leaderboard — SWE-bench Verified

Top official all-agent results on the 500-task verified split. Updated May 2026.

Official site →

#	Model	Agent / Scaffold	Resolve %	Type	Date
1	Claude Opus 4.5 mediumAnthropic / UIUC	live-SWE-agent	79.2%	API	2025-12
2	Claude Opus 4.5Anthropic / Sonar	Sonar Foundation Agent	79.2%	API	2025-12
3	Doubao-Seed-CodeByteDance	TRAE	78.8%	API	2025-09
4	Gemini 3 Pro PreviewGoogle / UIUC	live-SWE-agent	77.4%	API	2025-11
5	Claude Sonnet 4 + GPT-5Atlassian	Rovo Dev	76.8%	API	2025-09
6	Claude Sonnet 4EPAM	AI/Run Developer Agent	76.8%	API	2025-08
7	Claude Opus 4.5 highAnthropic / SWE-agent	mini-SWE-agent v2	76.8%	API	2026-02
8	Mixed frontier modelsACoder	ACoder	76.4%	API	2025-08
9	Gemini 3 Flash highGoogle / SWE-agent	mini-SWE-agent v2	75.8%	API	2026-02
10	MiniMax M2.5 highMiniMax / SWE-agent	mini-SWE-agent v2	75.8%	API	2026-02
11	Warp mixed modelsWarp	Warp	75.6%	API	2025-09
12	Claude Opus 4.6Anthropic / SWE-agent	mini-SWE-agent v2	75.6%	API	2026-02

Key Insights

41×

Improvement since launch

From 1.96% (Claude 2, Oct 2023) to 79.2% on the official all-agent Verified leaderboard.

Agent > Model

Scaffolding matters

The same model scores differently depending on the agent scaffold. Claude Opus 4.5 is 79.2% with live-SWE-agent or Sonar and 76.8% in the mini-SWE-agent v2 bash-only slice.

New entrants

Doubao, Gemini, MiniMax

Recent official rows include Doubao-Seed-Code at 78.8%, Gemini 3 Pro Preview at 77.4%, and MiniMax M2.5 in the mini-SWE-agent slice at 75.8%.

Source Repositories

SWE-bench tasks are drawn from real issues in these 12 Python projects:

django/django matplotlib/matplotlib sphinx-doc/sphinx sympy/sympy scikit-learn/scikit-learn pytest-dev/pytest pallets/flask psf/requests pydata/xarray pylint-dev/astroid pylint-dev/pylint mwaskom/seaborn

Evaluation Pipeline

📋Step 1

Issue + Codebase

Agent gets issue text & repo at the pre-fix commit

→

🔍Step 2

Exploration

Agent navigates files, reads code, identifies relevant modules

→

🛠Step 3

Patch Generation

Agent writes a unified diff patch to fix the issue

→

✅Step 4

Test Execution

Patch applied in Docker, full test suite runs (fail-to-pass + pass-to-pass)

→

🏁Step 5

Resolved?

All new tests pass & no regressions → task resolved

Why SWE-bench is Hard

Real codebases, not toys

Django alone has 500k+ lines of code. Agents must navigate complex module structures, understand framework patterns, and modify the right files among thousands.

Strict test validation

It's not enough to "look right." Patches must make fail-to-pass tests pass while keeping all pass-to-pass tests green. A single regression means failure.

Multi-file changes

Many issues require changes across multiple files — models, views, tests, migrations. Agents must reason about dependencies across the codebase.

Under-specified issues

Real GitHub issues are often vague. The agent must infer intent, reproduce the bug, and figure out the correct fix — just like a human developer would.

Key Papers

Foundational papers that define SWE-bench and the leading agent architectures.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan·ICLR 2024·850 citations

Paper

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Yang, Jimenez, Wettig, Liber, Yao, Pei, Press, Narasimhan·NeurIPS 2024·420 citations

Paper

Agentless: Demystifying LLM-based Software Engineering Agents

Xia, Wen, Deng, Kang, Zou, Zhang·ICSE 2025·310 citations

Paper

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang et al.·ICLR 2025·280 citations

Paper

AutoCodeRover: Autonomous Program Improvement

Zhang, Chen, Wen, Cao, Chen, Xia, Noller·ISSTA 2024·190 citations

Paper

Aider: AI Pair Programming in Your Terminal

Gauthier·Open Source

Paper

SWE-bench Verified

OpenAI, SWE-bench Team·Blog Post 2024

Paper

SWE-bench Multimodal: Can AI Agents Handle Visual Issues?

Yang, Jimenez et al.·arXiv 2024·45 citations

Paper

Key GitHub Repositories

Open-source agents and frameworks that define the SWE-bench ecosystem.

SWE-bench/SWE-bench

Official benchmark framework & evaluation harness

⭐ 3.2k

princeton-nlp/SWE-agent

Agent-computer interface for SWE tasks

⭐ 14.5k

All-Hands-AI/OpenHands

Open platform for AI software developers

⭐ 51k

Aider-AI/aider

AI pair programming in your terminal

⭐ 30k

nus-apr/auto-code-rover

Autonomous program improvement

⭐ 2.8k

OpenAutoCoder/Agentless

Agentless approach to SWE tasks

⭐ 2.5k

cognition-labs/devin

AI software engineer (website/waitlist)

Metrics

Resolve Rate (%)

Primary metric. Percentage of tasks where the generated patch passes all fail-to-pass tests without introducing regressions.

Apply Rate (%)

Percentage of patches that cleanly apply to the codebase. A patch that fails to apply counts as unresolved.

Cost ($)

Total API cost per evaluation run. Important for practical deployment — some agents cost $300+ per full evaluation.

Avg. API Calls

Mean number of LLM API calls per task. Indicates agent efficiency and latency characteristics.

Related Benchmarks

Benchmark	Focus	Tasks	Real code?
SWE-bench Verified	Full SE tasks	500	Yes
SWE-bench Pro	Harder SE tasks	Private	Yes
HumanEval	Function synthesis	164	No (synthetic)
MBPP	Basic Python tasks	974	No (synthetic)
LiveCodeBench	Competitive coding	Rolling	Semi (LeetCode-style)
RE-Bench	Research engineering	7	Yes
HCAST	Security + AI R&D	90	Yes

Access the Benchmark

SWE-bench is fully open-source. Run evaluations with Docker locally or in the cloud.

GitHub Repository Read the Paper Official Leaderboard HuggingFace Dataset

Track every AI benchmark in one place

CodeSOTA tracks state-of-the-art results across 200+ benchmarks in agentic AI, NLP, computer vision, code, and more.

Browse Agentic AI Benchmarks Explore All Areas