ICLR 2024Princeton NLPActive Benchmark

SWE-bench

Can AI agents resolve real-world GitHub issues? The definitive benchmark for evaluating autonomous coding agents on 2,294 software engineering tasks drawn from 12 popular Python repositories.

79.2%

Current SOTA

Verified

2,294

Total Tasks

Full set

500

Verified Tasks

Human-checked

12

Source Repos

Python

1.96%

First Result

Oct 2023

What is SWE-bench?

SWE-bench is a benchmark that tests whether language models can solve real software engineering problems. Each task is a GitHub issue from a popular open-source Python project, paired with the human-written pull request that fixed it.

To "resolve" a task, an AI agent must produce a code patch that passes the project's test suite — including the specific tests added by the original fix. This means the agent must understand the codebase, locate the bug, write working code, and satisfy existing tests without breaking anything.

Unlike synthetic coding benchmarks (HumanEval, MBPP), SWE-bench uses real bugs from production codebases — Django, scikit-learn, SymPy, matplotlib — making it the closest proxy to actual developer work.

How evaluation works

  1. 1Agent receives the issue description and codebase at the pre-fix commit
  2. 2Agent explores the codebase, identifies affected files, and writes a patch
  3. 3Patch is applied and the full test suite runs (including tests from the fix PR)
  4. 4Task is "resolved" only if all relevant tests pass and no regressions occur

Dataset Variants

SWE-bench Full

Superseded

2,294

Original complete benchmark from 12 Python repos.

SWE-bench Verified

Primary

500

Human-validated subset. 68.3% filtered for quality. The standard evaluation.

SWE-bench Lite

Active

300

Smaller subset for cost-effective evaluation and rapid iteration.

SWE-bench Multimodal

New

517

Issues with screenshots, diagrams, and visual elements.

SOTA Progress: 1.96% → 79.2%

SWE-bench Verified resolve rate over time. Current CodeSOTA coverage follows the official all-agent leaderboard.

2023-10
1.96%
Claude 2 + SWE-agentSWE-bench launch
2024-03
12.5%
SWE-agent + GPT-4First scaffold
2024-05
13.8%
DevinFirst "AI developer"
2024-06
19%
AutoCodeRoverOpen-source catches up
2024-08
27%
OpenHands + Claude 3.5
2024-10
36.2%
Amazon Q Developer AgentEnterprise agents enter
2024-11
38.4%
Agentless + GPT-4o
2024-12
49%
Claude Sonnet 3.5 v2Breaking 50% barrier
2025-03
55.2%
Claude Opus 4 + Aider
2025-06
62%
GPT-4.5 + CodexBreaking 60%
2025-09
70.8%
Claude Sonnet 4.5Breaking 70%
2025-12
78%
Claude Opus 4.5
2025-12
79.2%
Claude Opus 4.5 + live-SWE-agent / SonarCurrent official top

Leaderboard — SWE-bench Verified

Top official all-agent results on the 500-task verified split. Updated May 2026.

Official site →
#ModelAgent / ScaffoldResolve %TypeDate
1Claude Opus 4.5 mediumAnthropic / UIUClive-SWE-agent79.2%API2025-12
2Claude Opus 4.5Anthropic / SonarSonar Foundation Agent79.2%API2025-12
3Doubao-Seed-CodeByteDanceTRAE78.8%API2025-09
4Gemini 3 Pro PreviewGoogle / UIUClive-SWE-agent77.4%API2025-11
5Claude Sonnet 4 + GPT-5AtlassianRovo Dev76.8%API2025-09
6Claude Sonnet 4EPAMAI/Run Developer Agent76.8%API2025-08
7Claude Opus 4.5 highAnthropic / SWE-agentmini-SWE-agent v276.8%API2026-02
8Mixed frontier modelsACoderACoder76.4%API2025-08
9Gemini 3 Flash highGoogle / SWE-agentmini-SWE-agent v275.8%API2026-02
10MiniMax M2.5 highMiniMax / SWE-agentmini-SWE-agent v275.8%API2026-02
11Warp mixed modelsWarpWarp75.6%API2025-09
12Claude Opus 4.6Anthropic / SWE-agentmini-SWE-agent v275.6%API2026-02

Key Insights

41×

Improvement since launch

From 1.96% (Claude 2, Oct 2023) to 79.2% on the official all-agent Verified leaderboard.

Agent > Model

Scaffolding matters

The same model scores differently depending on the agent scaffold. Claude Opus 4.5 is 79.2% with live-SWE-agent or Sonar and 76.8% in the mini-SWE-agent v2 bash-only slice.

New entrants

Doubao, Gemini, MiniMax

Recent official rows include Doubao-Seed-Code at 78.8%, Gemini 3 Pro Preview at 77.4%, and MiniMax M2.5 in the mini-SWE-agent slice at 75.8%.

Source Repositories

SWE-bench tasks are drawn from real issues in these 12 Python projects:

Evaluation Pipeline

📋Step 1

Issue + Codebase

Agent gets issue text & repo at the pre-fix commit

🔍Step 2

Exploration

Agent navigates files, reads code, identifies relevant modules

🛠Step 3

Patch Generation

Agent writes a unified diff patch to fix the issue

Step 4

Test Execution

Patch applied in Docker, full test suite runs (fail-to-pass + pass-to-pass)

🏁Step 5

Resolved?

All new tests pass & no regressions → task resolved

Why SWE-bench is Hard

Real codebases, not toys

Django alone has 500k+ lines of code. Agents must navigate complex module structures, understand framework patterns, and modify the right files among thousands.

Strict test validation

It's not enough to "look right." Patches must make fail-to-pass tests pass while keeping all pass-to-pass tests green. A single regression means failure.

Multi-file changes

Many issues require changes across multiple files — models, views, tests, migrations. Agents must reason about dependencies across the codebase.

Under-specified issues

Real GitHub issues are often vague. The agent must infer intent, reproduce the bug, and figure out the correct fix — just like a human developer would.

Key Papers

Foundational papers that define SWE-bench and the leading agent architectures.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan·ICLR 2024·850 citations
Paper
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Yang, Jimenez, Wettig, Liber, Yao, Pei, Press, Narasimhan·NeurIPS 2024·420 citations
Paper
Agentless: Demystifying LLM-based Software Engineering Agents
Xia, Wen, Deng, Kang, Zou, Zhang·ICSE 2025·310 citations
Paper
AutoCodeRover: Autonomous Program Improvement
Zhang, Chen, Wen, Cao, Chen, Xia, Noller·ISSTA 2024·190 citations
Paper
SWE-bench Verified
OpenAI, SWE-bench Team·Blog Post 2024
Paper
SWE-bench Multimodal: Can AI Agents Handle Visual Issues?
Yang, Jimenez et al.·arXiv 2024·45 citations
Paper

Key GitHub Repositories

Open-source agents and frameworks that define the SWE-bench ecosystem.

Metrics

Resolve Rate (%)

Primary metric. Percentage of tasks where the generated patch passes all fail-to-pass tests without introducing regressions.

Apply Rate (%)

Percentage of patches that cleanly apply to the codebase. A patch that fails to apply counts as unresolved.

Cost ($)

Total API cost per evaluation run. Important for practical deployment — some agents cost $300+ per full evaluation.

Avg. API Calls

Mean number of LLM API calls per task. Indicates agent efficiency and latency characteristics.

Related Benchmarks

BenchmarkFocusTasksReal code?
SWE-bench VerifiedFull SE tasks500Yes
SWE-bench ProHarder SE tasksPrivateYes
HumanEvalFunction synthesis164No (synthetic)
MBPPBasic Python tasks974No (synthetic)
LiveCodeBenchCompetitive codingRollingSemi (LeetCode-style)
RE-BenchResearch engineering7Yes
HCASTSecurity + AI R&D90Yes

Access the Benchmark

SWE-bench is fully open-source. Run evaluations with Docker locally or in the cloud.

Track every AI benchmark in one place

CodeSOTA tracks state-of-the-art results across 200+ benchmarks in agentic AI, NLP, computer vision, code, and more.