ICLR 2024Princeton NLPActive Benchmark

SWE-bench

Can AI agents resolve real-world GitHub issues? The definitive benchmark for evaluating autonomous coding agents on 2,294 software engineering tasks drawn from 12 popular Python repositories.

80.9%

Current SOTA

Verified

2,294

Total Tasks

Full set

500

Verified Tasks

Human-checked

12

Source Repos

Python

1.96%

First Result

Oct 2023

What is SWE-bench?

SWE-bench is a benchmark that tests whether language models can solve real software engineering problems. Each task is a GitHub issue from a popular open-source Python project, paired with the human-written pull request that fixed it.

To "resolve" a task, an AI agent must produce a code patch that passes the project's test suite — including the specific tests added by the original fix. This means the agent must understand the codebase, locate the bug, write working code, and satisfy existing tests without breaking anything.

Unlike synthetic coding benchmarks (HumanEval, MBPP), SWE-bench uses real bugs from production codebases — Django, scikit-learn, SymPy, matplotlib — making it the closest proxy to actual developer work.

How evaluation works

  1. 1Agent receives the issue description and codebase at the pre-fix commit
  2. 2Agent explores the codebase, identifies affected files, and writes a patch
  3. 3Patch is applied and the full test suite runs (including tests from the fix PR)
  4. 4Task is "resolved" only if all relevant tests pass and no regressions occur

Dataset Variants

SWE-bench Full

Superseded

2,294

Original complete benchmark from 12 Python repos.

SWE-bench Verified

Primary

500

Human-validated subset. 68.3% filtered for quality. The standard evaluation.

SWE-bench Lite

Active

300

Smaller subset for cost-effective evaluation and rapid iteration.

SWE-bench Multimodal

New

517

Issues with screenshots, diagrams, and visual elements.

SOTA Progress: 1.96% → 80.9%

SWE-bench Verified resolve rate over time. From barely functional to near-human in 28 months.

2023-10
1.96%
Claude 2 + SWE-agentSWE-bench launch
2024-03
12.5%
SWE-agent + GPT-4First scaffold
2024-05
13.8%
DevinFirst "AI developer"
2024-06
19%
AutoCodeRoverOpen-source catches up
2024-08
27%
OpenHands + Claude 3.5
2024-10
36.2%
Amazon Q Developer AgentEnterprise agents enter
2024-11
38.4%
Agentless + GPT-4o
2024-12
49%
Claude Sonnet 3.5 v2Breaking 50% barrier
2025-03
55.2%
Claude Opus 4 + Aider
2025-06
62%
GPT-4.5 + CodexBreaking 60%
2025-09
70.8%
Claude Sonnet 4.5Breaking 70%
2025-12
78%
Claude Opus 4.5
2026-02
80.9%
Claude Opus 4.5Breaking 80%

Leaderboard — SWE-bench Verified

Top models by resolve rate on the 500-task verified split. Updated February 2026.

Official site →
#ModelAgent / ScaffoldResolve %TypeDate
1Claude Opus 4.5AnthropicAnthropic Internal80.9%API2026-02
2Claude Opus 4.6AnthropicAnthropic Internal80.8%API2026-02
3MiniMax M2.5MiniMaxMiniMax Agent80.2%Open2026-01
4GPT-5.2OpenAIOpenAI Internal80%API2026-02
5Sonar FoundationSonarSourceSonar Agent79.2%API2026-01
6Claude Opus 4.5AnthropicLive-SWE-agent79.2%API2026-01
7GLM-5Zhipu AIZhipu Agent77.8%Open2026-01
8Gemini 3 ProGoogleLive-SWE-agent77.4%API2026-01
9Claude Sonnet 4.5AnthropicAnthropic Internal77.2%API2025-12
10Kimi K2.5Moonshot AIMoonshot Agent76.8%API2026-01
11Claude Opus 4.5Anthropicmini-SWE-agent v276.8%API2026-02
12Gemini 3 FlashGooglemini-SWE-agent v275.8%API2026-02
13Gemini 3 ProGoogleGoogle Internal76.2%API2025-12
14DeepSeek V3.5DeepSeekDeepSeek Agent74.6%Open2025-11
15Qwen 3 72BAlibabaQwen Agent72.4%Open2025-10

Key Insights

41×

Improvement since launch

From 1.96% (Claude 2, Oct 2023) to 80.9% (Claude Opus 4.5, Feb 2026) in just 28 months.

Agent > Model

Scaffolding matters

The same model scores very differently depending on the agent scaffold. Claude Opus 4.5 ranges from 76.8% to 80.9% depending on the agent framework used.

Open catching up

MiniMax M2.5 at 80.2%

Open-weight models are now within 1% of the best proprietary systems, making enterprise deployment without API dependency viable.

Source Repositories

SWE-bench tasks are drawn from real issues in these 12 Python projects:

Evaluation Pipeline

📋Step 1

Issue + Codebase

Agent gets issue text & repo at the pre-fix commit

🔍Step 2

Exploration

Agent navigates files, reads code, identifies relevant modules

🛠Step 3

Patch Generation

Agent writes a unified diff patch to fix the issue

Step 4

Test Execution

Patch applied in Docker, full test suite runs (fail-to-pass + pass-to-pass)

🏁Step 5

Resolved?

All new tests pass & no regressions → task resolved

Why SWE-bench is Hard

Real codebases, not toys

Django alone has 500k+ lines of code. Agents must navigate complex module structures, understand framework patterns, and modify the right files among thousands.

Strict test validation

It's not enough to "look right." Patches must make fail-to-pass tests pass while keeping all pass-to-pass tests green. A single regression means failure.

Multi-file changes

Many issues require changes across multiple files — models, views, tests, migrations. Agents must reason about dependencies across the codebase.

Under-specified issues

Real GitHub issues are often vague. The agent must infer intent, reproduce the bug, and figure out the correct fix — just like a human developer would.

Key Papers

Foundational papers that define SWE-bench and the leading agent architectures.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan·ICLR 2024·850 citations
Paper
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Yang, Jimenez, Wettig, Liber, Yao, Pei, Press, Narasimhan·NeurIPS 2024·420 citations
Paper
Agentless: Demystifying LLM-based Software Engineering Agents
Xia, Wen, Deng, Kang, Zou, Zhang·ICSE 2025·310 citations
Paper
AutoCodeRover: Autonomous Program Improvement
Zhang, Chen, Wen, Cao, Chen, Xia, Noller·ISSTA 2024·190 citations
Paper
SWE-bench Verified
OpenAI, SWE-bench Team·Blog Post 2024
Paper
SWE-bench Multimodal: Can AI Agents Handle Visual Issues?
Yang, Jimenez et al.·arXiv 2024·45 citations
Paper

Key GitHub Repositories

Open-source agents and frameworks that define the SWE-bench ecosystem.

Metrics

Resolve Rate (%)

Primary metric. Percentage of tasks where the generated patch passes all fail-to-pass tests without introducing regressions.

Apply Rate (%)

Percentage of patches that cleanly apply to the codebase. A patch that fails to apply counts as unresolved.

Cost ($)

Total API cost per evaluation run. Important for practical deployment — some agents cost $300+ per full evaluation.

Avg. API Calls

Mean number of LLM API calls per task. Indicates agent efficiency and latency characteristics.

Related Benchmarks

BenchmarkFocusTasksReal code?
SWE-bench VerifiedFull SE tasks500Yes
SWE-bench ProHarder SE tasksPrivateYes
HumanEvalFunction synthesis164No (synthetic)
MBPPBasic Python tasks974No (synthetic)
LiveCodeBenchCompetitive codingRollingSemi (LeetCode-style)
RE-BenchResearch engineering7Yes
HCASTSecurity + AI R&D90Yes

Access the Benchmark

SWE-bench is fully open-source. Run evaluations with Docker locally or in the cloud.

Track every AI benchmark in one place

CodeSOTA tracks state-of-the-art results across 200+ benchmarks in agentic AI, NLP, computer vision, code, and more.