SWE-bench
Can AI agents resolve real-world GitHub issues? The definitive benchmark for evaluating autonomous coding agents on 2,294 software engineering tasks drawn from 12 popular Python repositories.
80.9%
Current SOTA
Verified
2,294
Total Tasks
Full set
500
Verified Tasks
Human-checked
12
Source Repos
Python
1.96%
First Result
Oct 2023
What is SWE-bench?
SWE-bench is a benchmark that tests whether language models can solve real software engineering problems. Each task is a GitHub issue from a popular open-source Python project, paired with the human-written pull request that fixed it.
To "resolve" a task, an AI agent must produce a code patch that passes the project's test suite — including the specific tests added by the original fix. This means the agent must understand the codebase, locate the bug, write working code, and satisfy existing tests without breaking anything.
Unlike synthetic coding benchmarks (HumanEval, MBPP), SWE-bench uses real bugs from production codebases — Django, scikit-learn, SymPy, matplotlib — making it the closest proxy to actual developer work.
How evaluation works
- 1Agent receives the issue description and codebase at the pre-fix commit
- 2Agent explores the codebase, identifies affected files, and writes a patch
- 3Patch is applied and the full test suite runs (including tests from the fix PR)
- 4Task is "resolved" only if all relevant tests pass and no regressions occur
Dataset Variants
SWE-bench Full
Superseded2,294
Original complete benchmark from 12 Python repos.
SWE-bench Verified
Primary500
Human-validated subset. 68.3% filtered for quality. The standard evaluation.
SWE-bench Lite
Active300
Smaller subset for cost-effective evaluation and rapid iteration.
SWE-bench Multimodal
New517
Issues with screenshots, diagrams, and visual elements.
SOTA Progress: 1.96% → 80.9%
SWE-bench Verified resolve rate over time. From barely functional to near-human in 28 months.
Leaderboard — SWE-bench Verified
Top models by resolve rate on the 500-task verified split. Updated February 2026.
| # | Model | Agent / Scaffold | Resolve % | Type | Date |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.5Anthropic | Anthropic Internal | 80.9% | API | 2026-02 |
| 2 | Claude Opus 4.6Anthropic | Anthropic Internal | 80.8% | API | 2026-02 |
| 3 | MiniMax M2.5MiniMax | MiniMax Agent | 80.2% | Open | 2026-01 |
| 4 | GPT-5.2OpenAI | OpenAI Internal | 80% | API | 2026-02 |
| 5 | Sonar FoundationSonarSource | Sonar Agent | 79.2% | API | 2026-01 |
| 6 | Claude Opus 4.5Anthropic | Live-SWE-agent | 79.2% | API | 2026-01 |
| 7 | GLM-5Zhipu AI | Zhipu Agent | 77.8% | Open | 2026-01 |
| 8 | Gemini 3 ProGoogle | Live-SWE-agent | 77.4% | API | 2026-01 |
| 9 | Claude Sonnet 4.5Anthropic | Anthropic Internal | 77.2% | API | 2025-12 |
| 10 | Kimi K2.5Moonshot AI | Moonshot Agent | 76.8% | API | 2026-01 |
| 11 | Claude Opus 4.5Anthropic | mini-SWE-agent v2 | 76.8% | API | 2026-02 |
| 12 | Gemini 3 FlashGoogle | mini-SWE-agent v2 | 75.8% | API | 2026-02 |
| 13 | Gemini 3 ProGoogle | Google Internal | 76.2% | API | 2025-12 |
| 14 | DeepSeek V3.5DeepSeek | DeepSeek Agent | 74.6% | Open | 2025-11 |
| 15 | Qwen 3 72BAlibaba | Qwen Agent | 72.4% | Open | 2025-10 |
Key Insights
Improvement since launch
From 1.96% (Claude 2, Oct 2023) to 80.9% (Claude Opus 4.5, Feb 2026) in just 28 months.
Scaffolding matters
The same model scores very differently depending on the agent scaffold. Claude Opus 4.5 ranges from 76.8% to 80.9% depending on the agent framework used.
MiniMax M2.5 at 80.2%
Open-weight models are now within 1% of the best proprietary systems, making enterprise deployment without API dependency viable.
Source Repositories
SWE-bench tasks are drawn from real issues in these 12 Python projects:
Evaluation Pipeline
Issue + Codebase
Agent gets issue text & repo at the pre-fix commit
Exploration
Agent navigates files, reads code, identifies relevant modules
Patch Generation
Agent writes a unified diff patch to fix the issue
Test Execution
Patch applied in Docker, full test suite runs (fail-to-pass + pass-to-pass)
Resolved?
All new tests pass & no regressions → task resolved
Why SWE-bench is Hard
Real codebases, not toys
Django alone has 500k+ lines of code. Agents must navigate complex module structures, understand framework patterns, and modify the right files among thousands.
Strict test validation
It's not enough to "look right." Patches must make fail-to-pass tests pass while keeping all pass-to-pass tests green. A single regression means failure.
Multi-file changes
Many issues require changes across multiple files — models, views, tests, migrations. Agents must reason about dependencies across the codebase.
Under-specified issues
Real GitHub issues are often vague. The agent must infer intent, reproduce the bug, and figure out the correct fix — just like a human developer would.
Key Papers
Foundational papers that define SWE-bench and the leading agent architectures.
Key GitHub Repositories
Open-source agents and frameworks that define the SWE-bench ecosystem.
SWE-bench/SWE-bench
Official benchmark framework & evaluation harness
princeton-nlp/SWE-agent
Agent-computer interface for SWE tasks
All-Hands-AI/OpenHands
Open platform for AI software developers
Aider-AI/aider
AI pair programming in your terminal
nus-apr/auto-code-rover
Autonomous program improvement
OpenAutoCoder/Agentless
Agentless approach to SWE tasks
cognition-labs/devin
AI software engineer (website/waitlist)
Metrics
Resolve Rate (%)
Primary metric. Percentage of tasks where the generated patch passes all fail-to-pass tests without introducing regressions.
Apply Rate (%)
Percentage of patches that cleanly apply to the codebase. A patch that fails to apply counts as unresolved.
Cost ($)
Total API cost per evaluation run. Important for practical deployment — some agents cost $300+ per full evaluation.
Avg. API Calls
Mean number of LLM API calls per task. Indicates agent efficiency and latency characteristics.
Related Benchmarks
| Benchmark | Focus | Tasks | Real code? |
|---|---|---|---|
| SWE-bench Verified | Full SE tasks | 500 | Yes |
| SWE-bench Pro | Harder SE tasks | Private | Yes |
| HumanEval | Function synthesis | 164 | No (synthetic) |
| MBPP | Basic Python tasks | 974 | No (synthetic) |
| LiveCodeBench | Competitive coding | Rolling | Semi (LeetCode-style) |
| RE-Bench | Research engineering | 7 | Yes |
| HCAST | Security + AI R&D | 90 | Yes |
Access the Benchmark
SWE-bench is fully open-source. Run evaluations with Docker locally or in the cloud.
Track every AI benchmark in one place
CodeSOTA tracks state-of-the-art results across 200+ benchmarks in agentic AI, NLP, computer vision, code, and more.