SWE-bench
Can AI agents resolve real-world GitHub issues? The definitive benchmark for evaluating autonomous coding agents on 2,294 software engineering tasks drawn from 12 popular Python repositories.
79.2%
Current SOTA
Verified
2,294
Total Tasks
Full set
500
Verified Tasks
Human-checked
12
Source Repos
Python
1.96%
First Result
Oct 2023
What is SWE-bench?
SWE-bench is a benchmark that tests whether language models can solve real software engineering problems. Each task is a GitHub issue from a popular open-source Python project, paired with the human-written pull request that fixed it.
To "resolve" a task, an AI agent must produce a code patch that passes the project's test suite — including the specific tests added by the original fix. This means the agent must understand the codebase, locate the bug, write working code, and satisfy existing tests without breaking anything.
Unlike synthetic coding benchmarks (HumanEval, MBPP), SWE-bench uses real bugs from production codebases — Django, scikit-learn, SymPy, matplotlib — making it the closest proxy to actual developer work.
How evaluation works
- 1Agent receives the issue description and codebase at the pre-fix commit
- 2Agent explores the codebase, identifies affected files, and writes a patch
- 3Patch is applied and the full test suite runs (including tests from the fix PR)
- 4Task is "resolved" only if all relevant tests pass and no regressions occur
Dataset Variants
SWE-bench Full
Superseded2,294
Original complete benchmark from 12 Python repos.
SWE-bench Verified
Primary500
Human-validated subset. 68.3% filtered for quality. The standard evaluation.
SWE-bench Lite
Active300
Smaller subset for cost-effective evaluation and rapid iteration.
SWE-bench Multimodal
New517
Issues with screenshots, diagrams, and visual elements.
SOTA Progress: 1.96% → 79.2%
SWE-bench Verified resolve rate over time. Current CodeSOTA coverage follows the official all-agent leaderboard.
Leaderboard — SWE-bench Verified
Top official all-agent results on the 500-task verified split. Updated May 2026.
| # | Model | Agent / Scaffold | Resolve % | Type | Date |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.5 mediumAnthropic / UIUC | live-SWE-agent | 79.2% | API | 2025-12 |
| 2 | Claude Opus 4.5Anthropic / Sonar | Sonar Foundation Agent | 79.2% | API | 2025-12 |
| 3 | Doubao-Seed-CodeByteDance | TRAE | 78.8% | API | 2025-09 |
| 4 | Gemini 3 Pro PreviewGoogle / UIUC | live-SWE-agent | 77.4% | API | 2025-11 |
| 5 | Claude Sonnet 4 + GPT-5Atlassian | Rovo Dev | 76.8% | API | 2025-09 |
| 6 | Claude Sonnet 4EPAM | AI/Run Developer Agent | 76.8% | API | 2025-08 |
| 7 | Claude Opus 4.5 highAnthropic / SWE-agent | mini-SWE-agent v2 | 76.8% | API | 2026-02 |
| 8 | Mixed frontier modelsACoder | ACoder | 76.4% | API | 2025-08 |
| 9 | Gemini 3 Flash highGoogle / SWE-agent | mini-SWE-agent v2 | 75.8% | API | 2026-02 |
| 10 | MiniMax M2.5 highMiniMax / SWE-agent | mini-SWE-agent v2 | 75.8% | API | 2026-02 |
| 11 | Warp mixed modelsWarp | Warp | 75.6% | API | 2025-09 |
| 12 | Claude Opus 4.6Anthropic / SWE-agent | mini-SWE-agent v2 | 75.6% | API | 2026-02 |
Key Insights
Improvement since launch
From 1.96% (Claude 2, Oct 2023) to 79.2% on the official all-agent Verified leaderboard.
Scaffolding matters
The same model scores differently depending on the agent scaffold. Claude Opus 4.5 is 79.2% with live-SWE-agent or Sonar and 76.8% in the mini-SWE-agent v2 bash-only slice.
Doubao, Gemini, MiniMax
Recent official rows include Doubao-Seed-Code at 78.8%, Gemini 3 Pro Preview at 77.4%, and MiniMax M2.5 in the mini-SWE-agent slice at 75.8%.
Source Repositories
SWE-bench tasks are drawn from real issues in these 12 Python projects:
Evaluation Pipeline
Issue + Codebase
Agent gets issue text & repo at the pre-fix commit
Exploration
Agent navigates files, reads code, identifies relevant modules
Patch Generation
Agent writes a unified diff patch to fix the issue
Test Execution
Patch applied in Docker, full test suite runs (fail-to-pass + pass-to-pass)
Resolved?
All new tests pass & no regressions → task resolved
Why SWE-bench is Hard
Real codebases, not toys
Django alone has 500k+ lines of code. Agents must navigate complex module structures, understand framework patterns, and modify the right files among thousands.
Strict test validation
It's not enough to "look right." Patches must make fail-to-pass tests pass while keeping all pass-to-pass tests green. A single regression means failure.
Multi-file changes
Many issues require changes across multiple files — models, views, tests, migrations. Agents must reason about dependencies across the codebase.
Under-specified issues
Real GitHub issues are often vague. The agent must infer intent, reproduce the bug, and figure out the correct fix — just like a human developer would.
Key Papers
Foundational papers that define SWE-bench and the leading agent architectures.
Key GitHub Repositories
Open-source agents and frameworks that define the SWE-bench ecosystem.
SWE-bench/SWE-bench
Official benchmark framework & evaluation harness
princeton-nlp/SWE-agent
Agent-computer interface for SWE tasks
All-Hands-AI/OpenHands
Open platform for AI software developers
Aider-AI/aider
AI pair programming in your terminal
nus-apr/auto-code-rover
Autonomous program improvement
OpenAutoCoder/Agentless
Agentless approach to SWE tasks
cognition-labs/devin
AI software engineer (website/waitlist)
Metrics
Resolve Rate (%)
Primary metric. Percentage of tasks where the generated patch passes all fail-to-pass tests without introducing regressions.
Apply Rate (%)
Percentage of patches that cleanly apply to the codebase. A patch that fails to apply counts as unresolved.
Cost ($)
Total API cost per evaluation run. Important for practical deployment — some agents cost $300+ per full evaluation.
Avg. API Calls
Mean number of LLM API calls per task. Indicates agent efficiency and latency characteristics.
Related Benchmarks
| Benchmark | Focus | Tasks | Real code? |
|---|---|---|---|
| SWE-bench Verified | Full SE tasks | 500 | Yes |
| SWE-bench Pro | Harder SE tasks | Private | Yes |
| HumanEval | Function synthesis | 164 | No (synthetic) |
| MBPP | Basic Python tasks | 974 | No (synthetic) |
| LiveCodeBench | Competitive coding | Rolling | Semi (LeetCode-style) |
| RE-Bench | Research engineering | 7 | Yes |
| HCAST | Security + AI R&D | 90 | Yes |
Access the Benchmark
SWE-bench is fully open-source. Run evaluations with Docker locally or in the cloud.
Track every AI benchmark in one place
CodeSOTA tracks state-of-the-art results across 200+ benchmarks in agentic AI, NLP, computer vision, code, and more.