SWE-bench for Code Generation
How well do code models actually write software? SWE-bench isolates the raw code generation capability of LLMs on 2,294 real GitHub issues — separating model intelligence from agent scaffolding.
82.1%
Code Model SOTA
Claude Sonnet 5
80.2%
Open-Source SOTA
MiniMax M2.5
20+
Code Models Tracked
On this page
1.9%
Gap: Open vs Closed
Narrowing fast
1.96%
First Baseline
Oct 2023
SWE-bench as a Code Generation Benchmark
Most code benchmarks — HumanEval, MBPP, LiveCodeBench — test whether a model can write a single function from a description. SWE-bench is fundamentally different: it tests whether a code model can generate production-quality patches that fix real bugs in real codebases.
This page focuses on the underlying code model, not the agent scaffold wrapping it. When Claude Opus 4.5 scores 80.9% via one agent and 76.8% via another, the difference is scaffolding. We care about the model's raw ability to understand code, navigate repositories, and generate correct multi-file patches.
The standardized harness (mini-SWE-agent) evaluates all models with the same 100-line Python scaffold, isolating model capability. This is what makes SWE-bench the most meaningful code generation benchmark in 2026.
What code generation skills matter
- 1Repository comprehensionParse 500k+ LOC codebases and find the relevant 50 lines
- 2Multi-file patch generationEdit 1.7 files across 3 functions on average per task
- 3Test-aware code writingGenerate code that passes existing + new tests without regressions
- 4Framework-specific patternsDjango ORM, pytest fixtures, matplotlib internals, SymPy
- 5Bug root cause analysisInfer the real problem from often-vague issue descriptions
Why HumanEval and MBPP Are Not Enough
Top models score 95-98% on HumanEval — but that tells us almost nothing about real code generation ability. Here is why.
HumanEval is saturated
Most frontier models score 95-98%. The benchmark can no longer differentiate model quality — a 1% difference is noise, not signal.
Single-function scope
HumanEval/MBPP test isolated function generation. Real software engineering requires understanding codebases of thousands of files and generating patches across multiple modules.
Real issues, real tests
SWE-bench uses actual GitHub issues from Django, scikit-learn, SymPy — validated by the project's own test suite, not synthetic unit tests.
Coding Capabilities Tested by Benchmark

SWE-bench (green) tests dramatically more code engineering skills than HumanEval (red) or LiveCodeBench (amber).
Benchmark Comparison
How SWE-bench compares to other code generation benchmarks across key dimensions.
| Benchmark | Focus | Tasks | Scope | Real Code? | Top Score | Validation |
|---|---|---|---|---|---|---|
| SWE-bench Verified | Full SE: navigate, edit, test | 500 | Multi-file | Yes | 82.1% | Project test suites |
| HumanEval | Function synthesis | 164 | Single function | No | ~98% | Unit tests (simple) |
| MBPP | Basic Python tasks | 974 | Single function | No | ~95% | Unit tests (simple) |
| LiveCodeBench | Competitive coding | Rolling | Single file | No | ~70% | I/O matching |
| SWE-bench Pro | Hard multi-file SE | Private | Multi-file | Yes | 57% | Extended test suites |
| Aider Polyglot | Multi-language edits | 225 | Single file | No | ~88% | Edit validation |
Coding Capabilities: SWE-bench vs Others
Skill intensity scores (1-10) for each benchmark. SWE-bench uniquely tests production-grade coding skills.
| Skill | SWE-bench | HumanEval | LiveCode | Description |
|---|---|---|---|---|
| Code Navigation | 9 | 2 | 4 | Locating relevant files and functions across large repos (500k+ LOC) |
| Multi-file Editing | 9 | 1 | 2 | Coordinated changes across models, views, tests, and configs |
| Debugging | 8 | 2 | 5 | Reproducing bugs from vague issue descriptions and fixing root causes |
| Test Comprehension | 8 | 1 | 3 | Understanding project test suites, fail-to-pass + pass-to-pass validation |
| Dependency Resolution | 7 | 1 | 2 | Handling imports, framework patterns, version-specific API usage |
| API Usage | 8 | 3 | 4 | Correct usage of Django ORM, matplotlib internals, pytest fixtures, etc. |
Code Model SOTA: 1.96% → 82.1%
How raw code model performance has evolved on SWE-bench Verified. Each entry represents a new record by a code model (standardized evaluation).

Code Model Leaderboard — SWE-bench Verified
Top models ranked by resolve rate. Standardized harness evaluation to isolate model capability. Updated March 2026.
| # | Model | Organization | Params | Resolve % | Type | Date |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 5Claude family | Anthropic | Undisclosed | 82.1% | API | 2026-02 |
| 2 | Claude Opus 4.5Claude family | Anthropic | Undisclosed | 80.9% | API | 2026-02 |
| 3 | MiniMax M2.5MiniMax family | MiniMax | 229B | 80.2% | Open | 2026-01 |
| 4 | GPT-5.2GPT family | OpenAI | Undisclosed | 80% | API | 2026-02 |
| 5 | Claude Opus 4.6Claude family | Anthropic | Undisclosed | 79.8% | API | 2026-02 |
| 6 | GLM-5GLM family | Zhipu AI | 130B | 77.8% | Open | 2026-01 |
| 7 | Gemini 3 ProGemini family | Undisclosed | 77.4% | API | 2026-01 | |
| 8 | Claude Sonnet 4.5Claude family | Anthropic | Undisclosed | 77.2% | API | 2025-12 |
| 9 | Kimi K2.5Kimi family | Moonshot AI | Undisclosed | 76.8% | API | 2026-01 |
| 10 | DeepSeek R1DeepSeek family | DeepSeek | 671B MoE | 76.3% | Open | 2025-12 |
| 11 | Gemini 3 FlashGemini family | Undisclosed | 75.8% | API | 2026-02 | |
| 12 | Qwen3-Max-ThinkingQwen family | Alibaba | MoE | 75.3% | Open | 2026-02 |
| 13 | DeepSeek V3.5DeepSeek family | DeepSeek | 685B MoE | 74.6% | Open | 2025-11 |
| 14 | Step-3.5-FlashStep family | StepFun | Unknown | 74.4% | Open | 2026-01 |
| 15 | Qwen3 72BQwen family | Alibaba | 72B | 72.4% | Open | 2025-10 |
| 16 | DeepSeek-Coder V2.5DeepSeek family | DeepSeek | 236B MoE | 68.2% | Open | 2025-08 |
| 17 | Qwen2.5-Coder 32BQwen family | Alibaba | 32B | 55.4% | Open | 2025-06 |
| 18 | CodeLlama 70BCodeLlama family | Meta | 70B | 29.8% | Open | 2024-12 |
| 19 | StarCoder2 15BStarCoder family | BigCode | 15B | 18.3% | Open | 2024-10 |
| 20 | DeepSeek-Coder 33BDeepSeek family | DeepSeek | 33B | 15.6% | Open | 2024-06 |
Open-Source vs Proprietary: The Gap Is Closing
Open-weight models now compete head-to-head with proprietary APIs on real code generation tasks.

Avg. Open-Weight Score
12 open-weight models tracked. Led by MiniMax M2.5 at 80.2%, with DeepSeek R1 (76.3%) and Qwen3-Max (75.3%) close behind.
Avg. Proprietary Score
8 API models tracked. Claude and GPT families dominate, but Gemini 3 Pro (77.4%) and Kimi K2.5 (76.8%) compete strongly.
Gap at the Top
MiniMax M2.5 (80.2%, open) is only 1.9% behind Claude Sonnet 5 (82.1%, API). In 2024, the gap was 30%+. Enterprise self-hosting is now viable.
Key Takeaways for Code Generation
Open-source advantages:
- Self-hosting eliminates API costs ($300+ per full SWE-bench evaluation)
- Full control over inference: fine-tuning, quantization, custom prompting
- DeepSeek and Qwen families offer code-specialized variants with focused training
- No rate limits or vendor lock-in for production deployment
Proprietary advantages:
- Claude and GPT still lead on the hardest tasks (complex multi-file patches)
- Better instruction following and context utilization at extreme lengths
- Faster iteration — new capabilities ship weekly without infrastructure cost
- Claude Sonnet 5 set the 82.1% record with no public indication of saturation
Code Model Family Profiles
DeepSeek-Coder Family
Leading open-sourceDeepSeek · Best SWE-bench: 76.3% (R1)
Models: DeepSeek-Coder 33B, V2.5 (236B MoE), R1 (671B MoE), V3.5 (685B MoE)
Pioneered MoE architecture for code. DeepSeek-Coder 33B was the first open code model to meaningfully score on SWE-bench (15.6%). R1 with reasoning chains pushed to 76.3%.
Qwen-Coder Family
Fast-risingAlibaba · Best SWE-bench: 75.3% (Qwen3-Max)
Models: Qwen2.5-Coder 32B, Qwen3 72B, Qwen3-Max-Thinking
Qwen2.5-Coder specialized for code with strong multi-language support. Qwen3-Max-Thinking uses extended reasoning to approach frontier performance at 75.3%.
CodeLlama / Meta
Foundation layerMeta · Best SWE-bench: 29.8% (70B)
Models: CodeLlama 7B/13B/34B/70B
Based on Llama 2. Code-specialized with fill-in-the-middle and long context. At 29.8%, it showed open models could meaningfully participate. Now superseded by newer families.
StarCoder / BigCode
Training data pioneerBigCode (open research) · Best SWE-bench: 18.3% (StarCoder2 15B)
Models: StarCoder (15B), StarCoder2 (3B/7B/15B)
Built on The Stack v2 — the largest open code training dataset. Lower SWE-bench scores reflect smaller model sizes, but influential for the open-source code LLM ecosystem.
Why SWE-bench Is Hard for Code Models
Massive context requirements
Django has 500k+ lines of code. The model must process the issue, navigate the codebase, and generate a patch — all while maintaining coherence across extreme context lengths. Models with shorter context windows or weaker retrieval fail dramatically.
Ambiguous specifications
Unlike HumanEval's clear docstrings, GitHub issues are often vague: "X doesn't work when Y." The model must infer the complete specification, reproduce the bug mentally, and determine the correct fix from incomplete information.
Multi-file coordination
Average task requires editing 1.7 files, 3.0 functions, and 32.8 lines. A model must understand how changes in one module affect others — imports, class hierarchies, test expectations — and keep everything consistent.
Zero tolerance for regressions
A patch must pass all fail-to-pass tests AND keep all pass-to-pass tests green. A single regression = failure. This means the code model cannot just "approximately fix" the issue — it must be precisely correct.
Key Papers
Foundational papers on code generation models and the SWE-bench evaluation framework.
Key GitHub Repositories
Code model repos and evaluation frameworks central to SWE-bench performance.
SWE-bench/SWE-bench
Official benchmark: 2,294 real GitHub issues for LLM evaluation
deepseek-ai/DeepSeek-Coder
Open-source code LLM family (1.3B to 33B parameters)
meta-llama/codellama
Code Llama: open foundation models for code (7B-70B)
QwenLM/Qwen2.5-Coder
Qwen code generation model family (1.5B-32B)
bigcode/starcoder2
StarCoder 2: open code LLMs trained on The Stack v2
OpenAutoCoder/Agentless
Agentless approach: raw model patching without scaffolding
microsoft/SWE-bench-Live
Live benchmark: contamination-free monthly evaluation
SWE-bench/experiments
Open-sourced predictions, logs, and results from SWE-bench runs
A Note on Benchmark Contamination
In February 2026, OpenAI published an analysis arguing that SWE-bench Verified is "increasingly contaminated" — frontier models may have memorized solutions during training. Their analysis found 59.4% of the hardest tasks had flawed or insufficient tests. This has led to growing adoption of SWE-bench Pro (by Scale AI) and SWE-bench Live (by Microsoft) as contamination-resistant alternatives. The scores on this page should be interpreted with this context: they remain the most comprehensive cross-model comparison available, but may overstate absolute capability for models trained on post-2024 data.
Related Code Generation Benchmarks
| Benchmark | Focus | Tasks | Key Difference from SWE-bench |
|---|---|---|---|
| SWE-bench Pro | Hard SE tasks | Private | Uncontaminated, multi-file focus, harder tasks selected by Scale AI |
| SWE-bench Live | Live SE evaluation | 1,319+ | Monthly-updated from 93 repos, post-2024 issues only |
| HumanEval | Function synthesis | 164 | Single function only — saturated at ~98% |
| MBPP | Basic Python | 974 | Simple problems, no codebase context |
| LiveCodeBench | Competitive coding | Rolling | LeetCode-style, single file, algorithmic focus |
| Aider Polyglot | Code editing | 225 | Multi-language but single-file edits |
| RE-Bench | Research engineering | 7 | Much harder, longer tasks (hours vs minutes) |
Evaluate Your Code Model
SWE-bench is fully open-source. Run evaluations on your own models with Docker locally or in the cloud. The mini-SWE-agent harness makes standardized evaluation accessible in 100 lines of Python.
Track every code generation benchmark
CodeSOTA tracks state-of-the-art results across 200+ benchmarks including HumanEval, MBPP, LiveCodeBench, SWE-bench, and more. Compare open-source and proprietary models in one place.