Code GenerationPrinceton NLPActive Benchmark

SWE-bench for Code Generation

How well do code models actually write software? SWE-bench isolates the raw code generation capability of LLMs on 2,294 real GitHub issues — separating model intelligence from agent scaffolding.

82.1%

Code Model SOTA

Claude Sonnet 5

80.2%

Open-Source SOTA

MiniMax M2.5

20+

Code Models Tracked

On this page

1.9%

Gap: Open vs Closed

Narrowing fast

1.96%

First Baseline

Oct 2023

SWE-bench as a Code Generation Benchmark

Most code benchmarks — HumanEval, MBPP, LiveCodeBench — test whether a model can write a single function from a description. SWE-bench is fundamentally different: it tests whether a code model can generate production-quality patches that fix real bugs in real codebases.

This page focuses on the underlying code model, not the agent scaffold wrapping it. When Claude Opus 4.5 scores 80.9% via one agent and 76.8% via another, the difference is scaffolding. We care about the model's raw ability to understand code, navigate repositories, and generate correct multi-file patches.

The standardized harness (mini-SWE-agent) evaluates all models with the same 100-line Python scaffold, isolating model capability. This is what makes SWE-bench the most meaningful code generation benchmark in 2026.

What code generation skills matter

  • 1
    Repository comprehensionParse 500k+ LOC codebases and find the relevant 50 lines
  • 2
    Multi-file patch generationEdit 1.7 files across 3 functions on average per task
  • 3
    Test-aware code writingGenerate code that passes existing + new tests without regressions
  • 4
    Framework-specific patternsDjango ORM, pytest fixtures, matplotlib internals, SymPy
  • 5
    Bug root cause analysisInfer the real problem from often-vague issue descriptions

Why HumanEval and MBPP Are Not Enough

Top models score 95-98% on HumanEval — but that tells us almost nothing about real code generation ability. Here is why.

~98%

HumanEval is saturated

Most frontier models score 95-98%. The benchmark can no longer differentiate model quality — a 1% difference is noise, not signal.

1 file

Single-function scope

HumanEval/MBPP test isolated function generation. Real software engineering requires understanding codebases of thousands of files and generating patches across multiple modules.

2,294

Real issues, real tests

SWE-bench uses actual GitHub issues from Django, scikit-learn, SymPy — validated by the project's own test suite, not synthetic unit tests.

Coding Capabilities Tested by Benchmark

Radar chart comparing coding capabilities tested by SWE-bench vs HumanEval vs LiveCodeBench

SWE-bench (green) tests dramatically more code engineering skills than HumanEval (red) or LiveCodeBench (amber).

Benchmark Comparison

How SWE-bench compares to other code generation benchmarks across key dimensions.

BenchmarkFocusTasksScopeReal Code?Top ScoreValidation
SWE-bench VerifiedFull SE: navigate, edit, test500Multi-fileYes82.1%Project test suites
HumanEvalFunction synthesis164Single functionNo~98%Unit tests (simple)
MBPPBasic Python tasks974Single functionNo~95%Unit tests (simple)
LiveCodeBenchCompetitive codingRollingSingle fileNo~70%I/O matching
SWE-bench ProHard multi-file SEPrivateMulti-fileYes57%Extended test suites
Aider PolyglotMulti-language edits225Single fileNo~88%Edit validation

Coding Capabilities: SWE-bench vs Others

Skill intensity scores (1-10) for each benchmark. SWE-bench uniquely tests production-grade coding skills.

SkillSWE-benchHumanEvalLiveCodeDescription
Code Navigation924Locating relevant files and functions across large repos (500k+ LOC)
Multi-file Editing912Coordinated changes across models, views, tests, and configs
Debugging825Reproducing bugs from vague issue descriptions and fixing root causes
Test Comprehension813Understanding project test suites, fail-to-pass + pass-to-pass validation
Dependency Resolution712Handling imports, framework patterns, version-specific API usage
API Usage834Correct usage of Django ORM, matplotlib internals, pytest fixtures, etc.

Code Model SOTA: 1.96% → 82.1%

How raw code model performance has evolved on SWE-bench Verified. Each entry represents a new record by a code model (standardized evaluation).

SWE-bench Verified SOTA progression chart from 2023 to 2026
2023-10
1.96%
Claude 2
2024-03
12.5%
GPT-4 Turbo
2024-06
19%
GPT-4o
2024-08
27%
Claude 3.5 Sonnet
2024-10
36.2%
o1-preview
2024-12
49%
Claude 3.5 Sonnet v2
2025-03
55.2%
Claude Opus 4
2025-06
62%
GPT-4.5
2025-09
70.8%
Claude Sonnet 4.5
2025-12
78%
Claude Opus 4.5
2026-01
80.2%
MiniMax M2.5 (open)
2026-02
82.1%
Claude Sonnet 5

Code Model Leaderboard — SWE-bench Verified

Top models ranked by resolve rate. Standardized harness evaluation to isolate model capability. Updated March 2026.

Official site →
#ModelOrganizationParamsResolve %TypeDate
1Claude Sonnet 5Claude familyAnthropicUndisclosed82.1%API2026-02
2Claude Opus 4.5Claude familyAnthropicUndisclosed80.9%API2026-02
3MiniMax M2.5MiniMax familyMiniMax229B80.2%Open2026-01
4GPT-5.2GPT familyOpenAIUndisclosed80%API2026-02
5Claude Opus 4.6Claude familyAnthropicUndisclosed79.8%API2026-02
6GLM-5GLM familyZhipu AI130B77.8%Open2026-01
7Gemini 3 ProGemini familyGoogleUndisclosed77.4%API2026-01
8Claude Sonnet 4.5Claude familyAnthropicUndisclosed77.2%API2025-12
9Kimi K2.5Kimi familyMoonshot AIUndisclosed76.8%API2026-01
10DeepSeek R1DeepSeek familyDeepSeek671B MoE76.3%Open2025-12
11Gemini 3 FlashGemini familyGoogleUndisclosed75.8%API2026-02
12Qwen3-Max-ThinkingQwen familyAlibabaMoE75.3%Open2026-02
13DeepSeek V3.5DeepSeek familyDeepSeek685B MoE74.6%Open2025-11
14Step-3.5-FlashStep familyStepFunUnknown74.4%Open2026-01
15Qwen3 72BQwen familyAlibaba72B72.4%Open2025-10
16DeepSeek-Coder V2.5DeepSeek familyDeepSeek236B MoE68.2%Open2025-08
17Qwen2.5-Coder 32BQwen familyAlibaba32B55.4%Open2025-06
18CodeLlama 70BCodeLlama familyMeta70B29.8%Open2024-12
19StarCoder2 15BStarCoder familyBigCode15B18.3%Open2024-10
20DeepSeek-Coder 33BDeepSeek familyDeepSeek33B15.6%Open2024-06

Open-Source vs Proprietary: The Gap Is Closing

Open-weight models now compete head-to-head with proprietary APIs on real code generation tasks.

Open-source vs proprietary model comparison on SWE-bench Verified
59.9%

Avg. Open-Weight Score

12 open-weight models tracked. Led by MiniMax M2.5 at 80.2%, with DeepSeek R1 (76.3%) and Qwen3-Max (75.3%) close behind.

78.8%

Avg. Proprietary Score

8 API models tracked. Claude and GPT families dominate, but Gemini 3 Pro (77.4%) and Kimi K2.5 (76.8%) compete strongly.

1.9%

Gap at the Top

MiniMax M2.5 (80.2%, open) is only 1.9% behind Claude Sonnet 5 (82.1%, API). In 2024, the gap was 30%+. Enterprise self-hosting is now viable.

Key Takeaways for Code Generation

Open-source advantages:

  • Self-hosting eliminates API costs ($300+ per full SWE-bench evaluation)
  • Full control over inference: fine-tuning, quantization, custom prompting
  • DeepSeek and Qwen families offer code-specialized variants with focused training
  • No rate limits or vendor lock-in for production deployment

Proprietary advantages:

  • Claude and GPT still lead on the hardest tasks (complex multi-file patches)
  • Better instruction following and context utilization at extreme lengths
  • Faster iteration — new capabilities ship weekly without infrastructure cost
  • Claude Sonnet 5 set the 82.1% record with no public indication of saturation

Code Model Family Profiles

DeepSeek-Coder Family

Leading open-source

DeepSeek · Best SWE-bench: 76.3% (R1)

Models: DeepSeek-Coder 33B, V2.5 (236B MoE), R1 (671B MoE), V3.5 (685B MoE)

Pioneered MoE architecture for code. DeepSeek-Coder 33B was the first open code model to meaningfully score on SWE-bench (15.6%). R1 with reasoning chains pushed to 76.3%.

Qwen-Coder Family

Fast-rising

Alibaba · Best SWE-bench: 75.3% (Qwen3-Max)

Models: Qwen2.5-Coder 32B, Qwen3 72B, Qwen3-Max-Thinking

Qwen2.5-Coder specialized for code with strong multi-language support. Qwen3-Max-Thinking uses extended reasoning to approach frontier performance at 75.3%.

CodeLlama / Meta

Foundation layer

Meta · Best SWE-bench: 29.8% (70B)

Models: CodeLlama 7B/13B/34B/70B

Based on Llama 2. Code-specialized with fill-in-the-middle and long context. At 29.8%, it showed open models could meaningfully participate. Now superseded by newer families.

StarCoder / BigCode

Training data pioneer

BigCode (open research) · Best SWE-bench: 18.3% (StarCoder2 15B)

Models: StarCoder (15B), StarCoder2 (3B/7B/15B)

Built on The Stack v2 — the largest open code training dataset. Lower SWE-bench scores reflect smaller model sizes, but influential for the open-source code LLM ecosystem.

Why SWE-bench Is Hard for Code Models

Massive context requirements

Django has 500k+ lines of code. The model must process the issue, navigate the codebase, and generate a patch — all while maintaining coherence across extreme context lengths. Models with shorter context windows or weaker retrieval fail dramatically.

Ambiguous specifications

Unlike HumanEval's clear docstrings, GitHub issues are often vague: "X doesn't work when Y." The model must infer the complete specification, reproduce the bug mentally, and determine the correct fix from incomplete information.

Multi-file coordination

Average task requires editing 1.7 files, 3.0 functions, and 32.8 lines. A model must understand how changes in one module affect others — imports, class hierarchies, test expectations — and keep everything consistent.

Zero tolerance for regressions

A patch must pass all fail-to-pass tests AND keep all pass-to-pass tests green. A single regression = failure. This means the code model cannot just "approximately fix" the issue — it must be precisely correct.

Key Papers

Foundational papers on code generation models and the SWE-bench evaluation framework.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan·ICLR 2024·1,612 citations
Paper
Code Llama: Open Foundation Models for Code
Roziere, Lachaux, Chanussot, Lample et al.·Meta AI 2023·2,945 citations
Paper
DeepSeek-Coder: When the Large Language Model Meets Programming
Guo, Zhu, Cong, et al.·arXiv 2024·820 citations
Paper
StarCoder 2 and The Stack v2: The Next Generation
Lozhkov, Li, Allal, et al. (BigCode)·arXiv 2024·589 citations
Paper
Qwen2.5-Coder Technical Report
Hui, Yang, Cui, et al. (Alibaba)·arXiv 2024·310 citations
Paper
Agentless: Demystifying LLM-based Software Engineering Agents
Xia, Wen, Deng, Kang, Zou, Zhang·ICSE 2025·420 citations
Paper
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
Yang, Jimenez, Wettig, et al.·NeurIPS 2024·510 citations
Paper

Key GitHub Repositories

Code model repos and evaluation frameworks central to SWE-bench performance.

A Note on Benchmark Contamination

In February 2026, OpenAI published an analysis arguing that SWE-bench Verified is "increasingly contaminated" — frontier models may have memorized solutions during training. Their analysis found 59.4% of the hardest tasks had flawed or insufficient tests. This has led to growing adoption of SWE-bench Pro (by Scale AI) and SWE-bench Live (by Microsoft) as contamination-resistant alternatives. The scores on this page should be interpreted with this context: they remain the most comprehensive cross-model comparison available, but may overstate absolute capability for models trained on post-2024 data.

Related Code Generation Benchmarks

BenchmarkFocusTasksKey Difference from SWE-bench
SWE-bench ProHard SE tasksPrivateUncontaminated, multi-file focus, harder tasks selected by Scale AI
SWE-bench LiveLive SE evaluation1,319+Monthly-updated from 93 repos, post-2024 issues only
HumanEvalFunction synthesis164Single function only — saturated at ~98%
MBPPBasic Python974Simple problems, no codebase context
LiveCodeBenchCompetitive codingRollingLeetCode-style, single file, algorithmic focus
Aider PolyglotCode editing225Multi-language but single-file edits
RE-BenchResearch engineering7Much harder, longer tasks (hours vs minutes)

Evaluate Your Code Model

SWE-bench is fully open-source. Run evaluations on your own models with Docker locally or in the cloud. The mini-SWE-agent harness makes standardized evaluation accessible in 100 lines of Python.

Track every code generation benchmark

CodeSOTA tracks state-of-the-art results across 200+ benchmarks including HumanEval, MBPP, LiveCodeBench, SWE-bench, and more. Compare open-source and proprietary models in one place.