SWE-Bench Explained
A real-world benchmark that measures whether an LLM can fix real GitHub bugs. What the number actually represents, how it's scored, which variant to believe, and why the scores on the leaderboard are probably 5-15 points too generous.
What SWE-Bench actually is
The task
Each SWE-Bench instance gives an agent: (1) a GitHub issue describing a bug, and (2) the full source repo at the commit just before that bug was fixed. The agent must produce a patch whose application makes the formerly-failing test suite pass, without breaking any previously-passing tests.
- - 2,294 issues in the original (Oct 2023)
- - 500 human-verified in SWE-Bench Verified (Aug 2024)
- - From 12 mature Python repos: django, scikit-learn, sympy, flask, requests, matplotlib, pytest, astropy, pylint, pydata/xarray, psf/black, sphinx
The scoring
Binary per-task. Agent resolved an issue iff after applying its patch: every fail-to-pass test passes AND every pass-to-pass test still passes. The reported metric is the % resolved — pass@1 by convention.
- - No partial credit. "Mostly works" scores 0.
- - No test creativity — agents are graded against the PR author's chosen tests.
- - Harness: Docker per instance, repro-able bit-for-bit.
Scoring pipeline
Every SWE-Bench submission flows through the same harness.
Architecture
SWE-Bench scoring flow
Issue + repo → agent patch → apply → tests → resolved / unresolved
One concrete example
Real instance: django__django-11099. A URL name beginning with a Python reserved word broke reverse(). One-liner fix.
@@@ -341,6 +341,10 @@ class URLResolver:def _populate(self):lookups = MultiValueDict()- if name.isidentifier():+ if name.isidentifier() and not keyword.iskeyword(name):self.apps[name] = app_namespaces
@@@ +131,7 @@ class ResolverMatchTests(...):+ def test_keyword_reserved_name_rejected(self):+ with self.assertRaises(ImproperlyConfigured):+ resolve("/with/class/view/", urlconf="class_urls")
The agent succeeds iff, after applying its patch, the new test passes and no pre-existing Django URL resolver test regresses.
What an agent actually does
Stylized call trace for the above instance under Claude Code + Opus 4.5.
Trajectory
Agent trajectory — one SWE-Bench instance
django__django-11099 · Opus 4.5 + Claude Code
The family of variants
"SWE-Bench" in 2026 means a family, not a single benchmark. Each variant exists to patch a specific weakness of the original.
Benchmark family
SWE-Bench variants — size, coverage, purpose
Circle radius scales with task count (non-linear for readability).
Progress, 2023 → 2026
How the headline number has moved. Note the logarithmic look — classic benchmark saturation curve.
Progress curve
SWE-Bench Verified SOTA over time
The contamination problem
All 2,294 original issues predate every major 2024+ training cutoff. The fixes are on GitHub, visible in discussions, blogged about. Models may have memorized them.
Contamination risk
Why SWE-Bench scores may be inflated
Train-set overlap with test-set PRs. Scored fixes existed publicly before model cutoffs.
The OpenAI Feb 2026 audit found 59.4% of the hardest Verified tasks had tests that wouldn't actually catch the intended bug — meaning an agent can "pass" without truly resolving the issue. Contamination plus weak tests inflates Verified scores by an estimated 5-15 points on post-2023 models. For clean numbers, use SWE-Bench Live or SWE-Bench Pro.
Running SWE-Bench yourself
A full Verified run costs $100-$500 in API calls for frontier models and takes 4-12 hours on a single machine.
What SWE-Bench does not measure
Design
No architectural decisions. Every task is "fix this bug", not "design this system".
Stakeholder negotiation
No ambiguity about what the bug actually is — the linked PR defines ground truth.
Frontend
Python-first (until Multilingual). No React, no visual correctness.
Production ops
No deployment, no monitoring, no on-call debugging of live systems.
Code review
Agents generate patches; they do not review or critique human code.
Cross-service
Single-repo only. No microservices, no distributed debugging.