ExplainerUpdated April 2026

SWE-Bench Explained

A real-world benchmark that measures whether an LLM can fix real GitHub bugs. What the number actually represents, how it's scored, which variant to believe, and why the scores on the leaderboard are probably 5-15 points too generous.

What SWE-Bench actually is

The task

Each SWE-Bench instance gives an agent: (1) a GitHub issue describing a bug, and (2) the full source repo at the commit just before that bug was fixed. The agent must produce a patch whose application makes the formerly-failing test suite pass, without breaking any previously-passing tests.

- 2,294 issues in the original (Oct 2023)
- 500 human-verified in SWE-Bench Verified (Aug 2024)
- From 12 mature Python repos: django, scikit-learn, sympy, flask, requests, matplotlib, pytest, astropy, pylint, pydata/xarray, psf/black, sphinx

The scoring

Binary per-task. Agent resolved an issue iff after applying its patch: every fail-to-pass test passes AND every pass-to-pass test still passes. The reported metric is the % resolved — pass@1 by convention.

- No partial credit. "Mostly works" scores 0.
- No test creativity — agents are graded against the PR author's chosen tests.
- Harness: Docker per instance, repro-able bit-for-bit.

Scoring pipeline

Every SWE-Bench submission flows through the same harness.

Architecture

SWE-Bench scoring flow

Issue + repo → agent patch → apply → tests → resolved / unresolved

One concrete example

Real instance: django__django-11099. A URL name beginning with a Python reserved word broke reverse(). One-liner fix.

Gold patch (human PR #11100)django/urls/resolvers.py

@@@ -341,6 +341,10 @@ class URLResolver:
     def _populate(self):
         lookups = MultiValueDict()
-        if name.isidentifier():
+        if name.isidentifier() and not keyword.iskeyword(name):
             self.apps[name] = app_namespaces

Test added in the same PRtests/urlpatterns/tests.py

@@@ +131,7 @@ class ResolverMatchTests(...):
+    def test_keyword_reserved_name_rejected(self):
+        with self.assertRaises(ImproperlyConfigured):
+            resolve("/with/class/view/", urlconf="class_urls")

The agent succeeds iff, after applying its patch, the new test passes and no pre-existing Django URL resolver test regresses.

What an agent actually does

Stylized call trace for the above instance under Claude Code + Opus 4.5.

Trajectory

Agent trajectory — one SWE-Bench instance

django__django-11099 · Opus 4.5 + Claude Code

readeditbashtestsearchthink

The family of variants

"SWE-Bench" in 2026 means a family, not a single benchmark. Each variant exists to patch a specific weakness of the original.

Benchmark family

SWE-Bench variants — size, coverage, purpose

Circle radius scales with task count (non-linear for readability).

Progress, 2023 → 2026

How the headline number has moved. Note the logarithmic look — classic benchmark saturation curve.

Progress curve

SWE-Bench Verified SOTA over time

The contamination problem

All 2,294 original issues predate every major 2024+ training cutoff. The fixes are on GitHub, visible in discussions, blogged about. Models may have memorized them.

Contamination risk

Why SWE-Bench scores may be inflated

Train-set overlap with test-set PRs. Scored fixes existed publicly before model cutoffs.

The OpenAI Feb 2026 audit found 59.4% of the hardest Verified tasks had tests that wouldn't actually catch the intended bug — meaning an agent can "pass" without truly resolving the issue. Contamination plus weak tests inflates Verified scores by an estimated 5-15 points on post-2023 models. For clean numbers, use SWE-Bench Live or SWE-Bench Pro.

Running SWE-Bench yourself

# 1. Install

git clone https://github.com/SWE-bench/SWE-bench && cd SWE-bench

pip install -e .

# 2. Pull the Verified split

python -m swebench.harness.run_evaluation \\

--dataset_name princeton-nlp/SWE-bench_Verified \\

--predictions_path my_preds.jsonl \\

--run_id my-eval

# 3. Or run mini-SWE-agent end-to-end (recommended)

git clone https://github.com/SWE-agent/mini-SWE-agent

mini-swe-agent run --model claude-opus-4-5 --split verified

A full Verified run costs $100-$500 in API calls for frontier models and takes 4-12 hours on a single machine.

What SWE-Bench does not measure

Design

No architectural decisions. Every task is "fix this bug", not "design this system".

Stakeholder negotiation

No ambiguity about what the bug actually is — the linked PR defines ground truth.

Frontend

Python-first (until Multilingual). No React, no visual correctness.

Production ops

No deployment, no monitoring, no on-call debugging of live systems.

Code review

Agents generate patches; they do not review or critique human code.

Cross-service

Single-repo only. No microservices, no distributed debugging.