CodeBenchmarkMarch 20, 2026|8 min read

Claude Opus 4.5 Hits 80.9% on SWE-bench Verified

Anthropic's flagship model resolves 4 out of 5 real GitHub issues on the industry's most demanding software engineering benchmark. With HumanEval reaching 97.8% under Opus 4.6, raw model capability is plateauing — and agent scaffolding has become the true differentiator.

80.9%

SWE-bench Verified

97.8%

HumanEval (Opus 4.6)

4/5

Issues Resolved

SWE-bench Leader

Anthropic has pushed the SWE-bench Verified benchmark past 80% for the first time. Claude Opus 4.5, released in March 2026, achieves 80.9% on the benchmark that tests models against real GitHub issues from popular open-source repositories. The result represents a 4.7 percentage point lead over GPT-5 (76.2%) and a 6.1 point gap over Gemini 3 Pro (74.8%).

This is not just an incremental gain. Crossing the 80% threshold means the model can reliably resolve the majority of real-world software engineering tasks — from bug fixes and feature additions to complex refactoring across multi-file repositories. Combined with Claude Opus 4.6's 97.8% on HumanEval, Anthropic now holds the top position on both the most widely cited coding benchmarks.

SWE-bench Verified Leaderboard

Model	Score	Notes
SOTAClaude Opus 4.5	80.9%	Anthropic flagship, extended thinking
GPT-5	76.2%	OpenAI flagship, chain-of-thought
Gemini 3 Pro	74.8%	Google DeepMind, long-context
DeepSeek V3.2	73.1%	Open-weight MoE, 671B total

SWE-bench Verified uses human-validated test cases from real GitHub issues across 12 popular Python repositories including Django, Flask, scikit-learn, and sympy.

HumanEval Leaderboard

Model	Score	Notes
SOTAClaude Opus 4.6	97.8%	Latest Anthropic model
Claude Opus 4.5	96.4%	Anthropic flagship
GPT-5	95.1%	OpenAI flagship
Gemini 3 Pro	93.7%	Google DeepMind
DeepSeek V3.2	92.4%	Open-weight MoE

HumanEval measures function-level code generation across 164 programming problems. At 97.8%, the benchmark is approaching saturation.

Full Comparison: Frontier Coding Models

How Claude Opus 4.5 stacks up against the current generation of frontier models across key coding benchmarks and pricing:

Model	SWE-bench	HumanEval	Cost	Provider
SOTAClaude Opus 4.5	80.9%	96.4%	$15.00/1M	Anthropic
GPT-5	76.2%	95.1%	$10.00/1M	OpenAI
Gemini 3 Pro	74.8%	93.7%	$7.00/1M	Google
DeepSeek V3.2	73.1%	92.4%	$0.27/1M	DeepSeek

Cost-performance trade-off: Claude Opus 4.5 commands a premium at $15/1M tokens, but its 80.9% SWE-bench score justifies the cost for teams where resolution rate directly impacts engineering velocity. DeepSeek V3.2 remains the best value at $0.27/1M but trails by 7.8 points on SWE-bench.

Why 80% on SWE-bench Matters

The Reliability Threshold

At 80.9%, Claude Opus 4.5 resolves approximately 4 out of every 5 real GitHub issues it encounters. This crosses a critical threshold where AI-assisted development becomes reliable enough for production workflows. Engineers can delegate routine bug fixes and feature implementations with high confidence that the model will produce correct, mergeable patches.

Historical Trajectory

GPT-4 (Mar 2023)33.2%

Claude 3.5 Sonnet (Oct 2024)49.0%

DeepSeek V3.2 (Sep 2025)73.1%

Claude Opus 4.5 (Mar 2026)80.9%

Agent Scaffolding: The New Differentiator

With HumanEval nearing 98%, the era of raw model capability as the primary differentiator is ending. The models that win SWE-bench today do so not just through better weights, but through superior agent scaffolding — the tooling, retry logic, file navigation, and multi-step planning that wraps around the base model.

Claude Opus 4.5's SWE-bench submission uses Anthropic's extended thinking mode with iterative tool use: the model reads repository structure, identifies relevant files, forms a hypothesis about the bug, writes a patch, and validates against test cases — all within a single agentic loop. The gap between Opus 4.5's 80.9% and GPT-5's 76.2% likely reflects differences in agent architecture as much as model capability.

Extended Thinking

Multi-step reasoning chains allow the model to plan complex patches before writing code, reducing errors on multi-file changes.

Iterative Tool Use

File reading, search, and test execution are interleaved with reasoning, allowing the model to self-correct based on real feedback.

Repository Context

Large context windows (200K+ tokens) enable the model to ingest entire repository structures and understand cross-file dependencies.

HumanEval at 97.8%: Benchmark Saturation

Claude Opus 4.6 reaches 97.8% on HumanEval, leaving only 3-4 problems unsolved out of 164. At this level, HumanEval has effectively been saturated as a discriminative benchmark. The remaining failures are edge cases involving unusual Python idioms or ambiguous problem specifications rather than fundamental capability gaps.

This saturation is why the industry's attention has shifted to SWE-bench, which tests the full pipeline of software engineering — understanding issue descriptions, navigating codebases, writing patches, and ensuring tests pass. SWE-bench Verified remains far from saturated at 80.9%, giving it at least another 1-2 years of useful signal before the same ceiling problem emerges.

What This Means for Engineering Teams

Ready for Production Use

-Automated bug triage and patch generation
-Code review acceleration with high-confidence suggestions
-Test generation from issue descriptions
-Routine refactoring and migration tasks
-Documentation generation from code changes

Still Requires Human Review

-Architecture decisions and system design
-Security-critical code paths
-Performance optimization in latency-sensitive systems
-Novel algorithm implementation
-Cross-team coordination and API design

Conclusion

Claude Opus 4.5's 80.9% on SWE-bench Verified is a landmark result. For the first time, a model reliably resolves the majority of real-world software engineering tasks drawn from production open-source repositories. Combined with HumanEval nearing saturation at 97.8% under Opus 4.6, Anthropic has established a clear lead in coding benchmarks.

But the more important story is the shift in what drives performance. The gap between 80.9% and 76.2% is not explained by model scale alone. Agent scaffolding — extended thinking, iterative tool use, repository-scale context — is now the primary lever for improvement. Teams evaluating coding AI should focus as much on the agent framework as on the underlying model.

Track the latest SWE-bench and HumanEval results on CodeSOTA's code generation benchmarks page.