Codesota · AgenticFive benchmarks · thirty-plus models · one registryIssue: April 22, 2026
Editorial · Agentic coding

Coding agents,
measured.

Autocomplete is a solved parlour trick. The harder question is how a model behaves when it is given tools, a repo, and an hour to fix a real bug. Five benchmarks on this page attempt to answer it, each from a different angle.

What follows is the leaderboard as of April 2026 — software engineering, security analysis, observability instrumentation, autonomous task horizon, long-run startup planning, and now agent-memory coverage.

§ 01 · Software engineering

SWE-bench Verified.

500 hand-verified GitHub issues drawn from twelve popular Python repositories. This table uses the official all-agent Verified leaderboard; the mini-SWE-agent v2 bash-only slice is lower and should be read separately.


Metric
Resolve rate · higher is better
Suite
Verified subset · 500 tasks
Source
swebench.com
#ModelProviderAgent / ScaffoldDateResolve
01Claude Opus 4.5 mediumAnthropic / UIUClive-SWE-agentDec 202579.2%
02Claude Opus 4.5Anthropic / SonarSonar Foundation AgentDec 202579.2%
03Doubao-Seed-CodeByteDanceTRAESep 202578.8%
04Gemini 3 Pro PreviewGoogle / UIUClive-SWE-agentNov 202577.4%
05Claude Sonnet 4 + GPT-5AtlassianRovo DevSep 202576.8%
06Claude Sonnet 4EPAMAI/Run Developer AgentAug 202576.8%
07Claude Opus 4.5 highAnthropic / SWE-agentmini-SWE-agent v2Feb 202676.8%
08Mixed frontier modelsACoderACoderAug 202576.4%
09Gemini 3 Flash highGoogle / SWE-agentmini-SWE-agent v2Feb 202675.8%
10MiniMax M2.5 highMiniMax / SWE-agentmini-SWE-agent v2Feb 202675.8%
11Claude Opus 4.6Anthropic / SWE-agentmini-SWE-agent v2Feb 202675.6%
Fig 1 · Resolve rate on the official SWE-bench Verified all-agent leaderboard. Copper rows mark the current top score. The mini-SWE-agent v2 bash-only slice currently tops out at 76.8%, so scaffold labels are part of the result.
§ 02 · Security

BinaryAudit, reverse-engineered.

33 tasks testing whether an agent can spot a backdoor or time-bomb planted inside a roughly 40 MB compiled binary. Tools on the table: Ghidra, radare2, patience.


Metric
Detection rate · higher is better
Caveat
False-positive column is adjacent; read them together
Source
QuesmaOrg/binaryaudit
#ModelProviderDetectFalse +
01Gemini 3.1 Pro PreviewGoogle49%12%
02Claude Opus 4.6Anthropic49%8%
03GPT-5.2 Codex XHighOpenAI46%14%
04Gemini 3 Pro PreviewGoogle44%9%
05GPT-5.3 Codex XHighOpenAI42%11%
06Claude Sonnet 4.6Anthropic31%7%
07DeepSeek v3.2DeepSeek18%22%
08Grok 4.1-FastxAI12%86%
Fig 2 · Detection and false-positive rates on BinaryAudit. A high detect rate paired with a high false-positive rate — Grok 4.1-Fast's 12% / 86% — is an agent flagging everything as suspicious rather than reading the binary.
§ 03 · Observability

OTelBench, across eleven languages.

23 tasks asking an agent to add distributed tracing, metrics, and logging to a real codebase using OpenTelemetry SDKs. Eleven languages on the table; the overall field average sits at 14%.


Metric
Pass rate · higher is better
Field avg.
14% across all tested models
Source
QuesmaOrg/otel-bench
#ModelProviderPass
01claude-opus-4.5Anthropic29%
02gpt-5.2OpenAI26%
03claude-sonnet-4.5Anthropic22%
04gemini-3-flash-previewGoogle19%
05gemini-3-pro-previewGoogle16%
06gpt-5.2-codexOpenAI16%
07gpt-5.1OpenAI14%
08glm-4.7Z.ai13%
09deepseek-v3.2DeepSeek12%
10gpt-5.1-codex-maxOpenAI12%
11kimi-k2-thinkingMoonshot AI7%
12claude-haiku-4.5Anthropic6%
13grok-4xAI4%
14grok-4.1-fastxAI3%
Fig 3 · OTelBench pass rates, fourteen models ranked. The distance between first (29%) and last (3%) is wider than the whole MMLU spread — agentic workloads still separate the field.
§ 04 · Autonomy

METR Time Horizon.

How long can an agent work on its own before it fails or asks for help? The 50% time horizon is the task complexity at which the agent succeeds half the time. The record has roughly doubled every 4.3 months since 2023.


Metric
50% time horizon · longer is better
Suite
TH 1.1 · 228-task HCAST · Inspect framework
Source
metr.org/time-horizons
#ModelProviderDateTH-50
01Claude Opus 4.6AnthropicFeb 2026~12 hr
02GPT-5.3-CodexOpenAIFeb 2026350 min
03GPT-5.2OpenAIDec 2025352 min
04Claude Opus 4.5AnthropicNov 2025293 min
05Gemini 3 ProGoogleNov 2025224 min
06GPT-5.1-Codex-MaxOpenAINov 2025224 min
07GPT-5OpenAIAug 2025203 min
08o3OpenAIApr 2025120 min
09Claude Opus 4Anthropic2025101 min
10Claude 3.7 SonnetAnthropicFeb 202560 min
11o1OpenAIDec 202439 min
Fig 4 · 50% time horizon on METR TH 1.1. The December-2024 top (o1, 39 min) is now the bottom of the table — in sixteen months the horizon has extended from tens of minutes to roughly twelve hours.
§ 05 · Long-horizon planning

YC-Bench, a simulated year.

The agent is handed $200K and twelve months. It hires, fires, picks contracts, and handles adversarial clients in a partially observable world. Averaged across three seeds. Only three models cleared their starting capital.


Metric
Ending net worth · higher is better
Seeds
3 · bankruptcy column counts failed seeds
Source
collinear-ai/yc-bench
#ModelProviderNet worthBankrupt
01Claude Opus 4.6Anthropic$1.27M0/3
02GLM-5Zhipu AI$1.21M0/3
03GPT-5.4OpenAI$1.00M0/3
04Kimi-K2.5Moonshot AI$409K1/3
05Gemini 3 FlashGoogle$394K0/3
06Gemini 3.1 Flash LiteGoogle$203K1/3
07GPT-5.4 MiniOpenAI$138K1/3
08Claude Sonnet 4.6Anthropic$104K2/3
09Qwen 3.5-397BAlibaba$91K1/3
10Gemini 3.1 ProGoogle$66K1/3
11GPT-5.4 NanoOpenAI$39K1/3
12Grok 4.20 BetaxAI$25K2/3
Fig 5 · Ending net worth averaged over three seeds. Bankruptcies are tallied separately so that a high average driven by one lucky run is visible as such.
§ 06 · Memory

Agent memory, before the run.

Long-running agents fail when they repeat solved mistakes, preserve stale beliefs, ignore deletes, or cannot show why a preflight decision was made. Memory benchmarks belong in the agentic area, but local regression artifacts are not the same thing as an official leaderboard score.


Track
Agent Memory Benchmark + preflight-memory artifacts
Gate
No Audrey score until an official AMB run exists
Source
Evilander/Audrey
#ArtifactScopeStatusSource
01Agent Memory Benchmark (AMB)Provider harnessTrack for official scoresvectorize-io/agent-memory-benchmark
02Audrey memory artifactsLocal deterministic evidenceEvidence only; no leaderboard claimHF report + raw artifacts
03Audrey AMB provider requestEvaluation routePending official harness runAMB issue #11
Fig 6 · Agent-memory coverage queue. The Audrey rows are local deterministic regression/performance evidence; CodeSOTA should promote them to scored leaderboard rows only after the AMB harness produces comparable results.
§ 07
Commentary

Why agentic is not code completion.

A code-completion benchmark asks: given this prefix, what is the next token? An agentic benchmark asks: given a goal, a shell, and an hour of wall-clock, what does the model do? The two metrics measure different things, and the scores do not transfer.

On HumanEval, a strong 2023 model clears 90% pass@1. On SWE-bench Verified the same architecture struggles past 50% — because solving a real issue requires reading the repo, running tests, interpreting a stack trace, and revising a patch. The failure modes are not compilation errors. They are bad plans.

That is also why scaffolds matter. mini-SWE-agent v2, SWE-agent, Aider, Cline, claude-code — each shapes the model's environment differently. A number without its scaffold is not a meaningful number; the tables above keep them together for exactly that reason.

The scored benchmarks on this page are the closest thing we have to a real job description: fix bugs, audit binaries, instrument services, work alone for a shift, plan a year. The memory queue adds another axis: can the agent avoid repeating itself when the facts change?

§ 08 · Related

What to read next.

Cross-linked · April 2026
Read next

Three places to go from here.

Adoption data
OpenRouter models
Inverted view of OpenRouter — every model in the catalog, every agent that uses it, ranked by spend, volume, and adoption.
Adoption data
OpenRouter trends
Vendor share over time. Where the dollar shifted month-over-month and what flipped on the chart.
Sister hub
LLM benchmarks
The full register of frontier LLM benchmarks. Reasoning, code, multimodal, and the rest.