Why agentic is not code completion.
A code-completion benchmark asks: given this prefix, what is the next token? An agentic benchmark asks: given a goal, a shell, and an hour of wall-clock, what does the model do? The two metrics measure different things, and the scores do not transfer.
On HumanEval, a strong 2023 model clears 90% pass@1. On SWE-bench Verified the same architecture struggles past 50% — because solving a real issue requires reading the repo, running tests, interpreting a stack trace, and revising a patch. The failure modes are not compilation errors. They are bad plans.
That is also why scaffolds matter. mini-SWE-agent v2, SWE-agent, Aider, Cline, claude-code — each shapes the model's environment differently. A number without its scaffold is not a meaningful number; the tables above keep them together for exactly that reason.
The scored benchmarks on this page are the closest thing we have to a real job description: fix bugs, audit binaries, instrument services, work alone for a shift, plan a year. The memory queue adds another axis: can the agent avoid repeating itself when the facts change?