Code benchmarks come in three sizes. The smallest — HumanEval — asks the model to write a single function from a docstring: sort a list, de-duplicate, solve a small algorithmic puzzle. Pass@1 means it has one attempt and must pass all unit tests. Frontier models now saturate this band.
The middle tier — LiveCodeBench — uses fresh competitive-programming problems posted after the model's training cutoff, which defeats memorisation. The score is less flattering but more honest.
The largest — SWE-bench Verified — hands the model a real GitHub issue and a real repository. The model must read the bug report, navigate files it has never seen, reproduce the failure, and write a patch that passes the project's own test suite. Nothing about the task fits in a prompt; an agent loop is required. This is the benchmark that correlates with shipping.