AI That writes
Software
From single functions (HumanEval) to resolving GitHub issues (SWE-bench), code generation is the most practically impactful frontier of LLM capability.
Code Benchmark Stats
From Snippets to Agents
Pass@1 (Function Level)
The model gets one try to write a single function (e.g., "sort this list"). If it passes unit tests, it wins. This is what HumanEval measures.
Repo-Level Resolution
The model is given a real GitHub issue (bug report) and must navigate multiple files, reproduce the bug, and write a patch. This is SWE-bench.
def solve_problem(input_list):
"""
Sorts list and removes duplicates.
>>> solve_problem([3, 1, 2, 1])
[1, 2, 3]
"""
# Model generated code:
return sorted(list(set(input_list))) Coding Proficiency Leaderboard
Comparing top models on standard function synthesis (HumanEval) and real-world engineering (SWE-bench Verified).
| Rank | Model | HumanEval (Pass@1) | MBPP (Pass@1) | SWE-bench (Verified) |
|---|---|---|---|---|
| #1 | Claude 3.5 Sonnet Anthropic | 92.0% | 89.2% | 49.0%
|
| #2 | GPT-4o OpenAI | 90.2% | 87.8% | 41.2%
|
| #3 | o1-preview OpenAI | 92.4% | - | - |
| #4 | DeepSeek V3 DeepSeek | 82.6% | - | - |
| #5 | Llama 3 70B Meta | 81.7% | - | - |
| #6 | DeepSeek V2.5 DeepSeek | - | - | 37.0%
|
*SWE-bench Verified scores shown where available. Scores may vary by prompt strategy (e.g., 0-shot vs few-shot).
The Benchmarks
HumanEval
2021164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.
MBPP
2021974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.
HumanEval+
2023Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.
MBPP+
2023Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.
APPS
202110,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.
CodeContests
202213,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.
SWE-Bench
20232,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.
SWE-Bench Verified
2024500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.