Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
Code Generation is a key task in computer code. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
HumanEval
HumanEval: Hand-Written Evaluation Set
164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.
State of the Art
o1-preview
OpenAI
92.4
pass@1
SWE-Bench Verified
SWE-bench Verified Subset
500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.
State of the Art
Claude 3.5 Sonnet
Anthropic
49
resolve-rate
MBPP
Mostly Basic Python Problems
974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.
State of the Art
Claude 3.5 Sonnet
Anthropic
89.2
pass@1
HumanEval+
HumanEval+ Extended Version
Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.
No results tracked yet
APPS
Automated Programming Progress Standard
10,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.
No results tracked yet
MBPP+
MBPP+ Extended Version
Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.
No results tracked yet
SWE-Bench
SWE-bench: Software Engineering Benchmark
2,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.
No results tracked yet
CodeContests
CodeContests Competitive Programming
13,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.
No results tracked yet