Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
Benchmarks & Datasets
HumanEval
164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.
MBPP
974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.
HumanEval+
Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.
MBPP+
Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.
APPS
10,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.
CodeContests
13,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.
SWE-Bench
2,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.
SWE-Bench Verified
500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.