Home/Browse/Computer Code/Code Generation

Code Generation

Generating code from natural language descriptions (HumanEval, MBPP).

Benchmarks & Datasets

HumanEval

164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.

MBPP

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

HumanEval+

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

MBPP+

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

APPS

10,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.

CodeContests

13,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.

SWE-Bench

2,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.

SWE-Bench Verified

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.