Computer Code

Code Generation

Generating code from natural language descriptions (HumanEval, MBPP).

9 datasets112 resultsView full task mapping →

Code generation translates natural language descriptions into executable code. GPT-4, Claude 3.5 Sonnet, and DeepSeek-Coder lead on HumanEval and MBPP benchmarks, with pass@1 rates exceeding 90%. The frontier has shifted from function-level to repository-level code generation and multi-file project scaffolding.

History

2021

Codex (OpenAI) — first large-scale code generation model, powers GitHub Copilot

2021

HumanEval benchmark released — 164 Python programming problems

2022

AlphaCode (DeepMind) generates competitive programming solutions via massive sampling

2023

CodeLlama (Meta) — open-source code LLM up to 34B parameters

2023

StarCoder (BigCode) trained on permissively-licensed code from The Stack

2023

GPT-4 achieves 67% pass@1 on HumanEval

2024

DeepSeek-Coder-V2 reaches 90%+ on HumanEval with 236B MoE architecture

2024

Claude 3.5 Sonnet achieves 92% on HumanEval, excelling at complex multi-function tasks

2024

Qwen2.5-Coder-32B matches GPT-4 level at open-source scale

2025

HumanEval effectively saturated; EvalPlus, SWE-bench, and LiveCodeBench replace it as discriminating benchmarks

How Code Generation Works

1Intent UnderstandingThe model parses the natura…2Context AnalysisAvailable context — imports3Code SynthesisThe LLM generates code toke…4Self-VerificationAdvanced methods generate m…5Iterative RefinementIf tests failCode Generation Pipeline
1

Intent Understanding

The model parses the natural language description, docstring, or function signature to understand what code should do.

2

Context Analysis

Available context — imports, existing code, type hints, test cases — constrains and guides the generation.

3

Code Synthesis

The LLM generates code token by token, drawing on patterns learned from millions of open-source repositories during pretraining.

4

Self-Verification

Advanced methods generate multiple candidates and filter using execution against test cases (pass@k) or model self-review.

5

Iterative Refinement

If tests fail, the model reads error messages and refines the code — this debug loop is what makes agentic code generation effective.

Current Landscape

Code generation in 2025 is a commodity capability — every frontier LLM can write correct Python functions for standard tasks. The differentiation is in harder settings: multi-file generation, repository-aware code, and agentic coding that includes debugging. HumanEval has been superseded by EvalPlus (harder test cases), LiveCodeBench (real coding contest problems), and SWE-bench (real repository issues). Open-source models (DeepSeek-Coder, Qwen-Coder) have closed the gap with proprietary models, democratizing access.

Key Challenges

Benchmark saturation — HumanEval is nearly solved; real-world code generation is far harder

Repository context — generating code that fits into an existing codebase requires understanding thousands of files

Specification ambiguity — natural language descriptions are inherently imprecise, leading to correct-but-wrong implementations

Security vulnerabilities — generated code often contains security issues (SQL injection, XSS, buffer overflows)

License compliance — models trained on open-source code may reproduce copyrighted or licensed snippets

Quick Recommendations

Production code generation

Claude 3.5 Sonnet / GPT-4o

Highest reliability for real-world code with good instruction following

Open-source deployment

DeepSeek-Coder-V2 / Qwen2.5-Coder-32B

GPT-4-class code generation at open-source model costs

IDE integration

Copilot (GPT-4) / Continue (any model)

Inline completion integrated into development workflow

Competitive programming

AlphaCode2 / OpenAI o3

Best at hard algorithmic problems requiring creative solutions

What's Next

The frontier is autonomous software development — generating not just functions but entire features, with tests, documentation, and CI/CD integration. Expect code generation to merge with software engineering agents, where the distinction between 'write this function' and 'implement this feature' disappears.

Benchmarks & SOTA

SWE-Bench Verified

SWE-bench Verified Subset

202438 results

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

State of the Art

Claude Opus 4.5

Anthropic

80.9

resolve-rate

HumanEval

HumanEval: Hand-Written Evaluation Set

202130 results

164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.

State of the Art

o4-mini

OpenAI

97.3

pass@1

LiveCodeBench

LiveCodeBench

202425 results

Contamination-free coding benchmark collecting new problems from LeetCode, AtCoder, and CodeForces after model knowledge cutoffs. Updated continuously with fresh problems. Primary metric is pass@1 on the full test set.

State of the Art

DeepSeek R1-0528

DeepSeek

73.3

pass@1

MBPP

Mostly Basic Python Problems

202119 results

974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.

State of the Art

o4-mini

OpenAI

94.9

pass@1

HumanEval+

HumanEval+ Extended Version

20230 results

Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.

No results tracked yet

APPS

Automated Programming Progress Standard

20210 results

10,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.

No results tracked yet

MBPP+

MBPP+ Extended Version

20230 results

Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.

No results tracked yet

SWE-Bench

SWE-bench: Software Engineering Benchmark

20230 results

2,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.

No results tracked yet

CodeContests

CodeContests Competitive Programming

20220 results

13,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.

No results tracked yet

Related Tasks

Something wrong or missing?

Help keep Code Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000