Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
Code generation translates natural language descriptions into executable code. GPT-4, Claude 3.5 Sonnet, and DeepSeek-Coder lead on HumanEval and MBPP benchmarks, with pass@1 rates exceeding 90%. The frontier has shifted from function-level to repository-level code generation and multi-file project scaffolding.
History
Codex (OpenAI) — first large-scale code generation model, powers GitHub Copilot
HumanEval benchmark released — 164 Python programming problems
AlphaCode (DeepMind) generates competitive programming solutions via massive sampling
CodeLlama (Meta) — open-source code LLM up to 34B parameters
StarCoder (BigCode) trained on permissively-licensed code from The Stack
GPT-4 achieves 67% pass@1 on HumanEval
DeepSeek-Coder-V2 reaches 90%+ on HumanEval with 236B MoE architecture
Claude 3.5 Sonnet achieves 92% on HumanEval, excelling at complex multi-function tasks
Qwen2.5-Coder-32B matches GPT-4 level at open-source scale
HumanEval effectively saturated; EvalPlus, SWE-bench, and LiveCodeBench replace it as discriminating benchmarks
How Code Generation Works
Intent Understanding
The model parses the natural language description, docstring, or function signature to understand what code should do.
Context Analysis
Available context — imports, existing code, type hints, test cases — constrains and guides the generation.
Code Synthesis
The LLM generates code token by token, drawing on patterns learned from millions of open-source repositories during pretraining.
Self-Verification
Advanced methods generate multiple candidates and filter using execution against test cases (pass@k) or model self-review.
Iterative Refinement
If tests fail, the model reads error messages and refines the code — this debug loop is what makes agentic code generation effective.
Current Landscape
Code generation in 2025 is a commodity capability — every frontier LLM can write correct Python functions for standard tasks. The differentiation is in harder settings: multi-file generation, repository-aware code, and agentic coding that includes debugging. HumanEval has been superseded by EvalPlus (harder test cases), LiveCodeBench (real coding contest problems), and SWE-bench (real repository issues). Open-source models (DeepSeek-Coder, Qwen-Coder) have closed the gap with proprietary models, democratizing access.
Key Challenges
Benchmark saturation — HumanEval is nearly solved; real-world code generation is far harder
Repository context — generating code that fits into an existing codebase requires understanding thousands of files
Specification ambiguity — natural language descriptions are inherently imprecise, leading to correct-but-wrong implementations
Security vulnerabilities — generated code often contains security issues (SQL injection, XSS, buffer overflows)
License compliance — models trained on open-source code may reproduce copyrighted or licensed snippets
Quick Recommendations
Production code generation
Claude 3.5 Sonnet / GPT-4o
Highest reliability for real-world code with good instruction following
Open-source deployment
DeepSeek-Coder-V2 / Qwen2.5-Coder-32B
GPT-4-class code generation at open-source model costs
IDE integration
Copilot (GPT-4) / Continue (any model)
Inline completion integrated into development workflow
Competitive programming
AlphaCode2 / OpenAI o3
Best at hard algorithmic problems requiring creative solutions
What's Next
The frontier is autonomous software development — generating not just functions but entire features, with tests, documentation, and CI/CD integration. Expect code generation to merge with software engineering agents, where the distinction between 'write this function' and 'implement this feature' disappears.
Benchmarks & SOTA
SWE-Bench Verified
SWE-bench Verified Subset
500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.
State of the Art
Claude Opus 4.5
Anthropic
80.9
resolve-rate
HumanEval
HumanEval: Hand-Written Evaluation Set
164 hand-crafted Python programming problems with function signatures, docstrings, and unit tests. Standard benchmark for code generation.
State of the Art
o4-mini
OpenAI
97.3
pass@1
LiveCodeBench
LiveCodeBench
Contamination-free coding benchmark collecting new problems from LeetCode, AtCoder, and CodeForces after model knowledge cutoffs. Updated continuously with fresh problems. Primary metric is pass@1 on the full test set.
State of the Art
DeepSeek R1-0528
DeepSeek
73.3
pass@1
MBPP
Mostly Basic Python Problems
974 crowd-sourced Python programming problems suitable for beginners. Covers programming fundamentals and standard library.
State of the Art
o4-mini
OpenAI
94.9
pass@1
HumanEval+
HumanEval+ Extended Version
Extended HumanEval with 80x more test cases. Tests code robustness and edge case handling.
No results tracked yet
APPS
Automated Programming Progress Standard
10,000 coding problems from Codewars, AtCoder, Kattis, and CodeForces. Ranges from introductory to competition level.
No results tracked yet
MBPP+
MBPP+ Extended Version
Extended MBPP with additional test cases. Uses 399 hand-verified problems from MBPP-sanitized.
No results tracked yet
SWE-Bench
SWE-bench: Software Engineering Benchmark
2,294 real GitHub issues from popular Python repositories. Tests ability to resolve real-world software engineering tasks.
No results tracked yet
CodeContests
CodeContests Competitive Programming
13,610 competitive programming problems from CodeForces. ~200 private test cases per problem. 12+ programming languages.
No results tracked yet
Related Tasks
Something wrong or missing?
Help keep Code Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.