What is the best AI coding assistant in 2025?

Claude 3.5 Sonnet leads SWE-bench Verified at 49.0%, demonstrating the strongest ability to resolve real GitHub issues. For function-level coding, o1-preview achieves 92.4% on HumanEval.

What is HumanEval benchmark?

HumanEval is a benchmark of 164 Python programming problems that tests a model's ability to generate correct functions from docstrings. Pass@1 measures if the first generated solution passes all unit tests.

SWE-bench tests AI models on resolving real GitHub issues from popular Python repositories. Given an issue description, the model must navigate the codebase, identify the bug, and write a patch that passes the test suite.

Is GPT-4 good at coding?

GPT-4o achieves 90.2% on HumanEval for function-level coding and 33.2% on SWE-bench Verified for complex repository-level tasks. It's excellent for code completion but trails Claude 3.5 Sonnet on real-world software engineering.

Key AI Capability

AI That writes
Software

From single functions (HumanEval) to resolving GitHub issues (SWE-bench), code generation is the most practically impactful frontier of LLM capability.

View Leaderboard Understand Pass@k Try Claude Code

Code Benchmark Stats

49.0%

SOTA on SWE-bench Verified

92.4%

SOTA on HumanEval

Python

Primary Language

From Snippets to Agents

Pass@1 (Function Level)

The model gets one try to write a single function (e.g., "sort this list"). If it passes unit tests, it wins. This is what HumanEval measures.

Repo-Level Resolution

The model is given a real GitHub issue (bug report) and must navigate multiple files, reproduce the bug, and write a patch. This is SWE-bench.

solution.py

def solve_problem(input_list):""" Sorts list and removes duplicates. >>> solve_problem([3, 1, 2, 1]) [1, 2, 3] """# Model generated code:return sorted(list(set(input_list)))

Test Passed|Execution time: 0.02s

Coding Proficiency Leaderboard

Comparing top models on standard function synthesis (HumanEval) and real-world engineering (SWE-bench Verified).

Rank	Model	HumanEval (Pass@1)	MBPP (Pass@1)	SWE-bench (Verified)
#1	Claude 3.5 Sonnet Anthropic	92.0%	89.2%	49.0%
#2	GPT-4o OpenAI	90.2%	87.8%	41.2%
#3	o1-preview OpenAI	92.4%	-	-
#4	DeepSeek V3 DeepSeek	82.6%	-	-
#5	Llama 3 70B Meta	81.7%	-	-
#6	DeepSeek V2.5 DeepSeek	-	-	37.0%

500

Paper

AI That writes Software

Code Benchmark Stats

From Snippets to Agents

Pass@1 (Function Level)

Repo-Level Resolution

Coding Proficiency Leaderboard

The Benchmarks

HumanEval

MBPP

HumanEval+

MBPP+

APPS

CodeContests

SWE-Bench

SWE-Bench Verified

AI That writes
Software