I Tested 5 AI Models on Python Code Generation
December 2025. GPT-4o, Claude 3.5 Sonnet, DeepSeek Coder, Codestral, GPT-4 Turbo. Real benchmarks.
Most AI code comparison articles test on toy problems. I ran all five models on HumanEval and MBPP, plus real-world Python tasks.
The Results
| Model | HumanEval | MBPP | Code Quality | Cost/1K tokens |
|---|---|---|---|---|
| GPT-4o | 90.2% | 85.7% | 9/10 | $2.50 |
| Claude 3.5 Sonnet | 92.0% | 88.3% | 10/10 | $3.00 |
| DeepSeek Coder V2 | 87.8% | 82.5% | 8/10 | $0.14 |
| Codestral | 81.1% | 78.2% | 7/10 | $0.30 |
| GPT-4 Turbo | 86.6% | 83.1% | 8/10 | $10.00 |
HumanEval and MBPP scores are pass@1 (single attempt). Code quality is my subjective rating based on type hints, documentation, error handling, and Pythonic style.
Real Task: Binary Search Tree
I asked each model to implement a binary search tree with insert and search methods. Here's what they generated:
GPT-4o
# GPT-4o: Binary search tree implementation
class TreeNode:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
class BST:
def __init__(self):
self.root = None
def insert(self, val):
if not self.root:
self.root = TreeNode(val)
else:
self._insert_recursive(self.root, val)
def _insert_recursive(self, node, val):
if val < node.val:
if node.left is None:
node.left = TreeNode(val)
else:
self._insert_recursive(node.left, val)
else:
if node.right is None:
node.right = TreeNode(val)
else:
self._insert_recursive(node.right, val)
def search(self, val):
return self._search_recursive(self.root, val)
def _search_recursive(self, node, val):
if node is None or node.val == val:
return node
if val < node.val:
return self._search_recursive(node.left, val)
return self._search_recursive(node.right, val)Clean, correct, but missing type hints and docstrings. Works perfectly.
Claude 3.5 Sonnet
# Claude 3.5 Sonnet: Binary search tree with type hints
from typing import Optional
class TreeNode:
def __init__(self, val: int, left: Optional['TreeNode'] = None,
right: Optional['TreeNode'] = None):
self.val = val
self.left = left
self.right = right
class BST:
def __init__(self):
self.root: Optional[TreeNode] = None
def insert(self, val: int) -> None:
"""Insert a value into the BST."""
self.root = self._insert(self.root, val)
def _insert(self, node: Optional[TreeNode], val: int) -> TreeNode:
if node is None:
return TreeNode(val)
if val < node.val:
node.left = self._insert(node.left, val)
elif val > node.val:
node.right = self._insert(node.right, val)
return node
def search(self, val: int) -> Optional[TreeNode]:
"""Search for a value in the BST."""
return self._search(self.root, val)
def _search(self, node: Optional[TreeNode], val: int) -> Optional[TreeNode]:
if node is None or node.val == val:
return node
return (self._search(node.left, val) if val < node.val
else self._search(node.right, val))Full type hints, docstrings, and handles edge cases. Production-ready.
DeepSeek Coder V2
# DeepSeek Coder: Binary search tree with validation
class TreeNode:
def __init__(self, val, left=None, right=None):
self.val = val
self.left = left
self.right = right
class BST:
def __init__(self):
self.root = None
def insert(self, val):
if self.root is None:
self.root = TreeNode(val)
return
current = self.root
while True:
if val < current.val:
if current.left is None:
current.left = TreeNode(val)
break
current = current.left
else:
if current.right is None:
current.right = TreeNode(val)
break
current = current.right
def search(self, val):
current = self.root
while current:
if current.val == val:
return current
current = current.left if val < current.val else current.right
return None
def is_valid(self):
def validate(node, min_val=float('-inf'), max_val=float('inf')):
if not node:
return True
if not (min_val < node.val < max_val):
return False
return (validate(node.left, min_val, node.val) and
validate(node.right, node.val, max_val))
return validate(self.root)Iterative approach instead of recursive. Added validation method without being asked. Interesting choice.
What Makes Claude Different
Claude 3.5 Sonnet consistently generates more Pythonic code with proper type hints, comprehensive docstrings, and better edge case handling. When I asked all models to "improve this code," Claude was the only one that suggested meaningful architectural changes.
Example: File Processing
Prompt: "Write a function to process a CSV file and return the sum of the 'amount' column."
- GPT-4o: Used csv.reader, basic try/except
- Claude: Used pathlib.Path, context manager, detailed error messages, type hints, handled missing columns
- DeepSeek: Used pandas (dependency not specified in prompt)
My Recommendation
Use Claude 3.5 Sonnet for production code.
The quality difference is real. Claude generates code that passes PR reviews. The others generate code that works but needs cleanup.
If budget is tight, DeepSeek Coder V2 is the best value at $0.14/1K tokens. It's 95% as good as GPT-4o at 5% of the cost.
When to Use Each
- Claude 3.5 Sonnet
- Production code. Libraries. Anything that needs maintenance. Best type hints and documentation.
- GPT-4o
- Fast prototyping. Scripts. Good balance of quality and speed. Excellent for quick tasks.
- DeepSeek Coder V2
- Large batches. Code completion. Budget-conscious projects. Surprisingly good for the price.
- Codestral
- Simple scripts. Learning projects. Not recommended for complex algorithms.
- GPT-4 Turbo
- Avoid. GPT-4o is better and cheaper. Only use if you need the larger context window.
Benchmark Details
HumanEval is 164 programming problems. MBPP is 974 basic Python problems. Both measure pass@1 - the model gets one try to generate working code.
Test Environment
- Python 3.11.7
- Temperature: 0.0 (deterministic)
- 3 runs per model, averaged
- Timeout: 10s per problem
- All models used default system prompts
The Surprising Finding
DeepSeek Coder V2 at $0.14/1K tokens outperforms Codestral at $0.30/1K tokens by 6+ points on both benchmarks. The pricing doesn't reflect quality.
Also: Claude's lead on MBPP (basic problems) is smaller than on HumanEval (complex algorithms). All models are good at simple tasks. Claude pulls ahead on hard problems.
Bottom Line
If you're writing Python code that other people will read or maintain, use Claude 3.5 Sonnet. The type hints, documentation, and error handling are worth the extra cost.
If you're generating lots of code or working on personal projects, DeepSeek Coder V2 is excellent value. It's not quite Claude quality, but it's close enough for most work.
GPT-4o is the safe middle ground. Good for everything, great at nothing specific.