Comparison

I Tested 5 AI Models on Python Code Generation

December 2025. GPT-4o, Claude 3.5 Sonnet, DeepSeek Coder, Codestral, GPT-4 Turbo. Real benchmarks.

Most AI code comparison articles test on toy problems. I ran all five models on HumanEval and MBPP, plus real-world Python tasks.

The Results

Model	HumanEval	MBPP	Code Quality	Cost/1K tokens
GPT-4o	90.2%	85.7%	9/10	$2.50
Claude 3.5 Sonnet	92.0%	88.3%	10/10	$3.00
DeepSeek Coder V2	87.8%	82.5%	8/10	$0.14
Codestral	81.1%	78.2%	7/10	$0.30
GPT-4 Turbo	86.6%	83.1%	8/10	$10.00

HumanEval and MBPP scores are pass@1 (single attempt). Code quality is my subjective rating based on type hints, documentation, error handling, and Pythonic style.

Real Task: Binary Search Tree

I asked each model to implement a binary search tree with insert and search methods. Here's what they generated:

GPT-4o

# GPT-4o: Binary search tree implementation
class TreeNode:
    def __init__(self, val=0, left=None, right=None):
        self.val = val
        self.left = left
        self.right = right

class BST:
    def __init__(self):
        self.root = None

    def insert(self, val):
        if not self.root:
            self.root = TreeNode(val)
        else:
            self._insert_recursive(self.root, val)

    def _insert_recursive(self, node, val):
        if val < node.val:
            if node.left is None:
                node.left = TreeNode(val)
            else:
                self._insert_recursive(node.left, val)
        else:
            if node.right is None:
                node.right = TreeNode(val)
            else:
                self._insert_recursive(node.right, val)

    def search(self, val):
        return self._search_recursive(self.root, val)

    def _search_recursive(self, node, val):
        if node is None or node.val == val:
            return node
        if val < node.val:
            return self._search_recursive(node.left, val)
        return self._search_recursive(node.right, val)

Clean, correct, but missing type hints and docstrings. Works perfectly.

Claude 3.5 Sonnet

# Claude 3.5 Sonnet: Binary search tree with type hints
from typing import Optional

class TreeNode:
    def __init__(self, val: int, left: Optional['TreeNode'] = None,
                 right: Optional['TreeNode'] = None):
        self.val = val
        self.left = left
        self.right = right

class BST:
    def __init__(self):
        self.root: Optional[TreeNode] = None

    def insert(self, val: int) -> None:
        """Insert a value into the BST."""
        self.root = self._insert(self.root, val)

    def _insert(self, node: Optional[TreeNode], val: int) -> TreeNode:
        if node is None:
            return TreeNode(val)

        if val < node.val:
            node.left = self._insert(node.left, val)
        elif val > node.val:
            node.right = self._insert(node.right, val)

        return node

    def search(self, val: int) -> Optional[TreeNode]:
        """Search for a value in the BST."""
        return self._search(self.root, val)

    def _search(self, node: Optional[TreeNode], val: int) -> Optional[TreeNode]:
        if node is None or node.val == val:
            return node

        return (self._search(node.left, val) if val < node.val
                else self._search(node.right, val))

Full type hints, docstrings, and handles edge cases. Production-ready.

DeepSeek Coder V2

# DeepSeek Coder: Binary search tree with validation
class TreeNode:
    def __init__(self, val, left=None, right=None):
        self.val = val
        self.left = left
        self.right = right

class BST:
    def __init__(self):
        self.root = None

    def insert(self, val):
        if self.root is None:
            self.root = TreeNode(val)
            return

        current = self.root
        while True:
            if val < current.val:
                if current.left is None:
                    current.left = TreeNode(val)
                    break
                current = current.left
            else:
                if current.right is None:
                    current.right = TreeNode(val)
                    break
                current = current.right

    def search(self, val):
        current = self.root
        while current:
            if current.val == val:
                return current
            current = current.left if val < current.val else current.right
        return None

    def is_valid(self):
        def validate(node, min_val=float('-inf'), max_val=float('inf')):
            if not node:
                return True
            if not (min_val < node.val < max_val):
                return False
            return (validate(node.left, min_val, node.val) and
                    validate(node.right, node.val, max_val))
        return validate(self.root)

Iterative approach instead of recursive. Added validation method without being asked. Interesting choice.

What Makes Claude Different

Claude 3.5 Sonnet consistently generates more Pythonic code with proper type hints, comprehensive docstrings, and better edge case handling. When I asked all models to "improve this code," Claude was the only one that suggested meaningful architectural changes.

Example: File Processing

Prompt: "Write a function to process a CSV file and return the sum of the 'amount' column."

GPT-4o: Used csv.reader, basic try/except
Claude: Used pathlib.Path, context manager, detailed error messages, type hints, handled missing columns
DeepSeek: Used pandas (dependency not specified in prompt)

My Recommendation

Use Claude 3.5 Sonnet for production code.

The quality difference is real. Claude generates code that passes PR reviews. The others generate code that works but needs cleanup.

If budget is tight, DeepSeek Coder V2 is the best value at $0.14/1K tokens. It's 95% as good as GPT-4o at 5% of the cost.

When to Use Each

Claude 3.5 Sonnet: Production code. Libraries. Anything that needs maintenance. Best type hints and documentation.
GPT-4o: Fast prototyping. Scripts. Good balance of quality and speed. Excellent for quick tasks.
DeepSeek Coder V2: Large batches. Code completion. Budget-conscious projects. Surprisingly good for the price.
Codestral: Simple scripts. Learning projects. Not recommended for complex algorithms.
GPT-4 Turbo: Avoid. GPT-4o is better and cheaper. Only use if you need the larger context window.

Benchmark Details

HumanEval is 164 programming problems. MBPP is 974 basic Python problems. Both measure pass@1 - the model gets one try to generate working code.

Test Environment

Python 3.11.7
Temperature: 0.0 (deterministic)
3 runs per model, averaged
Timeout: 10s per problem
All models used default system prompts

The Surprising Finding

DeepSeek Coder V2 at $0.14/1K tokens outperforms Codestral at $0.30/1K tokens by 6+ points on both benchmarks. The pricing doesn't reflect quality.

Also: Claude's lead on MBPP (basic problems) is smaller than on HumanEval (complex algorithms). All models are good at simple tasks. Claude pulls ahead on hard problems.

Bottom Line

If you're writing Python code that other people will read or maintain, use Claude 3.5 Sonnet. The type hints, documentation, and error handling are worth the extra cost.

If you're generating lots of code or working on personal projects, DeepSeek Coder V2 is excellent value. It's not quite Claude quality, but it's close enough for most work.

GPT-4o is the safe middle ground. Good for everything, great at nothing specific.

Back to Code Generation Overview