Codesota · Guides · Code generation models6 models · 4 benchmarks · March 2026

Published Mar 28, 2026

Guide · Code generation

Six models, four benchmarks, one decision.

Claude Opus 4, GPT-5, Gemini 2.5 Pro, DeepSeek-V3, Qwen2.5-Coder-32B, and Codestral — benchmarked head-to-head. Real numbers, real pricing, production code.

Claude Opus 4 leads real-world engineering. GPT-5 is the best generalist. Gemini leads algorithmic reasoning. Qwen2.5-Coder is the open-source sweet spot. Cost varies one hundred-fold between them.

Decision framework See the numbers
§ 01 · Findings

Six headlines, March 2026.

  1. 01
    Claude Opus 4 dominates real-world coding.
    79.4% on SWE-bench Verified, 3+ points ahead. If you are building a coding agent, this is the model.
  2. 02
    GPT-5 is the best generalist.
    96.3% HumanEval, strong across every benchmark, $0.015/request. The default choice for most teams.
  3. 03
    Gemini 2.5 Pro leads algorithmic reasoning.
    70.4% on LiveCodeBench with a 1M context window. Best for competitive programming and codebase analysis.
  4. 04
    Open-source is viable for code completion.
    Qwen2.5-Coder-32B hits 92.7% HumanEval on a single GPU. Fine-tune on your codebase for even better results.
  5. 05
    Cost varies 100× between models.
    From $0.001/request (Qwen self-hosted) to $0.105/request (Claude Opus 4). Most teams should use a tiered approach.
  6. 06
    SWE-bench is the benchmark that matters.
    HumanEval is saturated — top models all 90%+. Real-world engineering tasks reveal true differences.
§ 02 · Benchmarks

Six models, head to head.

Scores are pass@1 unless noted. $/request estimates a ~2K-input, 1K-output coding task.

ModelHumanEvalHE+MBPPSWE-bench V.LiveCodeBenchContext$/reqType
Claude Opus 4
Anthropic
93.7%89.2%91.4%79.4%67.8%200K$0.105API
GPT-5
OpenAI
96.3%91.8%93.1%76.2%68.1%256K$0.015API
Gemini 2.5 Pro
Google
93.2%87.4%91.8%63.8%70.4%1M$0.013API
DeepSeek-V3
DeepSeek (Open Source)
82.6%75.3%82.4%42%65.4%128K$0.0016Open
Qwen2.5-Coder-32B
Alibaba (Open Source)
92.7%87.6%90.2%33.4%55.2%128K$0.0010Open
Codestral 25.01
Mistral AI
87.3%82.1%87.6%28.6%48.3%256K$0.0015API
* SWE-bench Verified scores reflect agent scaffolding performance (model + tool use). Raw model capability may differ.
* LiveCodeBench scores from the latest available evaluation period (contamination-free problems only).
§ 03 · Deep dives

Six models, one page each.

Pricing, strengths, weaknesses, best-for.

Claude Opus 4Anthropic · Jan 2026 · Undisclosed · 200K ctx
SWE-bench 79.4%
Input / 1M
$15
Output / 1M
$75
Cached
$1.5
Per request
$0.105
Latency
3-8s
Strengths
  • Highest SWE-bench Verified (79.4%) — best at real-world engineering
  • Superior instruction following and tool use for agents
  • 200K context handles entire codebases
  • Extended thinking mode for complex debugging
  • Excellent at multi-file refactoring and architecture
Weaknesses
  • Most expensive API option ($15/$75 per M tokens)
  • Slower generation speed (3-8s first token)
  • Overkill for simple code completion tasks
  • No open-source or self-hosted option
Best forAgentic coding, complex refactoring, and production-grade software engineering
GPT-5OpenAI · Dec 2025 · Undisclosed · 256K ctx
SWE-bench 76.2%
Input / 1M
$2.5
Output / 1M
$10
Cached
$0.25
Per request
$0.015
Latency
2-5s
Strengths
  • Highest HumanEval score (96.3%) — best function-level synthesis
  • 256K context window with strong long-range coherence
  • Excellent structured output / JSON mode
  • Good balance of speed and quality
  • Competitive pricing for a frontier model
Weaknesses
  • Falls behind Claude Opus 4 on SWE-bench (−3.2 points)
  • Less reliable at multi-step tool use
  • Occasional instruction-following failures on complex prompts
  • Rate limits on Tier 1-3 accounts
Best forGeneral-purpose code generation, function synthesis, and IDE autocomplete backends
Gemini 2.5 ProGoogle · Mar 2025 · Undisclosed (MoE) · 1M ctx
SWE-bench 63.8%
Input / 1M
$1.25
Output / 1M
$10
Cached
$0.315
Per request
$0.013
Latency
1-4s
Strengths
  • Best LiveCodeBench score (70.4%) — strong algorithmic reasoning
  • 1M token context window — largest available
  • Native code execution for verification
  • Thinking mode for step-by-step solutions
  • Strong multimodal coding (diagram to code)
Weaknesses
  • Lower SWE-bench than Claude/GPT for real-world tasks
  • Inconsistent on multi-file refactoring
  • Output formatting less predictable
  • Google Cloud ecosystem lock-in for some features
Best forCompetitive programming, algorithmic challenges, and large-codebase analysis
DeepSeek-V3DeepSeek (Open Source) · Dec 2024 · 671B MoE (37B active) · 128K ctx
SWE-bench 42%
Input / 1M
$0.27
Output / 1M
$1.1
Cached
$0.07
Per request
$0.0016
Latency
2-6s (API) / 5-15s (self-hosted)
Strengths
  • Open-source (MIT license) with full weights available
  • Extremely cost-effective via DeepSeek API ($0.27/$1.10 per M)
  • Strong for its price point — competitive with GPT-5.4 on coding
  • MoE architecture for efficient inference
  • Self-hostable for complete data privacy
Weaknesses
  • Significant gap to frontier models on SWE-bench (42.0%)
  • Requires 8× H100 for self-hosting at full precision
  • Weaker at complex multi-step reasoning
  • Less reliable instruction following than proprietary models
Best forCost-sensitive teams, privacy-first deployments, and high-volume code generation
Qwen2.5-Coder-32BAlibaba (Open Source) · Nov 2024 · 32.5B · 128K ctx
SWE-bench 33.4%
Input / 1M
Free
Output / 1M
Free
Cached
N/A
Per request
$0.0010
Latency
1-4s (self-hosted)
Strengths
  • Best code-specialised open-source model at its size
  • 92.7% HumanEval — competitive with frontier models
  • Runs on a single A100 or 2× A6000 (32B params)
  • Apache 2.0 license — full commercial use
  • Excellent for fine-tuning on proprietary codebases
Weaknesses
  • Weak on real-world engineering tasks (33.4% SWE-bench)
  • Limited general reasoning outside of code
  • Struggles with complex multi-file changes
  • No built-in tool use capability
Best forSelf-hosted code completion, IDE integration, and fine-tuning on internal codebases
Codestral 25.01Mistral AI · Jan 2025 · 25B · 256K ctx
SWE-bench 28.6%
Input / 1M
$0.3
Output / 1M
$0.9
Cached
$0.1
Per request
$0.0015
Latency
0.5-2s
Strengths
  • Fastest inference speed — ideal for real-time autocomplete
  • 80+ language support including rare languages
  • 256K context window at budget pricing
  • Fill-in-the-Middle (FIM) support for code completion
  • Available via Mistral API and self-hosted
Weaknesses
  • Significantly behind frontier models on all benchmarks
  • Poor on complex software engineering tasks (28.6% SWE-bench)
  • Less reliable for multi-step reasoning
  • Weaker at code review and debugging
Best forLow-latency autocomplete, FIM code completion, and multilingual code support
§ 04 · Method

Understanding the benchmarks.

HumanEval is saturated. SWE-bench tests real engineering. LiveCodeBench prevents contamination. Each measures a different skill.

HumanEvalgithub.com/openai/human-eval

164 hand-crafted Python programming problems testing function-level code synthesis from docstrings

Metric · pass@1 (% of problems solved on first attempt)
Leader · GPT-5 (96.3%)

Standard baseline for code generation ability. Widely used but increasingly saturated — top models all score 90%+.

MBPPgithub.com/google-research/google-research/tree/master/mbpp

Mostly Basic Python Problems — 974 entry-level programming challenges, sanitised subset (427 problems)

Metric · pass@1
Leader · GPT-5 (93.1%)

Broader than HumanEval with more edge cases. MBPP+ adds stricter test cases, revealing true reliability.

SWE-bench Verifiedwww.swebench.com/

500 real-world GitHub issues from 12 popular Python repositories (Django, Flask, scikit-learn, etc.)

Metric · % of issues resolved (verified by human reviewers)
Leader · Claude Opus 4 (79.4%)

Gold standard for real-world software engineering. Tests multi-file changes, debugging, and understanding existing codebases.

LiveCodeBenchlivecodebench.github.io/

Continuously updated competitive programming problems from Codeforces, LeetCode, and AtCoder (post-training cutoff)

Metric · pass@1 on contamination-free problems
Leader · Gemini 2.5 Pro (70.4%)

Best for measuring algorithmic reasoning without data contamination. Problems are released after model training cutoffs.

Benchmark limitations
  • HumanEval is saturated. Top models all score 90%+. A 93% vs 96% gap is less meaningful than SWE-bench differences.
  • SWE-bench scores depend on scaffolding. The same model can score 10–20 points higher with better agent setup.
  • Data contamination is real. LiveCodeBench mitigates with post-cutoff problems; HumanEval/MBPP are potentially contaminated.
  • Your task is not the benchmark. Always validate on YOUR codebase before committing.
§ 05 · Code

Six models, six idiomatic calls.

Production patterns only — the features worth reaching for in each API.

Claude Opus 4 — Extended thinking

pip install anthropic · Best for: agentic coding, multi-file changes

import anthropic

client = anthropic.Anthropic()

def generate_code(task: str, codebase_context: str = "") -> str:
    """Generate code using Claude Opus 4 with extended thinking."""
    messages = [
        {
            "role": "user",
            "content": f"""You are a senior software engineer.

Context from the codebase:
{codebase_context}

Task: {task}

Requirements:
- Write production-quality code with error handling
- Follow existing code patterns from the context
- Include type hints and docstrings
- Add inline comments for non-obvious logic"""
        }
    ]

    # Use extended thinking for complex tasks
    response = client.messages.create(
        model="claude-opus-4-20250115",
        max_tokens=16000,
        thinking={
            "type": "enabled",
            "budget_tokens": 10000  # Let the model reason deeply
        },
        messages=messages
    )

    # Extract the text response (thinking is internal)
    for block in response.content:
        if block.type == "text":
            return block.text
    return ""

GPT-5 — Structured outputs

pip install openai pydantic · Best for: structured analysis + generation

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class CodeResponse(BaseModel):
    code: str
    language: str
    explanation: str
    complexity: str  # O(n), O(n log n), etc.
    edge_cases: list[str]

response = client.beta.chat.completions.parse(
    model="gpt-5",
    messages=[
        {"role": "system", "content": "Generate code and analysis in the required format."},
        {"role": "user", "content": "Implement a thread-safe LRU cache with TTL support in Python"}
    ],
    response_format=CodeResponse,
)

result = response.choices[0].message.parsed
print(f"Complexity: {result.complexity}")
print(f"Edge cases: {result.edge_cases}")
print(result.code)

Gemini 2.5 Pro — 1M context

pip install google-generativeai · Best for: algorithmic reasoning, whole-codebase analysis

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

def generate_code(task: str, thinking: bool = True) -> str:
    """Generate code using Gemini 2.5 Pro with thinking mode."""
    model = genai.GenerativeModel("gemini-2.5-pro-preview-03-25")

    config = genai.GenerationConfig(
        temperature=0,
        max_output_tokens=8192,
    )
    if thinking:
        config.thinking_config = {"thinking_budget": 8000}

    response = model.generate_content(
        f"""Solve this programming problem step by step.

{task}

Provide:
1. Your approach and reasoning
2. Clean, optimized code
3. Time and space complexity analysis
4. Test cases covering edge cases""",
        generation_config=config
    )
    return response.text

# Leverage 1M context for whole-codebase analysis
def analyze_codebase(files: dict[str, str], question: str) -> str:
    model = genai.GenerativeModel("gemini-2.5-pro-preview-03-25")
    context = "\n\n".join(f"--- {p} ---\n{c}" for p, c in files.items())
    response = model.generate_content(
        f"Codebase:\n\n{context}\n\nQuestion: {question}",
        generation_config=genai.GenerationConfig(temperature=0)
    )
    return response.text

DeepSeek-V3 — API + self-hosted

pip install openai · Best for: budget, privacy-first deployments

from openai import OpenAI

# DeepSeek uses an OpenAI-compatible API
client = OpenAI(api_key="YOUR_DEEPSEEK_KEY", base_url="https://api.deepseek.com")

def generate_code(task: str) -> str:
    response = client.chat.completions.create(
        model="deepseek-chat",  # Points to DeepSeek-V3
        messages=[
            {"role": "system", "content": "You are an expert programmer."},
            {"role": "user", "content": task}
        ],
        temperature=0,
        max_tokens=4096,
    )
    return response.choices[0].message.content

# Self-hosted via vLLM for full privacy:
#   vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8
def generate_code_self_hosted(task: str) -> str:
    local = OpenAI(api_key="not-needed", base_url="http://localhost:8000/v1")
    return local.chat.completions.create(
        model="deepseek-ai/DeepSeek-V3",
        messages=[{"role": "user", "content": task}],
        temperature=0, max_tokens=4096
    ).choices[0].message.content

Qwen2.5-Coder-32B — self-hosted

pip install transformers torch accelerate · Best for: self-hosted autocomplete, fine-tuning

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Qwen/Qwen2.5-Coder-32B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

def generate_code(task: str) -> str:
    messages = [
        {"role": "system", "content": "You are an expert programmer."},
        {"role": "user", "content": task},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False, num_beams=1)
    return tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

# For production: serve with vLLM
# vllm serve Qwen/Qwen2.5-Coder-32B-Instruct --tensor-parallel-size 2

Codestral 25.01 — Fill-in-the-Middle

pip install mistralai · Best for: real-time autocomplete, IDE integration

from mistralai import Mistral

client = Mistral(api_key="YOUR_MISTRAL_KEY")

# Fill-in-the-Middle — Codestral's killer feature
def code_completion(prefix: str, suffix: str) -> str:
    response = client.fim.complete(
        model="codestral-latest",
        prompt=prefix,
        suffix=suffix,
        temperature=0,
        max_tokens=512,
    )
    return response.choices[0].message.content

prefix = '''def binary_search(arr: list[int], target: int) -> int:
    """Find target in sorted array. Returns index or -1."""
    left, right = 0, len(arr) - 1
    while left <= right:
'''
suffix = '''
    return -1

assert binary_search([1, 3, 5, 7, 9], 5) == 2
assert binary_search([1, 3, 5, 7, 9], 4) == -1
'''
middle = code_completion(prefix, suffix)
§ 06 · Pricing

Cost per million, cost per month.

Monthly spend at 100 / 1K / 10K requests per day. Prices current March 2026.

ModelInput/1MOutput/1MCached/1M100/day1K/day10K/day
Claude Opus 4$15.00$75.00$1.50$315$3,150$31,500
GPT-5$2.50$10.00$0.25$45$450$4,500
Gemini 2.5 Pro$1.25$10.00$0.32$41$405$4,050
DeepSeek-V3$0.27$1.10$0.07$5$48$480
Qwen2.5-Coder-32BFree*Free*N/A$3$30$300
Codestral 25.01$0.30$0.90$0.10$5$45$450
* Open-source models are free to download. Self-hosting costs depend on infrastructure: single A100 ~$1.50–2.00/hr, 8× H100 ~$25–30/hr. Per-request cost assumes amortised GPU compute at typical utilisation.
Tiered model routing

Route 80–90% of simple requests to a cheap model, escalate complex tasks to a frontier model.

Simple completion · Codestral · $0.0015
Standard generation · GPT-5 · $0.015
Complex engineering · Opus 4 · $0.105
Blended (80/15/5) ≈ $0.0075/req
Prompt caching

Reuse system prompts and codebase context across requests. Saves 80–90% on input tokens.

Claude Opus 4 · $15 → $1.50/M cached
GPT-5 · $2.50 → $0.25/M cached
Gemini 2.5 Pro · $1.25 → $0.315/M cached
Typical savings · 40–60% of total cost
§ 07 · Live

Benchmark rankings, auto-updated.

From the CodeSOTA database. These numbers refresh every deploy.

Leaderboard — pass@1Full →
#ModelScore
1o4-mini97.30
2o3-mini96.30
3Claude Opus 4.696.30
4GPT-595.10
5o394.80
No results yet.
§ 08 · Decide

Pick the right model for your situation.

Startup / Solo Developer<500 requests/day · <$100/month
Requirements: Fast iteration · Multi-language · IDE integration
Recommended: GPT-5
Best all-around code generation at $0.015/request. Strong in every language, excellent IDE tool support via Copilot.
Alternative · Codestral for real-time autocomplete at $0.0015/request
AI-Powered Coding AgentVariable (agent-driven) · $200–$2,000/month
Requirements: Instruction following · Tool use · Multi-file edits
Recommended: Claude Opus 4
79.4% SWE-bench Verified — best at understanding real codebases, tool use, and multi-step engineering tasks.
Alternative · GPT-5 for cost savings with moderate agent capability (76.2% SWE-bench)
Enterprise Engineering Team1,000–10,000 requests/day · $500–$5,000/month
Requirements: Reliability · Security · Audit trail
Recommended: GPT-5 + Claude Opus 4 tiered
Use GPT-5 for routine generation ($0.015/req), escalate complex tasks to Claude Opus 4 ($0.105/req). Typical 90/10 split ≈ $0.024/req average.
Alternative · Gemini 2.5 Pro for teams already on Google Cloud
Privacy-First / Air-GappedAny · Infrastructure costs
Requirements: No external API calls · Full data control · On-premise
Recommended: Qwen2.5-Coder-32B
92.7% HumanEval on a single GPU. Apache 2.0 license. Fine-tune on your codebase for even better results.
Alternative · DeepSeek-V3 if you have 8× H100s and need stronger general reasoning
Competitive Programming / EducationModerate · <$200/month
Requirements: Algorithmic reasoning · Step-by-step explanations · Multiple approaches
Recommended: Gemini 2.5 Pro
Highest LiveCodeBench (70.4%). Thinking mode shows reasoning steps. 1M context handles complex problem statements.
Alternative · GPT-5 for broader problem coverage
Quick flowchart
  1. Q1
    Do you need data to stay on your infrastructure?
    Yes → Qwen2.5-Coder-32B (single GPU) or DeepSeek-V3 (8× GPU). No → Q2.
  2. Q2
    Are you building an autonomous coding agent?
    Yes → Claude Opus 4 — best tool use and SWE-bench. No → Q3.
  3. Q3
    Is latency critical (real-time autocomplete)?
    Yes → Codestral — fastest inference, FIM support. No → Q4.
  4. Q4
    Do you need to analyse very large codebases (>500K tokens)?
    Yes → Gemini 2.5 Pro — 1M context. No → Q5.
  5. Q5
    Default for everything else.
    GPT-5 — best balance of quality, speed, and cost at $0.015/req.
§ 09 · FAQ

Five questions we get the most.

Should I use a code-specific model or a general-purpose LLM?
For most tasks, general-purpose frontier models (Claude Opus 4, GPT-5, Gemini 2.5 Pro) outperform code-specific ones. Code-specific models like Qwen2.5-Coder and Codestral excel at completion/autocomplete and are much cheaper to self-host, but lack the reasoning depth for complex engineering. Use code-specific for IDE autocomplete and fine-tuning, frontier models for agent-driven development.
How do I evaluate models on my specific codebase?
Create a test suite of 20–50 representative tasks: bug fixes, feature implementations, refactoring. Run each through your candidate models and have engineers blind-rate the outputs. One to two days of work, far more reliable signal than public benchmarks. Track acceptance rate, edit distance, and time-to-usable-code.
Can I fine-tune these models for better performance on my code?
Only open-source models support fine-tuning. Qwen2.5-Coder-32B is the most practical — it runs on a single A100 and can be fine-tuned with LoRA at 4-bit. Fine-tuning on 10–50K examples typically improves task-specific performance by 10–30% while maintaining general capability.
What about reasoning models like o3 and DeepSeek-R1?
Reasoning models use chain-of-thought internally and excel at complex algorithmic problems. But they are slower (10–60s per response) and more expensive. For most coding tasks, standard models are faster and cheaper. Use reasoning models for competitive programming, complex debugging, and algorithmic design.
How reliable are these benchmark scores?
Take them with appropriate skepticism. HumanEval and MBPP may be contaminated in training data. SWE-bench Verified is the most trustworthy — human-verified, real-world — but scores depend on agent scaffolding. LiveCodeBench is contamination-free by design. Always validate with your own evaluation before committing.
§ 10 · Methodology

How we read these numbers.

Benchmark scores are collected from official model releases, published papers, and verified third-party evaluations. Where multiple scores exist, we use the most recent evaluation with standard settings (pass@1, temperature 0, no majority voting). SWE-bench Verified scores reflect the best known agent scaffolding for each model.

Pricing is based on official API pricing pages as of March 2026. Per-request costs assume a typical coding task: ~2,000 input tokens and ~1,000 output tokens. Self-hosted per-request costs assume amortised GPU compute at 60% utilisation on cloud GPU instances.

Model parameters, context windows, and release dates are sourced from official documentation. This guide is updated monthly. Last update: March 28, 2026.

Related · Further reading

Continue through the registry.