Codesota · Tasks · Code GenerationTasks/Code/Code Generation

Code · two markets buyers conflate

Code Generation & AI Coding Assistants.

On one side, IDE-integrated developer products (Copilot, Cursor, Cody, Tabnine, Codeium, Augment) priced per seat and optimised for inline completion, chat, and multi-file edits. On the other, raw code-capable LLMs (Claude Opus 4.7, GPT-5, Gemini 3 Pro, DeepSeek, Qwen3-Coder) priced per token — the intelligence layer that powers the first category, and that agentic CLIs like Aider and Codex call directly.

Below: 13 products and models, compared on the axes that actually decide it.

Frontier LLM leaderboard →LiveCodeBench Claim a listing

§ 01 · The matrix

13 coding tools & LLMs, side by side.

IDE products (per seat / month) · frontier LLM APIs (per 1M tokens) · open-weights LLMs. Pricing units differ by tier — read the Cost column accordingly.

Provider / Product	Tier	License	Cost	IDE integrations	Context	Agent	SWE-bench Verified
GitHub Copilot Copilot · Copilot Chat · Copilot Workspace	IDE	Proprietary product	$10–19 / seat / mo	VSCode · JetBrains · Vim/Neovim · Visual Studio · Xcode	Repo-aware (workspace indexing)	✓	~55% (Workspace agent)	Claim →
Cu Cursor Cursor Editor · Composer · Agent	IDE	Proprietary product	$20–40 / seat / mo	Cursor (VSCode fork) only	Repo-wide embedding index · multi-file edits	✓	~50–60% (Agent mode)	Claim →
Cd Sourcegraph Cody Cody · Cody Enterprise	IDE	Proprietary product	$9–19 / seat / mo	VSCode · JetBrains · Neovim · Web	Graph-based code intelligence across the monorepo	✓	Not self-reported	Claim →
Tn Tabnine Tabnine Pro · Enterprise	IDE	Proprietary product	$9–39 / seat / mo	VSCode · JetBrains · Vim/Neovim · Eclipse · many more	Local + repo-aware	✓	Not self-reported	Claim →
Co Codeium Codeium · Windsurf Editor	IDE	Proprietary product	Free · $15 / seat / mo	VSCode · JetBrains · Vim · 40+ editors · Windsurf (own IDE)	Repo-wide	✓	Not self-reported	Claim →
Au Augment Code Augment Code · Remote Agents	IDE	Proprietary product	$30 / seat / mo	VSCode · JetBrains · Vim · CLI · Remote Agents	200K+ token context engine across large monorepos	✓	~65% (Remote Agent, self-reported)	Claim →
Cn Continue Continue.dev	IDE	Open source	Free · BYO model	VSCode · JetBrains	Configurable — local, repo, or custom retrievers	✓	Harness-dependent	Claim →
Ai Aider Aider CLI	IDE	Open source	Free · BYO model	CLI (works alongside any editor)	Repo map + git-aware edits	✓	Aider Polyglot leaderboard	Claim →
Anthropic Claude Opus 4.7 · Sonnet 4.6	Frontier LLM	Proprietary API	$3 / $15 – $15 / $75 per 1M	Via Copilot · Cursor · Cody · Continue · Aider · Claude Code CLI	200K–1M tokens	✓	SOTA on SWE-bench Verified (~70%+ harnessed)	Claim →
OpenAI GPT-5 · GPT-5 Codex · o-series	Frontier LLM	Proprietary API	~$10 / $30 per 1M (GPT-5 typical)	Via Copilot · Cursor · Cody · Continue · Aider · Codex CLI	200K–400K tokens typical	✓	~65–70% (Codex-style harnesses)	Claim →
Google Gemini 3 Pro · Gemini 3 Ultra	Frontier LLM	Proprietary API	$1.25 / $5 per 1M (Pro)	Via Copilot · Cursor · Cody · Continue · Aider · Gemini CLI	1M–2M tokens	✓	Top of LiveCodeBench (~91.7%)	Claim →
DS DeepSeek DeepSeek V3.2 · V3.1 · Coder V2	Open LLM	Open weights	$0.27 / $1.10 per 1M (hosted)	Via Continue · Aider · any OpenAI-compatible client	128K tokens	✓	~55–60% (best open result)	Claim →
Alibaba / Qwen Qwen3-Coder · Qwen3-Coder-Plus	Open LLM	Open weights	Self-host · ~$1–3 per 1M (hosted)	Via Continue · Aider · vLLM / SGLang / Ollama	256K–1M tokens (YaRN)	✓	~50–55% (best Apache-licensed result)	Claim →

Pricing as of 2026-04. IDE products are priced per seat / month; LLMs are priced per 1M tokens (input / output). SWE-bench Verified scores are harness-dependent — the same model can swing 20 points across harnesses, which is why we report approximate ranges and note the harness where relevant. Click any price to open the vendor’s pricing page. Spot an error? Tell us →

§ 02 · Decision shortcuts

Which should I use?

The buyer question for an IDE tool (“Cursor or Copilot?”) is about ergonomics and multi-file context. The buyer question for a raw LLM (“Claude Opus 4.7 or GPT-5?”) is about SWE-bench numbers and cost per task. They’re not the same decision.

Solo developer · best bang-for-buck

GitHub Copilot · Cursor Pro

$10–20/mo. Copilot if you live in VSCode or JetBrains; Cursor if you're willing to switch editors for a better multi-file agent loop.

Autonomous / agentic coding (CLI)

Claude Code · Aider + Claude Opus 4.7 · Codex CLI

Terminal-driven loops that read errors, run tests, and commit their own edits. Claude Opus 4.7 is the default model; GPT-5 and Gemini 3 Pro are credible alternatives.

Cheapest credible tokens

Gemini 3 Pro · DeepSeek V3.2

Gemini 3 Pro at $1.25/$5 per 1M is the cheapest frontier model. DeepSeek V3.2 at $0.27/$1.10 is 80% of the quality at 1/5th the price again.

Large monorepo · repo-scale context

Sourcegraph Cody · Augment Code · Gemini 3 Pro

Cody indexes the graph; Augment's context engine is tuned for 200K+ token codebases; Gemini's 2M window lets you just paste the whole repo.

Air-gapped / on-prem / regulated

Tabnine Enterprise · Continue + self-hosted Qwen3-Coder

No code leaves your VPC. Tabnine is the productised path; Continue + Qwen3-Coder is the build-it-yourself path on Apache-licensed weights.

Open-source-only stack

Continue · Aider · Qwen3-Coder · DeepSeek V3.2

Apache / OSI-approved from editor to weights. Pair Continue or Aider with Qwen3-Coder locally via vLLM for a fully reproducible setup.

Raw LLM for an agent framework

Claude Opus 4.7 · GPT-5 · Gemini 3 Pro

Building your own agent loop? Opus 4.7 currently tops SWE-bench Verified with the right harness; Gemini 3 Pro tops LiveCodeBench on fresh problems.

§ 03 · Methodology

What to actually test (vendor demos lie).

HumanEval-style single-function completion is a solved problem — every model on this page clears it. The interesting failure modes are the ones that surface when you point a tool at a real codebase. Build your own 6-task eval covering these:

Run the same tasks through 2–3 candidates blind and score on finished PRs, not tokens generated. A tool that writes confident code that doesn’t compile is worse than one that asks a clarifying question.

Multi-file refactor

“Rename this interface across 5+ files, update callers, update tests.” Single-function completion is solved; multi-file edits are where Cursor / Augment / Claude Code pull ahead of Copilot-style autocomplete.

Long-context recall

Does it remember the convention you established 400 lines earlier? Or does it reinvent a parallel pattern? This is the difference between a repo-aware tool and a glorified autocompleter.

Library version awareness

Ask it to write code against a library that had a major API change in the last 6 months. Most models hallucinate the old API — the good ones either ask or use your lockfile.

Agentic execution loop

Can it run the tests, read the failures, edit the code, and re-run? This is the qualitative gap between ‘fancy autocomplete’ and ‘junior engineer you can dispatch a ticket to.’

Code review (not generation)

Paste a PR with a subtle bug. Can the tool catch it? Writing new code is easier than finding bugs in existing code — and review is where coding assistants earn their keep in teams.

Domain-specific languages

SQL, Terraform, GLSL, Solidity, Zig. Benchmarks are Python + TypeScript heavy. If your production code is 60% DSL, that's where the real evaluation happens.

§ 04 · Metrics

Why HumanEval / MBPP scores stopped being meaningful.

HumanEval (2021) and MBPP (2021) are the original code-gen benchmarks — 164 and 974 short Python problems respectively. They’re both saturated and both contamination-prone: the problems are on the public internet, which means every model trained after ~2022 has seen them. Reporting 95% on HumanEval tells you the model was trained; it doesn’t tell you the model is good.

A 2-point delta on HumanEval is entirely noise. Worse, it’s often training-data leakage dressed as capability.

The evals that still discriminate in 2026 are SWE-bench Verified (real GitHub issues, agentic), LiveCodeBench (time-stamped competition problems, contamination-resistant by construction), Aider Polyglot (6 languages, edit-based), and BigCodeBench (function-level with real library calls).

Even these have a harness problem — the same base model can post wildly different SWE-bench numbers depending on the scaffold. Agent harness > model: a great model inside a bad harness loses to a worse model with a good one.

§ 05 · Reference benchmarks

The boards that matter.

Six datasets that show up on every coding leaderboard. The first two are saturated legacy; the next three are the ones you should actually read when comparing models in 2026; the last is the emerging function-level standard.

HumanEval

164 problems · hand-written Python2021

The pioneer. OpenAI's 2021 release that defined the category. Every model on this page clears 90%+. Saturated; contamination-prone. Reported for historical comparability only.

Benchmark page →

MBPP

974 basic Python problems2021

“Mostly Basic Python Programming.” Pairs naturally with HumanEval as the entry-level benchmark suite. Same saturation story — useful only for checking that a model can code at all.

Benchmark page →

LiveCodeBench

Rolling · competition-style · time-stamped2024

Contest problems from LeetCode, AtCoder, and Codeforces, time-stamped to filter out contamination. Updates continuously. Gemini 3 Pro leads at ~91.7% on recent slices; still the cleanest signal on raw algorithmic ability.

Benchmark page →

SWE-bench Verified

500 real GitHub issues · Python2024

Human-verified subset of the original SWE-bench. Each task is a real bug-fix PR from a popular OSS project. Requires editing multiple files, running tests, and iterating — the reference agentic benchmark. Harness-dependent.

Benchmark page →

Aider Polyglot

225 problems · 6 languages2024

Edit-based eval across Python, JavaScript, Go, Rust, C++, Java. Run through Aider's CLI harness (the same harness you'd use in production), which makes the numbers honest and reproducible.

Benchmark page →

BigCodeBench

1,140 problems · real library calls2024

Function-level generation that forces the model to use real libraries (requests, numpy, pandas, etc). Discriminates between models that memorised docs and models that can actually compose APIs.

Benchmark page →

§ 06 · Practical tips

Five rules for picking a coding stack in 2026.

Stop looking at HumanEval. It’s contamination-prone and saturated — every frontier model clears 90%+ and the ranking tells you more about training data than capability. Use LiveCodeBench (for algorithmic problems) or SWE-bench Verified (for agentic, real-codebase work) instead.

IDE tools and raw APIs are different purchases. IDE products matter for ergonomics — completion latency, keybindings, diff UX, multi-file context in the editor. Raw LLM APIs matter for autonomous workflows — an agent dispatched to a backlog ticket doesn’t care how pretty the diff view is. Don’t pick one and shoehorn the other.

Open-weights have closed most of the gap. Qwen3-Coder and DeepSeek V3.2 land ~80% of frontier quality at roughly 1/30th the cost per token. If your workload is high-volume or your procurement team rejects anything that can’t run on-prem, this is the pragmatic path — especially paired with an open agent harness like Continue or Aider.

Agent harness beats model. A great model inside a bad scaffold (no test execution, no error feedback loop, no retry on failure) loses to a weaker model with a well-engineered harness. SWE-bench numbers are mostly a harness story — Anthropic, Aider, and Cognition all post different numbers for the same underlying Claude. If you’re picking a tool, the harness matters more than the model brand.

Cache your prompts on long codebases. Anthropic and OpenAI both offer prompt caching with 50–90% discount on cached input. If you’re running a coding agent that re-sends the same system prompt + repo context on every turn, caching is the highest-leverage cost lever available — easily 50–80% off the monthly bill.

For vendors

Run an AI coding product? Claim your listing.

CodeSOTA’s code-generation comparison is read by engineering leaders picking a coding assistant or LLM for production. If you represent one of the vendors above — or a product we missed — claim the listing to submit verified pricing, SWE-bench results, harness details, and a demo link. Free; credibility-gated, not pay-to-play.

Claim a listing →Get a rank badge for your site →

Related comparisons

Frontier LLM leaderboard (code metrics) →Visual Question Answering →Text-to-Speech →

Reply within 48 hours · No newsletter

What were you looking for on AI coding assistants?

Missing a product, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.