On one side, IDE-integrated developer products (Copilot, Cursor, Cody, Tabnine, Codeium, Augment) priced per seat and optimised for inline completion, chat, and multi-file edits. On the other, raw code-capable LLMs (Claude Opus 4.7, GPT-5, Gemini 3 Pro, DeepSeek, Qwen3-Coder) priced per token — the intelligence layer that powers the first category, and that agentic CLIs like Aider and Codex call directly.
Below: 13 products and models, compared on the axes that actually decide it.
IDE products (per seat / month) · frontier LLM APIs (per 1M tokens) · open-weights LLMs. Pricing units differ by tier — read the Cost column accordingly.
| Provider / Product | Tier | License | Cost | IDE integrations | Context | Agent | SWE-bench Verified | |
|---|---|---|---|---|---|---|---|---|
| IDE | Proprietary product | $10–19 / seat / mo | VSCode · JetBrains · Vim/Neovim · Visual Studio · Xcode | Repo-aware (workspace indexing) | ✓ | ~55% (Workspace agent) | Claim → | |
Cu | IDE | Proprietary product | $20–40 / seat / mo | Cursor (VSCode fork) only | Repo-wide embedding index · multi-file edits | ✓ | ~50–60% (Agent mode) | Claim → |
| IDE | Proprietary product | $9–19 / seat / mo | VSCode · JetBrains · Neovim · Web | Graph-based code intelligence across the monorepo | ✓ | Not self-reported | Claim → | |
Tn | IDE | Proprietary product | $9–39 / seat / mo | VSCode · JetBrains · Vim/Neovim · Eclipse · many more | Local + repo-aware | ✓ | Not self-reported | Claim → |
Co | IDE | Proprietary product | Free · $15 / seat / mo | VSCode · JetBrains · Vim · 40+ editors · Windsurf (own IDE) | Repo-wide | ✓ | Not self-reported | Claim → |
| IDE | Proprietary product | $30 / seat / mo | VSCode · JetBrains · Vim · CLI · Remote Agents | 200K+ token context engine across large monorepos | ✓ | ~65% (Remote Agent, self-reported) | Claim → | |
Cn | IDE | Open source | Free · BYO model | VSCode · JetBrains | Configurable — local, repo, or custom retrievers | ✓ | Harness-dependent | Claim → |
Ai | IDE | Open source | Free · BYO model | CLI (works alongside any editor) | Repo map + git-aware edits | ✓ | Aider Polyglot leaderboard | Claim → |
| Frontier LLM | Proprietary API | $3 / $15 – $15 / $75 per 1M | Via Copilot · Cursor · Cody · Continue · Aider · Claude Code CLI | 200K–1M tokens | ✓ | SOTA on SWE-bench Verified (~70%+ harnessed) | Claim → | |
| Frontier LLM | Proprietary API | ~$10 / $30 per 1M (GPT-5 typical) | Via Copilot · Cursor · Cody · Continue · Aider · Codex CLI | 200K–400K tokens typical | ✓ | ~65–70% (Codex-style harnesses) | Claim → | |
| Frontier LLM | Proprietary API | $1.25 / $5 per 1M (Pro) | Via Copilot · Cursor · Cody · Continue · Aider · Gemini CLI | 1M–2M tokens | ✓ | Top of LiveCodeBench (~91.7%) | Claim → | |
DS | Open LLM | Open weights | $0.27 / $1.10 per 1M (hosted) | Via Continue · Aider · any OpenAI-compatible client | 128K tokens | ✓ | ~55–60% (best open result) | Claim → |
| Open LLM | Open weights | Self-host · ~$1–3 per 1M (hosted) | Via Continue · Aider · vLLM / SGLang / Ollama | 256K–1M tokens (YaRN) | ✓ | ~50–55% (best Apache-licensed result) | Claim → |
Pricing as of 2026-04. IDE products are priced per seat / month; LLMs are priced per 1M tokens (input / output). SWE-bench Verified scores are harness-dependent — the same model can swing 20 points across harnesses, which is why we report approximate ranges and note the harness where relevant. Click any price to open the vendor’s pricing page. Spot an error? Tell us →
The buyer question for an IDE tool (“Cursor or Copilot?”) is about ergonomics and multi-file context. The buyer question for a raw LLM (“Claude Opus 4.7 or GPT-5?”) is about SWE-bench numbers and cost per task. They’re not the same decision.
Solo developer · best bang-for-buck
GitHub Copilot · Cursor Pro
$10–20/mo. Copilot if you live in VSCode or JetBrains; Cursor if you're willing to switch editors for a better multi-file agent loop.
Autonomous / agentic coding (CLI)
Claude Code · Aider + Claude Opus 4.7 · Codex CLI
Terminal-driven loops that read errors, run tests, and commit their own edits. Claude Opus 4.7 is the default model; GPT-5 and Gemini 3 Pro are credible alternatives.
Cheapest credible tokens
Gemini 3 Pro · DeepSeek V3.2
Gemini 3 Pro at $1.25/$5 per 1M is the cheapest frontier model. DeepSeek V3.2 at $0.27/$1.10 is 80% of the quality at 1/5th the price again.
Large monorepo · repo-scale context
Sourcegraph Cody · Augment Code · Gemini 3 Pro
Cody indexes the graph; Augment's context engine is tuned for 200K+ token codebases; Gemini's 2M window lets you just paste the whole repo.
Air-gapped / on-prem / regulated
Tabnine Enterprise · Continue + self-hosted Qwen3-Coder
No code leaves your VPC. Tabnine is the productised path; Continue + Qwen3-Coder is the build-it-yourself path on Apache-licensed weights.
Open-source-only stack
Continue · Aider · Qwen3-Coder · DeepSeek V3.2
Apache / OSI-approved from editor to weights. Pair Continue or Aider with Qwen3-Coder locally via vLLM for a fully reproducible setup.
Raw LLM for an agent framework
Claude Opus 4.7 · GPT-5 · Gemini 3 Pro
Building your own agent loop? Opus 4.7 currently tops SWE-bench Verified with the right harness; Gemini 3 Pro tops LiveCodeBench on fresh problems.
HumanEval-style single-function completion is a solved problem — every model on this page clears it. The interesting failure modes are the ones that surface when you point a tool at a real codebase. Build your own 6-task eval covering these:
Run the same tasks through 2–3 candidates blind and score on finished PRs, not tokens generated. A tool that writes confident code that doesn’t compile is worse than one that asks a clarifying question.
“Rename this interface across 5+ files, update callers, update tests.” Single-function completion is solved; multi-file edits are where Cursor / Augment / Claude Code pull ahead of Copilot-style autocomplete.
Does it remember the convention you established 400 lines earlier? Or does it reinvent a parallel pattern? This is the difference between a repo-aware tool and a glorified autocompleter.
Ask it to write code against a library that had a major API change in the last 6 months. Most models hallucinate the old API — the good ones either ask or use your lockfile.
Can it run the tests, read the failures, edit the code, and re-run? This is the qualitative gap between ‘fancy autocomplete’ and ‘junior engineer you can dispatch a ticket to.’
Paste a PR with a subtle bug. Can the tool catch it? Writing new code is easier than finding bugs in existing code — and review is where coding assistants earn their keep in teams.
SQL, Terraform, GLSL, Solidity, Zig. Benchmarks are Python + TypeScript heavy. If your production code is 60% DSL, that's where the real evaluation happens.
HumanEval (2021) and MBPP (2021) are the original code-gen benchmarks — 164 and 974 short Python problems respectively. They’re both saturated and both contamination-prone: the problems are on the public internet, which means every model trained after ~2022 has seen them. Reporting 95% on HumanEval tells you the model was trained; it doesn’t tell you the model is good.
A 2-point delta on HumanEval is entirely noise. Worse, it’s often training-data leakage dressed as capability.
The evals that still discriminate in 2026 are SWE-bench Verified (real GitHub issues, agentic), LiveCodeBench (time-stamped competition problems, contamination-resistant by construction), Aider Polyglot (6 languages, edit-based), and BigCodeBench (function-level with real library calls).
Even these have a harness problem — the same base model can post wildly different SWE-bench numbers depending on the scaffold. Agent harness > model: a great model inside a bad harness loses to a worse model with a good one.
Six datasets that show up on every coding leaderboard. The first two are saturated legacy; the next three are the ones you should actually read when comparing models in 2026; the last is the emerging function-level standard.
The pioneer. OpenAI's 2021 release that defined the category. Every model on this page clears 90%+. Saturated; contamination-prone. Reported for historical comparability only.
Benchmark page →“Mostly Basic Python Programming.” Pairs naturally with HumanEval as the entry-level benchmark suite. Same saturation story — useful only for checking that a model can code at all.
Benchmark page →Contest problems from LeetCode, AtCoder, and Codeforces, time-stamped to filter out contamination. Updates continuously. Gemini 3 Pro leads at ~91.7% on recent slices; still the cleanest signal on raw algorithmic ability.
Benchmark page →Human-verified subset of the original SWE-bench. Each task is a real bug-fix PR from a popular OSS project. Requires editing multiple files, running tests, and iterating — the reference agentic benchmark. Harness-dependent.
Benchmark page →Edit-based eval across Python, JavaScript, Go, Rust, C++, Java. Run through Aider's CLI harness (the same harness you'd use in production), which makes the numbers honest and reproducible.
Benchmark page →Function-level generation that forces the model to use real libraries (requests, numpy, pandas, etc). Discriminates between models that memorised docs and models that can actually compose APIs.
Benchmark page →Stop looking at HumanEval. It’s contamination-prone and saturated — every frontier model clears 90%+ and the ranking tells you more about training data than capability. Use LiveCodeBench (for algorithmic problems) or SWE-bench Verified (for agentic, real-codebase work) instead.
IDE tools and raw APIs are different purchases. IDE products matter for ergonomics — completion latency, keybindings, diff UX, multi-file context in the editor. Raw LLM APIs matter for autonomous workflows — an agent dispatched to a backlog ticket doesn’t care how pretty the diff view is. Don’t pick one and shoehorn the other.
Open-weights have closed most of the gap. Qwen3-Coder and DeepSeek V3.2 land ~80% of frontier quality at roughly 1/30th the cost per token. If your workload is high-volume or your procurement team rejects anything that can’t run on-prem, this is the pragmatic path — especially paired with an open agent harness like Continue or Aider.
Agent harness beats model. A great model inside a bad scaffold (no test execution, no error feedback loop, no retry on failure) loses to a weaker model with a well-engineered harness. SWE-bench numbers are mostly a harness story — Anthropic, Aider, and Cognition all post different numbers for the same underlying Claude. If you’re picking a tool, the harness matters more than the model brand.
Cache your prompts on long codebases. Anthropic and OpenAI both offer prompt caching with 50–90% discount on cached input. If you’re running a coding agent that re-sends the same system prompt + repo context on every turn, caching is the highest-leverage cost lever available — easily 50–80% off the monthly bill.
CodeSOTA’s code-generation comparison is read by engineering leaders picking a coding assistant or LLM for production. If you represent one of the vendors above — or a product we missed — claim the listing to submit verified pricing, SWE-bench results, harness details, and a demo link. Free; credibility-gated, not pay-to-play.
Missing a product, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.
Real humans read every message. We track what people are asking for and prioritize accordingly.