Codesota · Research method · Inference economicsThree simulations · one owned dataset · 100k drawsPublished June 2026
Editorial · Inference economics

The subscription giveaway proves the token margin.

A $200/month ChatGPT Pro plan can be driven to roughly $14,000/month of API-equivalent token consumption before its limits bind. No business hands out 70× its price unless the list price is almost entirely margin. We run three simulations to put numbers on “almost entirely.”

Method: subscription-arbitrage floors from SemiAnalysis’s June 2026 plan-exhaustion experiment; a 100,000-draw Monte Carlo of frontier serving cost anchored to DeepSeek’s published inference-system disclosure; a market cross-check against Chinese frontier pricing from CodeSOTA Intelligence, our own OpenRouter dataset; and a reconciliation against the 33–40% blended gross margins reported for OpenAI and Anthropic. Simulation code is in the methodology note at the end.

§ 01 · The measurement

Six plans, one experiment.

SemiAnalysis bought every Anthropic and OpenAI subscription tier and ran long-horizon coding tasks until the weekly limits bound. The resulting ratios are the cleanest public probe of what these companies think a token costs.

The experiment is simple: buy a plan, drive it with agentic coding workloads until it stops, and price the consumed tokens at API list rates. Every tier delivers a multiple of its price — and the multiple grows with the tier. The most expensive plan is the most heavily subsidized per dollar.

PlanPrice /moMax API-equiv. spendRatioMargin floor*
claude-pro$20$40020×95.0%
claude-max-5x$100$2,00020×95.0%
claude-max-20x$200$8,00040×97.5%
chatgpt-plus$20$70035×97.1%
chatgpt-pro-5x$100$3,50035×97.1%
chatgpt-pro-20x$200$14,00070×98.6%

* Floor on API gross margin under the conservative assumption that a fully maxed-out subscriber is merely break-even for the vendor. If maxed subscribers are profitable, true margins are higher still. Max-spend figures: SemiAnalysis, June 2026.

The arithmetic of the floor is one line. If $200 of subscription revenue covers the serving cost of $14,000 worth of list-price tokens, then serving cost is at most 200⁄14,000 = 1.4% of list revenue — a 98.6% gross margin on tokens sold at list. The Anthropic tiers imply 95.0–97.5% by the same logic. These are floors under an assumption, not estimates: vendors may well lose money on the heaviest users, the way gyms lose money on members who actually show up daily. The next section asks what tokens cost from first principles instead.

One more structural observation: OpenAI’s ratios run roughly 75% richer than Anthropic’s at every tier (35× vs 20×, 70× vs 40×). Either OpenAI serves tokens meaningfully cheaper, or it is buying developer market share more aggressively — most likely both, given GPT-tier list prices are also lower.

§ 02 · The simulation

What a frontier token costs to serve.

DeepSeek is the only frontier-scale lab that has published its inference economics. We use its disclosure as the anchor and Monte Carlo the uncertainty between its stack and a Western frontier lab's.

In February 2025 DeepSeek published a day in the life of its V3/R1 inference fleet: an average of 226.75 eight-GPU H800 nodes rented at $2/GPU-hour ($87,072/day), serving 608 billion input tokens (56.3% from cache) and 168 billion output tokens in 24 hours. At its R1 list prices that throughput was worth $562,027/day — a theoretical cost-profit ratio of 545%, i.e. an 84.5% gross margin, on rented, export-restricted hardware.

Two anchor numbers fall out of the disclosure. Decode throughput of ~14,800 output tokens/second per node at $16/node-hour prices an output token at $0.30 per million. And dividing the whole day’s cost by all 776 billion processed tokens gives a blended all-token cost of $0.11 per million — input tokens are nearly free next to decode.

A Western frontier model is not DeepSeek-V3. We model the gap with four uncertain multipliers and run 100,000 draws: active compute per token (1.5–10× V3’s 37B active parameters, mode 4×), effective GPU cost (1–4.5× the $2 H800 rental — owned H100/B200 fleets amortize near the bottom of that range), a hardware-efficiency credit (H100/B200 with full-bandwidth NVLink against export-nerfed H800s, 1.5–4×), and fleet utilization (35–80%). The resulting distribution of output-token serving cost:

Simulated serving cost — $ per million output tokens
p10
$1.17
p25
$1.66
median
$2.41
p75
$3.47
p90
$4.73
Fig 1 · 100,000 draws, triangular priors, seed 20260611. Prefill (input) cost simulates to a median $0.14/M (p10–p90: $0.06–$0.30) — against $3–10/M list input prices and ~$0.30–1.00/M cache reads.

Setting the cost distribution against June 2026 list prices — Anthropic charges $25 per million output tokens at Opus tier, $15 at Sonnet tier, $5 at Haiku tier; OpenAI’s flagship tier sits near $10 — gives the implied gross margin on output tokens:

Price tierIf costs are high (p90)MedianIf costs are low (p10)
Opus-tier $25/M out81.1%90.4%95.3%
Sonnet-tier $15/M out68.4%83.9%92.2%
GPT-5.x-tier $10/M out52.7%75.9%88.3%
Haiku-tier $5/M out5.3%51.8%76.5%

The headline: at Opus-tier pricing, even the pessimistic tail of our cost distribution leaves an 81% gross margin, and the median is 90%. Input tokens are more lopsided still — a median $0.14/M serving cost against $3–10/M list prices is a 95–98% margin, and cached input (90% off list, ~$0.30–1.00/M) remains comfortably profitable. The economy tiers are the only place list pricing approaches cost: a $5/M Haiku-class token at the high end of our cost range is nearly margin-free, which is consistent with small-model APIs being treated as traffic acquisition rather than profit centers.

This is the quantitative core of the “DeepSeek-level compute efficiency” observation: the subscription ratios are only survivable if OpenAI and Anthropic serve tokens at a cost structure comparable to the one DeepSeek published — and the first-principles math says they plausibly do, even with substantially larger models, once you credit better hardware and the overtraining-for-inference trade frontier labs explicitly make.

§ 03 · The market check

What the market already charges.

A simulation is only as good as its priors — so we check it against prices set by labs that have to live with them. Data: CodeSOTA Intelligence, our own daily OpenRouter dataset (~30T tokens/week tracked, 580+ days of history).

Chinese frontier labs are the natural experiment. They publish open weights, compete almost purely on price, and — per DeepSeek’s disclosure — serve profitably on rented, export-restricted hardware. Whatever they charge is a market-validated ceiling on what frontier-class inference costs. Over the last seven days on OpenRouter:

OriginTokens / 7dVolume shareBlended $/MOutput $/MAnnualized spend
Chinese-origin labs18.3T58.6%$0.25$0.72$234M
Western labs11.3T36.0%$2.53$5.69$1,488M

CodeSOTA Intelligence, 7 days to 2026-06-11. Remainder (~5%) is stealth/unattributed providers. Chinese-origin models overtook Western ones in volume share on 2026-03-21 and now carry ~60% of tokens for ~14% of revenue.

The blended gap is 10.1× ($2.53 vs $0.25 per million tokens); on output tokens it is 7.9× ($5.69 vs $0.72). At the flagship level the spread is wider still:

ModelOriginInput $/MOutput $/MTokens / 7d
Claude Opus 4.8WEST$5.00$25.001.11T
GPT-5.5WEST$5.00$30.000.39T
Gemini 3.1 Pro PreviewWEST$2.00$12.000.21T
Claude Sonnet 4.6WEST$3.00$15.001.55T
Kimi K2.6CHINA$0.68$3.410.30T
GLM 5.1CHINA$0.98$3.080.26T
MiniMax M3CHINA$0.30$1.202.71T
DeepSeek V4 ProCHINA$0.43$0.871.66T
DeepSeek V4 FlashCHINA$0.10$0.203.68T

Here is the point of the table: Kimi K2.6 at $3.41, GLM 5.1 at $3.08, DeepSeek V4 Pro at $0.87 per million output tokens all sit inside or below our simulated Western serving-cost range of $1.17–$4.73 — and these are prices, not costs, charged by labs that need positive margin on rented GPUs. The market independently confirms the simulation’s ceiling. Set Claude Opus 4.8’s $25/M or GPT-5.5’s $30/M against it, and the Western flagship premium of 7–35× over frontier-class Chinese pricing reads as what it is: margin plus a capability/trust premium, not cost.

The revenue asymmetry makes the same point from the demand side. Western labs collect 86% of OpenRouter spend on 36% of volume — Anthropic alone books $1.3B annualized through this one router, two-thirds of all spend on it. Buyers who route by price have already migrated; the tokens still sold at $15–30/M are sold to buyers paying for the frontier, and that is precisely the demand a 90%-margin price is designed to harvest.

§ 04 · The reconciliation

Why the reported margins look so much worse.

The Information reports OpenAI's 2025 gross margin at ~33% (vs a 46% target) and Anthropic's near 40%. Both numbers are compatible with ~90% token margins — in fact they quietly require them.

If the API runs at ~90% gross margin, where do 33–40% blended margins come from? From everything that isn’t the API: subscriptions delivering 20–70× their price in tokens, and free-tier serving that generates no revenue at all. We model a 40×-ratio plan (claude-max-20x) with a skewed utilization distribution — most subscribers light, a heavy tail near the cap — and ask what average cap-utilization reproduces the reported blended numbers, assuming 40% of revenue is API at a 92% margin.

Mean cap-utilizationSubscription gross marginBlended margin (40% API rev)
5%83.9%87.2%
10%67.9%77.5%
20%35.9%58.4%
35%−12.1%29.5%
50%−59.2%1.3%

Highlighted row: the regime consistent with reported blended margins. A 40× plan breaks even at ~31% mean cap-utilization; above that, every marginal heavy user is served at a loss.

The reported 33–40% blended margins land between the 20% and 35% utilization rows — meaning the average subscriber consumes roughly a quarter to half of what their plan allows, and the subscription business as a whole runs near zero or negative gross margin. SemiAnalysis’s separate finding that Anthropic’s inference margins improved from 38% to 70% over the past year fits the same picture: serving efficiency is compounding (hardware generations, speculative decoding, batching, caching) faster than prices are being cut.

Read carefully, the low blended margin is the strongest evidence for the high token margin, not against it. You cannot give away $8,000–14,000 of list-price compute against a $200 subscription, absorb a free tier of hundreds of millions of users, and still post any positive gross margin — unless the marginal token costs you a few percent of what you charge for it at the API window.

§ 05 · Implications

What this means if you buy tokens.

Three practical conclusions, in decreasing order of confidence.

1 · API list prices have enormous headroom to fall — and will, selectively. A 90% gross margin is not a stable equilibrium in a market with three credible frontier vendors and an open-weight chaser publishing its cost structure. But the cuts arrive as engineering products, not price-tag changes: batch APIs at 50% off, cache reads at 90% off, and cheaper mid-tier models that cannibalize the flagship. The list price of the flagship output token is the last thing to move, because it prices desperation, not cost.

2 · Subscriptions are the cheapest tokens money can buy, while they last. At measured ratios of 20–70×, a maxed plan is a 95–98.6% discount to list. The vendors know this — surcharges, credit metering, and per-agent billing introduced through 2026 are exactly what closing an arbitrage looks like. Routing agentic workloads through subscription seats (as the OpenClaw episode demonstrated at scale) is the kind of free lunch that gets eaten by the people serving it.

3 · “Compute-constrained” and “90% margins” are the same fact. High per-token margins are how a lab rations scarce inference capacity toward training and the highest bidders while its inference costs grow 3–4× a year (OpenAI: ~$8.4B in 2025; Anthropic: ~$2.7B). The margin is not evidence of gouging so much as a queue-pricing mechanism — which is why it coexists with blended margins that miss their own targets. For buyers the actionable read is unchanged: the spread between what a token costs to serve and what you pay at list is wide enough that every layer of the stack — caching, batching, routing, model right-sizing — is worth engineering against.

§ 06 · Method

Methodology and sources.

Reproducible: one Python script, seeded, no external dependencies.

All simulation outputs on this page come from scripts/token-margins-sim.py (seed 20260611, 100,000 draws, triangular priors). Part 1 is arithmetic on SemiAnalysis’s measured max-spend table. Part 2 anchors to DeepSeek’s disclosed fleet economics and scales by four uncertain multipliers covering model size, hardware cost, hardware efficiency, and utilization. Part 3 sweeps mean cap-utilization under a skewed (beta) per-user distribution with API share of revenue fixed at 40% and API margin at 92%. The honest caveats: frontier model sizes are unknown (we cover 1.5–10× DeepSeek’s active compute), revenue mix is approximated, and “max possible spend” reflects limits as measured in early June 2026 — both vendors tune limits frequently.

  • SemiAnalysis — subscription plan-exhaustion experiment (max possible spend per plan), June 2026, via @teortaxesTex.
  • DeepSeek — V3/R1 inference system overview (fleet size, costs, throughput, 545% theoretical cost-profit ratio), Feb 2025; coverage by CNBC.
  • The Information (via secondary coverage) — OpenAI 2025 gross margin ~33% vs 46% target; Anthropic ~40%, ten points under target; OpenAI 2025 inference cost ~$8.4B, Anthropic ~$2.7B.
  • SemiAnalysis — Anthropic inference margins improving 38% → 70% year over year.
  • CodeSOTA Intelligence — our own daily OpenRouter dataset (~30T tokens/week, 751 models, 580+ days incl. Internet Archive backfill). East–west split, blended prices, and flagship table queried from the raw data for the 7 days to 2026-06-11; origin classified by lab country.
  • Vendor list pricing, June 2026 — Anthropic ($5/$25 Opus-tier, $3/$15 Sonnet-tier, $1/$5 Haiku-tier per M in/out; batch 50% off; cache reads ~0.1×) and OpenAI published rates.

If you want this analysis re-run against a specific workload — your input/output mix, cache hit rate, and model tier — that is exactly the kind of custom benchmark engagement we do. Request a benchmark.

Further reading on Codesota
The bitter lesson: why compute wins
The economics above are downstream of this 2019 essay.
How to read an ML paper
The same skepticism, applied to benchmark claims.
All guides
The editorial catalogue.