Agentic · Market structure·Snapshot 2026-04-27

The AI inference market isn't choosing a winner. It's stratifying into three lanes.

Benchmark leaders don't lose revenue — they lose the long tail of workloads they used to own. When you look at what AI apps actually route through OpenRouter and match it to live pricing, a clear three-tier structure appears: SOTA models absorbing the dollar spend, cost-effective mid-tier handling daily workflows, and commodity models batching the scale workloads at pennies per million tokens. Here's how the market actually splits, with data.

SOTA · Premium

≥ $5/M blended

Models16
Monthly cost$54.38M
Tokens6.84T
% of spend
72%
% of tokens
21%

Cost-effective

$0.50–$5/M blended

Models43
Monthly cost$18.89M
Tokens14.79T
% of spend
25%
% of tokens
46%

Commodity · Scale

< $0.50/M blended

Models40
Monthly cost$2.74M
Tokens10.33T
% of spend
4%
% of tokens
32%

The upside-down insight: SOTA models are 72% of spend but only 21% of tokens. Commodity models are 32% of tokens but only 4% of spend. The market pays premium prices for a narrow slice of quality-critical work and routes everything else to whoever's cheapest.

The data behind the tiers

This isn't an editorial choice. It's a distribution.

If you plot every priced model on OpenRouter by its blended price, the distribution is bimodal — one dense peak in the commodity band, another in the cost-effective band, and then a sparse premium tail above $5/M. The same shape told two ways below: first by how many dollars flow through each price band (the economic shape), then by how many models inhabit it.

Dollar spend by price band

Where the money actually flows · log-scale buckets

$5K
$394
$18
$70
$6K
$863K
$84K
$419K
$1.36M
$1.09M
$3.19M
$1.53M
$7.68M
$4.55M
$181K
$2.13M
$24.49M
$28.38M
$35K
$4K
$20K
$0.50 ↓
$5 ↓
$0.05$0.10$0.50$1$5$10$25

The real shape of the market. The SOTA tail above $5/M is narrow in bins but tall in dollars — premium pricing compounds small token volumes into huge spend. Meanwhile commodity models sit near zero by dollars even where they host most of the token volume. Same models, opposite shape depending on the metric.

Model count by price band

How many priced models live at each price · 99 total

1
1
3
1
5
8
5
8
7
8
9
9
2
5
8
6
5
3
1
2
2
$0.50 ↓
$5 ↓
$0.05$0.10$0.50$1$5$10$25
CommodityCost-effectiveSOTA

Bimodal distribution. Commodity band holds 47 models, cost-effective band holds 39, SOTA is a sparse tail of 13 models. Two density peaks plus a long tail — the shape is closer to two clouds than three clusters. The premium tier is few models that matter enormously on the dollar chart above and not at all here.

Price vs monthly token volume

Each dot is a model · log-log · sized by monthly cost

$0.1/M$0.5/M$1/M$5/M$10/M$25/M1B10B100B1T10TBlended price per million tokens (log)Monthly tokens (log)

What to see: the cloud tilts negative — higher price, lower token volume. SOTA-tier dots sit top-right by cost (big circles) but anchor low on the y-axis. The commodity tier (left) carries the highest volumes. This is the anti-correlation: the market routes most tokens to whichever model is cheapest.

Concrete comparison: 1B tokens/month reference workload

Using the average prices of the models within each tier (not cherry-picked extremes). 72/28 input/output split. Scale this by your actual volume to see what tier choice costs you.

Tier 1 · SOTA

$18K/mo

for 1B tokens/month

avg $/M in$8.11
avg $/M out$43.88
avg blended$18.12
annual$217K

Tier 2 · Cost-effective

$2K/mo

for 1B tokens/month

avg $/M in$0.85
avg $/M out$4.11
avg blended$1.76
annual$21K

Tier 3 · Commodity

$230/mo

for 1B tokens/month

avg $/M in$0.13
avg $/M out$0.48
avg blended$0.23
annual$3K

What to see: the same 1B tokens that cost $18K/month on SOTA runs for $230/month on commodity — a 79× difference. At 10B tokens/month the gap is $179K/month — the entire economics of whether a company is profitable or not lives in this decision.

Tier 1 · SOTA

The premium lane — use when correctness is the constraint

These are the models you reach for when a wrong answer is more expensive than an extra dollar of compute. Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro. They top the agentic benchmarks (BinaryAudit, SWE-bench, OTelBench), they're slower, they're priced at $3–$30 input / $15–$150 output. In exchange, they do the thing you actually needed done.

Use when

  • Financial decisions, legal review, compliance flags
  • Critical-path reasoning in agentic workflows where each wrong step compounds
  • Final-answer generation after cheaper models have pre-filtered
  • Customer-facing output where quality is visible

Don't use for

  • Batch processing millions of documents
  • Bulk classification, OCR post-processing, or log parsing
  • Tasks where a 2× quality bump isn't worth 20× cost
#Model$/M in$/M outBlendedTokensMonthly costApps
1Anthropic: Claude Opus 4.6anthropic$5.00$25.00$10.602.13T$22.58M22
2Anthropic: Claude Sonnet 4.6anthropic$3.00$15.00$6.363.09T$19.65M23
3Anthropic: Claude Opus 4.7anthropic$5.00$25.00$10.60510.0B$5.41M13
4OpenAI: GPT-5.4openai$2.50$15.00$6.00520.4B$3.12M15
5Anthropic: Claude Sonnet 4.5anthropic$3.00$15.00$6.36267.7B$1.70M15
6OpenAI: GPT-5.3-Codexopenai$1.75$14.00$5.18276.8B$1.43M9
7Anthropic: Claude Opus 4.5anthropic$5.00$25.00$10.6037.5B$397K6
8OpenAI: GPT-5.5openai$5.00$30.00$12.002.9B$35K3
9Anthropic: Claude Opus 4.6 (Fast)anthropic$30.00$150.00$63.60188.5M$12K1
10OpenAI: GPT-5.2openai$1.75$14.00$5.182.3B$12K3
+ 6 more in this tier

Tier 2 · Cost-effective

The workhorse lane — where most real work actually runs

The sweet spot between quality and cost. Gemini 3 Flash, Qwen3.6 Plus, MiMo-V2-Pro, MiniMax M2.7, DeepSeek V3.2, GPT-5.4 Mini. Priced at $0.30–$2 input / $1–$5 output, often delivering 80–90% of SOTA quality on most tasks at 5–20% of the cost. This is the tier the OpenRouter data shows is winning on tokens — the vendors gaining share are all here.

Use when

  • Production AI features where cost sensitivity matters
  • Daily agentic work — code assistants, research, drafting
  • Tool-use and function-calling at reasonable volume
  • Anywhere SOTA is overkill but commodity is too weak

Don't use for

  • Adversarial reasoning or high-stakes decisions
  • Tasks where you've verified a specific SOTA model has a meaningful quality lead
  • Extreme scale batch jobs where commodity pricing wins
#Model$/M in$/M outBlendedTokensMonthly costApps
1Xiaomi: MiMo-V2-Proxiaomi$1.00$3.00$1.564.92T$7.68M15
2Qwen: Qwen3.6 Plusqwen$0.33$1.95$0.783.25T$2.53M28
3MiniMax: MiniMax M2.7minimax$0.30$1.20$0.551.91T$1.05M17
4Z.ai: GLM 5 Turboz-ai$1.20$4.00$1.981.56T$3.09M3
5Google: Gemini 3 Flash Previewgoogle$0.50$3.00$1.20891.1B$1.07M20
6MoonshotAI: Kimi K2.5moonshotai$0.44$2.00$0.88413.8B$363K14
7Google: Gemini 2.5 Flashgoogle$0.30$2.50$0.92318.9B$292K9
8Anthropic: Claude Haiku 4.5anthropic$1.00$5.00$2.12306.0B$649K9
9Xiaomi: MiMo-V2-Omnixiaomi$0.40$2.00$0.85257.2B$218K2
10Z.ai: GLM 5.1z-ai$1.05$3.50$1.74255.9B$444K17
+ 33 more in this tier

Tier 3 · Commodity scale

The scale lane — for workloads measured in billions of tokens

Priced at < $0.50/M blended. MiMo-V2-Flash, Step 3.5 Flash, Trinity Large, free tiers of larger open-weight models. These are the tools when the question is no longer "is this the best?" but "can we afford to run this against the whole corpus?". Quality varies wildly — some of these match mid-tier on narrow tasks, some are only useful for trivial classification.

Use when

  • Batch processing at scale — billions of tokens per month
  • Pre-filtering, triage, and cheap first-pass classification
  • Post-processing and format conversion (where rules + small LLM beat a big LLM)
  • Offline enrichment where latency doesn't matter

Don't use for

  • Agentic decisions that affect downstream state
  • Customer-facing text generation
  • Anywhere you haven't verified the specific model actually works on your task
#Model$/M in$/M outBlendedTokensMonthly costApps
1MiniMax: MiniMax M2.5minimax$0.15$1.15$0.433.02T$1.30M12
2StepFun: Step 3.5 Flashstepfun$0.10$0.30$0.162.73T$426K15
3NVIDIA: Nemotron 3 Supernvidia$0.09$0.45$0.191.43T$274K12
4DeepSeek: DeepSeek V3.2deepseek$0.25$0.38$0.291.23T$355K20
5Google: Gemini 2.5 Flash Litegoogle$0.10$0.40$0.18643.7B$118K6
6Arcee AI: Trinity Large Previewarcee-ai$0.15$0.45$0.23356.9B$84K6
7Xiaomi: MiMo-V2-Flashxiaomi$0.09$0.29$0.15214.2B$31K4
8Mistral: Mistral Nemomistralai$0.02$0.04$0.03184.6B$5K1
9DeepSeek: DeepSeek V3 0324deepseek$0.20$0.77$0.3686.8B$31K3
10DeepSeek: DeepSeek V4 Flashdeepseek$0.14$0.28$0.1873.5B$13K4
+ 30 more in this tier

A decision framework in four questions

The mistake most teams make is picking a single model for every workload. The market's behavior tells you to route by tier instead.

  1. 1

    Is a wrong answer expensive?

    If a single bad call costs more than $10 in downstream impact (disputed invoice, broken migration, misrouted support ticket), stay in Tier 1 SOTA for that step. The pricing premium is insurance.

  2. 2

    Can you split the workload?

    Most agentic workflows have a critical step buried in ten ancillary ones. Route only the critical step to SOTA; run the rest on Tier 2 cost-effective. This is how the apps winning on unit economics are set up.

  3. 3

    What's your token volume?

    Under 100M tokens/month: tier choice matters less, pick on quality. Over 1B tokens/month: you cannot afford a blanket Tier 1 — the bill will eat your margin. This is where Tier 3 commodity for bulk paths becomes non-negotiable.

  4. 4

    Is this task quality-verified on that model?

    A $0.10/M model that hallucinates 30% of the time on your task is more expensive than a $10/M model that doesn't. Price is only half the equation. Verify against a held-out set before committing any tier — especially commodity.

Why this analysis exists

Every benchmark site ranks models by a single number. Every routing service picks for you. Neither of those helps when the real answer is "use three models, one for each lane, and know where each boundary sits".

CodeSOTA joins benchmark performance, live pricing, and real OpenRouter usage into one view — so you can pick not just which model but which tier for which step. That's the decision-engine layer on top of the raw catalogs.

Related: one year of market trends · inverted model leaderboard · app-level spend rankings

We reply within 48 hours

Disagree with our tier boundaries?

If you think the $5/M and $0.50/M cutoffs are wrong — or you've run a specific model head-to-head against one we'd put a tier higher — tell us. We reply within 48 hours and update the analysis.

Tell us what you found →
No newsletter Real humans read this 30 seconds to send