Agentic · Market structure·Snapshot 2026-04-14

The AI inference market isn't choosing a winner. It's stratifying into three lanes.

Benchmark leaders don't lose revenue — they lose the long tail of workloads they used to own. When you look at what AI apps actually route through OpenRouter and match it to live pricing, a clear three-tier structure appears: SOTA models absorbing the dollar spend, cost-effective mid-tier handling daily workflows, and commodity models batching the scale workloads at pennies per million tokens. Here's how the market actually splits, with data.

SOTA · Premium

≥ $5/M blended

Models12
Monthly cost$49.41M
Tokens6.19T
% of spend
66%
% of tokens
18%

Cost-effective

$0.50–$5/M blended

Models44
Monthly cost$22.14M
Tokens16.51T
% of spend
30%
% of tokens
47%

Commodity · Scale

< $0.50/M blended

Models40
Monthly cost$3.08M
Tokens12.11T
% of spend
4%
% of tokens
35%

The upside-down insight: SOTA models are 66% of spend but only 18% of tokens. Commodity models are 35% of tokens but only 4% of spend. The market pays premium prices for a narrow slice of quality-critical work and routes everything else to whoever's cheapest.

The data behind the tiers

This isn't an editorial choice. It's a distribution.

If you plot every priced model on OpenRouter by its blended price, the distribution is bimodal — one dense peak in the commodity band, another in the cost-effective band, and then a sparse premium tail above $5/M. The same shape told two ways below: first by how many dollars flow through each price band (the economic shape), then by how many models inhabit it.

Dollar spend by price band

Where the money actually flows · log-scale buckets

$5K
$384
$18
$95
$6K
$767K
$248K
$429K
$1.62M
$984K
$3.28M
$1.69M
$8.66M
$6.64M
$196K
$1.60M
$22.50M
$25.98M
$17K
$0.50 ↓
$5 ↓
$0.05$0.10$0.50$1$5$10$25

The real shape of the market. The SOTA tail above $5/M is narrow in bins but tall in dollars — premium pricing compounds small token volumes into huge spend. Meanwhile commodity models sit near zero by dollars even where they host most of the token volume. Same models, opposite shape depending on the metric.

Model count by price band

How many priced models live at each price · 96 total

1
1
3
2
4
10
4
7
7
9
11
9
3
3
7
7
5
2
1
$0.50 ↓
$5 ↓
$0.05$0.10$0.50$1$5$10$25
CommodityCost-effectiveSOTA

Bimodal distribution. Commodity band holds 48 models, cost-effective band holds 40, SOTA is a sparse tail of 8 models. Two density peaks plus a long tail — the shape is closer to two clouds than three clusters. The premium tier is few models that matter enormously on the dollar chart above and not at all here.

Price vs monthly token volume

Each dot is a model · log-log · sized by monthly cost

$0.1/M$0.5/M$1/M$5/M$10/M$25/M1B10B100B1T10TBlended price per million tokens (log)Monthly tokens (log)

What to see: the cloud tilts negative — higher price, lower token volume. SOTA-tier dots sit top-right by cost (big circles) but anchor low on the y-axis. The commodity tier (left) carries the highest volumes. This is the anti-correlation: the market routes most tokens to whichever model is cheapest.

Concrete comparison: 1B tokens/month reference workload

Using the average prices of the models within each tier (not cherry-picked extremes). 72/28 input/output split. Scale this by your actual volume to see what tier choice costs you.

Tier 1 · SOTA

$12K/mo

for 1B tokens/month

avg $/M in$5.13
avg $/M out$30.08
avg blended$12.11
annual$145K

Tier 2 · Cost-effective

$2K/mo

for 1B tokens/month

avg $/M in$0.81
avg $/M out$3.81
avg blended$1.65
annual$20K

Tier 3 · Commodity

$222/mo

for 1B tokens/month

avg $/M in$0.13
avg $/M out$0.47
avg blended$0.22
annual$3K

What to see: the same 1B tokens that cost $12K/month on SOTA runs for $222/month on commodity — a 54× difference. At 10B tokens/month the gap is $119K/month — the entire economics of whether a company is profitable or not lives in this decision.

Tier 1 · SOTA

The premium lane — use when correctness is the constraint

These are the models you reach for when a wrong answer is more expensive than an extra dollar of compute. Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro. They top the agentic benchmarks (BinaryAudit, SWE-bench, OTelBench), they're slower, they're priced at $3–$30 input / $15–$150 output. In exchange, they do the thing you actually needed done.

Use when

  • Financial decisions, legal review, compliance flags
  • Critical-path reasoning in agentic workflows where each wrong step compounds
  • Final-answer generation after cheaper models have pre-filtered
  • Customer-facing output where quality is visible

Don't use for

  • Batch processing millions of documents
  • Bulk classification, OCR post-processing, or log parsing
  • Tasks where a 2× quality bump isn't worth 20× cost
#Model$/M in$/M outBlendedTokensMonthly costApps
1Anthropic: Claude Opus 4.6anthropic$5.00$25.00$10.602.37T$25.10M24
2Anthropic: Claude Sonnet 4.6anthropic$3.00$15.00$6.362.62T$16.67M24
3Anthropic: Claude Sonnet 4.5anthropic$3.00$15.00$6.36460.6B$2.93M18
4OpenAI: GPT-5.4openai$2.50$15.00$6.00478.4B$2.87M17
5OpenAI: GPT-5.3-Codexopenai$1.75$14.00$5.18172.2B$892K10
6Anthropic: Claude Opus 4.5anthropic$5.00$25.00$10.6083.1B$881K10
7Anthropic: Claude 3.7 Sonnetanthropic$3.00$15.00$6.363.5B$22K3
8OpenAI: GPT-5.4 Proopenai$30.00$180.00$72.00232.1M$17K2
9OpenAI: GPT-5.2openai$1.75$14.00$5.182.8B$15K4
10Anthropic: Claude Sonnet 4anthropic$3.00$15.00$6.36957.9M$6K2
+ 2 more in this tier

Tier 2 · Cost-effective

The workhorse lane — where most real work actually runs

The sweet spot between quality and cost. Gemini 3 Flash, Qwen3.6 Plus, MiMo-V2-Pro, MiniMax M2.7, DeepSeek V3.2, GPT-5.4 Mini. Priced at $0.30–$2 input / $1–$5 output, often delivering 80–90% of SOTA quality on most tasks at 5–20% of the cost. This is the tier the OpenRouter data shows is winning on tokens — the vendors gaining share are all here.

Use when

  • Production AI features where cost sensitivity matters
  • Daily agentic work — code assistants, research, drafting
  • Tool-use and function-calling at reasonable volume
  • Anywhere SOTA is overkill but commodity is too weak

Don't use for

  • Adversarial reasoning or high-stakes decisions
  • Tasks where you've verified a specific SOTA model has a meaningful quality lead
  • Extreme scale batch jobs where commodity pricing wins
#Model$/M in$/M outBlendedTokensMonthly costApps
1Xiaomi: MiMo-V2-Proxiaomi$1.00$3.00$1.565.49T$8.57M15
2Qwen: Qwen3.6 Plusqwen$0.33$1.95$0.782.98T$2.33M27
3Z.ai: GLM 5 Turboz-ai$1.20$4.00$1.982.84T$5.64M7
4MiniMax: MiniMax M2.7minimax$0.30$1.20$0.551.72T$947K19
5Google: Gemini 3 Flash Previewgoogle$0.50$3.00$1.20994.8B$1.19M24
6MoonshotAI: Kimi K2.5moonshotai$0.38$1.72$0.76629.5B$477K19
7Xiaomi: MiMo-V2-Omnixiaomi$0.40$2.00$0.85466.0B$395K3
8Anthropic: Claude Haiku 4.5anthropic$1.00$5.00$2.12444.5B$942K13
9Google: Gemini 2.5 Flashgoogle$0.30$2.50$0.92243.7B$223K10
10Z.ai: GLM 5z-ai$0.72$2.30$1.16197.8B$230K16
+ 34 more in this tier

Tier 3 · Commodity scale

The scale lane — for workloads measured in billions of tokens

Priced at < $0.50/M blended. MiMo-V2-Flash, Step 3.5 Flash, Trinity Large, free tiers of larger open-weight models. These are the tools when the question is no longer "is this the best?" but "can we afford to run this against the whole corpus?". Quality varies wildly — some of these match mid-tier on narrow tasks, some are only useful for trivial classification.

Use when

  • Batch processing at scale — billions of tokens per month
  • Pre-filtering, triage, and cheap first-pass classification
  • Post-processing and format conversion (where rules + small LLM beat a big LLM)
  • Offline enrichment where latency doesn't matter

Don't use for

  • Agentic decisions that affect downstream state
  • Customer-facing text generation
  • Anywhere you haven't verified the specific model actually works on your task
#Model$/M in$/M outBlendedTokensMonthly costApps
1StepFun: Step 3.5 Flashstepfun$0.10$0.30$0.163.99T$623K16
2MiniMax: MiniMax M2.5minimax$0.12$0.99$0.363.70T$1.34M15
3DeepSeek: DeepSeek V3.2deepseek$0.26$0.38$0.291.26T$371K24
4NVIDIA: Nemotron 3 Supernvidia$0.10$0.50$0.211.17T$248K15
5Arcee AI: Trinity Large Thinkingarcee-ai$0.22$0.85$0.40604.7B$240K12
6Google: Gemini 2.5 Flash Litegoogle$0.10$0.40$0.18591.4B$109K3
7Xiaomi: MiMo-V2-Flashxiaomi$0.09$0.29$0.15234.8B$34K6
8Mistral: Mistral Nemomistralai$0.02$0.04$0.03195.7B$5K1
9DeepSeek: DeepSeek V3 0324deepseek$0.20$0.77$0.3694.6B$34K4
10Z.ai: GLM 4.5 Airz-ai$0.13$0.85$0.3360.6B$20K4
+ 30 more in this tier

A decision framework in four questions

The mistake most teams make is picking a single model for every workload. The market's behavior tells you to route by tier instead.

  1. 1

    Is a wrong answer expensive?

    If a single bad call costs more than $10 in downstream impact (disputed invoice, broken migration, misrouted support ticket), stay in Tier 1 SOTA for that step. The pricing premium is insurance.

  2. 2

    Can you split the workload?

    Most agentic workflows have a critical step buried in ten ancillary ones. Route only the critical step to SOTA; run the rest on Tier 2 cost-effective. This is how the apps winning on unit economics are set up.

  3. 3

    What's your token volume?

    Under 100M tokens/month: tier choice matters less, pick on quality. Over 1B tokens/month: you cannot afford a blanket Tier 1 — the bill will eat your margin. This is where Tier 3 commodity for bulk paths becomes non-negotiable.

  4. 4

    Is this task quality-verified on that model?

    A $0.10/M model that hallucinates 30% of the time on your task is more expensive than a $10/M model that doesn't. Price is only half the equation. Verify against a held-out set before committing any tier — especially commodity.

Why this analysis exists

Every benchmark site ranks models by a single number. Every routing service picks for you. Neither of those helps when the real answer is "use three models, one for each lane, and know where each boundary sits".

CodeSOTA joins benchmark performance, live pricing, and real OpenRouter usage into one view — so you can pick not just which model but which tier for which step. That's the decision-engine layer on top of the raw catalogs.

Related: one year of market trends · inverted model leaderboard · app-level spend rankings

We reply within 48 hours

Disagree with our tier boundaries?

If you think the $5/M and $0.50/M cutoffs are wrong — or you've run a specific model head-to-head against one we'd put a tier higher — tell us. We reply within 48 hours and update the analysis.

Tell us what you found →
No newsletter Real humans read this 30 seconds to send