The AI inference market isn't choosing a winner. It's stratifying into three lanes.
Benchmark leaders don't lose revenue — they lose the long tail of workloads they used to own. When you look at what AI apps actually route through OpenRouter and match it to live pricing, a clear three-tier structure appears: SOTA models absorbing the dollar spend, cost-effective mid-tier handling daily workflows, and commodity models batching the scale workloads at pennies per million tokens. Here's how the market actually splits, with data.
SOTA · Premium
≥ $5/M blended
Cost-effective
$0.50–$5/M blended
Commodity · Scale
< $0.50/M blended
The upside-down insight: SOTA models are 66% of spend but only 18% of tokens. Commodity models are 35% of tokens but only 4% of spend. The market pays premium prices for a narrow slice of quality-critical work and routes everything else to whoever's cheapest.
The data behind the tiers
This isn't an editorial choice. It's a distribution.
If you plot every priced model on OpenRouter by its blended price, the distribution is bimodal — one dense peak in the commodity band, another in the cost-effective band, and then a sparse premium tail above $5/M. The same shape told two ways below: first by how many dollars flow through each price band (the economic shape), then by how many models inhabit it.
Dollar spend by price band
Where the money actually flows · log-scale buckets
The real shape of the market. The SOTA tail above $5/M is narrow in bins but tall in dollars — premium pricing compounds small token volumes into huge spend. Meanwhile commodity models sit near zero by dollars even where they host most of the token volume. Same models, opposite shape depending on the metric.
Model count by price band
How many priced models live at each price · 96 total
Bimodal distribution. Commodity band holds 48 models, cost-effective band holds 40, SOTA is a sparse tail of 8 models. Two density peaks plus a long tail — the shape is closer to two clouds than three clusters. The premium tier is few models that matter enormously on the dollar chart above and not at all here.
Price vs monthly token volume
Each dot is a model · log-log · sized by monthly cost
What to see: the cloud tilts negative — higher price, lower token volume. SOTA-tier dots sit top-right by cost (big circles) but anchor low on the y-axis. The commodity tier (left) carries the highest volumes. This is the anti-correlation: the market routes most tokens to whichever model is cheapest.
Concrete comparison: 1B tokens/month reference workload
Using the average prices of the models within each tier (not cherry-picked extremes). 72/28 input/output split. Scale this by your actual volume to see what tier choice costs you.
Tier 1 · SOTA
$12K/mo
for 1B tokens/month
Tier 2 · Cost-effective
$2K/mo
for 1B tokens/month
Tier 3 · Commodity
$222/mo
for 1B tokens/month
What to see: the same 1B tokens that cost $12K/month on SOTA runs for $222/month on commodity — a 54× difference. At 10B tokens/month the gap is $119K/month — the entire economics of whether a company is profitable or not lives in this decision.
Tier 1 · SOTA
The premium lane — use when correctness is the constraint
These are the models you reach for when a wrong answer is more expensive than an extra dollar of compute. Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro. They top the agentic benchmarks (BinaryAudit, SWE-bench, OTelBench), they're slower, they're priced at $3–$30 input / $15–$150 output. In exchange, they do the thing you actually needed done.
Use when
- ✓Financial decisions, legal review, compliance flags
- ✓Critical-path reasoning in agentic workflows where each wrong step compounds
- ✓Final-answer generation after cheaper models have pre-filtered
- ✓Customer-facing output where quality is visible
Don't use for
- ✗Batch processing millions of documents
- ✗Bulk classification, OCR post-processing, or log parsing
- ✗Tasks where a 2× quality bump isn't worth 20× cost
| # | Model | $/M in | $/M out | Blended | Tokens | Monthly cost | Apps |
|---|---|---|---|---|---|---|---|
| 1 | Anthropic: Claude Opus 4.6anthropic | $5.00 | $25.00 | $10.60 | 2.37T | $25.10M | 24 |
| 2 | Anthropic: Claude Sonnet 4.6anthropic | $3.00 | $15.00 | $6.36 | 2.62T | $16.67M | 24 |
| 3 | Anthropic: Claude Sonnet 4.5anthropic | $3.00 | $15.00 | $6.36 | 460.6B | $2.93M | 18 |
| 4 | OpenAI: GPT-5.4openai | $2.50 | $15.00 | $6.00 | 478.4B | $2.87M | 17 |
| 5 | OpenAI: GPT-5.3-Codexopenai | $1.75 | $14.00 | $5.18 | 172.2B | $892K | 10 |
| 6 | Anthropic: Claude Opus 4.5anthropic | $5.00 | $25.00 | $10.60 | 83.1B | $881K | 10 |
| 7 | Anthropic: Claude 3.7 Sonnetanthropic | $3.00 | $15.00 | $6.36 | 3.5B | $22K | 3 |
| 8 | OpenAI: GPT-5.4 Proopenai | $30.00 | $180.00 | $72.00 | 232.1M | $17K | 2 |
| 9 | OpenAI: GPT-5.2openai | $1.75 | $14.00 | $5.18 | 2.8B | $15K | 4 |
| 10 | Anthropic: Claude Sonnet 4anthropic | $3.00 | $15.00 | $6.36 | 957.9M | $6K | 2 |
| + 2 more in this tier | |||||||
Tier 2 · Cost-effective
The workhorse lane — where most real work actually runs
The sweet spot between quality and cost. Gemini 3 Flash, Qwen3.6 Plus, MiMo-V2-Pro, MiniMax M2.7, DeepSeek V3.2, GPT-5.4 Mini. Priced at $0.30–$2 input / $1–$5 output, often delivering 80–90% of SOTA quality on most tasks at 5–20% of the cost. This is the tier the OpenRouter data shows is winning on tokens — the vendors gaining share are all here.
Use when
- ✓Production AI features where cost sensitivity matters
- ✓Daily agentic work — code assistants, research, drafting
- ✓Tool-use and function-calling at reasonable volume
- ✓Anywhere SOTA is overkill but commodity is too weak
Don't use for
- ✗Adversarial reasoning or high-stakes decisions
- ✗Tasks where you've verified a specific SOTA model has a meaningful quality lead
- ✗Extreme scale batch jobs where commodity pricing wins
| # | Model | $/M in | $/M out | Blended | Tokens | Monthly cost | Apps |
|---|---|---|---|---|---|---|---|
| 1 | Xiaomi: MiMo-V2-Proxiaomi | $1.00 | $3.00 | $1.56 | 5.49T | $8.57M | 15 |
| 2 | Qwen: Qwen3.6 Plusqwen | $0.33 | $1.95 | $0.78 | 2.98T | $2.33M | 27 |
| 3 | Z.ai: GLM 5 Turboz-ai | $1.20 | $4.00 | $1.98 | 2.84T | $5.64M | 7 |
| 4 | MiniMax: MiniMax M2.7minimax | $0.30 | $1.20 | $0.55 | 1.72T | $947K | 19 |
| 5 | Google: Gemini 3 Flash Previewgoogle | $0.50 | $3.00 | $1.20 | 994.8B | $1.19M | 24 |
| 6 | MoonshotAI: Kimi K2.5moonshotai | $0.38 | $1.72 | $0.76 | 629.5B | $477K | 19 |
| 7 | Xiaomi: MiMo-V2-Omnixiaomi | $0.40 | $2.00 | $0.85 | 466.0B | $395K | 3 |
| 8 | Anthropic: Claude Haiku 4.5anthropic | $1.00 | $5.00 | $2.12 | 444.5B | $942K | 13 |
| 9 | Google: Gemini 2.5 Flashgoogle | $0.30 | $2.50 | $0.92 | 243.7B | $223K | 10 |
| 10 | Z.ai: GLM 5z-ai | $0.72 | $2.30 | $1.16 | 197.8B | $230K | 16 |
| + 34 more in this tier | |||||||
Tier 3 · Commodity scale
The scale lane — for workloads measured in billions of tokens
Priced at < $0.50/M blended. MiMo-V2-Flash, Step 3.5 Flash, Trinity Large, free tiers of larger open-weight models. These are the tools when the question is no longer "is this the best?" but "can we afford to run this against the whole corpus?". Quality varies wildly — some of these match mid-tier on narrow tasks, some are only useful for trivial classification.
Use when
- ✓Batch processing at scale — billions of tokens per month
- ✓Pre-filtering, triage, and cheap first-pass classification
- ✓Post-processing and format conversion (where rules + small LLM beat a big LLM)
- ✓Offline enrichment where latency doesn't matter
Don't use for
- ✗Agentic decisions that affect downstream state
- ✗Customer-facing text generation
- ✗Anywhere you haven't verified the specific model actually works on your task
| # | Model | $/M in | $/M out | Blended | Tokens | Monthly cost | Apps |
|---|---|---|---|---|---|---|---|
| 1 | StepFun: Step 3.5 Flashstepfun | $0.10 | $0.30 | $0.16 | 3.99T | $623K | 16 |
| 2 | MiniMax: MiniMax M2.5minimax | $0.12 | $0.99 | $0.36 | 3.70T | $1.34M | 15 |
| 3 | DeepSeek: DeepSeek V3.2deepseek | $0.26 | $0.38 | $0.29 | 1.26T | $371K | 24 |
| 4 | NVIDIA: Nemotron 3 Supernvidia | $0.10 | $0.50 | $0.21 | 1.17T | $248K | 15 |
| 5 | Arcee AI: Trinity Large Thinkingarcee-ai | $0.22 | $0.85 | $0.40 | 604.7B | $240K | 12 |
| 6 | Google: Gemini 2.5 Flash Litegoogle | $0.10 | $0.40 | $0.18 | 591.4B | $109K | 3 |
| 7 | Xiaomi: MiMo-V2-Flashxiaomi | $0.09 | $0.29 | $0.15 | 234.8B | $34K | 6 |
| 8 | Mistral: Mistral Nemomistralai | $0.02 | $0.04 | $0.03 | 195.7B | $5K | 1 |
| 9 | DeepSeek: DeepSeek V3 0324deepseek | $0.20 | $0.77 | $0.36 | 94.6B | $34K | 4 |
| 10 | Z.ai: GLM 4.5 Airz-ai | $0.13 | $0.85 | $0.33 | 60.6B | $20K | 4 |
| + 30 more in this tier | |||||||
A decision framework in four questions
The mistake most teams make is picking a single model for every workload. The market's behavior tells you to route by tier instead.
- 1
Is a wrong answer expensive?
If a single bad call costs more than $10 in downstream impact (disputed invoice, broken migration, misrouted support ticket), stay in Tier 1 SOTA for that step. The pricing premium is insurance.
- 2
Can you split the workload?
Most agentic workflows have a critical step buried in ten ancillary ones. Route only the critical step to SOTA; run the rest on Tier 2 cost-effective. This is how the apps winning on unit economics are set up.
- 3
What's your token volume?
Under 100M tokens/month: tier choice matters less, pick on quality. Over 1B tokens/month: you cannot afford a blanket Tier 1 — the bill will eat your margin. This is where Tier 3 commodity for bulk paths becomes non-negotiable.
- 4
Is this task quality-verified on that model?
A $0.10/M model that hallucinates 30% of the time on your task is more expensive than a $10/M model that doesn't. Price is only half the equation. Verify against a held-out set before committing any tier — especially commodity.
Why this analysis exists
Every benchmark site ranks models by a single number. Every routing service picks for you. Neither of those helps when the real answer is "use three models, one for each lane, and know where each boundary sits".
CodeSOTA joins benchmark performance, live pricing, and real OpenRouter usage into one view — so you can pick not just which model but which tier for which step. That's the decision-engine layer on top of the raw catalogs.
Related: one year of market trends · inverted model leaderboard · app-level spend rankings
Disagree with our tier boundaries?
If you think the $5/M and $0.50/M cutoffs are wrong — or you've run a specific model head-to-head against one we'd put a tier higher — tell us. We reply within 48 hours and update the analysis.