AI Text Arena
Which Model Wins?
Real humans pick their preferred response in blind head-to-head matchups. No cherry-picked demos — just raw preference data from hundreds of thousands of conversations. Rankings update continuously as votes come in.
Bradley-Terry model with bootstrap resampling. Each battle is a blind A/B test — voters don't know which model is which until after they vote.
Full Leaderboard
Ranked by Elo score · prices per million tokens (in / out)
| # | Model | Org | Elo | Score Bar | Votes | In $/M | Out $/M | Context | License |
|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-6-thinkingthinking | Anthropic | 1502±6 | 12K | $5 | $25 | 1M | Proprietary | |
| 2 | claude-opus-4-6 | Anthropic | 1501±6 | 13K | $5 | $25 | 1M | Proprietary | |
| 3 | gemini-3.1-pro-preview | 1493±6 | 15K | $2 | $12 | 1M | Proprietary | ||
| 4 | grok-4.20-beta1 | xAI | 1492±7 | 7.4K | N/A | N/A | N/A | Proprietary | |
| 5 | gemini-3-pro | 1486±4 | 42K | $2 | $12 | 1M | Proprietary | ||
| 6 | gpt-5.4-high | OpenAI | 1485±9 | 5.0K | $2.50 | $15 | 1.1M | Proprietary | |
| 7 | gpt-5.2-chat-latest | OpenAI | 1482±6 | 10K | $1.75 | $14 | 128K | Proprietary | |
| 8 | grok-4.20-beta-reasoningthinking | xAI | 1481±9 | 4.5K | $2 | $6 | 2M | Proprietary | |
| 9 | gemini-3-flash | 1475±4 | 31K | $0.50 | $3 | 1M | Proprietary | ||
| 10 | claude-opus-4-5-thinkingthinking | Anthropic | 1474±4 | 37K | $5 | $25 | 200K | Proprietary | |
| 11 | grok-4.1-thinkingthinking | xAI | 1472±4 | 44K | $0.20 | $0.50 | N/A | Proprietary | |
| 12 | claude-opus-4-5-20251101 | Anthropic | 1469±4 | 42K | $5 | $25 | 200K | Proprietary | |
| 13 | claude-sonnet-4-6 | Anthropic | 1465±6 | 9.8K | $3 | $15 | 1M | Proprietary | |
| 14 | qwen3.5-max-preview | Alibaba | 1464±9 | 4.3K | N/A | N/A | N/A | Proprietary | |
| 15 | gpt-5.3-chat-latest | OpenAI | 1464±7 | 8.9K | $1.75 | $14 | 128K | Proprietary | |
| 16 | gemini-3-flash-thinking-minimalthinking | 1463±4 | 27K | $0.50 | $3 | 1M | Proprietary | ||
| 17 | gpt-5.4 | OpenAI | 1463±8 | 5.0K | $2.50 | $15 | 1.1M | Proprietary | |
| 18 | dola-seed-2.0-preview | Bytedance | 1462±6 | 11K | N/A | N/A | N/A | Proprietary | |
| 19 | grok-4.1 | xAI | 1461±4 | 48K | $0.20 | $0.50 | N/A | Proprietary | |
| 20 | gpt-5.1-high | OpenAI | 1455±4 | 41K | $1.25 | $10 | 400K | Proprietary | |
| 21 | glm-5 | Z.ai | 1455±6 | 11K | $1 | $3.20 | 202.8K | MIT | |
| 22 | kimi-k2.5-thinkingthinking | Moonshot | 1453±5 | 16K | $0.60 | $3 | N/A | Modified MIT | |
| 23 | claude-sonnet-4-5 | Anthropic | 1453±3 | 54K | $3 | $15 | 200K | Proprietary | |
| 24 | claude-sonnet-4-5-thinkingthinking | Anthropic | 1453±3 | 56K | $3 | $15 | 200K | Proprietary | |
| 25 | ernie-5.0-0110 | Baidu | 1452±5 | 19K | N/A | N/A | N/A | Proprietary | |
| 26 | qwen3.5-397b-a17b | Alibaba | 1452±6 | 10K | $0.39 | $2.34 | 262.1K | Apache 2.0 | |
| 27 | ernie-5.0-preview | Baidu | 1450±7 | 9.9K | N/A | N/A | N/A | Proprietary | |
| 28 | claude-opus-4-1-thinkingthinking | Anthropic | 1449±3 | 50K | $15 | $75 | 200K | Proprietary | |
| 29 | gemini-2.5-pro | 1448±3 | 103K | $1.25 | $10 | 1M | Proprietary | ||
| 30 | claude-opus-4-1 | Anthropic | 1447±3 | 78K | $15 | $75 | 200K | Proprietary |
12K
$5/$25
1M
13K
$5/$25
1M
15K
$2/$12
1M
7.4K
N/A/N/A
N/A
42K
$2/$12
1M
5.0K
$2.50/$15
1.1M
10K
$1.75/$14
128K
4.5K
$2/$6
2M
31K
$0.50/$3
1M
37K
$5/$25
200K
44K
$0.20/$0.50
N/A
42K
$5/$25
200K
9.8K
$3/$15
1M
4.3K
N/A/N/A
N/A
8.9K
$1.75/$14
128K
27K
$0.50/$3
1M
5.0K
$2.50/$15
1.1M
11K
N/A/N/A
N/A
48K
$0.20/$0.50
N/A
41K
$1.25/$10
400K
11K
$1/$3.20
202.8K
16K
$0.60/$3
N/A
54K
$3/$15
200K
56K
$3/$15
200K
19K
N/A/N/A
N/A
10K
$0.39/$2.34
262.1K
9.9K
N/A/N/A
N/A
50K
$15/$75
200K
103K
$1.25/$10
1M
78K
$15/$75
200K
How Arena Rankings Work
The methodology behind human preference evaluation
Blind A/B Battles
A user submits a prompt. Two randomly selected models generate responses in parallel. The voter sees Model A and Model B — no names, no branding. They pick the better response, or call it a tie. Only after voting is the identity of both models revealed. This eliminates brand bias.
Elo / Bradley-Terry Rating
After each battle, Elo scores are updated. The stronger model gains fewer points for beating a weaker opponent; the rating converges toward a stable value. Arena.ai uses the Bradley-Terry model — a probabilistic extension of Elo that handles ties and scales to many players — with 1000 bootstrap iterations to compute confidence intervals.
Confidence Intervals
The ±CI column shows the 95% confidence interval. Models with fewer votes have wider intervals. A model at rank 1 with ±6 and a model at rank 2 with ±6 are statistically indistinguishable — treat them as tied until more votes accumulate. Models with 40K+ votes have tight intervals and their rankings are stable.
What It Measures
Arena ratings measure human preference in open-ended conversation — the full breadth of tasks that real users care about: coding help, writing, reasoning, factual Q&A, creative tasks. This differs from narrow academic benchmarks, which often test specific capabilities in isolation.
Important caveat: Arena rankings reflect average human preference, not ground truth quality. Tasks requiring verified facts, long-context retrieval, or specialized domain expertise may rank models differently from arena outcomes. Use arena scores alongside task-specific benchmarks for deployment decisions.
Key Insights
What the rankings tell us about the current AI landscape
Thinking Models Dominate
Of the top 12 spots, 6 are occupied by models with explicit chain-of-thought / extended thinking modes. claude-opus-4-6-thinking sits at #1, grok-4.20-beta-reasoning at #8, and grok-4.1-thinking at #11. Voters consistently prefer responses from models that reason before answering — even when they can't see the thinking trace.
Price-Performance Leaders
grok-4.1 (rank 19) costs just $0.20/$0.50 per million tokens yet beats models 3-4x more expensive. gemini-3-flash (rank 9, $0.50/$3) outperforms multiple frontier models. qwen3.5-397b-a17b (rank 26, $0.39/$2.34, Apache 2.0) is the standout open-source value pick — near-frontier quality at sub-$0.40 input pricing.
Open Source Contenders
Three models in the top 30 carry non-proprietary licenses: glm-5 (MIT, rank 21), kimi-k2.5-thinking (Modified MIT, rank 22), and qwen3.5-397b-a17b (Apache 2.0, rank 26). The gap between the best open-weights models and frontier proprietary models has shrunk to roughly 50 Elo points — compared to 200+ points just 18 months ago.
Context Window Divergence
Anthropic and Google have standardized on 1M token context windows for their top models. OpenAI's gpt-5.4 reaches 1.1M. xAI's grok-4.20-beta-reasoning offers a massive 2M window. Several xAI and Chinese models don't publish context lengths — a transparency gap for enterprise buyers.
Anthropic vs. Google at the Top
Anthropic and Google together hold 8 of the top 10 slots. OpenAI and xAI are competitive (6 models each in the top 30) but neither has a model in the top 5. The top-2 Claude models are statistically tied (1502 vs 1501, ±6 CI) — essentially a dead heat.
Vote Count vs. Reliability
gemini-2.5-pro (rank 29) has 103K votes — more than any other model — giving it the tightest CI of just ±3. New models like gpt-5.4-high (4,965 votes, ±9 CI) may move significantly as more votes arrive. High-vote models with ±3 CIs are the most trustworthy signal.
Provider Breakdown
Model count per organization in the top 30
Want to Influence the Rankings?
Every vote on arena.ai makes the rankings more accurate. Submit prompts, compare responses, and help the community understand which AI models are truly best.
Vote on Arena.ai ↗Data sourced from arena.ai — an independent platform. Rankings update continuously. CodeSOTA snapshot taken March 2026.