Real humans pick their preferred response in blind head-to-head matchups. No cherry-picked demos — just raw preference data from hundreds of thousands of conversations. Rankings update continuously as votes come in.
Bradley-Terry model with bootstrap resampling. Each battle is a blind A/B test — voters don't know which model is which until after they vote.
Ranked by Elo score · prices per million tokens (in / out)
| # | Model | Org | Elo | Score Bar | Votes | In $/M | Out $/M | Context | License |
|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-6-thinkingthinking | Anthropic | 1502±6 | 12K | $5 | $25 | 1M | Proprietary | |
| 2 | claude-opus-4-6 | Anthropic | 1501±6 | 13K | $5 | $25 | 1M | Proprietary | |
| 3 | gemini-3.1-pro-preview | 1493±6 | 15K | $2 | $12 | 1M | Proprietary | ||
| 4 | grok-4.20-beta1 | xAI | 1492±7 | 7.4K | N/A | N/A | N/A | Proprietary | |
| 5 | gemini-3-pro | 1486±4 | 42K | $2 | $12 | 1M | Proprietary | ||
| 6 | gpt-5.4-high | OpenAI | 1485±9 | 5.0K | $2.50 | $15 | 1.1M | Proprietary | |
| 7 | gpt-5.2-chat-latest | OpenAI | 1482±6 | 10K | $1.75 | $14 | 128K | Proprietary | |
| 8 | grok-4.20-beta-reasoningthinking | xAI | 1481±9 | 4.5K | $2 | $6 | 2M | Proprietary | |
| 9 | gemini-3-flash | 1475±4 | 31K | $0.50 | $3 | 1M | Proprietary | ||
| 10 | claude-opus-4-5-thinkingthinking | Anthropic | 1474±4 | 37K | $5 | $25 | 200K | Proprietary | |
| 11 | grok-4.1-thinkingthinking | xAI | 1472±4 | 44K | $0.20 | $0.50 | N/A | Proprietary | |
| 12 | claude-opus-4-5-20251101 | Anthropic | 1469±4 | 42K | $5 | $25 | 200K | Proprietary | |
| 13 | claude-sonnet-4-6 | Anthropic | 1465±6 | 9.8K | $3 | $15 | 1M | Proprietary | |
| 14 | qwen3.5-max-preview | Alibaba | 1464±9 | 4.3K | N/A | N/A | N/A | Proprietary | |
| 15 | gpt-5.3-chat-latest | OpenAI | 1464±7 | 8.9K | $1.75 | $14 | 128K | Proprietary | |
| 16 | gemini-3-flash-thinking-minimalthinking | 1463±4 | 27K | $0.50 | $3 | 1M | Proprietary | ||
| 17 | gpt-5.4 | OpenAI | 1463±8 | 5.0K | $2.50 | $15 | 1.1M | Proprietary | |
| 18 | dola-seed-2.0-preview | Bytedance | 1462±6 | 11K | N/A | N/A | N/A | Proprietary | |
| 19 | grok-4.1 | xAI | 1461±4 | 48K | $0.20 | $0.50 | N/A | Proprietary | |
| 20 | gpt-5.1-high | OpenAI | 1455±4 | 41K | $1.25 | $10 | 400K | Proprietary | |
| 21 | glm-5 | Z.ai | 1455±6 | 11K | $1 | $3.20 | 202.8K | MIT | |
| 22 | kimi-k2.5-thinkingthinking | Moonshot | 1453±5 | 16K | $0.60 | $3 | N/A | Modified MIT | |
| 23 | claude-sonnet-4-5 | Anthropic | 1453±3 | 54K | $3 | $15 | 200K | Proprietary | |
| 24 | claude-sonnet-4-5-thinkingthinking | Anthropic | 1453±3 | 56K | $3 | $15 | 200K | Proprietary | |
| 25 | ernie-5.0-0110 | Baidu | 1452±5 | 19K | N/A | N/A | N/A | Proprietary | |
| 26 | qwen3.5-397b-a17b | Alibaba | 1452±6 | 10K | $0.39 | $2.34 | 262.1K | Apache 2.0 | |
| 27 | ernie-5.0-preview | Baidu | 1450±7 | 9.9K | N/A | N/A | N/A | Proprietary | |
| 28 | claude-opus-4-1-thinkingthinking | Anthropic | 1449±3 | 50K | $15 | $75 | 200K | Proprietary | |
| 29 | gemini-2.5-pro | 1448±3 | 103K | $1.25 | $10 | 1M | Proprietary | ||
| 30 | claude-opus-4-1 | Anthropic | 1447±3 | 78K | $15 | $75 | 200K | Proprietary |
The methodology behind human preference evaluation
A user submits a prompt. Two randomly selected models generate responses in parallel. The voter sees Model A and Model B — no names, no branding. They pick the better response, or call it a tie. Only after voting is the identity of both models revealed. This eliminates brand bias.
After each battle, Elo scores are updated. The stronger model gains fewer points for beating a weaker opponent; the rating converges toward a stable value. Arena.ai uses the Bradley-Terry model — a probabilistic extension of Elo that handles ties and scales to many players — with 1000 bootstrap iterations to compute confidence intervals.
The ±CI column shows the 95% confidence interval. Models with fewer votes have wider intervals. A model at rank 1 with ±6 and a model at rank 2 with ±6 are statistically indistinguishable — treat them as tied until more votes accumulate. Models with 40K+ votes have tight intervals and their rankings are stable.
Arena ratings measure human preference in open-ended conversation — the full breadth of tasks that real users care about: coding help, writing, reasoning, factual Q&A, creative tasks. This differs from narrow academic benchmarks, which often test specific capabilities in isolation.
Important caveat: Arena rankings reflect average human preference, not ground truth quality. Tasks requiring verified facts, long-context retrieval, or specialized domain expertise may rank models differently from arena outcomes. Use arena scores alongside task-specific benchmarks for deployment decisions.
What the rankings tell us about the current AI landscape
Of the top 12 spots, 6 are occupied by models with explicit chain-of-thought / extended thinking modes. claude-opus-4-6-thinking sits at #1, grok-4.20-beta-reasoning at #8, and grok-4.1-thinking at #11. Voters consistently prefer responses from models that reason before answering — even when they can't see the thinking trace.
grok-4.1 (rank 19) costs just $0.20/$0.50 per million tokens yet beats models 3-4x more expensive. gemini-3-flash (rank 9, $0.50/$3) outperforms multiple frontier models. qwen3.5-397b-a17b (rank 26, $0.39/$2.34, Apache 2.0) is the standout open-source value pick — near-frontier quality at sub-$0.40 input pricing.
Three models in the top 30 carry non-proprietary licenses: glm-5 (MIT, rank 21), kimi-k2.5-thinking (Modified MIT, rank 22), and qwen3.5-397b-a17b (Apache 2.0, rank 26). The gap between the best open-weights models and frontier proprietary models has shrunk to roughly 50 Elo points — compared to 200+ points just 18 months ago.
Anthropic and Google have standardized on 1M token context windows for their top models. OpenAI's gpt-5.4 reaches 1.1M. xAI's grok-4.20-beta-reasoning offers a massive 2M window. Several xAI and Chinese models don't publish context lengths — a transparency gap for enterprise buyers.
Anthropic and Google together hold 8 of the top 10 slots. OpenAI and xAI are competitive (6 models each in the top 30) but neither has a model in the top 5. The top-2 Claude models are statistically tied (1502 vs 1501, ±6 CI) — essentially a dead heat.
gemini-2.5-pro (rank 29) has 103K votes — more than any other model — giving it the tightest CI of just ±3. New models like gpt-5.4-high (4,965 votes, ±9 CI) may move significantly as more votes arrive. High-vote models with ±3 CIs are the most trustworthy signal.
Model count per organization in the top 30
Every vote on arena.ai makes the rankings more accurate. Submit prompts, compare responses, and help the community understand which AI models are truly best.
Vote on Arena.ai ↗Data sourced from arena.ai — an independent platform. Rankings update continuously. CodeSOTA snapshot taken March 2026.