Live Human Preference Rankings

AI Text Arena
Which Model Wins?

Real humans pick their preferred response in blind head-to-head matchups. No cherry-picked demos — just raw preference data from hundreds of thousands of conversations. Rankings update continuously as votes come in.

SOTA Elo Score
1502
claude-opus-4-6-thinking
Models Ranked
30
Top 30 shown
Total Votes
824K
Human preferences
Methodology

Bradley-Terry model with bootstrap resampling. Each battle is a blind A/B test — voters don't know which model is which until after they vote.

Full Leaderboard

Ranked by Elo score · prices per million tokens (in / out)

Source: arena.ai ↗
1Anthropicthinking
1502±6
claude-opus-4-6-thinking
Votes
12K
Price
$5/$25
Ctx
1M
2Anthropic
1501±6
claude-opus-4-6
Votes
13K
Price
$5/$25
Ctx
1M
3Google
1493±6
gemini-3.1-pro-preview
Votes
15K
Price
$2/$12
Ctx
1M
4xAI
1492±7
grok-4.20-beta1
Votes
7.4K
Price
N/A/N/A
Ctx
N/A
5Google
1486±4
gemini-3-pro
Votes
42K
Price
$2/$12
Ctx
1M
6OpenAI
1485±9
gpt-5.4-high
Votes
5.0K
Price
$2.50/$15
Ctx
1.1M
7OpenAI
1482±6
gpt-5.2-chat-latest
Votes
10K
Price
$1.75/$14
Ctx
128K
8xAIthinking
1481±9
grok-4.20-beta-reasoning
Votes
4.5K
Price
$2/$6
Ctx
2M
9Google
1475±4
gemini-3-flash
Votes
31K
Price
$0.50/$3
Ctx
1M
10Anthropicthinking
1474±4
claude-opus-4-5-thinking
Votes
37K
Price
$5/$25
Ctx
200K
11xAIthinking
1472±4
grok-4.1-thinking
Votes
44K
Price
$0.20/$0.50
Ctx
N/A
12Anthropic
1469±4
claude-opus-4-5-20251101
Votes
42K
Price
$5/$25
Ctx
200K
13Anthropic
1465±6
claude-sonnet-4-6
Votes
9.8K
Price
$3/$15
Ctx
1M
14Alibaba
1464±9
qwen3.5-max-preview
Votes
4.3K
Price
N/A/N/A
Ctx
N/A
15OpenAI
1464±7
gpt-5.3-chat-latest
Votes
8.9K
Price
$1.75/$14
Ctx
128K
16Googlethinking
1463±4
gemini-3-flash-thinking-minimal
Votes
27K
Price
$0.50/$3
Ctx
1M
17OpenAI
1463±8
gpt-5.4
Votes
5.0K
Price
$2.50/$15
Ctx
1.1M
18Bytedance
1462±6
dola-seed-2.0-preview
Votes
11K
Price
N/A/N/A
Ctx
N/A
19xAI
1461±4
grok-4.1
Votes
48K
Price
$0.20/$0.50
Ctx
N/A
20OpenAI
1455±4
gpt-5.1-high
Votes
41K
Price
$1.25/$10
Ctx
400K
21Z.ai
1455±6
glm-5
Votes
11K
Price
$1/$3.20
Ctx
202.8K
22Moonshotthinking
1453±5
kimi-k2.5-thinking
Votes
16K
Price
$0.60/$3
Ctx
N/A
23Anthropic
1453±3
claude-sonnet-4-5
Votes
54K
Price
$3/$15
Ctx
200K
24Anthropicthinking
1453±3
claude-sonnet-4-5-thinking
Votes
56K
Price
$3/$15
Ctx
200K
25Baidu
1452±5
ernie-5.0-0110
Votes
19K
Price
N/A/N/A
Ctx
N/A
26Alibaba
1452±6
qwen3.5-397b-a17b
Votes
10K
Price
$0.39/$2.34
Ctx
262.1K
27Baidu
1450±7
ernie-5.0-preview
Votes
9.9K
Price
N/A/N/A
Ctx
N/A
28Anthropicthinking
1449±3
claude-opus-4-1-thinking
Votes
50K
Price
$15/$75
Ctx
200K
29Google
1448±3
gemini-2.5-pro
Votes
103K
Price
$1.25/$10
Ctx
1M
30Anthropic
1447±3
claude-opus-4-1
Votes
78K
Price
$15/$75
Ctx
200K

How Arena Rankings Work

The methodology behind human preference evaluation

Blind A/B Battles

A user submits a prompt. Two randomly selected models generate responses in parallel. The voter sees Model A and Model B — no names, no branding. They pick the better response, or call it a tie. Only after voting is the identity of both models revealed. This eliminates brand bias.

Elo / Bradley-Terry Rating

After each battle, Elo scores are updated. The stronger model gains fewer points for beating a weaker opponent; the rating converges toward a stable value. Arena.ai uses the Bradley-Terry model — a probabilistic extension of Elo that handles ties and scales to many players — with 1000 bootstrap iterations to compute confidence intervals.

Confidence Intervals

The ±CI column shows the 95% confidence interval. Models with fewer votes have wider intervals. A model at rank 1 with ±6 and a model at rank 2 with ±6 are statistically indistinguishable — treat them as tied until more votes accumulate. Models with 40K+ votes have tight intervals and their rankings are stable.

What It Measures

Arena ratings measure human preference in open-ended conversation — the full breadth of tasks that real users care about: coding help, writing, reasoning, factual Q&A, creative tasks. This differs from narrow academic benchmarks, which often test specific capabilities in isolation.

Important caveat: Arena rankings reflect average human preference, not ground truth quality. Tasks requiring verified facts, long-context retrieval, or specialized domain expertise may rank models differently from arena outcomes. Use arena scores alongside task-specific benchmarks for deployment decisions.

Key Insights

What the rankings tell us about the current AI landscape

🧠

Thinking Models Dominate

Of the top 12 spots, 6 are occupied by models with explicit chain-of-thought / extended thinking modes. claude-opus-4-6-thinking sits at #1, grok-4.20-beta-reasoning at #8, and grok-4.1-thinking at #11. Voters consistently prefer responses from models that reason before answering — even when they can't see the thinking trace.

💰

Price-Performance Leaders

grok-4.1 (rank 19) costs just $0.20/$0.50 per million tokens yet beats models 3-4x more expensive. gemini-3-flash (rank 9, $0.50/$3) outperforms multiple frontier models. qwen3.5-397b-a17b (rank 26, $0.39/$2.34, Apache 2.0) is the standout open-source value pick — near-frontier quality at sub-$0.40 input pricing.

🌍

Open Source Contenders

Three models in the top 30 carry non-proprietary licenses: glm-5 (MIT, rank 21), kimi-k2.5-thinking (Modified MIT, rank 22), and qwen3.5-397b-a17b (Apache 2.0, rank 26). The gap between the best open-weights models and frontier proprietary models has shrunk to roughly 50 Elo points — compared to 200+ points just 18 months ago.

📏

Context Window Divergence

Anthropic and Google have standardized on 1M token context windows for their top models. OpenAI's gpt-5.4 reaches 1.1M. xAI's grok-4.20-beta-reasoning offers a massive 2M window. Several xAI and Chinese models don't publish context lengths — a transparency gap for enterprise buyers.

⚔️

Anthropic vs. Google at the Top

Anthropic and Google together hold 8 of the top 10 slots. OpenAI and xAI are competitive (6 models each in the top 30) but neither has a model in the top 5. The top-2 Claude models are statistically tied (1502 vs 1501, ±6 CI) — essentially a dead heat.

📊

Vote Count vs. Reliability

gemini-2.5-pro (rank 29) has 103K votes — more than any other model — giving it the tightest CI of just ±3. New models like gpt-5.4-high (4,965 votes, ±9 CI) may move significantly as more votes arrive. High-vote models with ±3 CIs are the most trustworthy signal.

Provider Breakdown

Model count per organization in the top 30

Anthropic9
Best: claude-opus-4-6-thinking
Top score: 1502
Google5
Best: gemini-3.1-pro-preview
Top score: 1493
OpenAI5
Best: gpt-5.4-high
Top score: 1485
xAI4
Best: grok-4.20-beta1
Top score: 1492
Alibaba2
Best: qwen3.5-max-preview
Top score: 1464
Baidu2
Best: ernie-5.0-0110
Top score: 1452
Bytedance1
Best: dola-seed-2.0-preview
Top score: 1462
Z.ai1
Best: glm-5
Top score: 1455
Moonshot1
Best: kimi-k2.5-thinking
Top score: 1453

Want to Influence the Rankings?

Every vote on arena.ai makes the rankings more accurate. Submit prompts, compare responses, and help the community understand which AI models are truly best.

Vote on Arena.ai ↗

Data sourced from arena.ai — an independent platform. Rankings update continuously. CodeSOTA snapshot taken March 2026.

Explore More Benchmarks