AI Text Arena: Best AI Chatbot Rankings by Human Preference | CodeSOTA

Full Leaderboard

Ranked by Elo score · prices per million tokens (in / out)

#	Model	Org	Elo	Votes	In $/M	Out $/M	Context	License
1	claude-opus-4-6-thinkingthinking	Anthropic	1502±6	12K	$5	$25	1M	Proprietary
2	claude-opus-4-6	Anthropic	1501±6	13K	$5	$25	1M	Proprietary
3	gemini-3.1-pro-preview	Google	1493±6	15K	$2	$12	1M	Proprietary
4	grok-4.20-beta1	xAI	1492±7	7.4K	N/A	N/A	N/A	Proprietary
5	gemini-3-pro	Google	1486±4	42K	$2	$12	1M	Proprietary
6	gpt-5.4-high	OpenAI	1485±9	5.0K	$2.50	$15	1.1M	Proprietary
7	gpt-5.2-chat-latest	OpenAI	1482±6	10K	$1.75	$14	128K	Proprietary
8	grok-4.20-beta-reasoningthinking	xAI	1481±9	4.5K	$2	$6	2M	Proprietary
9	gemini-3-flash	Google	1475±4	31K	$0.50	$3	1M	Proprietary
10	claude-opus-4-5-thinkingthinking	Anthropic	1474±4	37K	$5	$25	200K	Proprietary
11	grok-4.1-thinkingthinking	xAI	1472±4	44K	$0.20	$0.50	N/A	Proprietary
12	claude-opus-4-5-20251101	Anthropic	1469±4	42K	$5	$25	200K	Proprietary
13	claude-sonnet-4-6	Anthropic	1465±6	9.8K	$3	$15	1M	Proprietary
14	qwen3.5-max-preview	Alibaba	1464±9	4.3K	N/A	N/A	N/A	Proprietary
15	gpt-5.3-chat-latest	OpenAI	1464±7	8.9K	$1.75	$14	128K	Proprietary
16	gemini-3-flash-thinking-minimalthinking	Google	1463±4	27K	$0.50	$3	1M	Proprietary
17	gpt-5.4	OpenAI	1463±8	5.0K	$2.50	$15	1.1M	Proprietary
18	dola-seed-2.0-preview	Bytedance	1462±6	11K	N/A	N/A	N/A	Proprietary
19	grok-4.1	xAI	1461±4	48K	$0.20	$0.50	N/A	Proprietary
20	gpt-5.1-high	OpenAI	1455±4	41K	$1.25	$10	400K	Proprietary
21	glm-5	Z.ai	1455±6	11K	$1	$3.20	202.8K	MIT
22	kimi-k2.5-thinkingthinking	Moonshot	1453±5	16K	$0.60	$3	N/A	Modified MIT
23	claude-sonnet-4-5	Anthropic	1453±3	54K	$3	$15	200K	Proprietary
24	claude-sonnet-4-5-thinkingthinking	Anthropic	1453±3	56K	$3	$15	200K	Proprietary
25	ernie-5.0-0110	Baidu	1452±5	19K	N/A	N/A	N/A	Proprietary
26	qwen3.5-397b-a17b	Alibaba	1452±6	10K	$0.39	$2.34	262.1K	Apache 2.0
27	ernie-5.0-preview	Baidu	1450±7	9.9K	N/A	N/A	N/A	Proprietary
28	claude-opus-4-1-thinkingthinking	Anthropic	1449±3	50K	$15	$75	200K	Proprietary
29	gemini-2.5-pro	Google	1448±3	103K	$1.25	$10	1M	Proprietary
30	claude-opus-4-1	Anthropic	1447±3	78K	$15	$75	200K	Proprietary

1Anthropicthinking

1502±6

claude-opus-4-6-thinking

Votes
12K

Price
$5/$25

Ctx
1M

2Anthropic

1501±6

claude-opus-4-6

Votes
13K

Price
$5/$25

Ctx
1M

3Google

1493±6

gemini-3.1-pro-preview

Votes
15K

Price
$2/$12

Ctx
1M

4xAI

1492±7

grok-4.20-beta1

Votes
7.4K

Price
N/A/N/A

Ctx
N/A

5Google

1486±4

gemini-3-pro

Votes
42K

Price
$2/$12

Ctx
1M

6OpenAI

1485±9

gpt-5.4-high

Votes
5.0K

Price
$2.50/$15

Ctx
1.1M

7OpenAI

1482±6

gpt-5.2-chat-latest

Votes
10K

Price
$1.75/$14

Ctx
128K

8xAIthinking

1481±9

grok-4.20-beta-reasoning

Votes
4.5K

Price
$2/$6

Ctx
2M

9Google

1475±4

gemini-3-flash

Votes
31K

Price
$0.50/$3

Ctx
1M

10Anthropicthinking

1474±4

claude-opus-4-5-thinking

Votes
37K

Price
$5/$25

Ctx
200K

11xAIthinking

1472±4

grok-4.1-thinking

Votes
44K

Price
$0.20/$0.50

Ctx
N/A

12Anthropic

1469±4

claude-opus-4-5-20251101

Votes
42K

Price
$5/$25

Ctx
200K

13Anthropic

1465±6

claude-sonnet-4-6

Votes
9.8K

Price
$3/$15

Ctx
1M

14Alibaba

1464±9

qwen3.5-max-preview

Votes
4.3K

Price
N/A/N/A

Ctx
N/A

15OpenAI

1464±7

gpt-5.3-chat-latest

Votes
8.9K

Price
$1.75/$14

Ctx
128K

16Googlethinking

1463±4

gemini-3-flash-thinking-minimal

Votes
27K

Price
$0.50/$3

Ctx
1M

17OpenAI

1463±8

gpt-5.4

Votes
5.0K

Price
$2.50/$15

Ctx
1.1M

18Bytedance

1462±6

dola-seed-2.0-preview

Votes
11K

Price
N/A/N/A

Ctx
N/A

19xAI

1461±4

grok-4.1

Votes
48K

Price
$0.20/$0.50

Ctx
N/A

20OpenAI

1455±4

gpt-5.1-high

Votes
41K

Price
$1.25/$10

Ctx
400K

21Z.ai

1455±6

glm-5

Votes
11K

Price
$1/$3.20

Ctx
202.8K

22Moonshotthinking

1453±5

kimi-k2.5-thinking

Votes
16K

Price
$0.60/$3

Ctx
N/A

23Anthropic

1453±3

claude-sonnet-4-5

Votes
54K

Price
$3/$15

Ctx
200K

24Anthropicthinking

1453±3

claude-sonnet-4-5-thinking

Votes
56K

Price
$3/$15

Ctx
200K

25Baidu

1452±5

ernie-5.0-0110

Votes
19K

Price
N/A/N/A

Ctx
N/A

26Alibaba

1452±6

qwen3.5-397b-a17b

Votes
10K

Price
$0.39/$2.34

Ctx
262.1K

27Baidu

1450±7

ernie-5.0-preview

Votes
9.9K

Price
N/A/N/A

Ctx
N/A

28Anthropicthinking

1449±3

claude-opus-4-1-thinking

Votes
50K

Price
$15/$75

Ctx
200K

29Google

1448±3

gemini-2.5-pro

Votes
103K

Price
$1.25/$10

Ctx
1M

30Anthropic

1447±3

claude-opus-4-1

Votes
78K

Price
$15/$75

Ctx
200K

How Arena Rankings Work

The methodology behind human preference evaluation

Blind A/B Battles

A user submits a prompt. Two randomly selected models generate responses in parallel. The voter sees Model A and Model B — no names, no branding. They pick the better response, or call it a tie. Only after voting is the identity of both models revealed. This eliminates brand bias.

Elo / Bradley-Terry Rating

After each battle, Elo scores are updated. The stronger model gains fewer points for beating a weaker opponent; the rating converges toward a stable value. Arena.ai uses the Bradley-Terry model — a probabilistic extension of Elo that handles ties and scales to many players — with 1000 bootstrap iterations to compute confidence intervals.

Confidence Intervals

The ±CI column shows the 95% confidence interval. Models with fewer votes have wider intervals. A model at rank 1 with ±6 and a model at rank 2 with ±6 are statistically indistinguishable — treat them as tied until more votes accumulate. Models with 40K+ votes have tight intervals and their rankings are stable.

What It Measures

Arena ratings measure human preference in open-ended conversation — the full breadth of tasks that real users care about: coding help, writing, reasoning, factual Q&A, creative tasks. This differs from narrow academic benchmarks, which often test specific capabilities in isolation.

Important caveat: Arena rankings reflect average human preference, not ground truth quality. Tasks requiring verified facts, long-context retrieval, or specialized domain expertise may rank models differently from arena outcomes. Use arena scores alongside task-specific benchmarks for deployment decisions.

Key Insights

What the rankings tell us about the current AI landscape

🧠

Thinking Models Dominate

Of the top 12 spots, 6 are occupied by models with explicit chain-of-thought / extended thinking modes. claude-opus-4-6-thinking sits at #1, grok-4.20-beta-reasoning at #8, and grok-4.1-thinking at #11. Voters consistently prefer responses from models that reason before answering — even when they can't see the thinking trace.

💰

Price-Performance Leaders

grok-4.1 (rank 19) costs just $0.20/$0.50 per million tokens yet beats models 3-4x more expensive. gemini-3-flash (rank 9, $0.50/$3) outperforms multiple frontier models. qwen3.5-397b-a17b (rank 26, $0.39/$2.34, Apache 2.0) is the standout open-source value pick — near-frontier quality at sub-$0.40 input pricing.

🌍

Open Source Contenders

Three models in the top 30 carry non-proprietary licenses: glm-5 (MIT, rank 21), kimi-k2.5-thinking (Modified MIT, rank 22), and qwen3.5-397b-a17b (Apache 2.0, rank 26). The gap between the best open-weights models and frontier proprietary models has shrunk to roughly 50 Elo points — compared to 200+ points just 18 months ago.

📏

Context Window Divergence

Anthropic and Google have standardized on 1M token context windows for their top models. OpenAI's gpt-5.4 reaches 1.1M. xAI's grok-4.20-beta-reasoning offers a massive 2M window. Several xAI and Chinese models don't publish context lengths — a transparency gap for enterprise buyers.

⚔️

Anthropic vs. Google at the Top

Anthropic and Google together hold 8 of the top 10 slots. OpenAI and xAI are competitive (6 models each in the top 30) but neither has a model in the top 5. The top-2 Claude models are statistically tied (1502 vs 1501, ±6 CI) — essentially a dead heat.

📊

Vote Count vs. Reliability

gemini-2.5-pro (rank 29) has 103K votes — more than any other model — giving it the tightest CI of just ±3. New models like gpt-5.4-high (4,965 votes, ±9 CI) may move significantly as more votes arrive. High-vote models with ±3 CIs are the most trustworthy signal.

Provider Breakdown

Model count per organization in the top 30

Anthropic9

Best: claude-opus-4-6-thinking

Top score: 1502

Google5

Best: gemini-3.1-pro-preview

Top score: 1493

OpenAI5

Best: gpt-5.4-high

Top score: 1485

xAI4

Best: grok-4.20-beta1

Top score: 1492

Alibaba2

Best: qwen3.5-max-preview

Top score: 1464

Baidu2

Best: ernie-5.0-0110

Top score: 1452

Bytedance1

Best: dola-seed-2.0-preview

Top score: 1462

Z.ai1

Best: glm-5

Top score: 1455

Moonshot1

Best: kimi-k2.5-thinking

Top score: 1453

Want to Influence the Rankings?

Every vote on arena.ai makes the rankings more accurate. Submit prompts, compare responses, and help the community understand which AI models are truly best.

Vote on Arena.ai ↗

Data sourced from arena.ai — an independent platform. Rankings update continuously. CodeSOTA snapshot taken March 2026.