AI Code Arena: Which Model Writes the Best Code?

Editorial Analysis

Why Claude Dominates Code Generation in 2026

Anthropic holds 7 of the top 30 positions and all top 3 spots. Claude Opus 4.6 leads at ELO 1548 — a 181-point gap over the bottom of the leaderboard, which in ELO terms means Claude would win roughly 75% of head-to-head matchups against rank-30 models.

The edge is most visible in multi-step coding tasks: complex refactors, debugging novel errors, and generating production-quality code that handles edge cases. Developers consistently vote Claude outputs higher when correctness and robustness matter.

Instruction following96%

Code correctness93%

Edge case handling91%

Documentation quality89%

Relative preference scores from battle votes where Claude faced non-Claude models.

claude-opus-4-6

1548

claude-opus-4-6-thinking

1546

claude-sonnet-4-6

1521

claude-opus-4-5-thinking

1489

Full Leaderboard

Top 30 models ranked by ELO. Confidence intervals shown for statistical context.

Top 3Top 10Rest

#	Model	Vendor	ELO	Votes	Price (in/out)	Context	License
🥇	claude-opus-4-6	Anthropic	1548±12	4,059	$5/$25	1M	Proprietary
🥈	claude-opus-4-6-thinking	Anthropic	1546±12	3,317	$5/$25	1M	Proprietary
🥉	claude-sonnet-4-6	Anthropic	1521±9	5,876	$3/$15	1M	Proprietary
4	claude-opus-4-5-thinking	Anthropic	1489±7	13,259	$5/$25	200K	Proprietary
5	claude-opus-4-5	Anthropic	1465±7	13,313	$5/$25	200K	Proprietary
6	gpt-5.4-high-codex	OpenAI	1457±17	1,486	N/A	N/A	Proprietary
7	gemini-3.1-pro-preview	Google	1454±10	4,364	$2/$12	1M	Proprietary
8	glm-5	Z.ai	1445±10	4,316	$1/$3.20	203K	MIT
9	minimax-m2.7	MiniMax	1445±14	2,015	$0.30/$1.20	205K	Proprietary
10	glm-4.7	Z.ai	1439±10	4,971	$0.39/$1.75	203K	MIT
11	gemini-3-pro	Google	1437±7	17,483	$2/$12	1M	Proprietary
12	gemini-3-flash	Google	1436±7	13,404	$0.50/$3	1M	Proprietary
13	mimo-v2-pro	Xiaomi	1436±16	1,350	$1/$3	1M	Proprietary
14	kimi-k2.5-thinking	Moonshot	1431±9	5,987	$0.60/$3	N/A	Modified MIT
15	gpt-5.4-medium-codex	OpenAI	1428±16	1,574	N/A	N/A	Proprietary
16	minimax-m2.5	MiniMax	1410±9	5,796	$0.20/$1.17	197K	Modified MIT
17	kimi-k2.5-instant	Moonshot	1409±11	3,632	$0.45/$2.20	262K	Modified MIT
18	gpt-5.3-codex	OpenAI	1409±12	2,973	$1.75/$14	400K	Proprietary
19	gpt-5.2	OpenAI	1400±16	1,531	$1.75/$14	400K	Proprietary
20	minimax-m2.1-preview	MiniMax	1399±8	9,584	$0.27/$0.95	197K	MIT
21	gemini-3-flash-thinking	Google	1395±7	11,042	$0.50/$3	1M	Proprietary
22	gpt-5-medium	OpenAI	1392±12	3,835	$1.25/$10	400K	Proprietary
23	claude-sonnet-4-5-thinking	Anthropic	1389±6	16,012	$3/$15	200K	Proprietary
24	gpt-5.1-medium	OpenAI	1388±9	6,255	$1.25/$10	400K	Proprietary
25	qwen3.5-397b-a17b	Alibaba	1386±10	4,535	$0.39/$2.34	262K	Apache 2.0
26	claude-sonnet-4-5	Anthropic	1386±6	17,832	$3/$15	200K	Proprietary
27	claude-opus-4-1	Anthropic	1384±9	8,738	$15/$75	200K	Proprietary
28	grok-4.20-beta-reasoning	xAI	1373±14	1,941	$2/$6	2M	Proprietary
29	deepseek-v3.2-thinking	DeepSeek	1370±8	7,445	$0.26/$0.38	164K	MIT
30	qwen3.5-122b-a10b	Alibaba	1367±11	3,239	$0.26/$2.08	262K	Apache 2.0

🥇Anthropic

claude-opus-4-6

1548

±12

4,059 votes•$5/$25 per MTok•1M ctxProprietary

🥈Anthropic

claude-opus-4-6-thinking

1546

±12

3,317 votes•$5/$25 per MTok•1M ctxProprietary

🥉Anthropic

claude-sonnet-4-6

1521

±9

5,876 votes•$3/$15 per MTok•1M ctxProprietary

#4Anthropic

claude-opus-4-5-thinking

1489

±7

13,259 votes•$5/$25 per MTok•200K ctxProprietary

#5Anthropic

claude-opus-4-5

1465

±7

13,313 votes•$5/$25 per MTok•200K ctxProprietary

#6OpenAI

gpt-5.4-high-codex

1457

±17

1,486 votes•Pricing TBA•N/A ctxProprietary

#7Google

gemini-3.1-pro-preview

1454

±10

4,364 votes•$2/$12 per MTok•1M ctxProprietary

#8Z.ai

glm-5

1445

±10

4,316 votes•$1/$3.20 per MTok•203K ctxMIT

#9MiniMax

minimax-m2.7

1445

±14

2,015 votes•$0.30/$1.20 per MTok•205K ctxProprietary

#10Z.ai

glm-4.7

1439

±10

4,971 votes•$0.39/$1.75 per MTok•203K ctxMIT

#11Google

gemini-3-pro

1437

±7

17,483 votes•$2/$12 per MTok•1M ctxProprietary

#12Google

gemini-3-flash

1436

±7

13,404 votes•$0.50/$3 per MTok•1M ctxProprietary

#13Xiaomi

mimo-v2-pro

1436

±16

1,350 votes•$1/$3 per MTok•1M ctxProprietary

#14Moonshot

kimi-k2.5-thinking

1431

±9

5,987 votes•$0.60/$3 per MTok•N/A ctxModified MIT

#15OpenAI

gpt-5.4-medium-codex

1428

±16

1,574 votes•Pricing TBA•N/A ctxProprietary

#16MiniMax

minimax-m2.5

1410

±9

5,796 votes•$0.20/$1.17 per MTok•197K ctxModified MIT

#17Moonshot

kimi-k2.5-instant

1409

±11

3,632 votes•$0.45/$2.20 per MTok•262K ctxModified MIT

#18OpenAI

gpt-5.3-codex

1409

±12

2,973 votes•$1.75/$14 per MTok•400K ctxProprietary

#19OpenAI

gpt-5.2

1400

±16

1,531 votes•$1.75/$14 per MTok•400K ctxProprietary

#20MiniMax

minimax-m2.1-preview

1399

±8

9,584 votes•$0.27/$0.95 per MTok•197K ctxMIT

#21Google

gemini-3-flash-thinking

1395

±7

11,042 votes•$0.50/$3 per MTok•1M ctxProprietary

#22OpenAI

gpt-5-medium

1392

±12

3,835 votes•$1.25/$10 per MTok•400K ctxProprietary

#23Anthropic

claude-sonnet-4-5-thinking

1389

±6

16,012 votes•$3/$15 per MTok•200K ctxProprietary

#24OpenAI

gpt-5.1-medium

1388

±9

6,255 votes•$1.25/$10 per MTok•400K ctxProprietary

#25Alibaba

qwen3.5-397b-a17b

1386

±10

4,535 votes•$0.39/$2.34 per MTok•262K ctxApache 2.0

#26Anthropic

claude-sonnet-4-5

1386

±6

17,832 votes•$3/$15 per MTok•200K ctxProprietary

#27Anthropic

claude-opus-4-1

1384

±9

8,738 votes•$15/$75 per MTok•200K ctxProprietary

#28xAI

grok-4.20-beta-reasoning

1373

±14

1,941 votes•$2/$6 per MTok•2M ctxProprietary

#29DeepSeek

deepseek-v3.2-thinking

1370

±8

7,445 votes•$0.26/$0.38 per MTok•164K ctxMIT

#30Alibaba

qwen3.5-122b-a10b

1367

±11

3,239 votes•$0.26/$2.08 per MTok•262K ctxApache 2.0

ELO ratings from crowd-sourced blind coding battles. Prices in USD per million tokens (input/output). Context window in tokens.

Open Source Standouts

Freely licensed models that punch above their weight in code generation.

MITRank #8

glm-5

Z.ai

1445 ELO

Input

$1/MTok

Context

203K

MIT-licensed. Beats GPT-5.4-high-codex (rank 6) in ELO when you factor in confidence intervals. Exceptional value from a fully open model.

MITRank #29

deepseek-v3.2-thinking

DeepSeek

1370 ELO

Input

$0.26/MTok

Context

164K

MIT-licensed at $0.26/$0.38 per MTok — less than 5% the cost of Claude Opus. Delivers ELO 1370, suitable for most production coding tasks at near-zero cost.

Apache 2.0Rank #25

qwen3.5-397b-a17b

Alibaba

1386 ELO

Input

$0.39/MTok

Context

262K

Apache 2.0, fully self-hostable. The 397B MoE architecture activates only 17B params per token, delivering top-25 coding performance at a fraction of dense model costs.

Price-Performance Analysis

Getting the most ELO per dollar for production coding workloads.

Best Google value

gemini-3-flash

Google

1436

ELO

/MTok out

ELO 1436 at $3/MTok output — matches Gemini 3 Pro performance at a fraction of the cost.

Sub-$1 powerhouse

minimax-m2.1-preview

MiniMax

1399

ELO

$0.95

/MTok out

ELO 1399 at just $0.95/MTok output with 9,584 battle votes confirming statistical reliability.

Best overall value

claude-sonnet-4-6

Anthropic

1521

ELO

$15

/MTok out

ELO 1521 — rank 3 globally — at $15/MTok. Significantly outperforms more expensive models.

Key Takeaway: The Efficiency Frontier

For most production coding tasks, claude-sonnet-4-6 (ELO 1521, $15/MTok out) represents the strongest cost-adjusted choice. It sits 27 ELO points above the next best non-Anthropic model while undercutting Claude Opus on price by 40%.

If budget is the primary constraint, deepseek-v3.2-thinking at $0.38/MTok output is the open-source answer. MIT-licensed, self-hostable, and delivering ELO 1370 — within viable range for automated coding pipelines where perfect output quality is less critical than throughput.

Methodology

How battles work

Developers submit a coding prompt — ranging from LeetCode-style algorithms to real-world debugging scenarios. Two models respond anonymously. The voter picks the better output. Results feed into an ELO system identical to chess ratings: wins against strong opponents gain more points; losses against weaker opponents lose more.

The confidence interval (±N) reflects statistical uncertainty — models with fewer battles have wider intervals. Models with CI > 15 should be treated with caution until more votes accumulate.

What "code quality" means here

Voters rate on a holistic read: Does the code work? Is it readable? Does it handle edge cases? Is the explanation clear? This mirrors how a senior engineer reviews code, rather than unit-test pass rates alone.

Pricing reflects public API list prices at time of recording. Self-hosted or enterprise pricing may differ. "N/A" indicates access-only models without published per-token pricing.