Live ELO Leaderboard

AI Code Arena:
Which Model Writes the Best Code?

Head-to-head ELO rankings from 201,164+ human coding battles. Covering real-world tasks: algorithm implementation, debugging, architecture design, and code review. Updated continuously as developers vote on model output quality.

Top Coding Model
claude-opus-4-6
Anthropic
1548
ELO ±124,059 votes
30
Models ranked
201K+
Coding battles
8
Labs represented
6
Open-source models
Editorial Analysis

Why Claude Dominates Code Generation in 2026

Anthropic holds 7 of the top 30 positions and all top 3 spots. Claude Opus 4.6 leads at ELO 1548 — a 181-point gap over the bottom of the leaderboard, which in ELO terms means Claude would win roughly 75% of head-to-head matchups against rank-30 models.

The edge is most visible in multi-step coding tasks: complex refactors, debugging novel errors, and generating production-quality code that handles edge cases. Developers consistently vote Claude outputs higher when correctness and robustness matter.

Instruction following96%
Code correctness93%
Edge case handling91%
Documentation quality89%

Relative preference scores from battle votes where Claude faced non-Claude models.

#1
claude-opus-4-6
1548
#2
claude-opus-4-6-thinking
1546
#3
claude-sonnet-4-6
1521
#4
claude-opus-4-5-thinking
1489

Full Leaderboard

Top 30 models ranked by ELO. Confidence intervals shown for statistical context.

🥇Anthropic
claude-opus-4-6
1548
±12
4,059 votes$5/$25 per MTok1M ctxProprietary
🥈Anthropic
claude-opus-4-6-thinking
1546
±12
3,317 votes$5/$25 per MTok1M ctxProprietary
🥉Anthropic
claude-sonnet-4-6
1521
±9
5,876 votes$3/$15 per MTok1M ctxProprietary
#4Anthropic
claude-opus-4-5-thinking
1489
±7
13,259 votes$5/$25 per MTok200K ctxProprietary
#5Anthropic
claude-opus-4-5
1465
±7
13,313 votes$5/$25 per MTok200K ctxProprietary
#6OpenAI
gpt-5.4-high-codex
1457
±17
1,486 votesPricing TBAN/A ctxProprietary
#7Google
gemini-3.1-pro-preview
1454
±10
4,364 votes$2/$12 per MTok1M ctxProprietary
#8Z.ai
glm-5
1445
±10
4,316 votes$1/$3.20 per MTok203K ctxMIT
#9MiniMax
minimax-m2.7
1445
±14
2,015 votes$0.30/$1.20 per MTok205K ctxProprietary
#10Z.ai
glm-4.7
1439
±10
4,971 votes$0.39/$1.75 per MTok203K ctxMIT
#11Google
gemini-3-pro
1437
±7
17,483 votes$2/$12 per MTok1M ctxProprietary
#12Google
gemini-3-flash
1436
±7
13,404 votes$0.50/$3 per MTok1M ctxProprietary
#13Xiaomi
mimo-v2-pro
1436
±16
1,350 votes$1/$3 per MTok1M ctxProprietary
#14Moonshot
kimi-k2.5-thinking
1431
±9
5,987 votes$0.60/$3 per MTokN/A ctxModified MIT
#15OpenAI
gpt-5.4-medium-codex
1428
±16
1,574 votesPricing TBAN/A ctxProprietary
#16MiniMax
minimax-m2.5
1410
±9
5,796 votes$0.20/$1.17 per MTok197K ctxModified MIT
#17Moonshot
kimi-k2.5-instant
1409
±11
3,632 votes$0.45/$2.20 per MTok262K ctxModified MIT
#18OpenAI
gpt-5.3-codex
1409
±12
2,973 votes$1.75/$14 per MTok400K ctxProprietary
#19OpenAI
gpt-5.2
1400
±16
1,531 votes$1.75/$14 per MTok400K ctxProprietary
#20MiniMax
minimax-m2.1-preview
1399
±8
9,584 votes$0.27/$0.95 per MTok197K ctxMIT
#21Google
gemini-3-flash-thinking
1395
±7
11,042 votes$0.50/$3 per MTok1M ctxProprietary
#22OpenAI
gpt-5-medium
1392
±12
3,835 votes$1.25/$10 per MTok400K ctxProprietary
#23Anthropic
claude-sonnet-4-5-thinking
1389
±6
16,012 votes$3/$15 per MTok200K ctxProprietary
#24OpenAI
gpt-5.1-medium
1388
±9
6,255 votes$1.25/$10 per MTok400K ctxProprietary
#25Alibaba
qwen3.5-397b-a17b
1386
±10
4,535 votes$0.39/$2.34 per MTok262K ctxApache 2.0
#26Anthropic
claude-sonnet-4-5
1386
±6
17,832 votes$3/$15 per MTok200K ctxProprietary
#27Anthropic
claude-opus-4-1
1384
±9
8,738 votes$15/$75 per MTok200K ctxProprietary
#28xAI
grok-4.20-beta-reasoning
1373
±14
1,941 votes$2/$6 per MTok2M ctxProprietary
#29DeepSeek
deepseek-v3.2-thinking
1370
±8
7,445 votes$0.26/$0.38 per MTok164K ctxMIT
#30Alibaba
qwen3.5-122b-a10b
1367
±11
3,239 votes$0.26/$2.08 per MTok262K ctxApache 2.0

ELO ratings from crowd-sourced blind coding battles. Prices in USD per million tokens (input/output). Context window in tokens.

Open Source Standouts

Freely licensed models that punch above their weight in code generation.

MITRank #8
glm-5
Z.ai
1445 ELO
Input
$1/MTok
Context
203K

MIT-licensed. Beats GPT-5.4-high-codex (rank 6) in ELO when you factor in confidence intervals. Exceptional value from a fully open model.

MITRank #29
deepseek-v3.2-thinking
DeepSeek
1370 ELO
Input
$0.26/MTok
Context
164K

MIT-licensed at $0.26/$0.38 per MTok — less than 5% the cost of Claude Opus. Delivers ELO 1370, suitable for most production coding tasks at near-zero cost.

Apache 2.0Rank #25
qwen3.5-397b-a17b
Alibaba
1386 ELO
Input
$0.39/MTok
Context
262K

Apache 2.0, fully self-hostable. The 397B MoE architecture activates only 17B params per token, delivering top-25 coding performance at a fraction of dense model costs.

Price-Performance Analysis

Getting the most ELO per dollar for production coding workloads.

Best Google value
gemini-3-flash
Google
1436
ELO
at
$3
/MTok out

ELO 1436 at $3/MTok output — matches Gemini 3 Pro performance at a fraction of the cost.

Sub-$1 powerhouse
minimax-m2.1-preview
MiniMax
1399
ELO
at
$0.95
/MTok out

ELO 1399 at just $0.95/MTok output with 9,584 battle votes confirming statistical reliability.

Best overall value
claude-sonnet-4-6
Anthropic
1521
ELO
at
$15
/MTok out

ELO 1521 — rank 3 globally — at $15/MTok. Significantly outperforms more expensive models.

Key Takeaway: The Efficiency Frontier

For most production coding tasks, claude-sonnet-4-6 (ELO 1521, $15/MTok out) represents the strongest cost-adjusted choice. It sits 27 ELO points above the next best non-Anthropic model while undercutting Claude Opus on price by 40%.

If budget is the primary constraint, deepseek-v3.2-thinking at $0.38/MTok output is the open-source answer. MIT-licensed, self-hostable, and delivering ELO 1370 — within viable range for automated coding pipelines where perfect output quality is less critical than throughput.

Methodology

How battles work

Developers submit a coding prompt — ranging from LeetCode-style algorithms to real-world debugging scenarios. Two models respond anonymously. The voter picks the better output. Results feed into an ELO system identical to chess ratings: wins against strong opponents gain more points; losses against weaker opponents lose more.

The confidence interval (±N) reflects statistical uncertainty — models with fewer battles have wider intervals. Models with CI > 15 should be treated with caution until more votes accumulate.

What "code quality" means here

Voters rate on a holistic read: Does the code work? Is it readable? Does it handle edge cases? Is the explanation clear? This mirrors how a senior engineer reviews code, rather than unit-test pass rates alone.

Pricing reflects public API list prices at time of recording. Self-hosted or enterprise pricing may differ. "N/A" indicates access-only models without published per-token pricing.

Related Benchmarks