AI Code Arena:
Which Model Writes the Best Code?
Head-to-head ELO rankings from 201,164+ human coding battles. Covering real-world tasks: algorithm implementation, debugging, architecture design, and code review. Updated continuously as developers vote on model output quality.
Why Claude Dominates Code Generation in 2026
Anthropic holds 7 of the top 30 positions and all top 3 spots. Claude Opus 4.6 leads at ELO 1548 — a 181-point gap over the bottom of the leaderboard, which in ELO terms means Claude would win roughly 75% of head-to-head matchups against rank-30 models.
The edge is most visible in multi-step coding tasks: complex refactors, debugging novel errors, and generating production-quality code that handles edge cases. Developers consistently vote Claude outputs higher when correctness and robustness matter.
Relative preference scores from battle votes where Claude faced non-Claude models.
Full Leaderboard
Top 30 models ranked by ELO. Confidence intervals shown for statistical context.
| # | Model | Vendor | ELO | Score bar | Votes | Price (in/out) | Context | License |
|---|---|---|---|---|---|---|---|---|
| 🥇 | claude-opus-4-6 | Anthropic | 1548±12 | 4,059 | $5/$25 | 1M | Proprietary | |
| 🥈 | claude-opus-4-6-thinking | Anthropic | 1546±12 | 3,317 | $5/$25 | 1M | Proprietary | |
| 🥉 | claude-sonnet-4-6 | Anthropic | 1521±9 | 5,876 | $3/$15 | 1M | Proprietary | |
| 4 | claude-opus-4-5-thinking | Anthropic | 1489±7 | 13,259 | $5/$25 | 200K | Proprietary | |
| 5 | claude-opus-4-5 | Anthropic | 1465±7 | 13,313 | $5/$25 | 200K | Proprietary | |
| 6 | gpt-5.4-high-codex | OpenAI | 1457±17 | 1,486 | N/A | N/A | Proprietary | |
| 7 | gemini-3.1-pro-preview | 1454±10 | 4,364 | $2/$12 | 1M | Proprietary | ||
| 8 | glm-5 | Z.ai | 1445±10 | 4,316 | $1/$3.20 | 203K | MIT | |
| 9 | minimax-m2.7 | MiniMax | 1445±14 | 2,015 | $0.30/$1.20 | 205K | Proprietary | |
| 10 | glm-4.7 | Z.ai | 1439±10 | 4,971 | $0.39/$1.75 | 203K | MIT | |
| 11 | gemini-3-pro | 1437±7 | 17,483 | $2/$12 | 1M | Proprietary | ||
| 12 | gemini-3-flash | 1436±7 | 13,404 | $0.50/$3 | 1M | Proprietary | ||
| 13 | mimo-v2-pro | Xiaomi | 1436±16 | 1,350 | $1/$3 | 1M | Proprietary | |
| 14 | kimi-k2.5-thinking | Moonshot | 1431±9 | 5,987 | $0.60/$3 | N/A | Modified MIT | |
| 15 | gpt-5.4-medium-codex | OpenAI | 1428±16 | 1,574 | N/A | N/A | Proprietary | |
| 16 | minimax-m2.5 | MiniMax | 1410±9 | 5,796 | $0.20/$1.17 | 197K | Modified MIT | |
| 17 | kimi-k2.5-instant | Moonshot | 1409±11 | 3,632 | $0.45/$2.20 | 262K | Modified MIT | |
| 18 | gpt-5.3-codex | OpenAI | 1409±12 | 2,973 | $1.75/$14 | 400K | Proprietary | |
| 19 | gpt-5.2 | OpenAI | 1400±16 | 1,531 | $1.75/$14 | 400K | Proprietary | |
| 20 | minimax-m2.1-preview | MiniMax | 1399±8 | 9,584 | $0.27/$0.95 | 197K | MIT | |
| 21 | gemini-3-flash-thinking | 1395±7 | 11,042 | $0.50/$3 | 1M | Proprietary | ||
| 22 | gpt-5-medium | OpenAI | 1392±12 | 3,835 | $1.25/$10 | 400K | Proprietary | |
| 23 | claude-sonnet-4-5-thinking | Anthropic | 1389±6 | 16,012 | $3/$15 | 200K | Proprietary | |
| 24 | gpt-5.1-medium | OpenAI | 1388±9 | 6,255 | $1.25/$10 | 400K | Proprietary | |
| 25 | qwen3.5-397b-a17b | Alibaba | 1386±10 | 4,535 | $0.39/$2.34 | 262K | Apache 2.0 | |
| 26 | claude-sonnet-4-5 | Anthropic | 1386±6 | 17,832 | $3/$15 | 200K | Proprietary | |
| 27 | claude-opus-4-1 | Anthropic | 1384±9 | 8,738 | $15/$75 | 200K | Proprietary | |
| 28 | grok-4.20-beta-reasoning | xAI | 1373±14 | 1,941 | $2/$6 | 2M | Proprietary | |
| 29 | deepseek-v3.2-thinking | DeepSeek | 1370±8 | 7,445 | $0.26/$0.38 | 164K | MIT | |
| 30 | qwen3.5-122b-a10b | Alibaba | 1367±11 | 3,239 | $0.26/$2.08 | 262K | Apache 2.0 |
ELO ratings from crowd-sourced blind coding battles. Prices in USD per million tokens (input/output). Context window in tokens.
Open Source Standouts
Freely licensed models that punch above their weight in code generation.
MIT-licensed. Beats GPT-5.4-high-codex (rank 6) in ELO when you factor in confidence intervals. Exceptional value from a fully open model.
MIT-licensed at $0.26/$0.38 per MTok — less than 5% the cost of Claude Opus. Delivers ELO 1370, suitable for most production coding tasks at near-zero cost.
Apache 2.0, fully self-hostable. The 397B MoE architecture activates only 17B params per token, delivering top-25 coding performance at a fraction of dense model costs.
Price-Performance Analysis
Getting the most ELO per dollar for production coding workloads.
ELO 1436 at $3/MTok output — matches Gemini 3 Pro performance at a fraction of the cost.
ELO 1399 at just $0.95/MTok output with 9,584 battle votes confirming statistical reliability.
ELO 1521 — rank 3 globally — at $15/MTok. Significantly outperforms more expensive models.
Key Takeaway: The Efficiency Frontier
For most production coding tasks, claude-sonnet-4-6 (ELO 1521, $15/MTok out) represents the strongest cost-adjusted choice. It sits 27 ELO points above the next best non-Anthropic model while undercutting Claude Opus on price by 40%.
If budget is the primary constraint, deepseek-v3.2-thinking at $0.38/MTok output is the open-source answer. MIT-licensed, self-hostable, and delivering ELO 1370 — within viable range for automated coding pipelines where perfect output quality is less critical than throughput.
Methodology
How battles work
Developers submit a coding prompt — ranging from LeetCode-style algorithms to real-world debugging scenarios. Two models respond anonymously. The voter picks the better output. Results feed into an ELO system identical to chess ratings: wins against strong opponents gain more points; losses against weaker opponents lose more.
The confidence interval (±N) reflects statistical uncertainty — models with fewer battles have wider intervals. Models with CI > 15 should be treated with caution until more votes accumulate.
What "code quality" means here
Voters rate on a holistic read: Does the code work? Is it readable? Does it handle edge cases? Is the explanation clear? This mirrors how a senior engineer reviews code, rather than unit-test pass rates alone.
Pricing reflects public API list prices at time of recording. Self-hosted or enterprise pricing may differ. "N/A" indicates access-only models without published per-token pricing.