AI Vision Arena: Multimodal Model Rankings | CodeSOTA

Key Insights

Google Dominates Vision

Google holds 6 of the top 20 spots, including #1, #2, #4, #6, and #9. The Gemini 3 family sweeps the leaderboard with 1M-token context windows and competitive pricing starting at $0.30/M input tokens.

#1 gemini-3-pro1290

#2 gemini-3.1-pro-preview1276

#4 gemini-3-flash1274

Open-Source Challengers

4 open-source models rank in the top 20. Alibaba's Qwen3.5 (Apache 2.0) reaches 1237 Elo at just $0.39/M input. Moonshot's Kimi K2.5 (Modified MIT) competes at 1246 Elo.

#10 kimi-k2.5-thinkingModified MIT

#12 kimi-k2.5-instantModified MIT

#15 qwen3.5-397b-a17bApache 2.0

#20 qwen3.5-27bApache 2.0

Best Value Vision

qwen3.5-27b offers 1237 Elo at only $0.20/M input — making it the top open-weight option for cost-sensitive vision workloads. Gemini 3 Flash at $0.50/M delivers 1274 Elo with 1M context.

Cheapest top-20qwen3.5-27b @ $0.20/M

Best Elo/dollargemini-3-flash @ 1274

Most votes#9 gemini-2.5-pro

Vision Leaderboard

Elo ratings from blind A/B human preference votes on image tasks · March 2026

Input price per 1M tokens
Output price per 1M tokens

Model

Elo

Votes

Price in/out

Ctx

🥇

gemini-3-pro

GoogleProp.

1290±8

13,906votes

$2.00$12.00

🥈

gemini-3.1-pro-preview

GoogleProp.

1276±9

7,465votes

$2.00$12.00

🥉

gpt-5.2-chat-latest

OpenAIProp.

1275±11

4,212votes

$1.75$14.00

128K

gemini-3-flash

GoogleProp.

1274±8

14,159votes

$0.50$3.00

dola-seed-2.0-preview

BytedanceProp.

1261±11

4,120votes

N/A

gemini-3-flash-thinking

GoogleProp.

1258±8

11,942votes

$0.50$3.00

gpt-5.2-high

OpenAIProp.

1250±9

7,437votes

$1.75$14.00

400K

gpt-5.1-high

OpenAIProp.

1248±8

9,824votes

$1.25$10.00

400K

gemini-2.5-pro

GoogleProp.

1247±6

83,351votes

$1.25$10.00

kimi-k2.5-thinking

MoonshotModified MIT

1246±9

7,605votes

$0.60$3.00

N/A

grok-4.20-beta-reasoning

xAIProp.

1240±18

1,169votes

$2.00$6.00

kimi-k2.5-instant

MoonshotModified MIT

1240±11

4,099votes

$0.45$2.20

262K

chatgpt-4o-latest

OpenAIProp.

1239±6

24,431votes

$5.00$15.00

128K

gpt-5.1

OpenAIProp.

1238±8

10,816votes

$1.25$10.00

400K

qwen3.5-397b-a17b

AlibabaApache 2.0

1237±10

4,866votes

$0.39$2.34

262K

gpt-5.2

OpenAIProp.

1231±9

7,985votes

$1.75$14.00

400K

gemini-2.5-flash-preview

GoogleProp.

1229±10

5,072votes

$0.30$2.50

gpt-4.5-preview

OpenAIProp.

1225±11

2,925votes

$75.00$150.00

128K

gpt-5-chat

OpenAIProp.

1225±7

42,853votes

$1.25$10.00

128K

qwen3.5-27b

AlibabaApache 2.0

1224±13

2,381votes

$0.20$1.56

262K

Elo scores reflect community human preference votes. Prices shown per 1M tokens at time of publication. Confidence intervals (±) indicate 95% CI from Bradley–Terry model.

Vendor Breakdown

Google

6 models

Best #1

1290

OpenAI

8 models

Best #3

1275

Bytedance

1 model

Best #5

1261

Moonshot

2 models

Best #10

1246

xAI

1 model

Best #11

1240

Alibaba

2 models

Best #15

1237

What Gets Evaluated

Arena vision battles cover a diverse set of image understanding tasks. Human judges compare two models side-by-side and vote for the better response.

Visual Reasoning

Spatial relationships, object counting, scene understanding, and logic puzzles from images.

Chart & Graph Reading

Extract values, trends, and insights from bar charts, line graphs, scatter plots, and tables.

Document Analysis

Parse PDFs, invoices, forms, screenshots, and handwritten notes with high fidelity.

Creative & Aesthetic

Art style recognition, image captioning, style transfer prompting, and creative description.

Frequently Asked Questions

Which is the best AI for images in 2026?

As of March 2026, gemini-3-pro leads the vision arena with 1290 Elo — the highest score across 13,906 human preference votes. For cost-effective image understanding, gemini-3-flash (1274 Elo, $0.50/M input) or qwen3.5-27b (1224 Elo, $0.20/M input) offer excellent value.

How does multimodal AI comparison work in the arena?

The arena uses blind A/B testing: users are shown the same image prompt answered by two randomly-selected models and vote for the better response without knowing which model produced it. Elo scores are computed using the Bradley–Terry model from these pairwise comparisons.

What is the best open-source multimodal model?

Alibaba's qwen3.5-397b-a17b (Apache 2.0) reaches 1237 Elo and ranks #15 overall, while qwen3.5-27b reaches 1224 Elo at just $0.20/M tokens. Moonshot's kimi-k2.5-thinking (Modified MIT) achieves 1246 Elo at #10.

Which vision model has the largest context window?

Google's Gemini models lead with 1M-token context windows (gemini-3-pro, gemini-3-flash, gemini-2.5-pro, gemini-2.5-flash-preview). xAI's grok-4.20-beta-reasoning offers the largest context at 2M tokens, though it's still early with only 1,169 votes.

Which model is most cost-effective for vision tasks?

qwen3.5-27b at $0.20/M input and $1.56/M output is the cheapest ranked model. For proprietary models, gemini-2.5-flash-preview ($0.30/$2.50) offers 1229 Elo at minimal cost. Avoid gpt-4.5-preview ($75/$150) unless you need its specific capabilities.

Methodology

Elo Rating System

Scores are computed via the Bradley–Terry model on pairwise comparisons. A K-factor of 4 is used for well-established models (>1000 votes) and 32 for new entrants. Confidence intervals reflect 95% CI.

Battle Sampling

Image prompts are sampled from a curated pool covering 12 vision task categories. Models are matched by proximity in Elo to maximize information gain. Each battle is independently logged.

Data Sources

Rankings are compiled from Chatbot Arena (LMSYS), internal CodeSOTA evaluation runs, and third-party benchmark aggregators. Prices reflect provider API pricing as of March 2026.

Explore More

Text Arena LLM Leaderboard MTEB Embeddings Speech Models Object Detection Medical Vision

AI Vision ArenaMultimodal Model Rankings

Key Insights

Google Dominates Vision

Open-Source Challengers

Best Value Vision

Vision Leaderboard

Vendor Breakdown

What Gets Evaluated

Visual Reasoning

Chart & Graph Reading

Document Analysis

Creative & Aesthetic

Frequently Asked Questions

Which is the best AI for images in 2026?

How does multimodal AI comparison work in the arena?

What is the best open-source multimodal model?

Which vision model has the largest context window?

Which model is most cost-effective for vision tasks?

Methodology

Elo Rating System

Battle Sampling

Data Sources

Explore More