Arena/Vision

AI Vision ArenaMultimodal Model Rankings

Community-voted Elo rankings for AI models on image understanding, visual reasoning, chart reading, and document analysis. Updated continuously from blind A/B comparisons.

20 Models Ranked250K+ Vision VotesLive Elo System
#1 Vision Model
gemini-3-pro
Google
1290Elo ±8
Input / 1M
$2.00
Context
1M
Votes
13,906
License
Proprietary

Key Insights

Google Dominates Vision

Google holds 6 of the top 20 spots, including #1, #2, #4, #6, and #9. The Gemini 3 family sweeps the leaderboard with 1M-token context windows and competitive pricing starting at $0.30/M input tokens.

#1 gemini-3-pro1290
#2 gemini-3.1-pro-preview1276
#4 gemini-3-flash1274

Open-Source Challengers

4 open-source models rank in the top 20. Alibaba's Qwen3.5 (Apache 2.0) reaches 1237 Elo at just $0.39/M input. Moonshot's Kimi K2.5 (Modified MIT) competes at 1246 Elo.

#10 kimi-k2.5-thinkingModified MIT
#12 kimi-k2.5-instantModified MIT
#15 qwen3.5-397b-a17bApache 2.0
#20 qwen3.5-27bApache 2.0

Best Value Vision

qwen3.5-27b offers 1237 Elo at only $0.20/M input — making it the top open-weight option for cost-sensitive vision workloads. Gemini 3 Flash at $0.50/M delivers 1274 Elo with 1M context.

Cheapest top-20qwen3.5-27b @ $0.20/M
Best Elo/dollargemini-3-flash @ 1274
Most votes#9 gemini-2.5-pro

Vision Leaderboard

Elo ratings from blind A/B human preference votes on image tasks · March 2026

#
Model
Elo
🥇
gemini-3-pro
GoogleProp.
1290±8
🥈
gemini-3.1-pro-preview
GoogleProp.
1276±9
🥉
gpt-5.2-chat-latest
OpenAIProp.
1275±11
4
gemini-3-flash
GoogleProp.
1274±8
5
dola-seed-2.0-preview
BytedanceProp.
1261±11
6
gemini-3-flash-thinking
GoogleProp.
1258±8
7
gpt-5.2-high
OpenAIProp.
1250±9
8
gpt-5.1-high
OpenAIProp.
1248±8
9
gemini-2.5-pro
GoogleProp.
1247±6
10
kimi-k2.5-thinking
MoonshotModified MIT
1246±9
11
grok-4.20-beta-reasoning
xAIProp.
1240±18
12
kimi-k2.5-instant
MoonshotModified MIT
1240±11
13
chatgpt-4o-latest
OpenAIProp.
1239±6
14
gpt-5.1
OpenAIProp.
1238±8
15
qwen3.5-397b-a17b
AlibabaApache 2.0
1237±10
16
gpt-5.2
OpenAIProp.
1231±9
17
gemini-2.5-flash-preview
GoogleProp.
1229±10
18
gpt-4.5-preview
OpenAIProp.
1225±11
19
gpt-5-chat
OpenAIProp.
1225±7
20
qwen3.5-27b
AlibabaApache 2.0
1224±13

Elo scores reflect community human preference votes. Prices shown per 1M tokens at time of publication. Confidence intervals (±) indicate 95% CI from Bradley–Terry model.

Vendor Breakdown

Google
6 models
Best #1
1290
OpenAI
8 models
Best #3
1275
Bytedance
1 model
Best #5
1261
Moonshot
2 models
Best #10
1246
xAI
1 model
Best #11
1240
Alibaba
2 models
Best #15
1237

What Gets Evaluated

Arena vision battles cover a diverse set of image understanding tasks. Human judges compare two models side-by-side and vote for the better response.

Visual Reasoning

Spatial relationships, object counting, scene understanding, and logic puzzles from images.

Chart & Graph Reading

Extract values, trends, and insights from bar charts, line graphs, scatter plots, and tables.

Document Analysis

Parse PDFs, invoices, forms, screenshots, and handwritten notes with high fidelity.

Creative & Aesthetic

Art style recognition, image captioning, style transfer prompting, and creative description.

Frequently Asked Questions

Which is the best AI for images in 2026?

As of March 2026, gemini-3-pro leads the vision arena with 1290 Elo — the highest score across 13,906 human preference votes. For cost-effective image understanding, gemini-3-flash (1274 Elo, $0.50/M input) or qwen3.5-27b (1224 Elo, $0.20/M input) offer excellent value.

How does multimodal AI comparison work in the arena?

The arena uses blind A/B testing: users are shown the same image prompt answered by two randomly-selected models and vote for the better response without knowing which model produced it. Elo scores are computed using the Bradley–Terry model from these pairwise comparisons.

What is the best open-source multimodal model?

Alibaba's qwen3.5-397b-a17b (Apache 2.0) reaches 1237 Elo and ranks #15 overall, while qwen3.5-27b reaches 1224 Elo at just $0.20/M tokens. Moonshot's kimi-k2.5-thinking (Modified MIT) achieves 1246 Elo at #10.

Which vision model has the largest context window?

Google's Gemini models lead with 1M-token context windows (gemini-3-pro, gemini-3-flash, gemini-2.5-pro, gemini-2.5-flash-preview). xAI's grok-4.20-beta-reasoning offers the largest context at 2M tokens, though it's still early with only 1,169 votes.

Which model is most cost-effective for vision tasks?

qwen3.5-27b at $0.20/M input and $1.56/M output is the cheapest ranked model. For proprietary models, gemini-2.5-flash-preview ($0.30/$2.50) offers 1229 Elo at minimal cost. Avoid gpt-4.5-preview ($75/$150) unless you need its specific capabilities.

Methodology

Elo Rating System

Scores are computed via the Bradley–Terry model on pairwise comparisons. A K-factor of 4 is used for well-established models (>1000 votes) and 32 for new entrants. Confidence intervals reflect 95% CI.

Battle Sampling

Image prompts are sampled from a curated pool covering 12 vision task categories. Models are matched by proximity in Elo to maximize information gain. Each battle is independently logged.

Data Sources

Rankings are compiled from Chatbot Arena (LMSYS), internal CodeSOTA evaluation runs, and third-party benchmark aggregators. Prices reflect provider API pricing as of March 2026.

Explore More