Codesota · Models · Qwen2-VL 72BAlibaba18 results · 12 benchmarks

Model card

Qwen2-VL 72B.

Alibabaopen-sourceVision-Language Model1 current SOTA

Qwen2's large vision-language model.

Hugging Face ↗

§ 02 · Benchmarks

Every benchmark Qwen2-VL 72B has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	VQA v2.0	Multimodal · Visual Question Answering	accuracy	87.6%	#1/16	2024-09-18	source ↗
02	CC-OCR	Computer Vision · General OCR Capabilities	kie-f1	71.8%	#1/5	—	source ↗
03	CC-OCR	Computer Vision · General OCR Capabilities	document-parsing	53.8%	#2/6	—	source ↗
04	CC-OCR	Computer Vision · General OCR Capabilities	multi-scene-f1	78.0%	#2/9	—	source ↗
05	DocVQA	Computer Vision · Document Understanding	anls	96.5%	#2/21	—	source ↗
06	TextVQA	Multimodal · Visual Question Answering	accuracy	85.5%	#2/23	—	source ↗
07	CC-OCR	Computer Vision · General OCR Capabilities	multilingual-f1	71.1%	#3/8	—	source ↗
08	TextVQA	Multimodal · Visual Question Answering	accuracy	84.9%	#4/23	2024-09-18	source ↗
09	MMBench	Multimodal · Visual Question Answering	accuracy	88.0%	#8/20	2024-09-18	source ↗
10	MMBench	Multimodal · Visual Question Answering	accuracy	86.5%	#10/20	—	source ↗
11	RealWorldQA	Multimodal · Visual Question Answering	accuracy	77.8%	#10/23	—	source ↗
12	MVBench	Multimodal · Video Understanding	accuracy	73.6%	#11/20	—	source ↗
13	MMStar	Multimodal · Image-Text-to-Text	accuracy	68.3%	#18/21	—	source ↗
14	Video-MME	Multimodal · Video Understanding	accuracy	71.2%	#18/24	—	source ↗
15	MMMU	Multimodal · Visual Question Answering	accuracy	64.5%	#19/30	2024-09-18	source ↗
16	MMMU	Multimodal · Visual Question Answering	accuracy	64.5%	#19/30	—	source ↗
17	MMMU	Multimodal · Image-Text-to-Text	accuracy	64.5%	#20/36	—	source ↗
18	MMMU-Pro	Multimodal · Visual Question Answering	accuracy	46.2%	#26/31	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 03 · Strengths by area

Where Qwen2-VL 72B actually performs.

Computer Vision

benchmarks

avg rank #2.0 · 1 SOTA

Multimodal

benchmarks

avg rank #12.8

§ 04 · Papers

2 papers with results for Qwen2-VL 72B.

2024-09-18· Multimodal· 4 results
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
2024-09-18· 10 results
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

§ 05 · Related models

Other Alibaba models scored on Codesota.

Qwen3-235B-A22B

235B (22B active) params · 9 results · 1 SOTA

7B params · 5 results

Qwen2.5-72B-Instruct

72B params · 4 results

Qwen2.5-Coder 32B

32B params · 4 results

§ 06 · Sources & freshness

Where these numbers come from.

pwc-dump

results

arxiv

results

alphaxiv-leaderboard

results

cc-ocr-paper

results

6 of 18 rows marked verified.

Qwen2-VL 72B.

Every benchmark Qwen2-VL 72B has a recorded score for.

Where Qwen2-VL 72B actually performs.

2 papers with results for Qwen2-VL 72B.

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Other Alibaba models scored on Codesota.

Where these numbers come from.