Qwen2-VL-72B-Instruct

Introduction

We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.

What’s New in Qwen2-VL?

Key Enhancements:

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

Model Architecture Updates:

Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.

Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.

We have three models with 2, 8 and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2-VL model. For more information, visit our Blog and GitHub.

Evaluation

Image Benchmarks

| Benchmark | Previous SoTA (Open-source LVLM) | Claude-3.5 Sonnet | GPT-4o | Qwen2-VL-72B | :--- | :---: | :---: | :---: | :---: | | MMMUval | 58.3 | 68.3 | 69.1 | 64.5 | DocVQAtest | 94.1 | 95.2 | 92.8 | 96.5 | InfoVQAtest | 82.0 | - | - | 84.5 | ChartQAtest | 88.4 | 90.8 | 85.7 | 88.3 | TextVQAval | 84.4 | - | - | 85.5 | OCRBench | 852 | 788 | 736 | 877 | MTVQA | 17.3 | 25.7 | 27.8 | 30.9 | VCRen easy | 84.67 | 63.85 | 91.55 | 91.93 | VCRzh easy | 22.09 | 1.0| 14.87 | 65.37 | RealWorldQA | 72.2 | 60.1 | 75.4 | 77.8 | MMEsum | 2414.7 | 1920.0 | 2328.7 | 2482.7 | MMBench-ENtest | 86.5 | 79.7 | 83.4 | 86.5 | MMBench-CNtest | 86.3 | 80.7 | 82.1 | 86.6 | MMBench-V1.1test | 85.5 | 78.5 | 82.2 | 85.9 | MMT-Benchtest | 63.4 | - | 65.5 | 71.7 | MMStar | 67.1 | 62.2 | 63.9 | 68.3 | MMVetGPT-4-Turbo | 65.7 | 66.0 | 69.1 | 74.0 | HallBenchavg | 55.2 | 49.9 | 55.0 | 58.1 | MathVistatestmini | 67.5 | 67.7 | 63.8 | 70.5 | MathVision | 16.97 | - | 30.4 | 25.9

Video Benchmarks

| Benchmark | Previous SoTA (Open-source LVLM) | Gemini 1.5-Pro | GPT-4o | Qwen2-VL-72B | :--- | :---: | :---: | :---: | :---: | | MVBench | 69.6 | - | - | 73.6 | PerceptionTesttest | 66.9 | - | - | 68.0 | EgoSchematest | 62.0 | 63.2 | 72.2 | 77.9 | Video-MME (wo/w subs) | 66.3/69.6 | 75.0/81.3 | 71.9/77.2 | 71.2/77.8

Agent Benchmarks

| |Benchmark | Metric | Previous SoTA | GPT-4o | Qwen2-VL-72B | | :-- | :-- | :--: | :--: | :--: | :--: | | General | FnCall[1] | TM | - | 90.2 | 93.1 | | | | EM | - | 50.0 | 53.2 | | Game | Number Line | SR | 89.4[2] | 91.5 | 100.0 | | | BlackJack | SR | 40.2[2] | 34.5 | 42.6 | | | EZPoint | SR | 50.0[2] | 85.5 | 100.0 | | | Point24 | SR | 2.6[2] | 3.0 | 4.5 | | Android | AITZ | TM | 83.0[3] | 70.0 | 89.6 | | | | EM | 47.7[3] | 35.3 | 72.1 | | AI2THOR | ALFREDvalid-unseen | SR | 67.7[4] | - | 67.8 | | | | GC | 75.3[4] | - | 75.8 | | VLN | R2Rvalid-unseen | SR | 79.0 | 43.7[5] | 51.7 | | | REVERIEvalid-unseen | SR | 61.0 | 31.6[5] | 31.0 |

SR, GC, TM and EM are short for success rate, goal-condition success, type match and exact match. ALFRED is supported by SAM[6].

Self-Curated Function Call Benchmark by Qwen Team
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
Segment Anything.

Multilingual Benchmarks

<table style="width:75%; text-align:center;"> <tr> <th>Models</th> <td>AR </td> <td>DE </td> <td>FR </td> <td>IT </td> <td>JA </td> <td>KO </td> <td>RU </td> <td>TH </td> <td>VI </td> <td>AVG</td> </tr> <tr> <th align="left">Qwen2-VL-72B</th> <td>20.7 </td> <td>36.5 </td> <td>44.1 </td> <td>42.8 </td> <td>21.6 </td> <td>37.4 </td> <td>15.6 </td> <td>17.7 </td> <td>41.6 </td> <td>30.9</td> </tr> <tr> <th align="left">GPT-4o</th> <td>20.2 </td> <td>34.2 </td> <td>41.2 </td> <td>32.7 </td> <td>20.0 </td> <td>33.9 </td> <td>11.5 </td> <td>22.5 </td> <td>34.2 </td> <td>27.8</td> </tr> <tr> <th align="left">Claude3 Opus</th> <td>15.1 </td> <td>33.4 </td> <td>40.6 </td> <td>34.4 </td> <td>19.4 </td> <td>27.2 </td> <td>13.0 </td> <td>19.5 </td> <td>29.1 </td> <td>25.7 </td> </tr> <tr> <th align="left">Gemini Ultra</th> <td>14.7 </td> <td>32.3 </td> <td>40.0 </td> <td>31.8 </td> <td>12.3 </td> <td>17.2 </td> <td>11.8 </td> <td>20.3 </td> <td>28.6 </td> <td>23.2</td> </tr> </table>

Requirements

The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:

KeyError: 'qwen2_vl'

Quickstart

We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:

bash
pip install qwen-vl-utils

Here we show a code snippet to show you how to use the chat model with transformers and qwen_vl_utils:

python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-72B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

<details> <summary>Without qwenvlutils</summary>

python
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

</details> <details> <summary>Multi image inference</summary>

python
# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

</details>

<details> <summary>Video inference</summary>

python
# Messages containing a images list as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inp

…

Card content reproduced from huggingface.co/Qwen/Qwen2-VL-72B-Instruct under the upstream license. Rendering trims fenced HTML, raw widgets and tables for safety; tap the link for the untouched original.

Qwen-VL.

Model card,
inline.

Qwen2-VL-72B-Instruct

Introduction

What’s New in Qwen2-VL?

Key Enhancements:

Model Architecture Updates:

Evaluation

Image Benchmarks

Video Benchmarks

Agent Benchmarks

Multilingual Benchmarks

Requirements

Quickstart

No recorded benchmark results yet.

Qwen-VL.

Model card,inline.

Qwen2-VL-72B-Instruct

Introduction

What’s New in Qwen2-VL?

Key Enhancements:

Model Architecture Updates:

Evaluation

Image Benchmarks

Video Benchmarks

Agent Benchmarks

Multilingual Benchmarks

Requirements

Quickstart

No recorded benchmark results yet.

Model card,
inline.