Who leads the GQA benchmark?

AIMv2 ViT-3B/14 + Llama 3.0 8B currently leads GQA with a score of 73.30 on accuracy.

What is the state-of-the-art score on GQA?

The state-of-the-art result on GQA is 73.30 (accuracy), achieved by AIMv2 ViT-3B/14 + Llama 3.0 8B as of 2025.

How many models are tracked on GQA?

Codesota tracks 4 models on GQA.

When was the GQA leaderboard last updated?

The GQA leaderboard on Codesota includes results through 2025, with the earliest tracked result from 2023.

Codesota · Multimodal · Visual Question Answering · GQATasks/Multimodal/Visual Question Answering

Visual Question Answering · benchmark dataset · 2019 · EN

GQA: Visual Reasoning in the Real World.

Name: GQA: Visual Reasoning in the Real World Benchmark Results
Creator: Codesota
Published: 2023-01-01
License: https://creativecommons.org/licenses/by/4.0/

22M compositional questions grounded in real images via scene graphs. Tests multi-step visual reasoning, spatial understanding, and attribute comparison.

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

4 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: accuracy · higher is better

accuracy· primary

4 rows

#	Model	Org	Submitted	Paper / code	accuracy
01	AIMv2 ViT-3B/14 + Llama 3.0 8B	—	Nov 2024	Multimodal Autoregressive Pre-training of Large Vision E… · code	73.30
02	VideoLLaMA3 7B	—	Jan 2025	VideoLLaMA 3: Frontier Multimodal Foundation Models for … · code	64.90
03	VideoLLaMA3 2B	—	Jan 2025	VideoLLaMA 3: Frontier Multimodal Foundation Models for … · code	62.70
04	BLIP-2 ViT-g FlanT5 XXL	—	Jan 2023	BLIP-2: Bootstrapping Language-Image Pre-training with F… · code	44.70

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

2 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy

Jan 30, 2023BLIP-2 ViT-g FlanT5 XXL44.70
Nov 21, 2024AIMv2 ViT-3B/14 + Llama 3.0 8B73.30

Fig 3 · SOTA-setting models only. 2 entries span Jan 2023 → Nov 2024.

§ 04 · Literature

3 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Jan 2025·VideoLLaMA3 7B, VideoLLaMA3 2B
arXiv ↗Code
Multimodal Autoregressive Pre-training of Large Vision Encoders
Nov 2024·AIMv2 ViT-3B/14 + Llama 3.0 8B
arXiv ↗Code
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Jan 2023·BLIP-2 ViT-g FlanT5 XXL
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

GQA: Visual Reasoning in the Real World.

Best published scores.

2 stepsof state of the art.

3 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

2 steps
of state of the art.

3 papers
tied to this benchmark.

Have a score that beats
this table?