22M compositional questions grounded in real images via scene graphs. Tests multi-step visual reasoning, spatial understanding, and attribute comparison.
Accuracy is the reported evaluation metric for GQA. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | AIMv2 ViT-3B/14 + Llama 3.0 8B | unverified | 73.3 | 2024 | Paper ↗Code ↗ | Looks wrong? |
| 02 | VideoLLaMA3 7B | unverified | 64.9 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 03 | VideoLLaMA3 2B | unverified | 62.7 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 04 | BLIP-2 ViT-g FlanT5 XXL | unverified | 44.7 | 2023 | Paper ↗Code ↗ | Looks wrong? |