VQA v2.0 (Visual Question Answering v2.0) is a large-scale visual question answering dataset and benchmark designed to reduce language priors present in the original VQA dataset. It contains open-ended natural-language questions about images (primarily COCO images) that require joint image and language understanding and commonsense reasoning to answer. The dataset was constructed by pairing complementary images so that language-only shortcuts are less effective. Key statistics (official site): ~204,721 COCO images (balanced real images), ~1,105,904 questions (≈5.4 questions per image), and 10 ground-truth answers per question (≈11,059,040 answers total). VQA v2.0 provides standard train/validation/test splits and an automatic evaluation metric for open-ended answers.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.