TextVQA evaluates a model's ability to read and reason about text embedded in images. The test set contains 45,336 questions over 28,408 images with prominent scene text, pushing models beyond pure object recognition into OCR-grounded visual reasoning.
VQA-style accuracy across answer variants; higher is better.
Higher is better