Text-Centric Visual Question Answering (TEC-VQA) benchmark featuring high-quality human expert annotations across 9 diverse languages (AR, DE, FR, IT, JA, KO, RU, TH, VI). MTVQA evaluates multimodal large language models on their ability to understand and answer questions about text in images across multiple languages.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.