TOMATO (Temporal Reasoning Multimodal Evaluation) is a video question-answer benchmark designed to rigorously evaluate visual temporal reasoning in multimodal (video+language) foundation models. The benchmark was constructed according to three principles proposed by the authors — Multi-Frame Gain, Frame Order Sensitivity, and Frame Information Disparity — to ensure questions require reasoning across multiple frames and correct temporal ordering. TOMATO contains diverse scenarios (self-recorded human-centric interactions, gesture/interactive scenarios, and simulated scenes) and is organized into six task-types. The paper reports 1,484 questions applied to 1,417 videos and uses accuracy as the primary evaluation metric. The benchmark highlights a substantial gap between human and current model performance and is released with code and data resources in the authors' repository.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.