PLM-VideoBench is a human-annotated video evaluation suite introduced in the PerceptionLM paper (arXiv:2504.13180). It is designed to test detailed video understanding and reasoning about “what”, “where”, “when” and “how” in video content. The benchmark contains multiple task-specific subsets: FGQA (fine-grained multiple-choice QA), SGQA (smart-glasses open-ended QA), RCap (video region captioning), RTLoc (region temporal localization), and RDCap (region dense video captioning). The PerceptionLM paper states the full PLM release includes 2.8M human-labeled instances across video QA and spatio-temporal captioning; the paper reports test-set sizes of FGQA ~4.3K, SGQA ~665, RCap ~10.06K, RTLoc ~7.91K and RDCap ~2.62K. Evaluation metrics used in the paper include MBAcc for FGQA, LLM-judge accuracy for SGQA and RCap, SODA for RDCap, and mean Recall@1 (averaged over IoU thresholds) for RTLoc. The Hugging Face dataset page (facebook/PLM-VideoBench) provides downloadable parquet subsets and metadata; the HF page lists subset row counts (for example: fgqa ~11k rows, rcap ~14.7k rows, rdcap ~5.17k rows, rtloc ~12.5k rows, sgqa 665 rows) which reflect the distributed dataset files on the hub. License: CC BY 4.0. Modalities: video + text (QA/captions/temporal spans).
MBAcc is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Mean Recall@1 is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
SODA is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
Accuracy is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better