Codesota · Benchmark · PLM-VideoBenchHome/Leaderboards/PLM-VideoBench
Unknown

PLM-VideoBench.

PLM-VideoBench is a human-annotated video evaluation suite introduced in the PerceptionLM paper (arXiv:2504.13180). It is designed to test detailed video understanding and reasoning about “what”, “where”, “when” and “how” in video content. The benchmark contains multiple task-specific subsets: FGQA (fine-grained multiple-choice QA), SGQA (smart-glasses open-ended QA), RCap (video region captioning), RTLoc (region temporal localization), and RDCap (region dense video captioning). The PerceptionLM paper states the full PLM release includes 2.8M human-labeled instances across video QA and spatio-temporal captioning; the paper reports test-set sizes of FGQA ~4.3K, SGQA ~665, RCap ~10.06K, RTLoc ~7.91K and RDCap ~2.62K. Evaluation metrics used in the paper include MBAcc for FGQA, LLM-judge accuracy for SGQA and RCap, SODA for RDCap, and mean Recall@1 (averaged over IoU thresholds) for RTLoc. The Hugging Face dataset page (facebook/PLM-VideoBench) provides downloadable parquet subsets and metadata; the HF page lists subset row counts (for example: fgqa ~11k rows, rcap ~14.7k rows, rdcap ~5.17k rows, rtloc ~12.5k rows, sgqa 665 rows) which reflect the distributed dataset files on the hub. License: CC BY 4.0. Modalities: video + text (QA/captions/temporal spans).

Paper Leaderboard
§ 01 · SOTA history

Year over year.

Not enough data to show trend.
§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark
Help build the community leaderboard — submit your model results.

MBAcc

MBAcc is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for MBAccverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01PLM (8B)
dataset: PLM-VideoBench; task: 9
paper67.7N/ASource ↗

Mean Recall@1

Mean Recall@1 is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Mean Recall@1verifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01PLM (8B)
dataset: PLM-VideoBench; task: 9
paper59.1N/ASource ↗

SODA

SODA is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for SODAverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01PLM (8B)
dataset: PLM-VideoBench; task: 9
paper52.8N/ASource ↗

Accuracy

Accuracy is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01PLM (8B)
dataset: PLM-VideoBench; task: 9
paper46.6N/ASource ↗
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards