PLM-VideoBench.

Name: PLM-VideoBench Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

PLM-VideoBench is a human-annotated video evaluation suite introduced in the PerceptionLM paper (arXiv:2504.13180). It is designed to test detailed video understanding and reasoning about “what”, “where”, “when” and “how” in video content. The benchmark contains multiple task-specific subsets: FGQA (fine-grained multiple-choice QA), SGQA (smart-glasses open-ended QA), RCap (video region captioning), RTLoc (region temporal localization), and RDCap (region dense video captioning). The PerceptionLM paper states the full PLM release includes 2.8M human-labeled instances across video QA and spatio-temporal captioning; the paper reports test-set sizes of FGQA ~4.3K, SGQA ~665, RCap ~10.06K, RTLoc ~7.91K and RDCap ~2.62K. Evaluation metrics used in the paper include MBAcc for FGQA, LLM-judge accuracy for SGQA and RCap, SODA for RDCap, and mean Recall@1 (averaged over IoU thresholds) for RTLoc. The Hugging Face dataset page (facebook/PLM-VideoBench) provides downloadable parquet subsets and metadata; the HF page lists subset row counts (for example: fgqa ~11k rows, rcap ~14.7k rows, rdcap ~5.17k rows, rtloc ~12.5k rows, sgqa 665 rows) which reflect the distributed dataset files on the hub. License: CC BY 4.0. Modalities: video + text (QA/captions/temporal spans).

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

Not enough data to show trend.

§ 02 · Leaderboard

Results by metric.

Only 1 model on this benchmark

Help build the community leaderboard — submit your model results.

MBAcc

MBAcc is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for MBAccverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	PLM (8B) dataset: PLM-VideoBench; task: 9	paper	67.7	N/A	Source ↗

Mean Recall@1

Mean Recall@1 is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Mean Recall@1verifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	PLM (8B) dataset: PLM-VideoBench; task: 9	paper	59.1	N/A	Source ↗

SODA

SODA is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for SODAverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	PLM (8B) dataset: PLM-VideoBench; task: 9	paper	52.8	N/A	Source ↗

Accuracy

Accuracy is the reported evaluation metric for PLM-VideoBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Rank	Model	Trust	Score	Year	Source
01	PLM (8B) dataset: PLM-VideoBench; task: 9	paper	46.6	N/A	Source ↗

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards