Codesota · General · Video-Language Models · PLM-VideoBenchTasks/General/Video-Language Models
Video-Language Models · benchmark dataset · EN

PLM-VideoBench (PerceptionLM Video Benchmark).

PLM-VideoBench is a human-annotated video evaluation suite introduced in the PerceptionLM paper (arXiv:2504.13180). It is designed to test detailed video understanding and reasoning about “what”, “where”, “when” and “how” in video content. The benchmark contains multiple task-specific subsets: FGQA (fine-grained multiple-choice QA), SGQA (smart-glasses open-ended QA), RCap (video region captioning), RTLoc (region temporal localization), and RDCap (region dense video captioning). The PerceptionLM paper states the full PLM release includes 2.8M human-labeled instances across video QA and spatio-temporal captioning; the paper reports test-set sizes of FGQA ~4.3K, SGQA ~665, RCap ~10.06K, RTLoc ~7.91K and RDCap ~2.62K. Evaluation metrics used in the paper include MBAcc for FGQA, LLM-judge accuracy for SGQA and RCap, SODA for RDCap, and mean Recall@1 (averaged over IoU thresholds) for RTLoc. The Hugging Face dataset page (facebook/PLM-VideoBench) provides downloadable parquet subsets and metadata; the HF page lists subset row counts (for example: fgqa ~11k rows, rcap ~14.7k rows, rdcap ~5.17k rows, rtloc ~12.5k rows, sgqa 665 rows) which reflect the distributed dataset files on the hub. License: CC BY 4.0. Modalities: video + text (QA/captions/temporal spans).

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

4 results indexed across 4 metrics. Shaded row marks current SOTA; ties broken by submission date.


Primary
Accuracy · higher is better
All metrics
Accuracy, MBAcc, Mean Recall@1, SODA
Accuracy· primary
1 row
#ModelOrgSubmittedPaper / codeAccuracy
01PLM (8B)Apr 2025PerceptionLM: Open-Access Data and Models for Detailed V… · code46.60
MBAcc
1 row
#ModelOrgSubmittedPaper / codeMBAcc
01PLM (8B)Apr 2025PerceptionLM: Open-Access Data and Models for Detailed V… · code67.70
Mean Recall@1
1 row
#ModelOrgSubmittedPaper / codeMean Recall@1
01PLM (8B)Apr 2025PerceptionLM: Open-Access Data and Models for Detailed V… · code59.10
SODA
1 row
#ModelOrgSubmittedPaper / codeSODA
01PLM (8B)Apr 2025PerceptionLM: Open-Access Data and Models for Detailed V… · code52.80
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on Accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · Accuracy
  1. Apr 17, 2025PLM (8B)46.60
Fig 3 · SOTA-setting models only. 1 entries span Apr 2025 Apr 2025.
§ 04 · Literature

1 paper
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

  • PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
    Jang Hyun ChoAndrea MadottoEffrosyni MavroudiTriantafyllos AfourasTushar NagarajanMuhammad MaazYale SongTengyu MaShuming HuSuyog JainMiguel MartinHuiyu WangHanoona RasheedPeize SunPo-Yao HuangDaniel BolyaNikhila RaviShashank JainTammy StarkShane MoonBabak DamavandiVivian LeeAndrew WestburySalman KhanPhilipp KrähenbühlPiotr DollárLorenzo TorresaniKristen GraumanChristoph Feichtenhofer
    Apr 2025·PLM (8B)
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies