Video segmentation
Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.
Video segmentation is a key task in computer vision. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
DAVIS
DAVIS (Densely Annotated VIdeo Segmentation) / DAVIS 2017
DAVIS (Densely Annotated VIdeo Segmentation) is a high-quality video object segmentation benchmark providing per-frame, pixel-accurate ground-truth masks for video sequences. The original DAVIS release (Perazzi et al., CVPR 2016) contains 50 high-resolution (Full HD) video sequences with dense annotations intended to benchmark video object segmentation algorithms. The DAVIS challenge was extended in 2017 (Pont-Tuset et al.) to DAVIS 2017, increasing the dataset size and introducing multi-object sequences and a public challenge/benchmark (DAVIS17: ~150 videos, commonly split into train/val/test sets). Common evaluation metrics for DAVIS are region similarity (J) and contour accuracy (F) and the combined J&F measure (mean of J and F). DAVIS is widely used for semi-supervised video object segmentation (first-frame annotation propagation) as well as for unsupervised and tracking-related tasks.
No results tracked yet
YouTube-VOS
YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
YouTube-VOS (YouTube Video Object Segmentation) is the first large-scale benchmark for video object segmentation, released by Ning Xu et al. in 2018. It targets semi-supervised video object segmentation (given the first-frame mask, segment the same object(s) in all frames) and provides a much larger and more diverse training set than earlier VOS datasets (e.g., DAVIS). The benchmark contains several thousand high-resolution YouTube clips with dense, high-quality pixel-level annotations sampled at 6 fps (annotations for every 5th frame). Commonly-reported statistics for the 2018 release: ~4,453 videos (split into train/val/test sets of 3,471 / 474 / 508), >7,800 unique object instances, and ~190k manual annotations. The dataset has been used for multiple VOS tasks (semi-supervised VOS, video instance segmentation, referring VOS) and supports evaluation protocols with unseen-category validation/test splits to test generalization.
No results tracked yet
MOSE
coMplex video Object SEgmentation (MOSE)
MOSE (coMplex video Object SEgmentation) is a video object segmentation (VOS) dataset introduced to study VOS under complex, realistic scenes where target objects are often small, inconspicuous, heavily occluded, disappear/reappear, or occur in crowded environments. MOSE contains 2,149 video clips with 5,200 target objects and 431,725 high-quality per-frame object segmentation masks (videos are typically 1920×1080 and 5–60 seconds long). The dataset was created to benchmark tracking-and-segmentation robustness in challenging scenarios; standard VOS metrics such as the J&F (region similarity J and contour accuracy F) are used for evaluation. The dataset and benchmark were published in the ICCV 2023 paper "MOSE: A New Dataset for Video Object Segmentation in Complex Scenes" (arXiv:2302.01872) and have an associated project/competition site (MOSE challenge / eval servers).
No results tracked yet
Related Tasks
Few-Shot Image Classification
Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.
Open-Vocabulary Object Detection
Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.
Object counting
Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.
OCR
OCR, or Optical Character Recognition, is the task of converting an image containing text into machine-readable, editable, and searchable digital text data. This involves converting scanned documents, photos, or image-only PDFs to text from their static visual format, enabling the document to be edited, searched, or used for data entry and other applications. Examples include digitizing receipts for your bank app, translating signs with Google Translate, or creating searchable archives from old documents.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Video segmentation benchmarks accurate. Report outdated results, missing benchmarks, or errors.