Computer Visionvideo-classification

Video classification

The task of classifying videos into predefined categories or classes. Video classification involves analyzing temporal sequences of frames to understand the content and assign appropriate labels to entire video clips.

6 datasets13 resultsView full task mapping →

Video classification assigns a single action or event label to a video clip. It's the temporal extension of image classification, requiring models to understand motion, temporal structure, and scene dynamics. Kinetics-400 accuracy has climbed from 62% (two-stream CNNs) to 91%+ (InternVideo2), driven by the shift from 3D convolutions to video transformers pretrained on massive datasets.

History

2014

Two-Stream Networks (Simonyan & Zisserman) process RGB and optical flow separately, showing that motion features are critical — 88% on UCF-101

2015

C3D (Tran et al.) applies 3D convolutions to learn spatiotemporal features directly from video, avoiding precomputed optical flow

2017

Kinetics-400 dataset (DeepMind) provides 300K video clips across 400 action classes, replacing the saturated UCF-101 as the primary benchmark

2017

I3D (Carreira & Zisserman) inflates pretrained 2D ImageNet convolutions into 3D, achieving 80.9% on Kinetics-400

2019

SlowFast Networks (Feichtenhofer et al.) process video at two temporal rates — slow (spatial detail) and fast (motion) — reaching 79.8% on Kinetics-400

2021

TimeSformer and ViViT apply transformers to video with divided space-time attention, matching 3D CNNs

2022

VideoMAE introduces masked autoencoding for video pretraining, achieving strong results with 90% tube masking — showing video is more temporally redundant than expected

2023

InternVideo (Shanghai AI Lab) unifies video understanding via multimodal pretraining on video-text pairs, reaching 91.1% on Kinetics-400

2024

InternVideo2 scales to 6B parameters with progressive training from image→video→video-text, achieving 92.1% on Kinetics-400

How Video classification Works

1Temporal SamplingVideos are sampled into T f…2Spatiotemporal Encodi…3D CNNs (I3D3Temporal AggregationFeatures from multiple fram…4Classification HeadPooled spatiotemporal featu…5EvaluationTop-1 and top-5 accuracy on…Video classification Pipeline
1

Temporal Sampling

Videos are sampled into T frames (typically 8-32) using uniform sampling or segment-based strategies. Higher temporal resolution captures faster actions but increases compute quadratically for attention-based models.

2

Spatiotemporal Encoding

3D CNNs (I3D, SlowFast) apply convolutions across space and time jointly. Video transformers (TimeSformer, VideoMAE) tokenize frames into patches and apply space-time attention. Some architectures process 2D features per frame then aggregate temporally.

3

Temporal Aggregation

Features from multiple frames are combined — via 3D pooling, temporal self-attention, or simple averaging. The goal is to capture motion patterns and temporal structure that distinguish actions (e.g., 'opening a door' vs. 'closing a door').

4

Classification Head

Pooled spatiotemporal features are projected to class logits. Multi-crop testing (sampling multiple temporal clips and spatial crops, then averaging predictions) is standard for benchmarking, boosting accuracy 1-3%.

5

Evaluation

Top-1 and top-5 accuracy on Kinetics-400/600/700, Something-Something v2 (temporal reasoning), and ActivityNet (long-form). Something-Something specifically tests temporal understanding since actions are defined by motion direction.

Current Landscape

Video classification in 2025 is dominated by large video-language models pretrained on web-scale data. InternVideo2 leads the benchmarks, but the practical landscape has shifted: zero-shot video classification via CLIP-style models is replacing task-specific training for many applications. VideoMAE proved that self-supervised pretraining works exceptionally well for video (better than supervised ImageNet pretraining), and this is now the standard recipe. The field's center of gravity is moving from clip-level classification toward long-form video understanding, where models must reason over minutes or hours of content.

Key Challenges

Computational cost — video models process 8-32× more data than image models; training a ViT-Large on video requires thousands of GPU-hours

Temporal reasoning — many actions can be classified from a single frame (static bias); Something-Something v2 showed that models often cheat by ignoring motion

Long-form video understanding — most models process 3-10 second clips, but real applications need minute-to-hour understanding (surveillance, sports analysis)

Data efficiency — Kinetics-400 has 300K clips, tiny compared to image datasets; self-supervised pretraining (VideoMAE) helps but doesn't close the gap

Deployment speed — real-time video classification requires processing 30 FPS, far too expensive for most transformer-based models without aggressive optimization

Quick Recommendations

Best accuracy

InternVideo2-6B

92.1% top-1 on Kinetics-400; multimodal pretraining captures both visual and semantic understanding

Best accuracy/efficiency

VideoMAEv2-ViT-B or TimeSformer-L

87%+ Kinetics-400 accuracy at manageable compute; self-supervised pretraining reduces labeled data needs

Temporal reasoning

SlowFast + VideoMAE pretraining

SlowFast's dual-rate design captures fine temporal patterns; 75%+ on Something-Something v2

Real-time / deployment

X3D-M or MoViNet-A2

X3D achieves 76% Kinetics-400 at 6 GFLOPs; MoViNet is optimized for streaming inference

Zero-shot / open-vocabulary

InternVideo2 or LanguageBind-Video

Classify videos with arbitrary text descriptions without retraining; useful when action categories change frequently

What's Next

The frontier is long-form video understanding (classify activities spanning minutes to hours), video-language models that can answer questions about video content (VideoQA), and efficient architectures that enable real-time classification on edge devices. State-space models (Mamba) and linear attention variants may finally make long-context video processing tractable. The ultimate goal is moving beyond clip classification toward temporal grounding — not just 'what action' but 'when exactly' in an untrimmed video.

Benchmarks & SOTA

Kinetics-400

20175 results

Human action recognition across 400 action classes

State of the Art

DINOv3 (7B)

88.2

accuracy

Something-Something V2

20175 results

Fine-grained temporal action understanding with objects

State of the Art

V-JEPA 2 ViT-g (1B, 384px)

77.3

accuracy

UCF-101

20123 results

Action recognition benchmark with 101 action categories

State of the Art

VideoMAE ViT-B

96.1

accuracy

COIN

COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis

0 results

COIN is a large-scale instructional video dataset for comprehensive instructional video analysis. It contains 11,827 videos covering 180 different tasks organized into a 3-level hierarchical lexicon of 12 domains → tasks → steps. Each video is annotated with task/step labels and step-level temporal localization (start/end times), making the dataset suitable for tasks such as step (action) localization, procedural step recognition, action/segment classification and instructional-video understanding. The dataset was introduced by Tang et al. (CVPR 2019 / arXiv:1903.02874) and is distributed with an official website and GitHub repositories for annotations and code (coin-dataset.github.io, github.com/coin-dataset).

No results tracked yet

Diving-48

Diving48

0 results

Diving48 is a fine-grained video action recognition dataset of competitive diving. It contains approximately 18,000 trimmed video clips spanning 48 distinct dive sequences (classes) defined by FINA rules. Each class corresponds to an unambiguous dive sequence (a combination of takeoff/dive group, flight movements such as somersaults/twists, and entry/position), so distinguishing classes requires modeling subtle, long-range temporal dynamics rather than just single-frame appearance. The dataset is widely used as a benchmark for fine-grained action classification (standard train/test splits are used in the literature) and evaluations typically report top-1 classification accuracy. Public references/hosts include the UCSD SVCL project page (dataset description) and a Hugging Face dataset entry (bkprocovid19/diving48).

No results tracked yet

Epic-Kitchens-100 (EK100)

EPIC-KITCHENS-100 (EK100)

0 results

EPIC-KITCHENS-100 (EK100) is a large-scale egocentric (first-person) video dataset of daily activities in kitchens, released as an extended version of the original EPIC-KITCHENS collection. It contains ~100 hours of head-mounted camera footage captured in 45 kitchens across multiple cities, with dense audio-visual narrations and manual annotations collected via a “pause-and-talk” narration interface. Key statistics: ~100 hours of Full HD video (~20M frames), ~90K action segments, ~20K narrations, 97 verb classes and ~300 noun classes. The dataset supports multiple challenges/tasks including action recognition (full and weak supervision), action detection, action anticipation (commonly used as a benchmark for action anticipation where metrics such as mean-class recall@5 for verb, noun and joint action are reported on the validation set), cross-modal retrieval and unsupervised domain adaptation. Official resources include the dataset website, annotations GitHub repo and the dataset paper (arXiv:2006.13256).

No results tracked yet

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Video classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000