Computer Visionvideo-classification

Video Classification

Video classification — recognizing actions and events in clips — extends image understanding into the temporal domain, requiring models to reason about motion, context, and temporal ordering. The field evolved from hand-crafted features (HOG, optical flow) through 3D CNNs (C3D, I3D) to video transformers like TimeSformer and VideoMAE that treat frames as spatiotemporal tokens. Kinetics-400 accuracy now exceeds 88%, but the real challenge is long-form video understanding where events unfold over minutes, not seconds. Essential for content moderation, sports analytics, and security applications.

3 datasets0 resultsView full task mapping →

Video classification assigns a single action or event label to a video clip. It's the temporal extension of image classification, requiring models to understand motion, temporal structure, and scene dynamics. Kinetics-400 accuracy has climbed from 62% (two-stream CNNs) to 91%+ (InternVideo2), driven by the shift from 3D convolutions to video transformers pretrained on massive datasets.

History

2014

Two-Stream Networks (Simonyan & Zisserman) process RGB and optical flow separately, showing that motion features are critical — 88% on UCF-101

2015

C3D (Tran et al.) applies 3D convolutions to learn spatiotemporal features directly from video, avoiding precomputed optical flow

2017

Kinetics-400 dataset (DeepMind) provides 300K video clips across 400 action classes, replacing the saturated UCF-101 as the primary benchmark

2017

I3D (Carreira & Zisserman) inflates pretrained 2D ImageNet convolutions into 3D, achieving 80.9% on Kinetics-400

2019

SlowFast Networks (Feichtenhofer et al.) process video at two temporal rates — slow (spatial detail) and fast (motion) — reaching 79.8% on Kinetics-400

2021

TimeSformer and ViViT apply transformers to video with divided space-time attention, matching 3D CNNs

2022

VideoMAE introduces masked autoencoding for video pretraining, achieving strong results with 90% tube masking — showing video is more temporally redundant than expected

2023

InternVideo (Shanghai AI Lab) unifies video understanding via multimodal pretraining on video-text pairs, reaching 91.1% on Kinetics-400

2024

InternVideo2 scales to 6B parameters with progressive training from image→video→video-text, achieving 92.1% on Kinetics-400

How Video Classification Works

1Temporal SamplingVideos are sampled into T f…2Spatiotemporal Encodi…3D CNNs (I3D3Temporal AggregationFeatures from multiple fram…4Classification HeadPooled spatiotemporal featu…5EvaluationTop-1 and top-5 accuracy on…Video Classification Pipeline
1

Temporal Sampling

Videos are sampled into T frames (typically 8-32) using uniform sampling or segment-based strategies. Higher temporal resolution captures faster actions but increases compute quadratically for attention-based models.

2

Spatiotemporal Encoding

3D CNNs (I3D, SlowFast) apply convolutions across space and time jointly. Video transformers (TimeSformer, VideoMAE) tokenize frames into patches and apply space-time attention. Some architectures process 2D features per frame then aggregate temporally.

3

Temporal Aggregation

Features from multiple frames are combined — via 3D pooling, temporal self-attention, or simple averaging. The goal is to capture motion patterns and temporal structure that distinguish actions (e.g., 'opening a door' vs. 'closing a door').

4

Classification Head

Pooled spatiotemporal features are projected to class logits. Multi-crop testing (sampling multiple temporal clips and spatial crops, then averaging predictions) is standard for benchmarking, boosting accuracy 1-3%.

5

Evaluation

Top-1 and top-5 accuracy on Kinetics-400/600/700, Something-Something v2 (temporal reasoning), and ActivityNet (long-form). Something-Something specifically tests temporal understanding since actions are defined by motion direction.

Current Landscape

Video classification in 2025 is dominated by large video-language models pretrained on web-scale data. InternVideo2 leads the benchmarks, but the practical landscape has shifted: zero-shot video classification via CLIP-style models is replacing task-specific training for many applications. VideoMAE proved that self-supervised pretraining works exceptionally well for video (better than supervised ImageNet pretraining), and this is now the standard recipe. The field's center of gravity is moving from clip-level classification toward long-form video understanding, where models must reason over minutes or hours of content.

Key Challenges

Computational cost — video models process 8-32× more data than image models; training a ViT-Large on video requires thousands of GPU-hours

Temporal reasoning — many actions can be classified from a single frame (static bias); Something-Something v2 showed that models often cheat by ignoring motion

Long-form video understanding — most models process 3-10 second clips, but real applications need minute-to-hour understanding (surveillance, sports analysis)

Data efficiency — Kinetics-400 has 300K clips, tiny compared to image datasets; self-supervised pretraining (VideoMAE) helps but doesn't close the gap

Deployment speed — real-time video classification requires processing 30 FPS, far too expensive for most transformer-based models without aggressive optimization

Quick Recommendations

Best accuracy

InternVideo2-6B

92.1% top-1 on Kinetics-400; multimodal pretraining captures both visual and semantic understanding

Best accuracy/efficiency

VideoMAEv2-ViT-B or TimeSformer-L

87%+ Kinetics-400 accuracy at manageable compute; self-supervised pretraining reduces labeled data needs

Temporal reasoning

SlowFast + VideoMAE pretraining

SlowFast's dual-rate design captures fine temporal patterns; 75%+ on Something-Something v2

Real-time / deployment

X3D-M or MoViNet-A2

X3D achieves 76% Kinetics-400 at 6 GFLOPs; MoViNet is optimized for streaming inference

Zero-shot / open-vocabulary

InternVideo2 or LanguageBind-Video

Classify videos with arbitrary text descriptions without retraining; useful when action categories change frequently

What's Next

The frontier is long-form video understanding (classify activities spanning minutes to hours), video-language models that can answer questions about video content (VideoQA), and efficient architectures that enable real-time classification on edge devices. State-space models (Mamba) and linear attention variants may finally make long-context video processing tractable. The ultimate goal is moving beyond clip classification toward temporal grounding — not just 'what action' but 'when exactly' in an untrimmed video.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Video Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000