Video Classification
Video classification — recognizing actions and events in clips — extends image understanding into the temporal domain, requiring models to reason about motion, context, and temporal ordering. The field evolved from hand-crafted features (HOG, optical flow) through 3D CNNs (C3D, I3D) to video transformers like TimeSformer and VideoMAE that treat frames as spatiotemporal tokens. Kinetics-400 accuracy now exceeds 88%, but the real challenge is long-form video understanding where events unfold over minutes, not seconds. Essential for content moderation, sports analytics, and security applications.
Video classification assigns a single action or event label to a video clip. It's the temporal extension of image classification, requiring models to understand motion, temporal structure, and scene dynamics. Kinetics-400 accuracy has climbed from 62% (two-stream CNNs) to 91%+ (InternVideo2), driven by the shift from 3D convolutions to video transformers pretrained on massive datasets.
History
Two-Stream Networks (Simonyan & Zisserman) process RGB and optical flow separately, showing that motion features are critical — 88% on UCF-101
C3D (Tran et al.) applies 3D convolutions to learn spatiotemporal features directly from video, avoiding precomputed optical flow
Kinetics-400 dataset (DeepMind) provides 300K video clips across 400 action classes, replacing the saturated UCF-101 as the primary benchmark
I3D (Carreira & Zisserman) inflates pretrained 2D ImageNet convolutions into 3D, achieving 80.9% on Kinetics-400
SlowFast Networks (Feichtenhofer et al.) process video at two temporal rates — slow (spatial detail) and fast (motion) — reaching 79.8% on Kinetics-400
TimeSformer and ViViT apply transformers to video with divided space-time attention, matching 3D CNNs
VideoMAE introduces masked autoencoding for video pretraining, achieving strong results with 90% tube masking — showing video is more temporally redundant than expected
InternVideo (Shanghai AI Lab) unifies video understanding via multimodal pretraining on video-text pairs, reaching 91.1% on Kinetics-400
InternVideo2 scales to 6B parameters with progressive training from image→video→video-text, achieving 92.1% on Kinetics-400
How Video Classification Works
Temporal Sampling
Videos are sampled into T frames (typically 8-32) using uniform sampling or segment-based strategies. Higher temporal resolution captures faster actions but increases compute quadratically for attention-based models.
Spatiotemporal Encoding
3D CNNs (I3D, SlowFast) apply convolutions across space and time jointly. Video transformers (TimeSformer, VideoMAE) tokenize frames into patches and apply space-time attention. Some architectures process 2D features per frame then aggregate temporally.
Temporal Aggregation
Features from multiple frames are combined — via 3D pooling, temporal self-attention, or simple averaging. The goal is to capture motion patterns and temporal structure that distinguish actions (e.g., 'opening a door' vs. 'closing a door').
Classification Head
Pooled spatiotemporal features are projected to class logits. Multi-crop testing (sampling multiple temporal clips and spatial crops, then averaging predictions) is standard for benchmarking, boosting accuracy 1-3%.
Evaluation
Top-1 and top-5 accuracy on Kinetics-400/600/700, Something-Something v2 (temporal reasoning), and ActivityNet (long-form). Something-Something specifically tests temporal understanding since actions are defined by motion direction.
Current Landscape
Video classification in 2025 is dominated by large video-language models pretrained on web-scale data. InternVideo2 leads the benchmarks, but the practical landscape has shifted: zero-shot video classification via CLIP-style models is replacing task-specific training for many applications. VideoMAE proved that self-supervised pretraining works exceptionally well for video (better than supervised ImageNet pretraining), and this is now the standard recipe. The field's center of gravity is moving from clip-level classification toward long-form video understanding, where models must reason over minutes or hours of content.
Key Challenges
Computational cost — video models process 8-32× more data than image models; training a ViT-Large on video requires thousands of GPU-hours
Temporal reasoning — many actions can be classified from a single frame (static bias); Something-Something v2 showed that models often cheat by ignoring motion
Long-form video understanding — most models process 3-10 second clips, but real applications need minute-to-hour understanding (surveillance, sports analysis)
Data efficiency — Kinetics-400 has 300K clips, tiny compared to image datasets; self-supervised pretraining (VideoMAE) helps but doesn't close the gap
Deployment speed — real-time video classification requires processing 30 FPS, far too expensive for most transformer-based models without aggressive optimization
Quick Recommendations
Best accuracy
InternVideo2-6B
92.1% top-1 on Kinetics-400; multimodal pretraining captures both visual and semantic understanding
Best accuracy/efficiency
VideoMAEv2-ViT-B or TimeSformer-L
87%+ Kinetics-400 accuracy at manageable compute; self-supervised pretraining reduces labeled data needs
Temporal reasoning
SlowFast + VideoMAE pretraining
SlowFast's dual-rate design captures fine temporal patterns; 75%+ on Something-Something v2
Real-time / deployment
X3D-M or MoViNet-A2
X3D achieves 76% Kinetics-400 at 6 GFLOPs; MoViNet is optimized for streaming inference
Zero-shot / open-vocabulary
InternVideo2 or LanguageBind-Video
Classify videos with arbitrary text descriptions without retraining; useful when action categories change frequently
What's Next
The frontier is long-form video understanding (classify activities spanning minutes to hours), video-language models that can answer questions about video content (VideoQA), and efficient architectures that enable real-time classification on edge devices. State-space models (Mamba) and linear attention variants may finally make long-context video processing tractable. The ultimate goal is moving beyond clip classification toward temporal grounding — not just 'what action' but 'when exactly' in an untrimmed video.
Benchmarks & SOTA
Kinetics-400
Human action recognition across 400 action classes
No results tracked yet
Something-Something V2
Fine-grained temporal action understanding with objects
No results tracked yet
UCF-101
Action recognition benchmark with 101 action categories
No results tracked yet
Related Tasks
Something wrong or missing?
Help keep Video Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.