Video classification
The task of classifying videos into predefined categories or classes. Video classification involves analyzing temporal sequences of frames to understand the content and assign appropriate labels to entire video clips.
Video classification assigns a single action or event label to a video clip. It's the temporal extension of image classification, requiring models to understand motion, temporal structure, and scene dynamics. Kinetics-400 accuracy has climbed from 62% (two-stream CNNs) to 91%+ (InternVideo2), driven by the shift from 3D convolutions to video transformers pretrained on massive datasets.
History
Two-Stream Networks (Simonyan & Zisserman) process RGB and optical flow separately, showing that motion features are critical — 88% on UCF-101
C3D (Tran et al.) applies 3D convolutions to learn spatiotemporal features directly from video, avoiding precomputed optical flow
Kinetics-400 dataset (DeepMind) provides 300K video clips across 400 action classes, replacing the saturated UCF-101 as the primary benchmark
I3D (Carreira & Zisserman) inflates pretrained 2D ImageNet convolutions into 3D, achieving 80.9% on Kinetics-400
SlowFast Networks (Feichtenhofer et al.) process video at two temporal rates — slow (spatial detail) and fast (motion) — reaching 79.8% on Kinetics-400
TimeSformer and ViViT apply transformers to video with divided space-time attention, matching 3D CNNs
VideoMAE introduces masked autoencoding for video pretraining, achieving strong results with 90% tube masking — showing video is more temporally redundant than expected
InternVideo (Shanghai AI Lab) unifies video understanding via multimodal pretraining on video-text pairs, reaching 91.1% on Kinetics-400
InternVideo2 scales to 6B parameters with progressive training from image→video→video-text, achieving 92.1% on Kinetics-400
How Video classification Works
Temporal Sampling
Videos are sampled into T frames (typically 8-32) using uniform sampling or segment-based strategies. Higher temporal resolution captures faster actions but increases compute quadratically for attention-based models.
Spatiotemporal Encoding
3D CNNs (I3D, SlowFast) apply convolutions across space and time jointly. Video transformers (TimeSformer, VideoMAE) tokenize frames into patches and apply space-time attention. Some architectures process 2D features per frame then aggregate temporally.
Temporal Aggregation
Features from multiple frames are combined — via 3D pooling, temporal self-attention, or simple averaging. The goal is to capture motion patterns and temporal structure that distinguish actions (e.g., 'opening a door' vs. 'closing a door').
Classification Head
Pooled spatiotemporal features are projected to class logits. Multi-crop testing (sampling multiple temporal clips and spatial crops, then averaging predictions) is standard for benchmarking, boosting accuracy 1-3%.
Evaluation
Top-1 and top-5 accuracy on Kinetics-400/600/700, Something-Something v2 (temporal reasoning), and ActivityNet (long-form). Something-Something specifically tests temporal understanding since actions are defined by motion direction.
Current Landscape
Video classification in 2025 is dominated by large video-language models pretrained on web-scale data. InternVideo2 leads the benchmarks, but the practical landscape has shifted: zero-shot video classification via CLIP-style models is replacing task-specific training for many applications. VideoMAE proved that self-supervised pretraining works exceptionally well for video (better than supervised ImageNet pretraining), and this is now the standard recipe. The field's center of gravity is moving from clip-level classification toward long-form video understanding, where models must reason over minutes or hours of content.
Key Challenges
Computational cost — video models process 8-32× more data than image models; training a ViT-Large on video requires thousands of GPU-hours
Temporal reasoning — many actions can be classified from a single frame (static bias); Something-Something v2 showed that models often cheat by ignoring motion
Long-form video understanding — most models process 3-10 second clips, but real applications need minute-to-hour understanding (surveillance, sports analysis)
Data efficiency — Kinetics-400 has 300K clips, tiny compared to image datasets; self-supervised pretraining (VideoMAE) helps but doesn't close the gap
Deployment speed — real-time video classification requires processing 30 FPS, far too expensive for most transformer-based models without aggressive optimization
Quick Recommendations
Best accuracy
InternVideo2-6B
92.1% top-1 on Kinetics-400; multimodal pretraining captures both visual and semantic understanding
Best accuracy/efficiency
VideoMAEv2-ViT-B or TimeSformer-L
87%+ Kinetics-400 accuracy at manageable compute; self-supervised pretraining reduces labeled data needs
Temporal reasoning
SlowFast + VideoMAE pretraining
SlowFast's dual-rate design captures fine temporal patterns; 75%+ on Something-Something v2
Real-time / deployment
X3D-M or MoViNet-A2
X3D achieves 76% Kinetics-400 at 6 GFLOPs; MoViNet is optimized for streaming inference
Zero-shot / open-vocabulary
InternVideo2 or LanguageBind-Video
Classify videos with arbitrary text descriptions without retraining; useful when action categories change frequently
What's Next
The frontier is long-form video understanding (classify activities spanning minutes to hours), video-language models that can answer questions about video content (VideoQA), and efficient architectures that enable real-time classification on edge devices. State-space models (Mamba) and linear attention variants may finally make long-context video processing tractable. The ultimate goal is moving beyond clip classification toward temporal grounding — not just 'what action' but 'when exactly' in an untrimmed video.
Benchmarks & SOTA
Kinetics-400
Kinetics Human Action Video Dataset (Kinetics-400)
Kinetics-400 (The Kinetics Human Action Video Dataset) is a large-scale human action video classification dataset introduced by Will Kay et al. (DeepMind). It contains 400 human-focused action classes, with at least 400 video clips per class. Each clip is roughly 10 seconds long and is taken from a unique YouTube video; clips are human-annotated with a single action label. The dataset is intended as a benchmark for action recognition / video classification and has been widely used for training and evaluating video classification models. The original dataset paper and release are: "The Kinetics Human Action Video Dataset" (Kay et al., arXiv:1705.06950). The dataset (and community redistributions) are commonly released under Creative Commons Attribution (CC BY 4.0) in the available metadata/releases.
No results tracked yet
UCF101
UCF101: A Dataset of 101 Human Action Classes From Videos in the Wild
UCF101 is a widely-used action recognition benchmark consisting of realistic, unconstrained video clips collected from YouTube. It contains 13,320 video clips across 101 human action categories (e.g., sports, body-motion, human-object interaction, human-human interaction, playing musical instruments). Clips are grouped into 25 groups per action (each group has 4–7 videos) to support cross-group evaluation; the dataset exhibits large variation in camera motion, viewpoint, background clutter, illumination, object scale and appearance. The dataset was introduced as a benchmark for video action recognition with baseline results reported in the original paper (Soomro et al., 2012). Dataset mentioned in DINOv3 evaluations; used for video classification evaluation and reporting top-1 accuracy.
No results tracked yet
Something-Something V2
Something-Something V2
Something-Something V2 is a large-scale temporally-sensitive action recognition / video classification dataset of short, trimmed videos of humans interacting with everyday objects. Version 2 contains 220,847 labeled video clips covering 174 fine-grained action classes (defined via caption-templates with placeholders). The data were crowd-sourced (collected from contributors / video sources such as YouTube) and designed to emphasize temporal reasoning (e.g., distinguishing actions that require motion context rather than single-frame cues). It is widely used to benchmark action-recognition / video-classification models and leaderboards report top-1/top-5 accuracies and other standard metrics. (Sources: ICCV 2017 paper by Goyal et al., TwentyBN release notes, and Hugging Face dataset card.)
No results tracked yet
COIN
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis
COIN is a large-scale instructional video dataset for comprehensive instructional video analysis. It contains 11,827 videos covering 180 different tasks organized into a 3-level hierarchical lexicon of 12 domains → tasks → steps. Each video is annotated with task/step labels and step-level temporal localization (start/end times), making the dataset suitable for tasks such as step (action) localization, procedural step recognition, action/segment classification and instructional-video understanding. The dataset was introduced by Tang et al. (CVPR 2019 / arXiv:1903.02874) and is distributed with an official website and GitHub repositories for annotations and code (coin-dataset.github.io, github.com/coin-dataset).
No results tracked yet
Diving-48
Diving48
Diving48 is a fine-grained video action recognition dataset of competitive diving. It contains approximately 18,000 trimmed video clips spanning 48 distinct dive sequences (classes) defined by FINA rules. Each class corresponds to an unambiguous dive sequence (a combination of takeoff/dive group, flight movements such as somersaults/twists, and entry/position), so distinguishing classes requires modeling subtle, long-range temporal dynamics rather than just single-frame appearance. The dataset is widely used as a benchmark for fine-grained action classification (standard train/test splits are used in the literature) and evaluations typically report top-1 classification accuracy. Public references/hosts include the UCSD SVCL project page (dataset description) and a Hugging Face dataset entry (bkprocovid19/diving48).
No results tracked yet
Epic-Kitchens-100 (EK100)
EPIC-KITCHENS-100 (EK100)
EPIC-KITCHENS-100 (EK100) is a large-scale egocentric (first-person) video dataset of daily activities in kitchens, released as an extended version of the original EPIC-KITCHENS collection. It contains ~100 hours of head-mounted camera footage captured in 45 kitchens across multiple cities, with dense audio-visual narrations and manual annotations collected via a “pause-and-talk” narration interface. Key statistics: ~100 hours of Full HD video (~20M frames), ~90K action segments, ~20K narrations, 97 verb classes and ~300 noun classes. The dataset supports multiple challenges/tasks including action recognition (full and weak supervision), action detection, action anticipation (commonly used as a benchmark for action anticipation where metrics such as mean-class recall@5 for verb, noun and joint action are reported on the validation set), cross-modal retrieval and unsupervised domain adaptation. Official resources include the dataset website, annotations GitHub repo and the dataset paper (arXiv:2006.13256).
No results tracked yet
Kinetics-400
Human action recognition across 400 action classes
No results tracked yet
Something-Something V2
Fine-grained temporal action understanding with objects
No results tracked yet
UCF-101
Action recognition benchmark with 101 action categories
No results tracked yet
Related Tasks
Few-Shot Image Classification
Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.
Open-Vocabulary Object Detection
Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.
Object counting
Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.
Video segmentation
Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Video classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.