Video→Structured Data

Action Recognition

Classify actions or activities in video clips for safety, sports, and analytics.

How Video Action Recognition Works

Understanding motion over time: from 3D convolutions to vision transformers. How machines learn to recognize human actions from video.

1. The Problem 2. Temporal Modeling 3. 3D Convolutions 4. Frame Sampling 5. Architectures 6. Interactive Demo 7. Code

The Fundamental Challenge

Why can not we just run an image classifier on each frame?

The Core Insight

Consider two video clips: one shows a person picking up a cup, the other shows a person putting down a cup. Any single frame looks nearly identical.The action meaning is encoded entirely in the temporal sequence of frames.

Action recognition requires understanding motion patterns, not just static appearance. The model must learn that the hand moves toward the cup, fingers close, arm lifts - this constitutes picking up, while the reverse sequence means putting down.

Image Classification

Frame

Single snapshot in time

- What is in the image?
- Spatial features: shapes, textures, colors
- Time-invariant: same answer at t=0 and t=100

Action Recognition

- What is happening over time?
- Spatiotemporal: motion, velocity, acceleration
- Order matters: reverse = different action

Same Appearance, Different Actions

Running vs Walking

Single frame: person with one foot forward. Difference: speed, stride length, flight phase.

Throwing vs Catching

Single frame: person with arm extended. Difference: ball direction, hand state.

Standing Up vs Sitting Down

Single frame: person mid-transition. Difference: center of mass trajectory.

Temporal Modeling Approaches

Four paradigms for capturing motion information across video frames.

Two-Stream (RGB + Flow)

Process appearance and motion separately, then fuse

3D Convolutions

Extend spatial filters to capture temporal patterns

Temporal Attention

Transformers attend across frames via self-attention

Recurrent (LSTM/GRU)

Process frames sequentially with memory

Two-Stream Architecture (Classic Approach)

RGB Frames

Appearance

Optical Flow

Motion

2D CNNs

Each stream

Fusion

Average logits

Action

Classification

Two-stream networks (Simonyan 2014) pioneered using optical flow for explicit motion modeling. Flow computation is expensive but provides strong motion cues.

The Modern Shift

Modern architectures (I3D onwards) move away from precomputed optical flow. 3D convolutions and temporal attention learn to extract motion features directly from RGB frames, making the pipeline simpler and often faster. SlowFast and transformers achieve this end-to-end.

3D Convolutions: The Key Innovation

Extending spatial convolutions to the temporal dimension.

Think of It This Way

A 2D convolution slides a filter across height and width to detect spatial patterns (edges, textures, shapes). A 3D convolution adds a third dimension: it slides across height, width, and time.

This means a single 3D filter can learn patterns like pixel gets brighter over 3 frames or edge moves rightward. It captures spatiotemporal features jointly, rather than processing frames independently and hoping the classifier figures out the motion.

Type	Input Dims	Kernel	Captures	Use Case
2D Conv	H x W	kH x kW	Spatial features per frame	Image classification, per-frame processing
2D Conv + Temporal Pooling	T x H x W	kH x kW (applied per frame)	Average/max across time	Simple temporal aggregation
3D Conv	T x H x W	kT x kH x kW	Spatiotemporal features	Learn motion patterns jointly
(2+1)D Conv	T x H x W	1 x kH x kW, then kT x 1 x 1	Factorized spatiotemporal	Efficient alternative to full 3D

2D Convolution Kernel (3x3)

Slides over H x W dimensions only. Same filter applied to each frame independently.

3D Convolution Kernel (3x3x3)

Slides over T x H x W dimensions. Captures motion across 3 consecutive frames.

(2+1)D Convolution: Best of Both Worlds

Full 3D convolutions are computationally expensive. The (2+1)D factorization (R(2+1)D, 2018) decomposes a 3D conv into a 2D spatial conv followed by a 1D temporal conv. This is more efficient and often performs better due to doubled nonlinearities.

3D Conv (t x h x w)

2D Conv (1 x h x w)

1D Conv (t x 1 x 1)

Frame Sampling Strategies

Videos can be minutes long, but models process fixed-size clips. Sampling strategy determines which frames the model sees.

Uniform Sampling

Select frames at regular intervals across the entire video

Video Timeline (16 frames)

t=0t=T

frame_idx = total_frames * i / num_samples

Advantages

+ Simple to implement
+ Covers full temporal extent

Disadvantages

- May miss fast actions between samples
- Fixed density everywhere

State-of-the-Art Architectures

From I3D to vision-language models: the evolution of video understanding.

SlowFast Networks

FAIR (Meta) - 2019

CNN

Key Idea

Two pathways: Slow for spatial semantics, Fast for temporal dynamics

Architecture

Dual ResNet-50 pathways with lateral connections

Input Format

Slow: 4 or 8 frames, Fast: 32 frames, both at 224x224

Strengths

+ Efficient design
+ No optical flow needed
+ Excellent accuracy/speed

Weaknesses

- Still requires many frames
- Two pathways add complexity

Best for Speed

SlowFast

Efficient dual-pathway design

Best for Accuracy

VideoMAE

Self-supervised pretraining

Best for Zero-Shot

X-CLIP

Vision-language for new actions

Interactive: Video Clip Classification

Watch how action recognition unfolds frame by frame.

Video Clip

standing12%

t=0Key frames highlighted

Model Output

Predicted Action (Frame 1)

standing

Confidence Over Time

Action Probabilities

standing

12%

walking

44%

running

44%

How Temporal Context Helps

Notice how early frames have low confidence: the model is uncertain whether the person will walk or run. As more frames arrive, the model sees the motion pattern (stride length, speed) and confidence increases. This is why temporal modeling is essential - single-frame classification would be much less reliable.

Code Examples

Production-ready code for major video action recognition frameworks.

PyTorchVideo (SlowFast)pip install pytorchvideo

Production

import torch
from pytorchvideo.models.hub import slowfast_r50
from pytorchvideo.transforms import (
    ApplyTransformToKey,
    UniformTemporalSubsample,
    ShortSideScale,
    CenterCropVideo,
    NormalizeVideo
)
from torchvision.transforms import Compose

# Load pretrained SlowFast model (Kinetics-400)
model = slowfast_r50(pretrained=True)
model = model.eval()

# SlowFast requires specific input format:
# - Slow pathway: T=8 frames (every 8th frame from clip)
# - Fast pathway: T=32 frames (every 2nd frame from clip)

# Preprocessing pipeline
transform = Compose([
    UniformTemporalSubsample(32),  # Sample 32 frames
    ShortSideScale(size=256),
    CenterCropVideo(crop_size=(256, 256)),
    NormalizeVideo(
        mean=[0.45, 0.45, 0.45],
        std=[0.225, 0.225, 0.225]
    ),
])

# Load and process video
# video_tensor shape: (C, T, H, W)
video_tensor = load_your_video("action_clip.mp4")
video_tensor = transform(video_tensor)

# Create SlowFast input: list of [slow_pathway, fast_pathway]
# Slow: subsample by 4 (32 -> 8 frames)
# Fast: keep all 32 frames
slow_pathway = video_tensor[:, ::4, :, :]  # Shape: (3, 8, 256, 256)
fast_pathway = video_tensor                 # Shape: (3, 32, 256, 256)

inputs = [slow_pathway.unsqueeze(0), fast_pathway.unsqueeze(0)]

# Inference
with torch.no_grad():
    predictions = model(inputs)
    probs = torch.softmax(predictions, dim=1)

# Get top-5 predictions
top5_probs, top5_indices = probs.topk(5)
print("Top 5 actions:")
for prob, idx in zip(top5_probs[0], top5_indices[0]):
    print(f"  {kinetics_labels[idx]}: {prob:.2%}")