Image to Video
Animate still images into videos. Bring photos to life with natural motion.
How Image-to-Video Works
A deep technical exploration of how AI transforms a single image into moving video. From the mathematics of temporal attention to practical generation with state-of-the-art models.
The Fundamental Problem
Why is image-to-video so hard? A single image contains no temporal information. The model must invent motion that looks plausible.
The Core Insight
Picture a photograph of a dog. What happens next? The dog could run left, right, jump, bark, lie down, or do nothing at all. There is no "correct" answer encoded in the pixels.
Image-to-video models learn a prior over plausible motionsfrom millions of training videos. Given a dog image, the model has seen thousands of dogs moving, and it generates motion that looks like what dogs typically do.
Temporal Coherence
"How do you ensure frame 47 looks like it belongs with frames 46 and 48?"
Temporal attention layers that let each frame 'see' neighboring frames during generation
Identity Preservation
"How do you keep the dog in frame 1 looking like the same dog in frame 100?"
Condition every frame on the input image's latent representation
Motion Plausibility
"How do you generate motion that looks physically realistic?"
Train on millions of real videos so the model learns how things actually move
Computational Cost
"Video is 24-60x more data than a single image. How do you make this tractable?"
Work in latent space (compressed), use sparse attention, generate at low FPS then interpolate
What Changes Between Frames?
The blue ball moves ~20 pixels per frame. The model must predict this trajectory from just frame 1.
Temporal Attention: The Key Innovation
Regular image diffusion models have no concept of time. Temporal attention adds a mechanism for frames to communicate with each other during generation.
Think of It This Way
Imagine you're an animator drawing frame 15 of a running dog. You don't draw it in isolation. You flip back to frames 10-14 to see where the dog was, and you think ahead to frames 16-20 to plan where it's going.
Temporal attention gives the neural network this same ability. Each frame can "look at" other frames while being generated, ensuring smooth, coherent motion.
Spatial Attention First
Each frame is processed independently through spatial self-attention, just like image generation. Tokens attend to other tokens within the same frame.
Spatial: attention within each frame (standard image diffusion)
Temporal Attention Second
Now comes the magic: tokens at the same spatial position across all frames attend to each other. Frame 1's sky pixel talks to frame 2's sky pixel, and so on.
Information Flows Through Time
This creates a 'temporal highway' where motion information propagates. If an object moved left in frames 1-5, the model knows to continue that trajectory in frames 6-10.
Motion information flows through temporal connections
Technical Implementation
Insert temporal attention layers after every spatial attention block in the U-Net. These layers are trained while keeping spatial layers frozen (from Stable Diffusion).
Use a Diffusion Transformer (DiT) that treats video as a sequence of spacetime patches. Full attention across all patches (extremely compute-intensive).
Motion Control: The Motion Bucket ID
How do you tell the model "move a lot" vs "barely move"? SVD uses a clever conditioning signal called the Motion Bucket ID.
How Motion Bucket Works
During training, Stability AI computed the optical flow magnitude for each training video. Videos with lots of motion got high values (200+), static videos got low values (20-50). This value was fed to the model as a conditioning signal.
At inference, you provide a motion bucket ID (0-255). The model generates motion that feels like videos with similar flow values in the training set.
Motion Bucket ID Scale
A photograph slowly coming to life with subtle breathing
Best for: Portraits, product shots, cinemagraphs
Hair blowing softly in the wind, water gently rippling
Best for: Nature scenes, ambient loops
A person walking, camera slowly panning across a scene
Best for: General purpose, balanced results
Running, dancing, action sequences
Best for: Dynamic scenes, sports
Fast action, chaotic movement, dramatic camera work
Best for: Action shots, but may cause artifacts
Pro Tips for Motion Control
Start with 127 (default) and adjust based on results
Match motion to content: portraits low (50-80), action high (150+)
Values above 200 often cause artifacts and distortion
Combine with noise_aug_strength for more variation
Architecture Approaches
Three fundamentally different approaches to image-to-video generation, each with its own trade-offs.
The SVD Pipeline
Latent Video Diffusion (SVD)
Extend image diffusion to video by adding temporal layers between spatial layers. The 2D U-Net becomes a pseudo-3D U-Net.
- - Builds on proven image diffusion
- - Efficient in latent space
- - Good visual quality
- - Hard to scale to long videos
- - Temporal artifacts possible
Full 3D Attention (Sora-style)
Treat video as a sequence of spacetime patches. Full attention across all patches in space and time.
- - Most coherent results
- - Scales with compute
- - Unified architecture
- - Extremely expensive
- - Quadratic in video length
Warping/Stitching (LivePortrait)
Detect keypoints in source image, animate them based on driving signal, warp source image to match. No diffusion at all.
- - Real-time
- - Precise control
- - No hallucination
- - Limited to faces/bodies
- - Needs driving signal
- - Can't generate new content
Model Comparison (2024 State of the Art)
The current landscape of image-to-video models, from open-source to commercial APIs.
Stable Video Diffusion
Stability AI - Nov 2023Runway Gen-3 Alpha
Runway - Jun 2024Kling
Kuaishou - Jun 2024Luma Dream Machine
Luma AI - Jun 2024LivePortrait
Kuaishou (Open Source) - Jul 2024CogVideoX
Tsinghua/Zhipu - Aug 2024Real Examples: What to Expect
Different input types produce different motion patterns. Here's what actually works.
Works Well
- - Natural scenes (water, clouds, fire, foliage)
- - Portraits with subtle expressions
- - Camera motion (pan, zoom, push)
- - Animals with predictable motion
- - Atmospheric effects (rain, snow, fog)
Struggles With
- - Text and fine details (often distort)
- - Hands and fingers (same as image gen)
- - Complex multi-object interactions
- - Physics-heavy scenarios (falling, bouncing)
- - Maintaining geometric consistency
Code Examples
Working code for the major image-to-video approaches.
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
import torch
# Load the SVD-XT model (25 frames, better motion)
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
pipe.enable_model_cpu_offload() # Saves VRAM
# Load and prepare your input image
# IMPORTANT: SVD expects 1024x576 (16:9) images
image = load_image("input.jpg")
image = image.resize((1024, 576))
# Generate video frames
# motion_bucket_id: 0-255, higher = more motion
# noise_aug_strength: adds noise to condition, helps with some images
frames = pipe(
image,
num_frames=25, # SVD-XT supports 25 frames
motion_bucket_id=127, # Default, moderate motion
noise_aug_strength=0.02, # Slight noise helps generalization
num_inference_steps=25, # More steps = better quality
decode_chunk_size=8, # Memory optimization
).frames[0]
# Export to video file (6 FPS is SVD's native rate)
export_to_video(frames, "output.mp4", fps=6)
print(f"Generated video with {len(frames)} frames")
# Pro tip: Use RIFE or FILM to interpolate to 24/30 FPS
# This dramatically improves perceived smoothnessCommon Pitfalls and How to Avoid Them
Mistakes that waste compute time and produce bad results.
Using the wrong aspect ratio
SVD is trained on 1024x576. Other ratios cause severe distortion.
Always resize/crop to the model's native aspect ratio before inference.
Motion bucket too high for static scenes
High motion + static input = the model hallucinates random motion.
Match motion bucket to content. Portraits: 40-80. Action: 150-200.
Expecting long videos from short models
SVD generates 14-25 frames. Looping or extending creates artifacts.
Use models designed for length (Kling, Gen-3) or accept short clips.
Low quality input images
Garbage in, garbage out. Blurry/noisy inputs get amplified.
Use high-resolution, well-lit, sharp images. Upscale if needed.
Ignoring the first frame
The input image IS the first frame. Bad composition = bad video.
Compose input as if it's a cinematic opening shot.
Quick Reference
- - SVD-XT (open source)
- - 1024x576 images only
- - motion_bucket_id=127
- - Runway Gen-3 (quality)
- - Luma (speed)
- - Kling (long videos)
- - LivePortrait (real-time)
- - EMO / Hallo (audio)
- - SVD with low motion
- - Motion bucket: 0-255
- - Noise aug: 0.02 default
- - Steps: 20-50
Use Cases
- ✓Photo animation
- ✓Product showcase videos
- ✓Social media content
- ✓Memory preservation
Architectural Patterns
Image-Conditioned Video Diffusion
Use the image as the first frame and generate subsequent frames.
- +Preserves subject identity
- +Natural motion
- -Limited by first frame
- -Motion can be subtle
Motion Transfer
Apply motion from a reference video to a still image.
- +Controllable motion
- +Consistent
- -Needs reference motion
- -May not match subject
Implementations
API Services
Runway Gen-3 Alpha
RunwayHigh-quality image-to-video. Camera motion control.
Kling
KuaishouExcellent motion quality. Long video support.
Pika
Pika LabsCreative effects and motion. Web interface.
Open Source
Stable Video Diffusion
Stability AI CommunityStandard for image animation. 14-25 frame outputs.
Benchmarks
Quick Facts
- Input
- Image
- Output
- Video
- Implementations
- 1 open source, 3 API
- Patterns
- 2 approaches