Home/Building Blocks/Image to Video
ImageVideo

Image to Video

Animate still images into videos. Bring photos to life with natural motion.

How Image-to-Video Works

A deep technical exploration of how AI transforms a single image into moving video. From the mathematics of temporal attention to practical generation with state-of-the-art models.

1

The Fundamental Problem

Why is image-to-video so hard? A single image contains no temporal information. The model must invent motion that looks plausible.

The Core Insight

Picture a photograph of a dog. What happens next? The dog could run left, right, jump, bark, lie down, or do nothing at all. There is no "correct" answer encoded in the pixels.

Image-to-video models learn a prior over plausible motionsfrom millions of training videos. Given a dog image, the model has seen thousands of dogs moving, and it generates motion that looks like what dogs typically do.

T
Temporal Coherence

"How do you ensure frame 47 looks like it belongs with frames 46 and 48?"

Temporal attention layers that let each frame 'see' neighboring frames during generation

I
Identity Preservation

"How do you keep the dog in frame 1 looking like the same dog in frame 100?"

Condition every frame on the input image's latent representation

M
Motion Plausibility

"How do you generate motion that looks physically realistic?"

Train on millions of real videos so the model learns how things actually move

C
Computational Cost

"Video is 24-60x more data than a single image. How do you make this tractable?"

Work in latent space (compressed), use sparse attention, generate at low FPS then interpolate

What Changes Between Frames?

Frame 1
Frame 2
Frame 3
Frame 4
Frame 5

The blue ball moves ~20 pixels per frame. The model must predict this trajectory from just frame 1.

2

Temporal Attention: The Key Innovation

Regular image diffusion models have no concept of time. Temporal attention adds a mechanism for frames to communicate with each other during generation.

Think of It This Way

Imagine you're an animator drawing frame 15 of a running dog. You don't draw it in isolation. You flip back to frames 10-14 to see where the dog was, and you think ahead to frames 16-20 to plan where it's going.

Temporal attention gives the neural network this same ability. Each frame can "look at" other frames while being generated, ensuring smooth, coherent motion.

1
Spatial Attention First

Each frame is processed independently through spatial self-attention, just like image generation. Tokens attend to other tokens within the same frame.

Pixel (2,3)
attends to
all pixels in Frame 1

Spatial: attention within each frame (standard image diffusion)

2
Temporal Attention Second

Now comes the magic: tokens at the same spatial position across all frames attend to each other. Frame 1's sky pixel talks to frame 2's sky pixel, and so on.

F1
F2
F3
F4
Same position, all frames attend to each other
3
Information Flows Through Time

This creates a 'temporal highway' where motion information propagates. If an object moved left in frames 1-5, the model knows to continue that trajectory in frames 6-10.

->
Motion signal
->
Propagates forward

Motion information flows through temporal connections

Technical Implementation
SVD / AnimateDiff Approach

Insert temporal attention layers after every spatial attention block in the U-Net. These layers are trained while keeping spatial layers frozen (from Stable Diffusion).

Sora / Movie Gen Approach

Use a Diffusion Transformer (DiT) that treats video as a sequence of spacetime patches. Full attention across all patches (extremely compute-intensive).

3

Motion Control: The Motion Bucket ID

How do you tell the model "move a lot" vs "barely move"? SVD uses a clever conditioning signal called the Motion Bucket ID.

How Motion Bucket Works

During training, Stability AI computed the optical flow magnitude for each training video. Videos with lots of motion got high values (200+), static videos got low values (20-50). This value was fed to the model as a conditioning signal.

At inference, you provide a motion bucket ID (0-255). The model generates motion that feels like videos with similar flow values in the training set.

Motion Bucket ID Scale

0
64
127
180
255
Static (cinemagraph)DefaultDynamic (action)
0
Almost no motion

A photograph slowly coming to life with subtle breathing

Best for: Portraits, product shots, cinemagraphs

64
Gentle motion

Hair blowing softly in the wind, water gently rippling

Best for: Nature scenes, ambient loops

127
Moderate motion (default)Default

A person walking, camera slowly panning across a scene

Best for: General purpose, balanced results

180
Active motion

Running, dancing, action sequences

Best for: Dynamic scenes, sports

255
Maximum motion

Fast action, chaotic movement, dramatic camera work

Best for: Action shots, but may cause artifacts

Pro Tips for Motion Control
1.

Start with 127 (default) and adjust based on results

2.

Match motion to content: portraits low (50-80), action high (150+)

3.

Values above 200 often cause artifacts and distortion

4.

Combine with noise_aug_strength for more variation

4

Architecture Approaches

Three fundamentally different approaches to image-to-video generation, each with its own trade-offs.

The SVD Pipeline

Image
Input
->
CLIP + VAE
Encode
->
3D U-Net+ temporal attn
Denoise
->
VAE Decode
To pixels
->
14-25 Frames
Output
Latent Video Diffusion (SVD)

Extend image diffusion to video by adding temporal layers between spatial layers. The 2D U-Net becomes a pseudo-3D U-Net.

Advantages:
  • - Builds on proven image diffusion
  • - Efficient in latent space
  • - Good visual quality
Disadvantages:
  • - Hard to scale to long videos
  • - Temporal artifacts possible
Full 3D Attention (Sora-style)

Treat video as a sequence of spacetime patches. Full attention across all patches in space and time.

Advantages:
  • - Most coherent results
  • - Scales with compute
  • - Unified architecture
Disadvantages:
  • - Extremely expensive
  • - Quadratic in video length
Warping/Stitching (LivePortrait)

Detect keypoints in source image, animate them based on driving signal, warp source image to match. No diffusion at all.

Advantages:
  • - Real-time
  • - Precise control
  • - No hallucination
Disadvantages:
  • - Limited to faces/bodies
  • - Needs driving signal
  • - Can't generate new content
5

Model Comparison (2024 State of the Art)

The current landscape of image-to-video models, from open-source to commercial APIs.

Stable Video Diffusion
Stability AI - Nov 2023
Open Source
Max Length
4s (14 frames) / 25 frames (XT)
Resolution
1024x576
Output FPS
6
Approach
Latent Video Diffusion
Strengths:
Open weights | Customizable | Active community
Weaknesses:
Limited motion | Fixed aspect ratio | No text prompts
Runway Gen-3 Alpha
Runway - Jun 2024
Commercial API
Max Length
10s
Resolution
1080p
Output FPS
24
Approach
Multimodal Transformer
Strengths:
Best quality | Text+image control | Consistent motion
Weaknesses:
Expensive | API only | Usage limits
Kling
Kuaishou - Jun 2024
Commercial
Max Length
5s (standard) / 2min (extended)
Resolution
1080p
Output FPS
30
Approach
3D VAE + DiT
Strengths:
Long videos | Great motion | Physics understanding
Weaknesses:
Limited access | Chinese platform | Watermark
Luma Dream Machine
Luma AI - Jun 2024
Commercial
Max Length
5s
Resolution
720p
Output FPS
24
Approach
Multimodal Transformer
Strengths:
Fast generation | Good quality/speed | Easy to use
Weaknesses:
Lower resolution | Sometimes inconsistent
LivePortrait
Kuaishou (Open Source) - Jul 2024
Open Source
Max Length
Unlimited
Resolution
512px
Output FPS
30
Approach
Stitching + Warping
Strengths:
Real-time | Precise control | No diffusion artifacts
Weaknesses:
Portraits only | Needs driving video | Limited motion range
CogVideoX
Tsinghua/Zhipu - Aug 2024
Open Source
Max Length
6s
Resolution
720p
Output FPS
8
Approach
3D VAE + Expert Transformer
Strengths:
Open weights | Text+image | Good quality
Weaknesses:
Slow | High VRAM | Lower FPS
Best Open Source
SVD-XT / CogVideoX
Free, customizable, run locally. SVD for general use, CogVideoX for text+image control.
Best Quality
Runway Gen-3 Alpha
Premium API with best overall quality, motion, and text control.
Best for Portraits
LivePortrait
Real-time, precise control, no diffusion artifacts. Limited to faces.
6

Real Examples: What to Expect

Different input types produce different motion patterns. Here's what actually works.

INPUT
Portrait photo of a woman with closed eyes
EXPECTED MOTION
Eyes slowly open, subtle facial movements, hair gently sways
BEST MODEL
LivePortrait, SVD with low motion bucket
CHALLENGE
Maintaining facial identity while adding natural micro-expressions
INPUT
Landscape with mountains and a lake
EXPECTED MOTION
Clouds drift, water ripples, trees sway gently in wind
BEST MODEL
SVD, Runway Gen-3 with nature prompt
CHALLENGE
Keeping distant mountains static while animating foreground elements
INPUT
Product photo of a sneaker on white background
EXPECTED MOTION
Camera orbit around product, subtle lighting changes
BEST MODEL
Runway Gen-3 with camera motion prompt
CHALLENGE
Maintaining sharp edges and product accuracy during rotation
INPUT
Anime character in action pose
EXPECTED MOTION
Hair flowing, clothes billowing, dynamic action continuation
BEST MODEL
AnimateDiff with anime-trained LoRA
CHALLENGE
Preserving art style while adding fluid animation
INPUT
Close-up of flames or water
EXPECTED MOTION
Fluid dynamics: fire flickering, water flowing
BEST MODEL
SVD with high motion bucket (180+)
CHALLENGE
Creating chaotic yet natural fluid motion
Works Well
  • - Natural scenes (water, clouds, fire, foliage)
  • - Portraits with subtle expressions
  • - Camera motion (pan, zoom, push)
  • - Animals with predictable motion
  • - Atmospheric effects (rain, snow, fog)
Struggles With
  • - Text and fine details (often distort)
  • - Hands and fingers (same as image gen)
  • - Complex multi-object interactions
  • - Physics-heavy scenarios (falling, bouncing)
  • - Maintaining geometric consistency
7

Code Examples

Working code for the major image-to-video approaches.

Stable Video Diffusion (SVD-XT)pip install diffusers
Recommended Start
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
import torch

# Load the SVD-XT model (25 frames, better motion)
pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()  # Saves VRAM

# Load and prepare your input image
# IMPORTANT: SVD expects 1024x576 (16:9) images
image = load_image("input.jpg")
image = image.resize((1024, 576))

# Generate video frames
# motion_bucket_id: 0-255, higher = more motion
# noise_aug_strength: adds noise to condition, helps with some images
frames = pipe(
    image,
    num_frames=25,                # SVD-XT supports 25 frames
    motion_bucket_id=127,         # Default, moderate motion
    noise_aug_strength=0.02,      # Slight noise helps generalization
    num_inference_steps=25,       # More steps = better quality
    decode_chunk_size=8,          # Memory optimization
).frames[0]

# Export to video file (6 FPS is SVD's native rate)
export_to_video(frames, "output.mp4", fps=6)
print(f"Generated video with {len(frames)} frames")

# Pro tip: Use RIFE or FILM to interpolate to 24/30 FPS
# This dramatically improves perceived smoothness
8

Common Pitfalls and How to Avoid Them

Mistakes that waste compute time and produce bad results.

1
Using the wrong aspect ratio

SVD is trained on 1024x576. Other ratios cause severe distortion.

FIX:

Always resize/crop to the model's native aspect ratio before inference.

2
Motion bucket too high for static scenes

High motion + static input = the model hallucinates random motion.

FIX:

Match motion bucket to content. Portraits: 40-80. Action: 150-200.

3
Expecting long videos from short models

SVD generates 14-25 frames. Looping or extending creates artifacts.

FIX:

Use models designed for length (Kling, Gen-3) or accept short clips.

4
Low quality input images

Garbage in, garbage out. Blurry/noisy inputs get amplified.

FIX:

Use high-resolution, well-lit, sharp images. Upscale if needed.

5
Ignoring the first frame

The input image IS the first frame. Bad composition = bad video.

FIX:

Compose input as if it's a cinematic opening shot.

Quick Reference

For Getting Started
  • - SVD-XT (open source)
  • - 1024x576 images only
  • - motion_bucket_id=127
For Production
  • - Runway Gen-3 (quality)
  • - Luma (speed)
  • - Kling (long videos)
For Portraits
  • - LivePortrait (real-time)
  • - EMO / Hallo (audio)
  • - SVD with low motion
Key Parameters
  • - Motion bucket: 0-255
  • - Noise aug: 0.02 default
  • - Steps: 20-50

Use Cases

  • Photo animation
  • Product showcase videos
  • Social media content
  • Memory preservation

Architectural Patterns

Image-Conditioned Video Diffusion

Use the image as the first frame and generate subsequent frames.

Pros:
  • +Preserves subject identity
  • +Natural motion
Cons:
  • -Limited by first frame
  • -Motion can be subtle

Motion Transfer

Apply motion from a reference video to a still image.

Pros:
  • +Controllable motion
  • +Consistent
Cons:
  • -Needs reference motion
  • -May not match subject

Implementations

API Services

Runway Gen-3 Alpha

Runway
API

High-quality image-to-video. Camera motion control.

Kling

Kuaishou
API

Excellent motion quality. Long video support.

Pika

Pika Labs
API

Creative effects and motion. Web interface.

Open Source

Stable Video Diffusion

Stability AI Community
Open Source

Standard for image animation. 14-25 frame outputs.

Benchmarks

Quick Facts

Input
Image
Output
Video
Implementations
1 open source, 3 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for image to video.

Submit Results