Image→Video

Image to Video

Animate still images into videos. Bring photos to life with natural motion.

How Image-to-Video Works

A deep technical exploration of how AI transforms a single image into moving video. From the mathematics of temporal attention to practical generation with state-of-the-art models.

1. The Problem 2. Temporal Attention 3. Motion Control 4. Architectures 5. Models 6. Real Examples 7. Code 8. Pitfalls

The Fundamental Problem

Why is image-to-video so hard? A single image contains no temporal information. The model must invent motion that looks plausible.

The Core Insight

Picture a photograph of a dog. What happens next? The dog could run left, right, jump, bark, lie down, or do nothing at all. There is no "correct" answer encoded in the pixels.

Image-to-video models learn a prior over plausible motionsfrom millions of training videos. Given a dog image, the model has seen thousands of dogs moving, and it generates motion that looks like what dogs typically do.

Temporal Coherence

"How do you ensure frame 47 looks like it belongs with frames 46 and 48?"

Temporal attention layers that let each frame 'see' neighboring frames during generation

Identity Preservation

"How do you keep the dog in frame 1 looking like the same dog in frame 100?"

Condition every frame on the input image's latent representation

Motion Plausibility

"How do you generate motion that looks physically realistic?"

Train on millions of real videos so the model learns how things actually move

Computational Cost

"Video is 24-60x more data than a single image. How do you make this tractable?"

Work in latent space (compressed), use sparse attention, generate at low FPS then interpolate

What Changes Between Frames?

Frame 1

Frame 2

Frame 3

Frame 4

Frame 5

The blue ball moves ~20 pixels per frame. The model must predict this trajectory from just frame 1.

Temporal Attention: The Key Innovation

Regular image diffusion models have no concept of time. Temporal attention adds a mechanism for frames to communicate with each other during generation.

Think of It This Way

Imagine you're an animator drawing frame 15 of a running dog. You don't draw it in isolation. You flip back to frames 10-14 to see where the dog was, and you think ahead to frames 16-20 to plan where it's going.

Temporal attention gives the neural network this same ability. Each frame can "look at" other frames while being generated, ensuring smooth, coherent motion.

Spatial Attention First

Each frame is processed independently through spatial self-attention, just like image generation. Tokens attend to other tokens within the same frame.

Pixel (2,3)

attends to

all pixels in Frame 1

Spatial: attention within each frame (standard image diffusion)

Temporal Attention Second

Now comes the magic: tokens at the same spatial position across all frames attend to each other. Frame 1's sky pixel talks to frame 2's sky pixel, and so on.

Same position, all frames attend to each other

Information Flows Through Time

This creates a 'temporal highway' where motion information propagates. If an object moved left in frames 1-5, the model knows to continue that trajectory in frames 6-10.

Motion signal

Propagates forward

Motion information flows through temporal connections

Technical Implementation

SVD / AnimateDiff Approach

Insert temporal attention layers after every spatial attention block in the U-Net. These layers are trained while keeping spatial layers frozen (from Stable Diffusion).

Sora / Movie Gen Approach

Use a Diffusion Transformer (DiT) that treats video as a sequence of spacetime patches. Full attention across all patches (extremely compute-intensive).

Motion Control: The Motion Bucket ID

How do you tell the model "move a lot" vs "barely move"? SVD uses a clever conditioning signal called the Motion Bucket ID.

How Motion Bucket Works

During training, Stability AI computed the optical flow magnitude for each training video. Videos with lots of motion got high values (200+), static videos got low values (20-50). This value was fed to the model as a conditioning signal.

At inference, you provide a motion bucket ID (0-255). The model generates motion that feels like videos with similar flow values in the training set.

Motion Bucket ID Scale

127

180

255

Static (cinemagraph)DefaultDynamic (action)

Almost no motion

A photograph slowly coming to life with subtle breathing

Best for: Portraits, product shots, cinemagraphs

Gentle motion

Hair blowing softly in the wind, water gently rippling

Best for: Nature scenes, ambient loops

127

Moderate motion (default)Default

A person walking, camera slowly panning across a scene

Best for: General purpose, balanced results

180

Active motion

Running, dancing, action sequences

Best for: Dynamic scenes, sports

255

Maximum motion

Fast action, chaotic movement, dramatic camera work

Best for: Action shots, but may cause artifacts

Pro Tips for Motion Control

Start with 127 (default) and adjust based on results

Match motion to content: portraits low (50-80), action high (150+)

Values above 200 often cause artifacts and distortion

Combine with noise_aug_strength for more variation

Architecture Approaches

Three fundamentally different approaches to image-to-video generation, each with its own trade-offs.

The SVD Pipeline

Image

Input

CLIP + VAE

Encode

3D U-Net+ temporal attn

Denoise

VAE Decode

To pixels

14-25 Frames

Output

Latent Video Diffusion (SVD)

Extend image diffusion to video by adding temporal layers between spatial layers. The 2D U-Net becomes a pseudo-3D U-Net.

Advantages:

- Builds on proven image diffusion
- Efficient in latent space
- Good visual quality

Disadvantages:

- Hard to scale to long videos
- Temporal artifacts possible

Full 3D Attention (Sora-style)

Treat video as a sequence of spacetime patches. Full attention across all patches in space and time.

Advantages:

- Most coherent results
- Scales with compute
- Unified architecture

Disadvantages:

- Extremely expensive
- Quadratic in video length

Warping/Stitching (LivePortrait)

Detect keypoints in source image, animate them based on driving signal, warp source image to match. No diffusion at all.

Advantages:

- Real-time
- Precise control
- No hallucination

Disadvantages:

- Limited to faces/bodies
- Needs driving signal
- Can't generate new content

Model Comparison (2024 State of the Art)

The current landscape of image-to-video models, from open-source to commercial APIs.

Stable Video Diffusion

Stability AI - Nov 2023

Open Source

Max Length

4s (14 frames) / 25 frames (XT)

Resolution

1024x576

Output FPS

Approach

Latent Video Diffusion

Strengths:

Open weights | Customizable | Active community

Weaknesses:

Limited motion | Fixed aspect ratio | No text prompts

Runway Gen-3 Alpha

Runway - Jun 2024

Commercial API

Max Length

10s

Resolution

1080p

Output FPS

Approach

Multimodal Transformer

Strengths:

Best quality | Text+image control | Consistent motion

Weaknesses:

Expensive | API only | Usage limits

Kling

Kuaishou - Jun 2024

Commercial

Max Length

5s (standard) / 2min (extended)

Resolution

1080p

Output FPS

Approach

3D VAE + DiT

Strengths:

Long videos | Great motion | Physics understanding

Weaknesses:

Limited access | Chinese platform | Watermark

Luma Dream Machine

Luma AI - Jun 2024

Commercial

Max Length

Resolution

720p

Output FPS

Approach

Multimodal Transformer

Strengths:

Fast generation | Good quality/speed | Easy to use

Weaknesses:

Lower resolution | Sometimes inconsistent

LivePortrait

Kuaishou (Open Source) - Jul 2024

Open Source

Max Length

Unlimited

Resolution

512px

Output FPS

Approach

Stitching + Warping

Strengths:

Real-time | Precise control | No diffusion artifacts

Weaknesses:

Portraits only | Needs driving video | Limited motion range

CogVideoX

Tsinghua/Zhipu - Aug 2024

Open Source

Max Length

Resolution

720p

Output FPS

Approach

3D VAE + Expert Transformer

Strengths:

Open weights | Text+image | Good quality

Weaknesses:

Slow | High VRAM | Lower FPS

Best Open Source

SVD-XT / CogVideoX

Free, customizable, run locally. SVD for general use, CogVideoX for text+image control.

Best Quality

Runway Gen-3 Alpha

Premium API with best overall quality, motion, and text control.

Best for Portraits

LivePortrait

Real-time, precise control, no diffusion artifacts. Limited to faces.

Real Examples: What to Expect

Different input types produce different motion patterns. Here's what actually works.

INPUT

Portrait photo of a woman with closed eyes

EXPECTED MOTION

Eyes slowly open, subtle facial movements, hair gently sways

BEST MODEL

LivePortrait, SVD with low motion bucket

CHALLENGE

Maintaining facial identity while adding natural micro-expressions

INPUT

Landscape with mountains and a lake

EXPECTED MOTION

Clouds drift, water ripples, trees sway gently in wind

BEST MODEL

SVD, Runway Gen-3 with nature prompt

CHALLENGE

Keeping distant mountains static while animating foreground elements

INPUT

Product photo of a sneaker on white background

EXPECTED MOTION

Camera orbit around product, subtle lighting changes

BEST MODEL

Runway Gen-3 with camera motion prompt

CHALLENGE

Maintaining sharp edges and product accuracy during rotation

INPUT

Anime character in action pose

EXPECTED MOTION

Hair flowing, clothes billowing, dynamic action continuation

BEST MODEL

AnimateDiff with anime-trained LoRA

CHALLENGE

Preserving art style while adding fluid animation

INPUT

Close-up of flames or water

EXPECTED MOTION

Fluid dynamics: fire flickering, water flowing

BEST MODEL

SVD with high motion bucket (180+)

CHALLENGE

Creating chaotic yet natural fluid motion

Works Well

- Natural scenes (water, clouds, fire, foliage)
- Portraits with subtle expressions
- Camera motion (pan, zoom, push)
- Animals with predictable motion
- Atmospheric effects (rain, snow, fog)

Struggles With

- Text and fine details (often distort)
- Hands and fingers (same as image gen)
- Complex multi-object interactions
- Physics-heavy scenarios (falling, bouncing)
- Maintaining geometric consistency

Code Examples

Working code for the major image-to-video approaches.

Stable Video Diffusion (SVD-XT)pip install diffusers

Recommended Start

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
import torch

# Load the SVD-XT model (25 frames, better motion)
pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()  # Saves VRAM

# Load and prepare your input image
# IMPORTANT: SVD expects 1024x576 (16:9) images
image = load_image("input.jpg")
image = image.resize((1024, 576))

# Generate video frames
# motion_bucket_id: 0-255, higher = more motion
# noise_aug_strength: adds noise to condition, helps with some images
frames = pipe(
    image,
    num_frames=25,                # SVD-XT supports 25 frames
    motion_bucket_id=127,         # Default, moderate motion
    noise_aug_strength=0.02,      # Slight noise helps generalization
    num_inference_steps=25,       # More steps = better quality
    decode_chunk_size=8,          # Memory optimization
).frames[0]

# Export to video file (6 FPS is SVD's native rate)
export_to_video(frames, "output.mp4", fps=6)
print(f"Generated video with {len(frames)} frames")

# Pro tip: Use RIFE or FILM to interpolate to 24/30 FPS
# This dramatically improves perceived smoothness

Common Pitfalls and How to Avoid Them

Mistakes that waste compute time and produce bad results.

Using the wrong aspect ratio

SVD is trained on 1024x576. Other ratios cause severe distortion.

FIX:

Always resize/crop to the model's native aspect ratio before inference.

Motion bucket too high for static scenes

High motion + static input = the model hallucinates random motion.

FIX:

Match motion bucket to content. Portraits: 40-80. Action: 150-200.

Expecting long videos from short models

SVD generates 14-25 frames. Looping or extending creates artifacts.

FIX:

Use models designed for length (Kling, Gen-3) or accept short clips.

Low quality input images

Garbage in, garbage out. Blurry/noisy inputs get amplified.

FIX:

Use high-resolution, well-lit, sharp images. Upscale if needed.

Ignoring the first frame

The input image IS the first frame. Bad composition = bad video.

FIX:

Compose input as if it's a cinematic opening shot.

Quick Reference

For Getting Started

- SVD-XT (open source)
- 1024x576 images only
- motion_bucket_id=127

For Production

- Runway Gen-3 (quality)
- Luma (speed)
- Kling (long videos)

For Portraits

- LivePortrait (real-time)
- EMO / Hallo (audio)
- SVD with low motion

Key Parameters

- Motion bucket: 0-255
- Noise aug: 0.02 default
- Steps: 20-50

Use Cases

✓Photo animation
✓Product showcase videos
✓Social media content
✓Memory preservation

Architectural Patterns

Image-Conditioned Video Diffusion

Use the image as the first frame and generate subsequent frames.

Pros:

+Preserves subject identity
+Natural motion

Cons:

-Limited by first frame
-Motion can be subtle

Motion Transfer

Apply motion from a reference video to a still image.

Pros:

+Controllable motion
+Consistent

Cons:

-Needs reference motion
-May not match subject

Implementations

API Services

Runway Gen-3 Alpha

Runway

API

High-quality image-to-video. Camera motion control.

Kling

Kuaishou

API

Excellent motion quality. Long video support.

Pika

Pika Labs

API

Creative effects and motion. Web interface.

Open Source

Stable Video Diffusion

Stability AI Community

Open Source

Standard for image animation. 14-25 frame outputs.

GitHub HuggingFace

Benchmarks

VBench →

Quick Facts

Input: Image
Output: Video
Implementations: 1 open source, 3 API
Patterns: 2 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for image to video.

Submit Results