Video-to-Video
Video-to-video translation transforms existing footage — applying style transfer, temporal super-resolution, relighting, or motion retargeting while preserving temporal coherence across frames. The naive approach of processing frames independently produces unwatchable flicker, so the core technical challenge is enforcing cross-frame consistency. Diffusion-based approaches like Rerender-A-Video and TokenFlow (2023) showed that propagating attention features between frames solves this elegantly. The practical frontier is real-time processing for live video — current methods are offline and slow, but the creative potential for film post-production, video editing, and content repurposing is enormous.
Video-to-video transforms an input video into a modified output video — style transfer, temporal super-resolution, relighting, face swapping, or motion transfer while maintaining temporal consistency. It's harder than image-to-image because each frame must be coherent with its neighbors. Diffusion-based approaches (Rerender-a-Video, TokenFlow) now produce commercially usable results.
History
Vid2vid (Wang et al.) applies conditional GANs to video translation with optical-flow-based temporal discriminators — first convincing results
Few-shot vid2vid demonstrates that video translation can work with just a few example frames, enabling face reenactment and pose transfer
First Order Motion Model (FOMM) enables face animation from a single source image + driving video, going viral for deepfakes and creative tools
Text2LIVE edits real videos guided by text using layered neural atlases — the first convincing text-driven video edit
ControlVideo and Text2Video-Zero apply ControlNet-style conditioning to video generation with cross-frame attention for temporal consistency
TokenFlow (Geyer et al.) propagates edited features along inter-frame correspondences, producing temporally consistent video edits from per-frame processing
Rerender-a-Video and CoDeF achieve high-quality video style transfer with explicit temporal modeling; RAVE enables zero-shot video editing
Runway Act-One and LivePortrait enable real-time face/expression transfer in video; commercial video editing APIs proliferate
Wan2.1 and HunyuanVideo support video-to-video editing natively; consistency techniques mature with flow-based warping integrated into diffusion
How Video-to-Video Works
Temporal Decomposition
Methods decompose the video into canonical content (what things look like) and motion (how they move). CoDeF uses a canonical 2D neural field + deformation field; flow-based methods compute optical flow between frames.
Keyframe Editing
Selected keyframes are edited using image-to-image techniques (ControlNet, SDXL img2img, style transfer). These edited keyframes serve as anchor points for the rest of the video.
Temporal Propagation
TokenFlow: token-level features from edited keyframes are propagated to other frames along inter-frame correspondences. Optical-flow warping: the edit is warped to adjacent frames using estimated flow. Cross-frame attention: frames attend to each other during generation.
Consistency Refinement
Post-processing removes residual flickering via temporal filtering, EbSynth-based patch matching, or diffusion-based temporal smoothing. Some methods use a video discriminator (GAN) or video quality loss.
Evaluation
Temporal consistency is measured by warping error between consecutive frames. Visual quality by FID. Overall by human evaluation (flickering, coherence, edit quality). There's no single standard benchmark — most papers use custom video sets.
Current Landscape
Video-to-video in 2025 is transitioning from research demos to production tools. The key insight that made it work was treating temporal consistency as a propagation problem rather than a generation problem — edit keyframes well, then propagate edits along optical flow or attention correspondences. TokenFlow and CoDeF represent the state of the art for open research. Commercial tools (Runway, Kling) are integrating video-to-video natively into their platforms, handling temporal consistency within the generation model itself rather than as post-processing. The market is driven by content creation: style transfer for social media, face transfer for virtual production, and video enhancement for legacy content.
Key Challenges
Temporal flickering — the most visible artifact; per-frame processing produces inconsistent colors, textures, and shapes that flicker at video playback speed
Preserving fine motion — small motions (facial expressions, finger movements, cloth ripples) are often lost or distorted during translation
Identity preservation — characters must look the same throughout the video while their appearance is being transformed (e.g., style transfer shouldn't change who someone is)
Speed — processing each frame through a diffusion model takes 1-5 seconds; a 30-second video at 30 FPS means 900 frame-level operations
Occlusion handling — when objects appear or disappear (walking behind a pillar, turning around), the temporal propagation breaks because correspondences are lost
Quick Recommendations
Style transfer on real video
TokenFlow or Rerender-a-Video
Best temporal consistency for artistic style transfer; TokenFlow requires no per-video training
Text-guided video editing
RAVE or Pix2Video with SDXL
Edit video content based on text instructions while preserving structure and motion
Face/expression transfer
LivePortrait or Runway Act-One
Real-time face reenactment with natural expression transfer; production-quality for virtual avatars
Video super-resolution
RealBasicVSR or RVRT
Temporal aggregation across frames produces sharper upscaling than per-frame super-resolution
Commercial pipeline
Runway Gen-3 or Wan2.1 video-to-video mode
Integrated editing with temporal attention built into the generation model — less post-processing needed
What's Next
The frontier is real-time video-to-video for live streaming and video calls (change backgrounds, apply styles, translate expressions in real-time), 4K temporal super-resolution, and compositional video editing (change one object's appearance without affecting anything else). World-model-based approaches may eventually replace explicit optical flow by understanding scene physics. The convergence with video generation (Wan2.1, Sora) suggests that future video editing will be 'describe what you want changed' rather than manual per-frame processing.
Benchmarks & SOTA
Related Tasks
Something wrong or missing?
Help keep Video-to-Video benchmarks accurate. Report outdated results, missing benchmarks, or errors.