Computer Visionvideo-to-video

Video-to-Video

Video-to-video translation transforms existing footage — applying style transfer, temporal super-resolution, relighting, or motion retargeting while preserving temporal coherence across frames. The naive approach of processing frames independently produces unwatchable flicker, so the core technical challenge is enforcing cross-frame consistency. Diffusion-based approaches like Rerender-A-Video and TokenFlow (2023) showed that propagating attention features between frames solves this elegantly. The practical frontier is real-time processing for live video — current methods are offline and slow, but the creative potential for film post-production, video editing, and content repurposing is enormous.

1 datasets2 resultsView full task mapping →

Video-to-video transforms an input video into a modified output video — style transfer, temporal super-resolution, relighting, face swapping, or motion transfer while maintaining temporal consistency. It's harder than image-to-image because each frame must be coherent with its neighbors. Diffusion-based approaches (Rerender-a-Video, TokenFlow) now produce commercially usable results.

History

2018

Vid2vid (Wang et al.) applies conditional GANs to video translation with optical-flow-based temporal discriminators — first convincing results

2019

Few-shot vid2vid demonstrates that video translation can work with just a few example frames, enabling face reenactment and pose transfer

2020

First Order Motion Model (FOMM) enables face animation from a single source image + driving video, going viral for deepfakes and creative tools

2022

Text2LIVE edits real videos guided by text using layered neural atlases — the first convincing text-driven video edit

2023

ControlVideo and Text2Video-Zero apply ControlNet-style conditioning to video generation with cross-frame attention for temporal consistency

2023

TokenFlow (Geyer et al.) propagates edited features along inter-frame correspondences, producing temporally consistent video edits from per-frame processing

2024

Rerender-a-Video and CoDeF achieve high-quality video style transfer with explicit temporal modeling; RAVE enables zero-shot video editing

2024

Runway Act-One and LivePortrait enable real-time face/expression transfer in video; commercial video editing APIs proliferate

2025

Wan2.1 and HunyuanVideo support video-to-video editing natively; consistency techniques mature with flow-based warping integrated into diffusion

How Video-to-Video Works

Temporal Decomposition

Methods decompose the video into canonical content (what things look like) and motion (how they move). CoDeF uses a canonical 2D neural field + deformation field; flow-based methods compute optical flow between frames.

Keyframe Editing

Selected keyframes are edited using image-to-image techniques (ControlNet, SDXL img2img, style transfer). These edited keyframes serve as anchor points for the rest of the video.

Temporal Propagation

TokenFlow: token-level features from edited keyframes are propagated to other frames along inter-frame correspondences. Optical-flow warping: the edit is warped to adjacent frames using estimated flow. Cross-frame attention: frames attend to each other during generation.

Consistency Refinement

Post-processing removes residual flickering via temporal filtering, EbSynth-based patch matching, or diffusion-based temporal smoothing. Some methods use a video discriminator (GAN) or video quality loss.

Evaluation

Temporal consistency is measured by warping error between consecutive frames. Visual quality by FID. Overall by human evaluation (flickering, coherence, edit quality). There's no single standard benchmark — most papers use custom video sets.

Current Landscape

Video-to-video in 2025 is transitioning from research demos to production tools. The key insight that made it work was treating temporal consistency as a propagation problem rather than a generation problem — edit keyframes well, then propagate edits along optical flow or attention correspondences. TokenFlow and CoDeF represent the state of the art for open research. Commercial tools (Runway, Kling) are integrating video-to-video natively into their platforms, handling temporal consistency within the generation model itself rather than as post-processing. The market is driven by content creation: style transfer for social media, face transfer for virtual production, and video enhancement for legacy content.

Key Challenges

Temporal flickering — the most visible artifact; per-frame processing produces inconsistent colors, textures, and shapes that flicker at video playback speed

Preserving fine motion — small motions (facial expressions, finger movements, cloth ripples) are often lost or distorted during translation

Identity preservation — characters must look the same throughout the video while their appearance is being transformed (e.g., style transfer shouldn't change who someone is)

Speed — processing each frame through a diffusion model takes 1-5 seconds; a 30-second video at 30 FPS means 900 frame-level operations

Occlusion handling — when objects appear or disappear (walking behind a pillar, turning around), the temporal propagation breaks because correspondences are lost

Quick Recommendations

Style transfer on real video

TokenFlow or Rerender-a-Video

Best temporal consistency for artistic style transfer; TokenFlow requires no per-video training

Text-guided video editing

RAVE or Pix2Video with SDXL

Edit video content based on text instructions while preserving structure and motion

Face/expression transfer

LivePortrait or Runway Act-One

Real-time face reenactment with natural expression transfer; production-quality for virtual avatars

Video super-resolution

RealBasicVSR or RVRT

Temporal aggregation across frames produces sharper upscaling than per-frame super-resolution

Commercial pipeline

Runway Gen-3 or Wan2.1 video-to-video mode

Integrated editing with temporal attention built into the generation model — less post-processing needed

What's Next

The frontier is real-time video-to-video for live streaming and video calls (change backgrounds, apply styles, translate expressions in real-time), 4K temporal super-resolution, and compositional video editing (change one object's appearance without affecting anything else). World-model-based approaches may eventually replace explicit optical flow by understanding scene physics. The convergence with video generation (Wan2.1, Sora) suggests that future video editing will be 'describe what you want changed' rather than manual per-frame processing.

Benchmarks & SOTA

DAVIS

20162 results

Video editing and object segmentation benchmark

State of the Art

DINOv3 (7B)

83.3

j-f

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Video-to-Video benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision