Multimodalimage-text-to-video

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

1 datasets0 resultsView full task mapping →

Image-text-to-video models animate a static image based on a text prompt, producing short video clips with motion, camera movement, and temporal coherence. This task is the backbone of AI filmmaking — enabling storyboard-to-video, product animation, and creative content production from a single reference frame.

History

2022

Make-A-Video (Meta) and Imagen Video (Google) demonstrate text-to-video diffusion at scale for the first time

2023

Stable Video Diffusion (Stability AI) open-sources image-to-video generation, enabling community development

2023

Gen-2 (Runway) launches the first commercial image-to-video product used in professional filmmaking

2024

Sora (OpenAI) stuns with minute-long, physically coherent videos from text and image inputs

2024

Kling (Kuaishou) and Dream Machine (Luma) ship competitive image-to-video models with public APIs

2024

CogVideoX (Tsinghua/ZhipuAI) open-sources a strong video generation model competitive with commercial options

2025

Sora launches publicly; Veo 2 (Google), Kling 2.0, and Minimax Video-01 push quality boundaries with longer, more coherent outputs

2025

Wan2.1 (Alibaba) and HunyuanVideo (Tencent) open-source competitive video generation models

How Image-Text-to-Video Works

1Image ConditioningThe input reference image i…2Text EncodingThe motion/action prompt is…3Spatiotemporal Diffus…A 3D U-Net or Video DiT (Di…4Frame Decoding & Inte…Video latents are decoded f…Image-Text-to-Video Pipeline
1

Image Conditioning

The input reference image is encoded via a VAE into latent space and used as the first-frame or structural conditioning signal. CLIP or SigLIP embeddings capture the semantic content of the image.

2

Text Encoding

The motion/action prompt is encoded via a text encoder (T5, CLIP) to guide temporal dynamics. The text specifies what should happen — camera movement, subject actions, environmental changes.

3

Spatiotemporal Diffusion

A 3D U-Net or Video DiT (Diffusion Transformer) generates video latents by jointly modeling spatial and temporal dimensions. Temporal attention layers ensure frame-to-frame consistency while allowing natural motion.

4

Frame Decoding & Interpolation

Video latents are decoded frame-by-frame through a temporal VAE decoder. Optional frame interpolation (RIFE, FILM) increases framerate from the model's native output (typically 8-24 fps) to smooth 30-60 fps.

Current Landscape

Video generation in 2025 is the most rapidly advancing multimodal frontier. Proprietary models (Veo 2, Sora, Kling 2.0) can produce 10-60 second clips with impressive coherence, and the gap between 'generated' and 'real footage' is narrowing fast for certain shot types. Open-source has made dramatic strides with Wan2.1 and HunyuanVideo approaching commercial quality. The Diffusion Transformer (DiT) architecture has won over 3D U-Nets for video generation. Camera control, character consistency across shots, and physics simulation remain the key differentiators between models. Cost is still high — a minute of Sora video costs roughly $0.50-2.00.

Key Challenges

Temporal coherence — objects morph, disappear, or deform unnaturally across frames, especially in longer generations

Physics violations — gravity, fluid dynamics, and rigid body motion are frequently unrealistic; hands and fine motor actions remain problematic

Motion control — precisely specifying camera trajectories and subject movements through text alone is imprecise

Duration limitations — most models produce 4-10 second clips; extending to minutes while maintaining coherence is unsolved at quality

Compute cost — generating a single 10-second video clip can take 2-10 minutes on high-end GPUs, making iteration slow

Quick Recommendations

Best overall quality

Veo 2 (Google DeepMind)

Highest temporal coherence and physical realism; produces cinematic quality 4K video up to 60 seconds

Best for creative/filmmaking

Sora (OpenAI)

Strong scene understanding and narrative coherence; built for creative professionals with storyboard-to-video workflows

Best for fast iteration

Kling 2.0

Good quality at fast generation speeds; competitive pricing and reliable API for production use

Open source

Wan2.1-14B (Alibaba)

Best open-weight video model; strong image-to-video capabilities with community support for fine-tuning and extensions

Open source (lightweight)

CogVideoX-5B

Runs on a single consumer GPU (24GB VRAM); good quality for prototyping and research applications

What's Next

The next milestone is consistent multi-shot video generation — maintaining character identity, scene continuity, and narrative arc across multiple generated clips. Expect real-time video generation for interactive applications, better motion control via trajectory sketches and physics engines, and integration with 3D scene representations (NeRFs, Gaussian Splatting) for view-consistent generation. The film industry will shift from 'can AI make a clip?' to 'can AI make a coherent scene?'

Benchmarks & SOTA

Related Tasks

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

Image-Text-to-Text

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

Visual Question Answering

Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.

Something wrong or missing?

Help keep Image-Text-to-Video benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Image-Text-to-Video Benchmarks - Multimodal - CodeSOTA | CodeSOTA