Image-Text-to-Video
Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.
Image-text-to-video models animate a static image based on a text prompt, producing short video clips with motion, camera movement, and temporal coherence. This task is the backbone of AI filmmaking — enabling storyboard-to-video, product animation, and creative content production from a single reference frame.
History
Make-A-Video (Meta) and Imagen Video (Google) demonstrate text-to-video diffusion at scale for the first time
Stable Video Diffusion (Stability AI) open-sources image-to-video generation, enabling community development
Gen-2 (Runway) launches the first commercial image-to-video product used in professional filmmaking
Sora (OpenAI) stuns with minute-long, physically coherent videos from text and image inputs
Kling (Kuaishou) and Dream Machine (Luma) ship competitive image-to-video models with public APIs
CogVideoX (Tsinghua/ZhipuAI) open-sources a strong video generation model competitive with commercial options
Sora launches publicly; Veo 2 (Google), Kling 2.0, and Minimax Video-01 push quality boundaries with longer, more coherent outputs
Wan2.1 (Alibaba) and HunyuanVideo (Tencent) open-source competitive video generation models
How Image-Text-to-Video Works
Image Conditioning
The input reference image is encoded via a VAE into latent space and used as the first-frame or structural conditioning signal. CLIP or SigLIP embeddings capture the semantic content of the image.
Text Encoding
The motion/action prompt is encoded via a text encoder (T5, CLIP) to guide temporal dynamics. The text specifies what should happen — camera movement, subject actions, environmental changes.
Spatiotemporal Diffusion
A 3D U-Net or Video DiT (Diffusion Transformer) generates video latents by jointly modeling spatial and temporal dimensions. Temporal attention layers ensure frame-to-frame consistency while allowing natural motion.
Frame Decoding & Interpolation
Video latents are decoded frame-by-frame through a temporal VAE decoder. Optional frame interpolation (RIFE, FILM) increases framerate from the model's native output (typically 8-24 fps) to smooth 30-60 fps.
Current Landscape
Video generation in 2025 is the most rapidly advancing multimodal frontier. Proprietary models (Veo 2, Sora, Kling 2.0) can produce 10-60 second clips with impressive coherence, and the gap between 'generated' and 'real footage' is narrowing fast for certain shot types. Open-source has made dramatic strides with Wan2.1 and HunyuanVideo approaching commercial quality. The Diffusion Transformer (DiT) architecture has won over 3D U-Nets for video generation. Camera control, character consistency across shots, and physics simulation remain the key differentiators between models. Cost is still high — a minute of Sora video costs roughly $0.50-2.00.
Key Challenges
Temporal coherence — objects morph, disappear, or deform unnaturally across frames, especially in longer generations
Physics violations — gravity, fluid dynamics, and rigid body motion are frequently unrealistic; hands and fine motor actions remain problematic
Motion control — precisely specifying camera trajectories and subject movements through text alone is imprecise
Duration limitations — most models produce 4-10 second clips; extending to minutes while maintaining coherence is unsolved at quality
Compute cost — generating a single 10-second video clip can take 2-10 minutes on high-end GPUs, making iteration slow
Quick Recommendations
Best overall quality
Veo 2 (Google DeepMind)
Highest temporal coherence and physical realism; produces cinematic quality 4K video up to 60 seconds
Best for creative/filmmaking
Sora (OpenAI)
Strong scene understanding and narrative coherence; built for creative professionals with storyboard-to-video workflows
Best for fast iteration
Kling 2.0
Good quality at fast generation speeds; competitive pricing and reliable API for production use
Open source
Wan2.1-14B (Alibaba)
Best open-weight video model; strong image-to-video capabilities with community support for fine-tuning and extensions
Open source (lightweight)
CogVideoX-5B
Runs on a single consumer GPU (24GB VRAM); good quality for prototyping and research applications
What's Next
The next milestone is consistent multi-shot video generation — maintaining character identity, scene continuity, and narrative arc across multiple generated clips. Expect real-time video generation for interactive applications, better motion control via trajectory sketches and physics engines, and integration with 3D scene representations (NeRFs, Gaussian Splatting) for view-consistent generation. The film industry will shift from 'can AI make a clip?' to 'can AI make a coherent scene?'
Benchmarks & SOTA
Related Tasks
Audio-Text-to-Text
Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.
Image-Text-to-Image
Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.
Image-Text-to-Text
Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.
Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.
Something wrong or missing?
Help keep Image-Text-to-Video benchmarks accurate. Report outdated results, missing benchmarks, or errors.