Text-to-Video
Text-to-video generation is the most ambitious frontier in generative AI — synthesizing temporally coherent, physically plausible video from text prompts alone. The field exploded in 2024 with Sora demonstrating cinematic-quality generation, followed by open models like CogVideoX and Mochi pushing accessibility. The core technical challenge is maintaining consistency across frames: characters shouldn't morph, physics should hold, and camera motion should feel intentional. Quality is improving at a staggering pace, but generation still takes minutes per clip and artifacts remain visible under scrutiny — the gap between demos and reliable production tools is real.
Text-to-video generation creates video clips from natural language descriptions. It's the holy grail of generative AI — combining language understanding, visual generation, physics simulation, and temporal consistency. Sora's demo in February 2024 was the field's 'ChatGPT moment,' and by 2025 both commercial (Kling, Runway) and open-source (Wan2.1, HunyuanVideo) models produce usable 10-30 second clips.
History
Make-A-Video (Meta) and Imagen Video (Google) publish the first compelling text-to-video results using cascaded diffusion models
Runway Gen-1/Gen-2 becomes the first commercial text-to-video product; quality is limited but demonstrates the market
AnimateDiff shows temporal motion modules can be added to any Stable Diffusion checkpoint, democratizing video generation
OpenAI's Sora demo (February) shows minute-long coherent video from text — a paradigm shift in perceived capabilities
Kling (Kuaishou) launches publicly with 5-second clips rivaling Sora's demo quality; Chinese labs emerge as leaders
Runway Gen-3 Alpha, Luma Dream Machine, and Pika 1.0 compete commercially; quality improves monthly
CogVideoX (Tsinghua/Zhipu) open-sources a competitive text-to-video model with 5B parameters
Wan2.1 (Alibaba) and HunyuanVideo (Tencent) release open-weights models rivaling commercial quality; community fine-tuning explodes
Sora launches commercially but faces quality criticisms; market fragments across multiple competitive offerings
How Text-to-Video Works
Text Encoding
The text prompt is encoded by a language model (T5-XXL, CLIP, or a custom LLM). Larger text encoders produce better prompt following — FLUX/SD3 use dual encoders (CLIP + T5) for both semantic and detailed understanding.
Noise Initialization
A 3D noise tensor (frames × height × width × channels) is sampled in VAE latent space. Some models initialize from a reference frame to improve consistency; others generate all frames from noise.
Spatiotemporal Denoising
A 3D U-Net or DiT with temporal attention layers iteratively denoises the latent video. Each step refines both spatial content and temporal coherence. Cross-attention to text embeddings guides the content. DiT architectures (used by Sora, Wan2.1) scale better than U-Nets.
Temporal Upsampling
Many models first generate a low-framerate video (e.g., 8 FPS) then use a frame interpolation model to reach 24-30 FPS, reducing computational cost while maintaining smoothness.
VAE Decoding + Super-Resolution
A temporal VAE decoder converts latent frames to pixel space. Optional super-resolution (diffusion or GAN-based) upscales to the final resolution (720p, 1080p). The decode step can introduce flickering if the VAE isn't temporally aware.
Current Landscape
Text-to-video generation in 2025 is experiencing its Cambrian explosion. At least 10 competitive models exist commercially and in open source, with quality improving monthly. The architecture has converged on diffusion transformers (DiT) with temporal attention, trained on hundreds of millions of video clips. Chinese labs (Kuaishou/Kling, Alibaba/Wan, Tencent/HunyuanVideo) have emerged as co-leaders alongside Western companies (Runway, OpenAI, Google). The quality gap between commercial and open-source has narrowed dramatically with Wan2.1 and HunyuanVideo releases. However, all models still fail on physics, complex multi-character scenes, and duration beyond 30 seconds.
Key Challenges
Physics and causality — models generate visually plausible but physically incorrect interactions (impossible reflections, objects phasing through each other, liquid defying gravity)
Temporal consistency — characters changing appearance, objects disappearing between cuts, and background morphing remain the most visible failure modes
Duration — coherent generation beyond 15-30 seconds requires hierarchical approaches (keyframe → interpolation) that compound errors
Prompt following — complex multi-sentence prompts with spatial relationships and sequential events are poorly handled; models often ignore clauses
Cost — generating a 10-second 720p video costs $0.10-1.00 in compute and takes 60-300 seconds, making iteration expensive and real-time generation impossible
Quick Recommendations
Best open-source quality
Wan2.1-14B or HunyuanVideo-13B
Closest to commercial quality in open weights; community LoRA fine-tuning enables style control
Best commercial quality
Kling 1.6 Pro or Runway Gen-3 Alpha Turbo
Best temporal consistency and prompt following in production APIs; Kling supports camera control
Fast iteration / prototyping
Runway Gen-3 Turbo or Pika 2.0
5-15 second generation time; good enough for storyboarding and concept exploration
Controllable generation
CogVideoX-5B + ControlNet
Open model with community-built control adapters for camera, motion, and structural guidance
Long-form (30s+)
Kling with extend feature or Sora
Best available for multi-shot generation; still requires manual curation between segments
What's Next
2025-2026 will focus on: (1) world models that actually understand physics rather than just mimicking visual patterns, (2) character consistency — the same person across an entire video or multiple clips, (3) real-time generation for interactive applications, (4) controllable camera and motion via explicit 3D representations, and (5) audio-visual joint generation (synchronized speech, sound effects). The endgame is not just video generation but controllable world simulation.
Benchmarks & SOTA
Related Tasks
Something wrong or missing?
Help keep Text-to-Video benchmarks accurate. Report outdated results, missing benchmarks, or errors.