Computer Visiontext-to-video

Text-to-Video

Text-to-video generation is the most ambitious frontier in generative AI — synthesizing temporally coherent, physically plausible video from text prompts alone. The field exploded in 2024 with Sora demonstrating cinematic-quality generation, followed by open models like CogVideoX and Mochi pushing accessibility. The core technical challenge is maintaining consistency across frames: characters shouldn't morph, physics should hold, and camera motion should feel intentional. Quality is improving at a staggering pace, but generation still takes minutes per clip and artifacts remain visible under scrutiny — the gap between demos and reliable production tools is real.

2 datasets0 resultsView full task mapping →

Text-to-video generation creates video clips from natural language descriptions. It's the holy grail of generative AI — combining language understanding, visual generation, physics simulation, and temporal consistency. Sora's demo in February 2024 was the field's 'ChatGPT moment,' and by 2025 both commercial (Kling, Runway) and open-source (Wan2.1, HunyuanVideo) models produce usable 10-30 second clips.

History

2022

Make-A-Video (Meta) and Imagen Video (Google) publish the first compelling text-to-video results using cascaded diffusion models

2023

Runway Gen-1/Gen-2 becomes the first commercial text-to-video product; quality is limited but demonstrates the market

2023

AnimateDiff shows temporal motion modules can be added to any Stable Diffusion checkpoint, democratizing video generation

2024

OpenAI's Sora demo (February) shows minute-long coherent video from text — a paradigm shift in perceived capabilities

2024

Kling (Kuaishou) launches publicly with 5-second clips rivaling Sora's demo quality; Chinese labs emerge as leaders

2024

Runway Gen-3 Alpha, Luma Dream Machine, and Pika 1.0 compete commercially; quality improves monthly

2024

CogVideoX (Tsinghua/Zhipu) open-sources a competitive text-to-video model with 5B parameters

2025

Wan2.1 (Alibaba) and HunyuanVideo (Tencent) release open-weights models rivaling commercial quality; community fine-tuning explodes

2025

Sora launches commercially but faces quality criticisms; market fragments across multiple competitive offerings

How Text-to-Video Works

1Text EncodingThe text prompt is encoded …2Noise InitializationA 3D noise tensor (frames ×…3Spatiotemporal Denois…A 3D U-Net or DiT with temp…4Temporal UpsamplingMany models first generate …5VAE Decoding + Super-…A temporal VAE decoder conv…Text-to-Video Pipeline
1

Text Encoding

The text prompt is encoded by a language model (T5-XXL, CLIP, or a custom LLM). Larger text encoders produce better prompt following — FLUX/SD3 use dual encoders (CLIP + T5) for both semantic and detailed understanding.

2

Noise Initialization

A 3D noise tensor (frames × height × width × channels) is sampled in VAE latent space. Some models initialize from a reference frame to improve consistency; others generate all frames from noise.

3

Spatiotemporal Denoising

A 3D U-Net or DiT with temporal attention layers iteratively denoises the latent video. Each step refines both spatial content and temporal coherence. Cross-attention to text embeddings guides the content. DiT architectures (used by Sora, Wan2.1) scale better than U-Nets.

4

Temporal Upsampling

Many models first generate a low-framerate video (e.g., 8 FPS) then use a frame interpolation model to reach 24-30 FPS, reducing computational cost while maintaining smoothness.

5

VAE Decoding + Super-Resolution

A temporal VAE decoder converts latent frames to pixel space. Optional super-resolution (diffusion or GAN-based) upscales to the final resolution (720p, 1080p). The decode step can introduce flickering if the VAE isn't temporally aware.

Current Landscape

Text-to-video generation in 2025 is experiencing its Cambrian explosion. At least 10 competitive models exist commercially and in open source, with quality improving monthly. The architecture has converged on diffusion transformers (DiT) with temporal attention, trained on hundreds of millions of video clips. Chinese labs (Kuaishou/Kling, Alibaba/Wan, Tencent/HunyuanVideo) have emerged as co-leaders alongside Western companies (Runway, OpenAI, Google). The quality gap between commercial and open-source has narrowed dramatically with Wan2.1 and HunyuanVideo releases. However, all models still fail on physics, complex multi-character scenes, and duration beyond 30 seconds.

Key Challenges

Physics and causality — models generate visually plausible but physically incorrect interactions (impossible reflections, objects phasing through each other, liquid defying gravity)

Temporal consistency — characters changing appearance, objects disappearing between cuts, and background morphing remain the most visible failure modes

Duration — coherent generation beyond 15-30 seconds requires hierarchical approaches (keyframe → interpolation) that compound errors

Prompt following — complex multi-sentence prompts with spatial relationships and sequential events are poorly handled; models often ignore clauses

Cost — generating a 10-second 720p video costs $0.10-1.00 in compute and takes 60-300 seconds, making iteration expensive and real-time generation impossible

Quick Recommendations

Best open-source quality

Wan2.1-14B or HunyuanVideo-13B

Closest to commercial quality in open weights; community LoRA fine-tuning enables style control

Best commercial quality

Kling 1.6 Pro or Runway Gen-3 Alpha Turbo

Best temporal consistency and prompt following in production APIs; Kling supports camera control

Fast iteration / prototyping

Runway Gen-3 Turbo or Pika 2.0

5-15 second generation time; good enough for storyboarding and concept exploration

Controllable generation

CogVideoX-5B + ControlNet

Open model with community-built control adapters for camera, motion, and structural guidance

Long-form (30s+)

Kling with extend feature or Sora

Best available for multi-shot generation; still requires manual curation between segments

What's Next

2025-2026 will focus on: (1) world models that actually understand physics rather than just mimicking visual patterns, (2) character consistency — the same person across an entire video or multiple clips, (3) real-time generation for interactive applications, (4) controllable camera and motion via explicit 3D representations, and (5) audio-visual joint generation (synchronized speech, sound effects). The endgame is not just video generation but controllable world simulation.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Text-to-Video benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Text-to-Video Benchmarks - Computer Vision - CodeSOTA | CodeSOTA