Computer Visiontext-to-video

Text-to-Video

Text-to-video generation is the most ambitious frontier in generative AI — synthesizing temporally coherent, physically plausible video from text prompts alone. The field exploded in 2024 with Sora demonstrating cinematic-quality generation, followed by open models like CogVideoX and Mochi pushing accessibility. The core technical challenge is maintaining consistency across frames: characters shouldn't morph, physics should hold, and camera motion should feel intentional. Quality is improving at a staggering pace, but generation still takes minutes per clip and artifacts remain visible under scrutiny — the gap between demos and reliable production tools is real.

2
Datasets
4
Results
composite
Canonical metric
Canonical Benchmark

VBench

Comprehensive text-to-video generation benchmark across 16 dimensions

Primary metric: composite
View full leaderboard

Top 10

Leading models on VBench.

RankModeltotal-scoreYearSource
1
Kling 1.0
85.42024paper
2
Runway Gen-3 Alpha
85.22024paper
3
CogVideoX-5B
82.82024paper
4
Open-Sora 1.2
80.92024paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Computer Vision.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace