Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Text-to-VideoHome/Tasks/Computer Vision/Text-to-Video
Computer Vision· text-to-video

Text-to-Video.

Text-to-video generation is the most ambitious frontier in generative AI — synthesizing temporally coherent, physically plausible video from text prompts alone. The field exploded in 2024 with Sora demonstrating cinematic-quality generation, followed by open models like CogVideoX and Mochi pushing accessibility. The core technical challenge is maintaining consistency across frames: characters shouldn't morph, physics should hold, and camera motion should feel intentional. Quality is improving at a staggering pace, but generation still takes minutes per clip and artifacts remain visible under scrutiny — the gap between demos and reliable production tools is real.

2
Datasets
0
Results
composite
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

VBench

Comprehensive text-to-video generation benchmark across 16 dimensions

Primary metric: composite
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on VBench.

No results yet. Be the first to contribute.

What were you looking for on Text-to-Video?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

2 datasets tracked for this task.

VBench
CANONICAL
0 results · composite
EvalCrafter
0 results · composite
§ 05 · Related tasks

Other tasks in Computer Vision.

Document Image ClassificationDocument Layout AnalysisDocument ParsingDocument UnderstandingGeneral OCR CapabilitiesHandwriting RecognitionImage Feature ExtractionImage-to-3D
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Text-to-Video? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.