Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Image-Text-to-VideoHome/Tasks/Multimodal/Image-Text-to-Video
Multimodal· image-text-to-video

Image-Text-to-Video.

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

1
Datasets
0
Results
composite
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

VideoBench

Evaluates instruction-guided video generation from image+text

Primary metric: composite
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on VideoBench.

No results yet. Be the first to contribute.

What were you looking for on Image-Text-to-Video?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

VideoBench
CANONICAL
0 results · composite
§ 05 · Related tasks

Other tasks in Multimodal.

Any-to-AnyAudio-Text-to-TextCross-Modal RetrievalImage CaptioningImage-Text-to-ImageImage-Text-to-TextText-to-Image GenerationVideo Understanding
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Image-Text-to-Video? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.