Multimodaltext-to-image

Text-to-Image Generation

Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022) proved diffusion models could produce photorealistic images from text, Stable Diffusion democratized it as open source, and Midjourney v5/v6 set the aesthetic bar that even non-technical users now expect. DALL-E 3 (2023) solved the prompt-following problem by training on highly descriptive captions, Flux pushed open-source quality to near-commercial levels, and Ideogram cracked reliable text rendering in images. The remaining frontiers are compositional generation (multiple objects with specified spatial relationships), consistent character identity across images, and the still-unsolved challenge of reliable hand and finger anatomy.

3
Datasets
0
Results
composite
Canonical metric
Canonical Benchmark

DPG-Bench

Dense prompt adherence benchmark for text-to-image models

Primary metric: composite
View full leaderboard

Top 10

Leading models on DPG-Bench.

No results yet. Be the first to contribute.

What were you looking for on Text-to-Image Generation?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

3 datasets tracked for this task.

Related tasks

Other tasks in Multimodal.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Text-to-Image Generation? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.