Text-to-Image Generation
Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022) proved diffusion models could produce photorealistic images from text, Stable Diffusion democratized it as open source, and Midjourney v5/v6 set the aesthetic bar that even non-technical users now expect. DALL-E 3 (2023) solved the prompt-following problem by training on highly descriptive captions, Flux pushed open-source quality to near-commercial levels, and Ideogram cracked reliable text rendering in images. The remaining frontiers are compositional generation (multiple objects with specified spatial relationships), consistent character identity across images, and the still-unsolved challenge of reliable hand and finger anatomy.
DPG-Bench
Dense prompt adherence benchmark for text-to-image models
Top 10
Leading models on DPG-Bench.
No results yet. Be the first to contribute.
What were you looking for on Text-to-Image Generation?
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
All datasets
3 datasets tracked for this task.
Related tasks
Other tasks in Multimodal.
Didn't find what you came for?
Still looking for something on Text-to-Image Generation? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.