Image-Text-to-Image
Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.
InstructPix2Pix
Instruction-guided image editing benchmark
Top 10
Leading models on InstructPix2Pix.
No results yet. Be the first to contribute.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Multimodal.
Looking to run a model? HuggingFace hosts inference for this task type.