Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Image-Text-to-ImageHome/Tasks/Multimodal/Image-Text-to-Image
Multimodal· image-text-to-image

Image-Text-to-Image.

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

2
Datasets
0
Results
clip-score
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

InstructPix2Pix

Instruction-guided image editing benchmark

Primary metric: clip-score
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on InstructPix2Pix.

No results yet. Be the first to contribute.

What were you looking for on Image-Text-to-Image?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

2 datasets tracked for this task.

InstructPix2Pix
CANONICAL
0 results · clip-score
MagicBrush
0 results · clip-score
§ 05 · Related tasks

Other tasks in Multimodal.

Any-to-AnyAudio-Text-to-TextCross-Modal RetrievalImage CaptioningImage-Text-to-TextImage-Text-to-VideoText-to-Image GenerationVideo Understanding
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Image-Text-to-Image? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.