Computer Visionimage-to-image

Image-to-Image

Image-to-image translation covers a vast family of tasks — super-resolution, style transfer, inpainting, colorization, denoising — unified by the idea of learning a mapping between image domains. Pix2Pix (2017) and CycleGAN showed paired and unpaired translation were both learnable, but diffusion models rewrote the playbook entirely. ControlNet (2023) demonstrated that conditioning Stable Diffusion on edges, depth, or poses gives surgical control over generation, while models like SUPIR push restoration quality beyond what was thought possible. The Swiss army knife of visual AI — nearly every creative and restoration workflow runs through some form of image-to-image.

2 datasets0 resultsView full task mapping →

Image-to-image translation transforms an input image into a modified output — style transfer, super-resolution, inpainting, colorization, and domain translation all fall here. The field was defined by Pix2Pix (2017) and CycleGAN (2017), but diffusion models (ControlNet, IP-Adapter) have made it the most practically useful generative task, powering photo editing, design tools, and content creation pipelines.

History

2015

Neural Style Transfer (Gatys et al.) shows CNNs can separate and recombine content and style, sparking public fascination with AI art

2016

Perceptual losses (Johnson et al.) enable real-time style transfer by training feed-forward networks against VGG features

2017

Pix2Pix (Isola et al.) introduces conditional GANs for paired image-to-image translation (edges→photos, day→night)

2017

CycleGAN enables unpaired translation via cycle-consistency loss — horse↔zebra, summer↔winter without aligned training pairs

2018

ESRGAN pushes super-resolution into practical territory with perceptually realistic 4× upscaling

2020

SPADE/GauGAN converts semantic layouts to photorealistic images, demonstrating label-map-to-photo translation

2022

Stable Diffusion enables text-guided image editing via img2img pipeline (add noise, re-denoise with new prompt)

2023

ControlNet (Zhang et al.) adds spatial conditioning (edges, depth, pose) to diffusion models, enabling precise structural control

2023

IP-Adapter and InstantID enable image-prompted generation — use a reference image (face, style, object) instead of just text

2024

FLUX and SD3 improve structural coherence; IC-Light and Relighting models show diffusion can modify scene lighting specifically

How Image-to-Image Works

1Input ConditioningThe source image is encoded…2Transformation ModelDiffusion-based: a U-Net or…3Controllability Mecha…ControlNet injects spatial …4DecoderLatent diffusion models dec…5EvaluationFID and LPIPS measure distr…Image-to-Image Pipeline
1

Input Conditioning

The source image is encoded into a conditioning signal: for diffusion models, this can be a latent encoding (img2img), spatial control map (ControlNet), or image embedding (IP-Adapter). For GANs, the encoder directly processes the image.

2

Transformation Model

Diffusion-based: a U-Net or DiT denoises a noisy latent conditioned on both the input image and text prompt. GAN-based: a generator (typically U-Net or ResNet architecture) directly maps input to output. The model learns which aspects to preserve and which to transform.

3

Controllability Mechanisms

ControlNet injects spatial structure (edges, depth, segmentation) via trainable copy of encoder blocks. Strength/denoising parameters control how much the output can deviate from the input.

4

Decoder

Latent diffusion models decode through a VAE decoder back to pixel space. GANs directly output pixels. Post-processing (tiling for high-res, face restoration) is common.

5

Evaluation

FID and LPIPS measure distributional quality and perceptual similarity. SSIM/PSNR for structural fidelity. Human preference studies (side-by-side comparisons) remain the gold standard because metrics poorly capture semantic quality.

Current Landscape

Image-to-image in 2025 is defined by diffusion models with controllable conditioning. ControlNet was the paradigm shift — it turned text-to-image models into precise image editors by adding spatial signals (edges, depth, pose). IP-Adapter extended this to image-prompted generation. The GAN era (Pix2Pix, CycleGAN) laid the theoretical foundation but is being replaced in practice by diffusion pipelines that are more flexible and produce higher quality. The ecosystem is now centered on composable conditioning: stack ControlNet (structure) + IP-Adapter (style) + LoRA (subject) to build complex editing workflows.

Key Challenges

Preserving identity and fine details from the input while applying meaningful transformations — too much preservation = no change, too little = lost subject

Spatial consistency — diffusion models can hallucinate extra fingers, distort faces, or misalign structural elements during editing

High-resolution output — most models operate at 512-1024px and require tiling strategies for print-quality output, introducing seam artifacts

Control granularity — users want to edit specific regions (change hair color, replace background) while keeping everything else pixel-perfect

Speed — diffusion models require 20-50 denoising steps (1-5 seconds on GPU), too slow for interactive editing in some workflows

Quick Recommendations

General-purpose image editing

FLUX.1-dev + ControlNet

Best text-image coherence in current diffusion models, strong structural control with ControlNet conditioning

Style transfer from reference image

IP-Adapter (SDXL or FLUX)

Captures style from a reference image without fine-tuning; works with any base diffusion model

Super-resolution

Real-ESRGAN or SwinIR

Real-ESRGAN handles real-world degradation (compression, noise) well; SwinIR for cleaner academic restoration

Inpainting / object removal

FLUX Inpainting or Stable Diffusion XL Inpaint

Context-aware fill that matches surrounding content, lighting, and perspective

Fast / real-time

SDXL Turbo or LCM-LoRA

1-4 step inference (~100ms on GPU) via distilled models; suitable for interactive applications

What's Next

The field is moving toward instruction-following image editing (InstructPix2Pix, MagicBrush) where you describe changes in natural language ('make it sunset', 'remove the car'), and consistent multi-image generation (same character across scenes). Video-to-video editing is the next frontier, requiring temporal consistency across frames. Expect 2025-2026 to bring real-time diffusion editing in consumer apps, one-step high-quality models, and better spatial reasoning to eliminate the 'extra fingers' problem.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Image-to-Image benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000