Home/Building Blocks/Image Transformation

Image→Image

Image Transformation

Transform images: style transfer, inpainting, super-resolution, editing, or generation from image prompts.

How Image-to-Image Works

A technical deep-dive into image-to-image transformations. From the fundamental insight of noise-level control to advanced techniques like ControlNet and IP-Adapter.

1. Core Insight 2. Task Types 3. Strength Parameter 4. ControlNet 5. IP-Adapter 6. Models 7. Code

The Core Insight

Understanding why image-to-image works requires grasping one fundamental idea.

The Problem

Text-to-image starts from pure noise. But what if you already have an image and want to modify it?

The Solution

Instead of starting from random noise, we start from a noisy version of your input image. The model then removes noise while following your instructions.

The Key Idea

The amount of noise added controls how much the output can deviate from the input. More noise = more creative freedom. Less noise = more faithful to the original.

Visualizing the Process

Original

Your input image

Noise

Controlled amount

Noisy

Starting point

Denoise

With your prompt

Result

Transformed image

The magic: less noise means the denoiser stays closer to your original.

More noise gives the model freedom to follow your prompt.

Image-to-Image Tasks

Each task solves a different problem, but all share the same fundamental mechanism.

Inpainting

Fill masked regions with contextually appropriate content

Why this matters

Sometimes you need to remove an object, fix a defect, or replace part of an image. The challenge is generating content that seamlessly blends with the surroundings.

How it works

The model sees the unmasked regions as fixed constraints. During denoising, it conditions on the visible pixels to ensure the generated content matches lighting, texture, and semantics.

Examples:Object removalArtifact repairContent replacement

Models:SDXL InpaintingFLUX FillIdeogram Canvas

The Strength Parameter

Understanding strength is the key to controlling image-to-image transformations.

Strength controls how much noise is added to your input image before denoising begins. Think of it as the "creativity dial" - higher values give the model more freedom to change your image.

Try adjusting the strength:

0.0 (Original)0.51.0 (Full transform)

0.2

Subtle changes

Minor variations, mostly original

0.5

Balanced

Mix of original and new content

0.8

Major changes

Prompt dominates, structure preserved

Full generation

Equivalent to text-to-image

Low Strength (0.2-0.4)

Good for: Style adjustments, color correction, subtle modifications

High Strength (0.7-1.0)

Good for: Major transformations, sketches to photos, style transfer

ControlNet: Spatial Control

ControlNet solves the fundamental limitation of text prompts: they cannot specify precise spatial structure.

The Problem

Text prompts are ambiguous. 'A person standing' could be any pose. How do you specify exact spatial structure?

The Solution

ControlNet adds a parallel network that encodes spatial conditions (edges, depth, pose) and injects them into the diffusion process.

ControlNet Architecture

Text + Noise

Standard input

Control Image

Edges, pose, depth

U-NetEncoder

Frozen

ControlNetEncoder

Trainable copy

Zero Convolutions

-->

U-Net Decoder(receives combined features)

Zero Convolutions: The Training Trick

What

A 1x1 convolution layer where all weights and biases are initialized to zero.

Why

At the start of training, ControlNet outputs zeros, meaning the base model is unchanged. This preserves the pre-trained model's capabilities.

The Insight

This is like adding a volume knob that starts at zero. The model learns to turn up the volume on control signals without breaking what it already knows.

Control Types

Canny Edge

Edges detected via gradient thresholding

Preserve exact outlines while changing textures/materials

Depth

Per-pixel distance from camera

Maintain 3D structure, change objects within it

OpenPose

Human body keypoint detection

Generate images with exact human poses

Segmentation

Semantic regions (sky, person, car)

Control object layout without exact shapes

Scribble

Freehand line drawings

Quick sketches to detailed images

Normal Map

Surface orientation at each pixel

Control lighting and surface detail

Conditioning Scale

0.0

1.0+

Ignore control0.5 (balanced)Strict control

Tip: Start with 0.5-0.8. Values above 1.0 can over-constrain the model, leading to artifacts.

IP-Adapter: Image as Prompt

What if you could use images as prompts instead of (or alongside) text?

The Problem

Text can't describe every visual detail. What if you could use images as prompts?

The Solution

IP-Adapter adds a parallel image encoder (CLIP) and cross-attention layers to inject image features alongside text.

How IP-Adapter Works

Encode reference image with CLIP vision encoder

Extract high-level semantic features

Project image features to text embedding space

A learned linear layer aligns modalities

Add decoupled cross-attention

Separate attention for text and image features

Generate with combined conditioning

Model sees both text and image context

Key Insight

Unlike fine-tuning (which changes the model), IP-Adapter is a lightweight adapter that preserves all base capabilities.

Decoupled Cross-Attention

Text Embeddings

Image Embeddings

Cross-Attn (text)

Cross-Attn (image)

Combined Features

By keeping text and image attention separate, the model learns when to listen to each.

IP-Adapter

Basic image conditioning. Good for style transfer.

IP-Adapter Plus

Higher fidelity. Uses more CLIP layers for detail.

IP-Adapter Face

Specialized for face identity preservation.

Model Comparison

Choosing the right model depends on your specific task and requirements.

Model	Task	Quality	Speed	Architecture	Strengths
Real-ESRGAN	Super Resolution	High	Fast	RRDB-Net (CNN)	Photorealistic faces, fast inference
SUPIR	Super Resolution	Very High	Slow	Diffusion-based	Best quality, handles extreme upscaling
SDXL Inpaint	Inpainting	High	Medium	Latent diffusion	Open source, flexible, good text following
FLUX Fill	Inpainting	Very High	Medium	Rectified Flow	Best coherence, superior text understanding
ControlNet	Guided Generation	High	Medium	Parallel encoder with zero-conv	Most control modalities, well documented
IP-Adapter	Image Prompting	High	Fast	Decoupled cross-attention	Simple image conditioning, composable

Best for Inpainting

FLUX Fill

Best coherence, superior text understanding

Best for Upscaling

Real-ESRGAN / SUPIR

SUPIR for quality, Real-ESRGAN for speed

Best for Control

ControlNet + SDXL

Most control modalities, well documented

Code Examples

Production-ready code with detailed comments explaining each step.

SDXL Inpaintingpip install diffusers torch

Inpainting

from diffusers import AutoPipelineForInpainting
from PIL import Image
import torch

# Load the inpainting pipeline
# The model was specifically trained for inpainting with a 9-channel VAE
# (3 RGB + 3 masked image + 3 mask = 9 channels)
pipe = AutoPipelineForInpainting.from_pretrained(
    "diffusers/stable-diffusion-xl-1.0-inpainting-0.1",
    torch_dtype=torch.float16,
    variant="fp16"
).to("cuda")

# Load your image and create a mask
# Mask should be white (255) where you want to inpaint
image = Image.open("input.jpg").resize((1024, 1024))
mask = Image.open("mask.png").resize((1024, 1024))

# Inpaint with text guidance
# The prompt describes what should fill the masked region
result = pipe(
    prompt="a beautiful garden with colorful flowers",
    negative_prompt="blurry, low quality, distorted",
    image=image,
    mask_image=mask,
    num_inference_steps=30,
    guidance_scale=7.5,
    strength=1.0,  # How much to change masked region
).images[0]

result.save("inpainted.jpg")

Quick Reference

For Inpainting

- FLUX Fill (best quality)
- SDXL Inpainting (open)
- Ideogram Canvas (API)

For Upscaling

- Real-ESRGAN (fast)
- SUPIR (best quality)
- Magnific AI (API)

For Structure Control

- ControlNet (edges, pose)
- T2I-Adapter (lighter)
- Multi-ControlNet

For Style/Reference

- IP-Adapter (image prompt)
- IP-Adapter Plus (detail)
- IP-Adapter Face (identity)

Key Takeaways

1. Strength controls how much the output can deviate from input
2. ControlNet uses zero-convolutions to safely add spatial control
3. IP-Adapter uses decoupled cross-attention for image prompts
4. Combine techniques: ControlNet + IP-Adapter for maximum control

Use Cases

✓Photo editing
✓Style transfer
✓Image restoration
✓Super-resolution
✓Object removal

Architectural Patterns

Diffusion-Based Editing

Use diffusion models for controlled image editing.

Pros:

+High quality
+Flexible control

Cons:

-Slow
-May change unintended areas

GAN-Based

Use GANs for image-to-image translation.

Pros:

+Fast inference
+Sharp outputs

Cons:

-Limited diversity
-Mode collapse

Inpainting Models

Specialized for filling masked regions.

Pros:

+Great for removal
+Context-aware

Cons:

-Needs mask input
-Limited editing

Implementations

API Services

Adobe Firefly

Adobe

API

Commercial-safe. Inpainting, generative fill.

Open Source

Stable Diffusion XL

CreativeML

Open Source

img2img, inpainting, outpainting. Versatile.

HuggingFace

ControlNet

Apache 2.0

Open Source

Precise control: pose, depth, edges, etc.

HuggingFace

InstructPix2Pix

MIT

Open Source

Edit images with text instructions.

HuggingFace

Real-ESRGAN

BSD 3-Clause

Open Source

Best upscaler. 4x enhancement.

GitHub

Benchmarks

ImageNet Super-Resolution →Places2 Inpainting →

Quick Facts

Input: Image
Output: Image
Implementations: 4 open source, 1 API
Patterns: 3 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for image transformation.

Submit Results