Home/Building Blocks/Image Transformation
ImageImage

Image Transformation

Transform images: style transfer, inpainting, super-resolution, editing, or generation from image prompts.

How Image-to-Image Works

A technical deep-dive into image-to-image transformations. From the fundamental insight of noise-level control to advanced techniques like ControlNet and IP-Adapter.

1

The Core Insight

Understanding why image-to-image works requires grasping one fundamental idea.

The Problem

Text-to-image starts from pure noise. But what if you already have an image and want to modify it?

The Solution

Instead of starting from random noise, we start from a noisy version of your input image. The model then removes noise while following your instructions.

The Key Idea

The amount of noise added controls how much the output can deviate from the input. More noise = more creative freedom. Less noise = more faithful to the original.

Visualizing the Process

Original
Your input image
+
Noise
Controlled amount
=
Noisy
Starting point
->
Denoise
With your prompt
=
Result
Transformed image

The magic: less noise means the denoiser stays closer to your original.

More noise gives the model freedom to follow your prompt.

2

Image-to-Image Tasks

Each task solves a different problem, but all share the same fundamental mechanism.

Inpainting

Fill masked regions with contextually appropriate content

Why this matters

Sometimes you need to remove an object, fix a defect, or replace part of an image. The challenge is generating content that seamlessly blends with the surroundings.

How it works

The model sees the unmasked regions as fixed constraints. During denoising, it conditions on the visible pixels to ensure the generated content matches lighting, texture, and semantics.

Examples:Object removalArtifact repairContent replacement
Models:SDXL InpaintingFLUX FillIdeogram Canvas
3

The Strength Parameter

Understanding strength is the key to controlling image-to-image transformations.

Strength controls how much noise is added to your input image before denoising begins. Think of it as the "creativity dial" - higher values give the model more freedom to change your image.

0.0 (Original)0.51.0 (Full transform)
0.2
Subtle changes
Minor variations, mostly original
0.5
Balanced
Mix of original and new content
0.8
Major changes
Prompt dominates, structure preserved
1
Full generation
Equivalent to text-to-image
Low Strength (0.2-0.4)

Good for: Style adjustments, color correction, subtle modifications

High Strength (0.7-1.0)

Good for: Major transformations, sketches to photos, style transfer

4

ControlNet: Spatial Control

ControlNet solves the fundamental limitation of text prompts: they cannot specify precise spatial structure.

The Problem

Text prompts are ambiguous. 'A person standing' could be any pose. How do you specify exact spatial structure?

The Solution

ControlNet adds a parallel network that encodes spatial conditions (edges, depth, pose) and injects them into the diffusion process.

ControlNet Architecture

Text + Noise
Standard input
Control Image
Edges, pose, depth
||
U-NetEncoder
Frozen
ControlNetEncoder
Trainable copy
|
Zero Convolutions
-->
U-Net Decoder(receives combined features)
Zero Convolutions: The Training Trick
What

A 1x1 convolution layer where all weights and biases are initialized to zero.

Why

At the start of training, ControlNet outputs zeros, meaning the base model is unchanged. This preserves the pre-trained model's capabilities.

The Insight

This is like adding a volume knob that starts at zero. The model learns to turn up the volume on control signals without breaking what it already knows.

Control Types

/
Canny Edge
Edges detected via gradient thresholding
Preserve exact outlines while changing textures/materials
D
Depth
Per-pixel distance from camera
Maintain 3D structure, change objects within it
P
OpenPose
Human body keypoint detection
Generate images with exact human poses
S
Segmentation
Semantic regions (sky, person, car)
Control object layout without exact shapes
~
Scribble
Freehand line drawings
Quick sketches to detailed images
N
Normal Map
Surface orientation at each pixel
Control lighting and surface detail
Conditioning Scale
0.0
1.0+
Ignore control0.5 (balanced)Strict control

Tip: Start with 0.5-0.8. Values above 1.0 can over-constrain the model, leading to artifacts.

5

IP-Adapter: Image as Prompt

What if you could use images as prompts instead of (or alongside) text?

The Problem

Text can't describe every visual detail. What if you could use images as prompts?

The Solution

IP-Adapter adds a parallel image encoder (CLIP) and cross-attention layers to inject image features alongside text.

How IP-Adapter Works

1
Encode reference image with CLIP vision encoder
Extract high-level semantic features
2
Project image features to text embedding space
A learned linear layer aligns modalities
3
Add decoupled cross-attention
Separate attention for text and image features
4
Generate with combined conditioning
Model sees both text and image context
Key Insight

Unlike fine-tuning (which changes the model), IP-Adapter is a lightweight adapter that preserves all base capabilities.

Decoupled Cross-Attention

Text Embeddings
Image Embeddings
||
Cross-Attn (text)
Cross-Attn (image)
\/
Combined Features

By keeping text and image attention separate, the model learns when to listen to each.

IP-Adapter

Basic image conditioning. Good for style transfer.

IP-Adapter Plus

Higher fidelity. Uses more CLIP layers for detail.

IP-Adapter Face

Specialized for face identity preservation.

6

Model Comparison

Choosing the right model depends on your specific task and requirements.

ModelTaskQualitySpeedArchitectureStrengths
Real-ESRGANSuper ResolutionHighFastRRDB-Net (CNN)Photorealistic faces, fast inference
SUPIRSuper ResolutionVery HighSlowDiffusion-basedBest quality, handles extreme upscaling
SDXL InpaintInpaintingHighMediumLatent diffusionOpen source, flexible, good text following
FLUX FillInpaintingVery HighMediumRectified FlowBest coherence, superior text understanding
ControlNetGuided GenerationHighMediumParallel encoder with zero-convMost control modalities, well documented
IP-AdapterImage PromptingHighFastDecoupled cross-attentionSimple image conditioning, composable
Best for Inpainting
FLUX Fill
Best coherence, superior text understanding
Best for Upscaling
Real-ESRGAN / SUPIR
SUPIR for quality, Real-ESRGAN for speed
Best for Control
ControlNet + SDXL
Most control modalities, well documented
7

Code Examples

Production-ready code with detailed comments explaining each step.

SDXL Inpaintingpip install diffusers torch
Inpainting
from diffusers import AutoPipelineForInpainting
from PIL import Image
import torch

# Load the inpainting pipeline
# The model was specifically trained for inpainting with a 9-channel VAE
# (3 RGB + 3 masked image + 3 mask = 9 channels)
pipe = AutoPipelineForInpainting.from_pretrained(
    "diffusers/stable-diffusion-xl-1.0-inpainting-0.1",
    torch_dtype=torch.float16,
    variant="fp16"
).to("cuda")

# Load your image and create a mask
# Mask should be white (255) where you want to inpaint
image = Image.open("input.jpg").resize((1024, 1024))
mask = Image.open("mask.png").resize((1024, 1024))

# Inpaint with text guidance
# The prompt describes what should fill the masked region
result = pipe(
    prompt="a beautiful garden with colorful flowers",
    negative_prompt="blurry, low quality, distorted",
    image=image,
    mask_image=mask,
    num_inference_steps=30,
    guidance_scale=7.5,
    strength=1.0,  # How much to change masked region
).images[0]

result.save("inpainted.jpg")

Quick Reference

For Inpainting
  • - FLUX Fill (best quality)
  • - SDXL Inpainting (open)
  • - Ideogram Canvas (API)
For Upscaling
  • - Real-ESRGAN (fast)
  • - SUPIR (best quality)
  • - Magnific AI (API)
For Structure Control
  • - ControlNet (edges, pose)
  • - T2I-Adapter (lighter)
  • - Multi-ControlNet
For Style/Reference
  • - IP-Adapter (image prompt)
  • - IP-Adapter Plus (detail)
  • - IP-Adapter Face (identity)
Key Takeaways
  • 1. Strength controls how much the output can deviate from input
  • 2. ControlNet uses zero-convolutions to safely add spatial control
  • 3. IP-Adapter uses decoupled cross-attention for image prompts
  • 4. Combine techniques: ControlNet + IP-Adapter for maximum control

Use Cases

  • Photo editing
  • Style transfer
  • Image restoration
  • Super-resolution
  • Object removal

Architectural Patterns

Diffusion-Based Editing

Use diffusion models for controlled image editing.

Pros:
  • +High quality
  • +Flexible control
Cons:
  • -Slow
  • -May change unintended areas

GAN-Based

Use GANs for image-to-image translation.

Pros:
  • +Fast inference
  • +Sharp outputs
Cons:
  • -Limited diversity
  • -Mode collapse

Inpainting Models

Specialized for filling masked regions.

Pros:
  • +Great for removal
  • +Context-aware
Cons:
  • -Needs mask input
  • -Limited editing

Implementations

API Services

Adobe Firefly

Adobe
API

Commercial-safe. Inpainting, generative fill.

Open Source

Stable Diffusion XL

CreativeML
Open Source

img2img, inpainting, outpainting. Versatile.

ControlNet

Apache 2.0
Open Source

Precise control: pose, depth, edges, etc.

InstructPix2Pix

MIT
Open Source

Edit images with text instructions.

Real-ESRGAN

BSD 3-Clause
Open Source

Best upscaler. 4x enhancement.

Benchmarks

Quick Facts

Input
Image
Output
Image
Implementations
4 open source, 1 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for image transformation.

Submit Results