Image-to-Image Models

AI Models Computer Vision

30 min read

Updated Jun 5, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 5, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v4 · 5,962 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Image-to-image models (often shortened to img2img) are machine learning systems that take an input image and output a transformed version of it. The transformation can change style, modality, resolution, content, or composition. Typical tasks include translating semantic layouts into photographs, converting horse pictures into zebra pictures, upscaling low-resolution photos, filling masked regions, colorizing grayscale pictures, removing noise, and editing scenes based on text instructions.

The field grew out of classical image restoration and texture synthesis. Modern systems are dominated by convolutional neural networks, generative adversarial networks, and diffusion models. Many widely deployed computer vision products in 2024 and 2025, including Adobe Photoshop Generative Fill, Stable Diffusion inpainting, Midjourney Vary, and FLUX.1 Fill, are built on image-to-image architectures.

Image-to-image models sit at the intersection of image restoration, generative modeling, and visual understanding. Unlike text-to-image models, which generate images from scratch based on text prompts, image-to-image models begin with an existing image and modify it in a controlled way. The conditioning source can be a paired target image, an unpaired exemplar, a mask, a text description, a sketch, a depth map, a pose skeleton, or a combination of these signals.

Definition and taxonomy

An image-to-image model learns a function f: X to Y where both X and Y are images. The function may be deterministic or stochastic. Researchers categorize the field by how training data is collected and by the change being made.

Category	Example task	Training regime
Paired translation	Semantic map to photo, sketch to photo	Supervised: matched input-output pairs
Unpaired translation	Horses to zebras, summer to winter	Self-supervised: cycle consistency
Style transfer	Photo to Van Gogh painting	Optimization or fast feed-forward
Modality conversion	Depth to color, infrared to RGB	Supervised or unpaired
Super-resolution	64x64 image to 512x512	Supervised with synthetic downsampling
Denoising	Low-light photo cleanup	Supervised with synthetic noise
Deblurring	Camera-shake correction	Supervised with synthetic blur kernels
Inpainting	Object removal, photo restoration	Supervised with synthetic masks
Outpainting	Generative canvas expansion	Masked generation with boundary context
Colorization	Black-and-white film restoration	Self-supervised from color images
Text-guided editing	"Make it snow"	Instruction datasets or RLHF

The key distinction between paired and unpaired training runs through much of the field's history. Paired training requires collecting matched examples of input and output images at the same scene or subject. This is straightforward for synthetic tasks like adding noise to a photograph and then recovering the clean version, but extremely expensive for tasks like translating a photo to a painting. Unpaired methods relax this requirement by exploiting structural constraints, the most influential being cycle consistency: if image x is translated to domain Y and then translated back to domain X, the result should match the original x.^[2]

History

Pre-deep-learning era

Before deep learning, image-to-image tasks relied on hand-crafted filters. Classical super-resolution used interpolation (bicubic, Lanczos) and example-based methods that copied patches from a database. Denoising used wavelet shrinkage, non-local means (Buades, Coll, and Morel 2005), and BM3D (Dabov et al. 2007). Patch-based texture synthesis from Efros and Leung (1999) and image quilting (Efros and Freeman 2001) influenced later inpainting work. PatchMatch (Barnes et al. 2009) powered Photoshop's Content-Aware Fill from 2010 onward.

These classical methods shared a common strategy: define a hand-coded energy function that rewards fidelity to known pixels and penalizes implausible patterns, then solve for the output that minimizes that energy. They worked well within narrow regimes but failed catastrophically at large missing regions, complex scene geometry, or cross-domain translation where the target distribution was fundamentally different from the source.

CNN era

Deep learning entered the field with SRCNN (Dong et al. 2014), a three-layer convolutional network for single image super-resolution. DCGAN by Radford, Metz, and Chintala (2015) standardized convolutional generative adversarial networks. Gatys, Ecker, and Bethge (2015) published "A Neural Algorithm of Artistic Style," using VGG feature statistics to combine the content of a photograph with the style of a painting and launching neural style transfer.^[14] The VGG perceptual loss from Johnson, Alahi, and Fei-Fei (2016) became the standard surrogate for visual similarity.^[25]

Neural style transfer worked by minimizing two losses simultaneously: a content loss computed from deep feature activations in a VGG network (measuring how well the output preserved the content of a photograph) and a style loss computed from Gram matrices of feature activations (measuring how well the output matched the statistical texture of a target painting).^[14] The optimization ran at test time on a single image pair, making it computationally expensive. Johnson, Alahi, and Fei-Fei (2016) trained feed-forward networks to amortize this cost, enabling real-time style transfer on mobile devices.^[25]

GAN-based translation

GAN-based image-to-image translation became a distinct subfield with Pix2Pix (Isola, Zhu, Zhou, and Efros, November 2016).^[1] Pix2Pix combined a U-Net generator with a PatchGAN discriminator and an L1 reconstruction term, producing a general-purpose framework for paired tasks like edges-to-photo and semantic-map-to-photo.^[1] CycleGAN (Zhu, Park, Isola, and Efros, March 2017) removed the requirement for paired data with a cycle consistency loss that demanded F(G(x)) approximately equal to x.^[2] CycleGAN's horse-to-zebra and summer-to-winter demonstrations went viral on social media.^[2]

Follow-ups pushed quality and flexibility. pix2pixHD (Wang et al., November 2017) reached megapixel resolutions using a coarse-to-fine generator and multi-scale discriminators. StarGAN (Choi et al., November 2017) translated between many domains with a single network using a domain label as input. MUNIT (Huang et al., April 2018) added disentangled content and style codes, so a user could specify content from one image and style from another. SPADE/GauGAN (Park, Liu, Wang, and Zhu, March 2019 at NVIDIA) introduced spatially adaptive normalization for semantic image synthesis and powered a popular landscape painting demo.^[5] StyleGAN-derived editing tools used GAN inversion to project real photos into a latent space and manipulate attributes such as age or expression.

The GAN framework for image-to-image translation had characteristic strengths and weaknesses. On the strength side, a well-trained GAN generator could produce sharp, photorealistic outputs from a single forward pass in milliseconds. On the weakness side, training was notoriously unstable: mode collapse (generating the same output for different inputs), training divergence, and sensitivity to hyperparameters required careful engineering. The discriminator only provided a gradient signal about whether an output looked real; it gave no pixel-level supervision about where the generator went wrong, which made learning fine detail difficult.

Super-resolution

SRGAN (Ledig et al., September 2016) was the first GAN to deliver photo-realistic 4x super-resolution by combining adversarial training with a VGG perceptual loss.^[3] ESRGAN (Wang et al., September 2018) refined SRGAN with a residual-in-residual dense block generator and won the PIRM perceptual challenge.^[18] EDSR (Lim et al. 2017) and RCAN added residual channel attention. Real-ESRGAN (Wang, Xie, Dong, and Shan, July 2021 at Tencent) trained on synthetic high-order degradations to handle blurry, compressed real-world inputs and became a popular open-source upscaler.^[4] SwinIR (Liang et al., August 2021) used a Swin Transformer backbone with fewer parameters.^[6] Later transformer models include HAT (Chen et al. 2022), DAT, and Restormer (Zamir et al. 2022).^[23] Diffusion-based super-resolution, including SR3 (Saharia et al. 2021) and Magnific AI, became prominent after 2022.^[22]

A key insight from SRGAN was that optimizing a pixel-level loss (MSE between the output and the high-resolution reference) produces over-smoothed results that score well on PSNR but look blurry to human observers.^[3] Perceptual and adversarial losses trade pixel accuracy for statistical realism: the output may not exactly match the reference pixel-by-pixel, but it looks like a plausible high-resolution image with credible texture detail.^[3]

Inpainting

Inpainting models reconstruct missing regions of an image. PartialConv (Liu et al. 2018 at NVIDIA) introduced masked convolutions that only attend to known pixels. DeepFill v1 and v2 (Yu et al. 2018, 2019) added gated convolutions and contextual attention. EdgeConnect (Nazeri et al. 2019) inpainted edges first and then filled colors. LaMa (Suvorov et al., September 2021 at Samsung Research) used fast Fourier convolutions for a global receptive field, allowing inpainting of large masks at resolutions far higher than seen during training.^[7] MAT (Li et al., March 2022 at CUHK) combined a mask-aware transformer with style modulation. CoModGAN (Zhao et al. 2021) added stochastic modulation. Stable Diffusion Inpaint (Stability AI, August 2022) and FLUX.1 Fill (Black Forest Labs, November 2024) brought diffusion-based inpainting to mainstream creative software.^[16]

Inpainting benchmarks typically measure reconstruction quality on held-out images where a region has been synthetically masked. Standard masks include rectangular crops, irregular shapes that simulate real object removal, and segmentation-based object masks. The difficulty scales with mask size and location: filling a small smooth region at the edge of a photo is much easier than filling the center of a complex scene.

Colorization

Learning-based colorization predicts chrominance channels of a grayscale input. Zhang, Isola, and Efros (2016) framed colorization as classification over quantized colors, training the network to predict a probability distribution over discrete color bins rather than a single point estimate.^[24] This allowed the model to express uncertainty about ambiguous regions (should the car be red or blue?) and sample diverse plausible colorizations.^[24] Iizuka, Simo-Serra, and Ishikawa (2016) fused global and local features. DeOldify, by Jason Antic in 2018, became the most-used colorization tool for archival photos and film. Modern diffusion editors handle colorization as one of many text-conditioned tasks.

Diffusion era

Diffusion models reshaped image-to-image translation between 2021 and 2024. SDEdit (Meng et al., August 2021) added Gaussian noise to a stroke painting or rough input and then ran reverse diffusion to produce realistic photographs; this method is the basis of the standard "img2img" slider in Stable Diffusion.^[9] Palette (Saharia et al., November 2021 at Google Research) unified colorization, inpainting, uncropping, and JPEG restoration in one conditional diffusion model and outperformed task-specific GANs.^[8]

Stable Diffusion (Rombach et al., December 2021) operated in latent space and shipped native image-to-image and inpainting modes that became the default for open-source generative art.^[21] ControlNet (Zhang, Rao, and Agrawala, February 2023) added a trainable copy of a diffusion U-Net conditioned on edges, depth, normal maps, segmentation, or human pose, with zero-initialized convolutions that prevent disturbing the base model.^[10] InstructPix2Pix (Brooks, Holynski, and Efros, November 2022) trained a diffusion model to follow natural-language edits using a synthetic dataset built from GPT-3 and Stable Diffusion.^[11] DiffEdit, Prompt-to-Prompt, and Null-text Inversion offered training-free editing. IP-Adapter (Ye et al., August 2023 at Tencent AI Lab) decoupled cross-attention to accept text and image prompts using only 22 million extra parameters.^[12]

The diffusion framework brought several structural advantages to image-to-image tasks. First, a diffusion model trained on large-scale data naturally builds a strong prior over realistic images; when used for editing, this prior constrains the output to look plausible even in regions that are far from the input. Second, noise level (called "denoising strength" or "strength" in user interfaces) became a natural dial for controlling how much the output deviates from the input: low strength preserves most of the original content while high strength allows radical changes. Third, diffusion models could be conditioned on many signal types by concatenating them as additional channels or by injecting them through cross-attention, making the same backbone reusable across tasks.

2024 and 2025 developments

Stable Diffusion 3 (Stability AI, June 2024) and FLUX.1 (Black Forest Labs, August 2024) switched to diffusion transformer (DiT) backbones. Black Forest Labs released FLUX.1 Tools in November 2024, including FLUX.1 Fill for inpainting and outpainting, FLUX.1 Depth and Canny for structural conditioning, and FLUX.1 Redux for variations.^[16] Meta's Emu Edit (Sheynin et al., November 2023) treated instruction-based editing as a multi-task problem with sixteen skills.^[13] ByteDance's SeedEdit (Wang et al., November 2024) aligned a text-to-image diffusion model to editing and shipped in the Doubao app, with SeedEdit 3.0 following in 2025. Google's Imagen 3 and Gemini 2.0 image API added competitive editing endpoints during 2024.

FLUX.1 Kontext (Black Forest Labs, May 2025) extended the DiT approach to in-context image editing, where the model conditions on both the source image and a text instruction using in-context learning rather than specialized adapter modules. This enabled strong identity preservation during edits that modify only specified regions of a scene. Adobe's Firefly Image 4 Ultra, released in 2025, brought diffusion-based generative fill to professional photographers working with 200+ megapixel files.

Architectures

GAN-based architectures

Early img2img networks used encoder-decoder generators. The U-Net with skip connections from Ronneberger, Fischer, and Brox (2015) became the most common backbone after Pix2Pix.^[1] Skip connections pass feature maps from encoder layers directly to the corresponding decoder layers, preserving fine-grained spatial detail that would otherwise be lost in the compressed bottleneck representation. GAN systems pair a generator with a discriminator that may operate at the patch level (PatchGAN), at multiple scales (pix2pixHD), or with attention.

PatchGAN discriminators classify overlapping image patches rather than the full image.^[1] This is computationally efficient and encourages the generator to produce locally plausible texture, but may miss global inconsistencies. Multi-scale discriminators (pix2pixHD) operate at several image resolutions simultaneously, providing gradient signals about both fine detail and coarse structure.

Diffusion-based architectures

Diffusion systems use a U-Net or DiT denoiser conditioned on the input image via channel concatenation, cross-attention, or adapter modules. In the channel concatenation approach used by Palette and early versions of Stable Diffusion Inpaint, the conditioning image (or the masked input) is concatenated with the noisy latent before the denoiser.^[8] This is simple but requires retraining the denoiser; it cannot easily be applied to a pretrained text-to-image model without modification.

ControlNet's trainable encoder branch, IP-Adapter's decoupled cross-attention, and T2I-Adapter's projection layers all expose the denoiser to image conditioning without modifying the original base model.^[10] This matters because the base model carries a strong image prior learned from billions of images; disturbing its weights risks losing the diversity and realism that makes diffusion generation powerful. Zero-initialized convolutions, the key innovation in ControlNet, allow the trainable branch to start as a no-op and gradually learn the conditioning signal during fine-tuning.^[10]

SDEdit uses a simpler approach: add a controlled amount of Gaussian noise to the input image and then run the reverse diffusion process from that noisy state.^[9] The noise level controls the tradeoff between fidelity to the input and conformity to the text prompt or style.^[9] At low noise levels the output looks similar to the input with minor tweaks; at high noise levels the result may share only the rough composition with the input.

Transformer restoration networks like SwinIR, Restormer, and HAT have replaced pure CNNs for super-resolution and denoising since 2021.^[6] These models use window-based or channel-based self-attention to capture long-range dependencies that convolutions struggle with, improving coherence in textures that repeat across large spatial distances.^[23]

Conditioning mechanisms

Image-to-image conditioning takes several technical forms:

Concatenation: The conditioning image is stacked with the noisy input along the channel dimension before entering the denoiser. Used by Palette, SR3, and Stable Diffusion's inpainting model.^[8]

Cross-attention injection: Conditioning signals are encoded to a sequence of tokens and fed into the denoiser through cross-attention layers. Used by IP-Adapter for image prompts and by InstructPix2Pix for text instructions.^[12]

ControlNet branches: A separate trainable copy of the denoiser encoder processes the conditioning image, and its outputs are added to the frozen decoder's residuals. Enables dense structural conditioning (edges, depth, pose) at high fidelity.^[10]

Adapter modules: Compact adapter networks (T2I-Adapter, ControlLoRA) map the conditioning image to feature residuals added to specific layers. More parameter-efficient than ControlNet.

In-context conditioning: Newer models (FLUX.1 Kontext) concatenate source and target tokens in the same attention window, letting the model reason about the relationship between the two at every layer.

Notable models

Model	Year	Organization	Task focus	Architecture
SRCNN	2014	CUHK	Single image super-resolution	3-layer CNN
Pix2Pix	2016	UC Berkeley	Paired conditional image translation	U-Net + PatchGAN
SRGAN	2016	Twitter, Imperial College	4x super-resolution with adversarial loss	ResNet + GAN
CycleGAN	2017	UC Berkeley	Unpaired translation with cycle consistency	Two ResNet GANs
pix2pixHD	2017	NVIDIA, UC Berkeley	High-resolution semantic to photo	Multi-scale GAN
StarGAN	2017	Korea University, Clova	Multi-domain attribute translation	Single GAN
ESRGAN	2018	SenseTime, CUHK	Enhanced super-resolution	RRDB + GAN
GauGAN (SPADE)	2019	NVIDIA	Semantic image synthesis	SPADE normalization GAN
Real-ESRGAN	2021	Tencent ARC Lab	Real-world blind super-resolution	RRDB + high-order degradation
SwinIR	2021	ETH Zurich	Transformer-based restoration	Swin Transformer
LaMa	2021	Samsung Research	Large mask inpainting via FFT convolutions	Fast Fourier Conv.
Palette	2021	Google Research	Unified diffusion image translation	Conditional U-Net
SDEdit	2021	Stanford, CMU	Diffusion-based stroke editing	Score-based SDE
MAT	2022	CUHK	Mask-aware transformer inpainting	ViT + style modulation
Stable Diffusion Inpaint	2022	Stability AI	Latent diffusion inpainting	LDM U-Net
InstructPix2Pix	2022	UC Berkeley	Instruction-following diffusion editor	LDM U-Net
ControlNet	2023	Stanford	Structural conditioning for diffusion	Trainable encoder branch
IP-Adapter	2023	Tencent AI Lab	Image-prompt adapter	Decoupled cross-attention
Emu Edit	2023	Meta	Multi-task instruction editing	LDM with task embeddings
SeedEdit	2024	ByteDance	Aligned diffusion image editing	DiT-based
FLUX.1 Fill	2024	Black Forest Labs	DiT-based inpainting and outpainting	12B MM-DiT
FLUX.1 Kontext	2025	Black Forest Labs	In-context image editing	MM-DiT

GAN approaches vs. diffusion approaches

The two dominant technical paradigms for image-to-image translation, generative adversarial networks and diffusion models, differ in how they model the output distribution and how they are trained.

GAN generators learn a deterministic mapping from a noise vector and a conditioning input to an output image. Training uses two adversarial objectives: the generator tries to fool a discriminator, and the discriminator tries to classify real and generated images correctly. GANs produce outputs in a single forward pass, making inference fast (milliseconds per image on a GPU). But GAN training is sensitive to the balance between generator and discriminator; training instability, mode collapse, and vanishing gradients are common. Achieving very high output diversity is difficult: a GAN may learn to generate one excellent output per conditioning image rather than the full distribution of possible outputs.

Diffusion models learn to reverse a noise-addition process. A forward process gradually corrupts a real image into Gaussian noise over many steps. The model is trained to predict and reverse one step of this corruption given the current noisy image and a timestep embedding. At inference, the model starts from pure noise and applies hundreds of denoising steps, each shifting the sample toward the image distribution. Conditioning signals (including the input image) guide this trajectory. Diffusion models produce richer output diversity, are easier to train stably, and scale better with data and compute. Their main drawback is inference speed: even with accelerated samplers like DDIM or DPM-Solver, generating a 512x512 image may take dozens of model evaluations.

For image-to-image tasks specifically, the choice between GANs and diffusion models involves additional tradeoffs. Fast restoration tasks (denoising, JPEG artifact removal) often use GANs or feed-forward CNNs because a single forward pass is sufficient and inference latency matters. Creative editing tasks (inpainting, style transfer, instruction-based editing) have shifted almost entirely to diffusion models because of their stronger generative prior and easier multi-modal conditioning.

Property	GAN	Diffusion model
Inference speed	Very fast (single forward pass)	Slower (many denoising steps)
Training stability	Difficult, mode collapse risk	Stable, MSE-like loss
Output diversity	Often limited	High with guidance
Perceptual quality at scale	Good but hard to improve	Excellent with scale
Conditioning flexibility	Specialized architectures needed	Easy via concatenation or cross-attention
Text guidance	Rare	Native through CLIP or T5 embeddings

Conditioning and control

A central challenge in image-to-image modeling is controlling which aspects of the input are preserved and which are transformed. Different methods expose this control at different levels of granularity.

Strength and noise level

In diffusion img2img (SDEdit), the denoising strength parameter controls how many diffusion steps are applied.^[9] Starting from a small amount of noise preserves most of the input's structure; starting from nearly pure noise allows dramatic changes. A strength of 1.0 is equivalent to text-to-image generation with the output initialized to noise, while a strength of 0.3 retains most of the original pixel arrangement.

Structural conditioning with ControlNet

ControlNet accepts dense conditioning images: Canny edges, HED soft edges, depth maps, normal maps, OpenPose skeleton coordinates, segmentation masks, MLSD straight-line detection, scribbles, and line art.^[10] Each conditioning type requires a separately trained ControlNet module; multiple modules can be stacked with weighted blending.^[10] This allows a user to, for example, fix the human pose in a generated image while completely changing its clothing, background, and lighting.

For DiT-based models like FLUX.1, Black Forest Labs released FLUX.1 Depth and FLUX.1 Canny as dedicated depth-conditioned and edge-conditioned variants, training the full model to accept these conditioning signals rather than using an adapter approach.^[16]

Instruction-based editing

InstructPix2Pix and its successors accept a text instruction like "add snow to the ground" or "make this person older" and apply the edit while preserving unspecified parts of the scene.^[11] The challenge is building a training set: instruction-image-output triplets. InstructPix2Pix solved this by generating synthetic edits using GPT-3 to write instructions and Stable Diffusion with Prompt-to-Prompt to execute them, then filtering the dataset.^[11] Emu Edit collected a cleaner human-curated dataset and trained with a multi-task objective that includes recognition tasks (image captioning, segmentation) alongside editing, improving precision.^[13]

Style and identity via IP-Adapter

IP-Adapter and its variants (InstantID, PuLID, IP-Adapter-FaceID) add a reference image as a second conditioning signal.^[12] The reference is encoded by an image encoder (CLIP ViT for style, a face recognition model for identity) and injected through decoupled cross-attention layers that operate in parallel with the text cross-attention.^[12] This enables "transfer this person's face to a new scene" or "apply the color palette of this painting" as one-shot operations without fine-tuning the base model.^[12]

Datasets

Common training and evaluation datasets include:

Dataset	Year	Size	Typical use
ImageNet	2009	1.2M images	Pretraining and classification losses
Cityscapes	2016	25K images	Semantic-to-photo translation
COCO	2014	330K images	Captioned editing, segmentation, inpainting
FFHQ	2019	70K faces	Face editing and StyleGAN training
CelebA-HQ	2017	30K faces	Attribute editing and inpainting
Places	2014	10M images	Inpainting and scene synthesis
DIV2K	2017	1K images	Super-resolution benchmarks
LAION-5B	2022	5.85B pairs	Diffusion model pretraining
MagicBrush	2023	10K triples	Instruction-based editing
PIPE	2023	1M pairs	Paired img2img for diffusion fine-tuning
InstructPix2Pix dataset	2022	454K triples	Instruction editing (synthetic)

DIV2K (Diverse 2K resolution) is the primary benchmark dataset for super-resolution and denoising. It contains 800 training images and 100 validation images at 2K resolution, along with bicubicly downsampled versions at x2, x3, and x4 scales. Models are commonly evaluated on Set5 (5 images), Set14 (14 images), BSD100 (100 images from Berkeley Segmentation Dataset), and Urban100 (100 urban images) alongside DIV2K.

Evaluation

No single metric captures image quality. Researchers report a mix of pixel accuracy, perceptual similarity, and distributional scores.

Metric	Year	Notes
PSNR	classical	Pixel-wise error, used in super-resolution; higher is better
SSIM	2004	Structural similarity, closer to human judgment than PSNR
LPIPS	2018	Learned perceptual similarity (Zhang et al.); lower is better
FID	2017	Distance between feature distributions (Heusel et al.); lower is better
KID	2018	Kernel inception distance, unbiased for small datasets
CLIP image similarity	2021	Cross-modal embedding distance for edit fidelity
DINO similarity	2021	Self-supervised feature distance for subject preservation
MagicBrush score	2023	Manual edit benchmark for instruction following
EditVal	2023	Compositional edit benchmark
CLIP-T	2021	Text-image CLIP score measuring how well the edit matches the instruction

Metric details

PSNR (Peak Signal-to-Noise Ratio) is defined as 10 log10(MAX^2 / MSE) where MAX is the maximum pixel value and MSE is the mean squared error between the output and a reference image. It measures pixel-level reconstruction fidelity. PSNR is widely used for super-resolution but correlates poorly with human perceptual quality; a blurry output that is slightly wrong everywhere may outscore a sharp output with occasional pixel errors.

SSIM (Structural Similarity Index Measure), introduced by Wang, Bovik, Sheikh, and Simoncelli (2004), computes a weighted combination of luminance similarity, contrast similarity, and structural similarity within local image windows.^[19] SSIM correlates better with human judgment of degradation than PSNR but still favors smoothed outputs.^[19]

LPIPS (Learned Perceptual Image Patch Similarity), introduced by Zhang et al. (2018), computes the distance between AlexNet, VGG, or SqueezeNet feature activations of two image patches.^[15] LPIPS was calibrated on a large dataset of human perceptual judgments and correlates far better with human preference than PSNR or SSIM.^[15] It is now the standard perceptual loss for image restoration.

FID (Frechet Inception Distance) measures the distance between the Inception-v3 feature distributions of a set of generated images and a set of real reference images, using the Frechet distance between fitted Gaussian distributions.^[20] FID captures diversity and realism at the distribution level but is sensitive to the reference dataset choice and the number of samples used for estimation.^[20]

For instruction-based editing, the standard protocol is to report CLIP-based image similarity (measuring how much of the original is preserved) alongside CLIP text-image similarity (measuring how well the edit was applied). These two metrics are in tension: applying a large edit scores high on text-image similarity but low on image preservation, so researchers typically report both and examine the tradeoff curve.

Applications

Image-to-image models are deployed across creative, scientific, and consumer software. Adobe Photoshop's Generative Fill, released in beta on 23 May 2023 and shipped in September 2023, uses Adobe Firefly to fill user-drawn masks.^[17] Lightroom and Topaz Labs ship AI-based denoise, sharpen, and upscale tools. Magnific AI built a business around diffusion super-resolution. DeOldify restores historical photographs and films. AR camera filters in Snapchat, TikTok, and Instagram are partly powered by image-to-image GANs.

E-commerce uses virtual try-on systems to fit clothing to user photos. Real estate platforms inpaint to remove clutter or stage furniture. Medical imaging uses cross-modal translation to predict CT from MRI, denoise low-dose CT, and accelerate MRI reconstruction. Satellite teams use super-resolution to enhance low-cost sensors and translate between SAR and optical bands. Video game studios generate texture maps and concept art with diffusion editors. Animation pipelines integrate ControlNet and IP-Adapter for character consistency. Face anonymization, deepfake detection, and forensic restoration also build on these components.

Creative and commercial applications

Generative inpainting has replaced PatchMatch-based Content-Aware Fill as the primary object removal tool in professional photo editing workflows. Diffusion inpainting models learn not only to fill the missing region plausibly but to understand scene geometry, lighting direction, and subject consistency. A model filling the space behind a removed tree in a sunset photograph will attempt to continue the gradient of the sky rather than paste in a neutral background patch.

Super-resolution is used in digital restoration of archival film and photographs. Upscalers like Real-ESRGAN and Topaz Gigapixel AI apply multiple passes of enhancement, sometimes combined with face restoration modules like GFPGAN or CodeFormer that specialize in restoring facial detail from heavily compressed inputs.^[4]

Virtual try-on (VTON) systems take a product image and a person image as inputs and produce a composite showing the person wearing the product. Early VTON systems used warping networks to deform the clothing texture to fit the person's body pose; modern diffusion-based approaches like TryOnDiffusion (Zhu et al., 2023) generate the composite directly from the two inputs, handling complex fabric folds and body occlusion more naturally.

Scientific and industrial applications

In medical imaging, cross-modal translation converts MRI images to synthetic CT volumes, enabling radiation treatment planning in cases where CT scans are unavailable or undesirable (for example, in pediatric patients where CT radiation exposure is a concern). CycleGAN-based methods were widely studied for this task; supervised approaches that use paired MRI/CT datasets have since taken the lead in clinical settings.^[2]

In remote sensing, super-resolution and SAR-to-optical translation are used to extract information from satellite imagery. Commercial earth observation companies use super-resolution to enhance the effective resolution of low-cost satellite constellations. Sentinel-2 multispectral images have a native 10-meter ground sample distance; diffusion-based super-resolution pipelines can convincingly synthesize 2-meter or finer imagery for monitoring crops, infrastructure, and disasters.

Limitations

Image-to-image models still have well-documented failure modes. Pixel-perfect editing is hard: small prompt changes can shift global lighting or color balance. Identity preservation during face edits is unreliable across long generation chains. Fine text in scenes is often garbled, although DiT-based models since 2024 have improved. Hands, fingers, and small accessories remain frequent error spots. Inpainting models can hallucinate plausible but unrelated content into masked regions. Diffusion models are slow at inference relative to GANs; FLUX.1 Fill needs several seconds per image while a CycleGAN forward pass takes milliseconds. Training-data biases produce uneven performance across skin tones and cultural settings, and many models inherit copyright disputes from web-scraped corpora like LAION-5B.

Additional limitations include:

Edit leakage: Changes made to one region of an image can propagate unexpectedly to other regions, particularly in diffusion models where the denoising process operates globally. Masking the region to be edited reduces but does not eliminate this effect.

Semantic inconsistency: Inpainting can fill a masked region with content that is visually plausible in isolation but semantically inconsistent with the rest of the scene (a door that opens into the wrong side of a building, a reflection that does not match the lighting).

GAN artifacts: Convolutional GAN generators produce characteristic checkerboard artifacts, high-frequency noise patterns, and color fringing near edges. These artifacts are often imperceptible at first glance but visible under close inspection.

Super-resolution hallucination: Upscaling models trained with adversarial losses may hallucinate texture details (skin pores, fabric weave, architectural ornament) that are not present in the original low-resolution image. The output looks sharper but is not more faithful to the actual scene.^[3]

Computational cost at scale: Diffusion-based image-to-image models are substantially more expensive to run than feed-forward CNN or GAN alternatives. For real-time applications like AR filters or video editing, this remains a significant practical barrier.

Data and consent concerns: Many models were trained on web-scraped image datasets. For inpainting and style transfer specifically, this means the model may reproduce recognizable stylistic elements from artists whose work appeared in training data. Legal clarity around this issue remains unsettled as of 2025.

Relation to other fields

Image-to-image models are closely related to several adjacent areas of generative AI. Text-to-image models share the same diffusion and GAN architectures; indeed, most text-to-image models include image-to-image and inpainting modes as standard features. The difference is in the primary conditioning source: a text prompt versus an input image.

Video-to-video translation applies image-to-image methods temporally, adding consistency constraints between frames. Stable Video Diffusion and AnimateDiff extended latent diffusion models to video by adding temporal attention layers. Text-to-video models like Sora and Wan Video use diffusion transformers over video patches, with image-to-image operations available as image conditioning for the first frame.

3D reconstruction systems increasingly use image-to-image translation as a preprocessing step. Normal maps, depth maps, and albedo maps estimated from single images are used as conditioning inputs for NeRF and Gaussian Splatting pipelines.

Diffusion models trained for image-to-image tasks also appear as components in larger systems: a super-resolution model may be appended to a text-to-image pipeline to refine output quality, or an inpainting model may be used inside a robotic manipulation system to augment training data by filling in objects at desired locations.

References

Isola, P., Zhu, J.Y., Zhou, T., and Efros, A.A. (2016). "Image-to-Image Translation with Conditional Adversarial Networks." arXiv:1611.07004. https://arxiv.org/abs/1611.07004 ↩
Zhu, J.Y., Park, T., Isola, P., and Efros, A.A. (2017). "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks." arXiv:1703.10593. https://arxiv.org/abs/1703.10593 ↩
Ledig, C. et al. (2016). "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network." arXiv:1609.04802. https://arxiv.org/abs/1609.04802 ↩
Wang, X., Xie, L., Dong, C., and Shan, Y. (2021). "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data." arXiv:2107.10833. https://arxiv.org/abs/2107.10833 ↩
Park, T., Liu, M.Y., Wang, T.C., and Zhu, J.Y. (2019). "Semantic Image Synthesis with Spatially-Adaptive Normalization." arXiv:1903.07291. https://arxiv.org/abs/1903.07291 ↩
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R. (2021). "SwinIR: Image Restoration Using Swin Transformer." arXiv:2108.10257. https://arxiv.org/abs/2108.10257 ↩
Suvorov, R. et al. (2021). "Resolution-robust Large Mask Inpainting with Fourier Convolutions." arXiv:2109.07161. https://arxiv.org/abs/2109.07161 ↩
Saharia, C. et al. (2021). "Palette: Image-to-Image Diffusion Models." arXiv:2111.05826. https://arxiv.org/abs/2111.05826 ↩
Meng, C. et al. (2021). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations." arXiv:2108.01073. https://arxiv.org/abs/2108.01073 ↩
Zhang, L., Rao, A., and Agrawala, M. (2023). "Adding Conditional Control to Text-to-Image Diffusion Models." arXiv:2302.05543. https://arxiv.org/abs/2302.05543 ↩
Brooks, T., Holynski, A., and Efros, A.A. (2022). "InstructPix2Pix: Learning to Follow Image Editing Instructions." arXiv:2211.09800. https://arxiv.org/abs/2211.09800 ↩
Ye, H., Zhang, J., Liu, S., Han, X., and Yang, W. (2023). "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models." arXiv:2308.06721. https://arxiv.org/abs/2308.06721 ↩
Sheynin, S. et al. (2023). "Emu Edit: Precise Image Editing via Recognition and Generation Tasks." arXiv:2311.10089. https://arxiv.org/abs/2311.10089 ↩
Gatys, L.A., Ecker, A.S., and Bethge, M. (2015). "A Neural Algorithm of Artistic Style." arXiv:1508.06576. https://arxiv.org/abs/1508.06576 ↩
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., and Wang, O. (2018). "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric." arXiv:1801.03924. https://arxiv.org/abs/1801.03924 ↩
Black Forest Labs (2024). "Introducing FLUX.1 Tools." https://bfl.ai/flux-1-tools/ ↩
Adobe (2023). "Get started with Generative Fill, powered by Adobe Firefly Generative AI now in Photoshop." https://blog.adobe.com/en/publish/2023/05/23/future-of-photoshop-powered-by-adobe-firefly ↩
Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Loy, C.C., Qiao, Y., and Tang, X. (2018). "ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks." arXiv:1809.00219. https://arxiv.org/abs/1809.00219 ↩
Wang, Z., Bovik, A.C., Sheikh, H.R., and Simoncelli, E.P. (2004). "Image Quality Assessment: From Error Visibility to Structural Similarity." IEEE Transactions on Image Processing. https://doi.org/10.1109/TIP.2003.819861 ↩
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium." NeurIPS 2017. arXiv:1706.08500. https://arxiv.org/abs/1706.08500 ↩
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. arXiv:2112.10752. https://arxiv.org/abs/2112.10752 ↩
Saharia, C. et al. (2021). "Image Super-Resolution via Iterative Refinement." arXiv:2104.07636. https://arxiv.org/abs/2104.07636 ↩
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., and Yang, M.H. (2022). "Restormer: Efficient Transformer for High-Resolution Image Restoration." CVPR 2022. arXiv:2111.09881. https://arxiv.org/abs/2111.09881 ↩
Zhang, R., Isola, P., and Efros, A.A. (2016). "Colorful Image Colorization." ECCV 2016. arXiv:1603.08511. https://arxiv.org/abs/1603.08511 ↩
Johnson, J., Alahi, A., and Fei-Fei, L. (2016). "Perceptual Losses for Real-Time Style Transfer and Super-Resolution." ECCV 2016. arXiv:1603.08155. https://arxiv.org/abs/1603.08155 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Artificial intelligence terms CycleGAN

Definition and taxonomy

History

Pre-deep-learning era

CNN era

GAN-based translation

Super-resolution

Inpainting

Colorization

Diffusion era

2024 and 2025 developments

Architectures

GAN-based architectures

Diffusion-based architectures

Conditioning mechanisms

Notable models

GAN approaches vs. diffusion approaches

Conditioning and control

Strength and noise level

Structural conditioning with ControlNet

Instruction-based editing

Style and identity via IP-Adapter

Datasets

Evaluation

Metric details

Applications

Creative and commercial applications

Scientific and industrial applications

Limitations

Relation to other fields

References

Improve this article

Related Articles

Image Classification Models

Segment Anything Model and Dataset (SAM and SA-1B)

Unconditional Image Generation Models

Video Classification Models

Visual Question Answering Models

Zero-Shot Image Classification Models

What links here

Related Articles

Image Classification Models

Segment Anything Model and Dataset (SAM and SA-1B)

Unconditional Image Generation Models

Video Classification Models

Visual Question Answering Models

Zero-Shot Image Classification Models

What links here