Image-to-Image Models
Last reviewed
Jun 5, 2026
Sources
25 citations
Review status
Source-backed
Revision
v4 ยท 5,962 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 5, 2026
Sources
25 citations
Review status
Source-backed
Revision
v4 ยท 5,962 words
Add missing citations, update stale details, or suggest a clearer explanation.
Image-to-image models (often shortened to img2img) are machine learning systems that take an input image and output a transformed version of it. The transformation can change style, modality, resolution, content, or composition. Typical tasks include translating semantic layouts into photographs, converting horse pictures into zebra pictures, upscaling low-resolution photos, filling masked regions, colorizing grayscale pictures, removing noise, and editing scenes based on text instructions.
The field grew out of classical image restoration and texture synthesis. Modern systems are dominated by convolutional neural networks, generative adversarial networks, and diffusion models. Many widely deployed computer vision products in 2024 and 2025, including Adobe Photoshop Generative Fill, Stable Diffusion inpainting, Midjourney Vary, and FLUX.1 Fill, are built on image-to-image architectures.
Image-to-image models sit at the intersection of image restoration, generative modeling, and visual understanding. Unlike text-to-image models, which generate images from scratch based on text prompts, image-to-image models begin with an existing image and modify it in a controlled way. The conditioning source can be a paired target image, an unpaired exemplar, a mask, a text description, a sketch, a depth map, a pose skeleton, or a combination of these signals.
An image-to-image model learns a function f: X to Y where both X and Y are images. The function may be deterministic or stochastic. Researchers categorize the field by how training data is collected and by the change being made.
| Category | Example task | Training regime |
|---|---|---|
| Paired translation | Semantic map to photo, sketch to photo | Supervised: matched input-output pairs |
| Unpaired translation | Horses to zebras, summer to winter | Self-supervised: cycle consistency |
| Style transfer | Photo to Van Gogh painting | Optimization or fast feed-forward |
| Modality conversion | Depth to color, infrared to RGB | Supervised or unpaired |
| Super-resolution | 64x64 image to 512x512 | Supervised with synthetic downsampling |
| Denoising | Low-light photo cleanup | Supervised with synthetic noise |
| Deblurring | Camera-shake correction | Supervised with synthetic blur kernels |
| Inpainting | Object removal, photo restoration | Supervised with synthetic masks |
| Outpainting | Generative canvas expansion | Masked generation with boundary context |
| Colorization | Black-and-white film restoration | Self-supervised from color images |
| Text-guided editing | "Make it snow" | Instruction datasets or RLHF |
The key distinction between paired and unpaired training runs through much of the field's history. Paired training requires collecting matched examples of input and output images at the same scene or subject. This is straightforward for synthetic tasks like adding noise to a photograph and then recovering the clean version, but extremely expensive for tasks like translating a photo to a painting. Unpaired methods relax this requirement by exploiting structural constraints, the most influential being cycle consistency: if image x is translated to domain Y and then translated back to domain X, the result should match the original x.[2]
Before deep learning, image-to-image tasks relied on hand-crafted filters. Classical super-resolution used interpolation (bicubic, Lanczos) and example-based methods that copied patches from a database. Denoising used wavelet shrinkage, non-local means (Buades, Coll, and Morel 2005), and BM3D (Dabov et al. 2007). Patch-based texture synthesis from Efros and Leung (1999) and image quilting (Efros and Freeman 2001) influenced later inpainting work. PatchMatch (Barnes et al. 2009) powered Photoshop's Content-Aware Fill from 2010 onward.
These classical methods shared a common strategy: define a hand-coded energy function that rewards fidelity to known pixels and penalizes implausible patterns, then solve for the output that minimizes that energy. They worked well within narrow regimes but failed catastrophically at large missing regions, complex scene geometry, or cross-domain translation where the target distribution was fundamentally different from the source.
Deep learning entered the field with SRCNN (Dong et al. 2014), a three-layer convolutional network for single image super-resolution. DCGAN by Radford, Metz, and Chintala (2015) standardized convolutional generative adversarial networks. Gatys, Ecker, and Bethge (2015) published "A Neural Algorithm of Artistic Style," using VGG feature statistics to combine the content of a photograph with the style of a painting and launching neural style transfer.[14] The VGG perceptual loss from Johnson, Alahi, and Fei-Fei (2016) became the standard surrogate for visual similarity.[25]
Neural style transfer worked by minimizing two losses simultaneously: a content loss computed from deep feature activations in a VGG network (measuring how well the output preserved the content of a photograph) and a style loss computed from Gram matrices of feature activations (measuring how well the output matched the statistical texture of a target painting).[14] The optimization ran at test time on a single image pair, making it computationally expensive. Johnson, Alahi, and Fei-Fei (2016) trained feed-forward networks to amortize this cost, enabling real-time style transfer on mobile devices.[25]
GAN-based image-to-image translation became a distinct subfield with Pix2Pix (Isola, Zhu, Zhou, and Efros, November 2016).[1] Pix2Pix combined a U-Net generator with a PatchGAN discriminator and an L1 reconstruction term, producing a general-purpose framework for paired tasks like edges-to-photo and semantic-map-to-photo.[1] CycleGAN (Zhu, Park, Isola, and Efros, March 2017) removed the requirement for paired data with a cycle consistency loss that demanded F(G(x)) approximately equal to x.[2] CycleGAN's horse-to-zebra and summer-to-winter demonstrations went viral on social media.[2]
Follow-ups pushed quality and flexibility. pix2pixHD (Wang et al., November 2017) reached megapixel resolutions using a coarse-to-fine generator and multi-scale discriminators. StarGAN (Choi et al., November 2017) translated between many domains with a single network using a domain label as input. MUNIT (Huang et al., April 2018) added disentangled content and style codes, so a user could specify content from one image and style from another. SPADE/GauGAN (Park, Liu, Wang, and Zhu, March 2019 at NVIDIA) introduced spatially adaptive normalization for semantic image synthesis and powered a popular landscape painting demo.[5] StyleGAN-derived editing tools used GAN inversion to project real photos into a latent space and manipulate attributes such as age or expression.
The GAN framework for image-to-image translation had characteristic strengths and weaknesses. On the strength side, a well-trained GAN generator could produce sharp, photorealistic outputs from a single forward pass in milliseconds. On the weakness side, training was notoriously unstable: mode collapse (generating the same output for different inputs), training divergence, and sensitivity to hyperparameters required careful engineering. The discriminator only provided a gradient signal about whether an output looked real; it gave no pixel-level supervision about where the generator went wrong, which made learning fine detail difficult.
SRGAN (Ledig et al., September 2016) was the first GAN to deliver photo-realistic 4x super-resolution by combining adversarial training with a VGG perceptual loss.[3] ESRGAN (Wang et al., September 2018) refined SRGAN with a residual-in-residual dense block generator and won the PIRM perceptual challenge.[18] EDSR (Lim et al. 2017) and RCAN added residual channel attention. Real-ESRGAN (Wang, Xie, Dong, and Shan, July 2021 at Tencent) trained on synthetic high-order degradations to handle blurry, compressed real-world inputs and became a popular open-source upscaler.[4] SwinIR (Liang et al., August 2021) used a Swin Transformer backbone with fewer parameters.[6] Later transformer models include HAT (Chen et al. 2022), DAT, and Restormer (Zamir et al. 2022).[23] Diffusion-based super-resolution, including SR3 (Saharia et al. 2021) and Magnific AI, became prominent after 2022.[22]
A key insight from SRGAN was that optimizing a pixel-level loss (MSE between the output and the high-resolution reference) produces over-smoothed results that score well on PSNR but look blurry to human observers.[3] Perceptual and adversarial losses trade pixel accuracy for statistical realism: the output may not exactly match the reference pixel-by-pixel, but it looks like a plausible high-resolution image with credible texture detail.[3]
Inpainting models reconstruct missing regions of an image. PartialConv (Liu et al. 2018 at NVIDIA) introduced masked convolutions that only attend to known pixels. DeepFill v1 and v2 (Yu et al. 2018, 2019) added gated convolutions and contextual attention. EdgeConnect (Nazeri et al. 2019) inpainted edges first and then filled colors. LaMa (Suvorov et al., September 2021 at Samsung Research) used fast Fourier convolutions for a global receptive field, allowing inpainting of large masks at resolutions far higher than seen during training.[7] MAT (Li et al., March 2022 at CUHK) combined a mask-aware transformer with style modulation. CoModGAN (Zhao et al. 2021) added stochastic modulation. Stable Diffusion Inpaint (Stability AI, August 2022) and FLUX.1 Fill (Black Forest Labs, November 2024) brought diffusion-based inpainting to mainstream creative software.[16]
Inpainting benchmarks typically measure reconstruction quality on held-out images where a region has been synthetically masked. Standard masks include rectangular crops, irregular shapes that simulate real object removal, and segmentation-based object masks. The difficulty scales with mask size and location: filling a small smooth region at the edge of a photo is much easier than filling the center of a complex scene.
Learning-based colorization predicts chrominance channels of a grayscale input. Zhang, Isola, and Efros (2016) framed colorization as classification over quantized colors, training the network to predict a probability distribution over discrete color bins rather than a single point estimate.[24] This allowed the model to express uncertainty about ambiguous regions (should the car be red or blue?) and sample diverse plausible colorizations.[24] Iizuka, Simo-Serra, and Ishikawa (2016) fused global and local features. DeOldify, by Jason Antic in 2018, became the most-used colorization tool for archival photos and film. Modern diffusion editors handle colorization as one of many text-conditioned tasks.
Diffusion models reshaped image-to-image translation between 2021 and 2024. SDEdit (Meng et al., August 2021) added Gaussian noise to a stroke painting or rough input and then ran reverse diffusion to produce realistic photographs; this method is the basis of the standard "img2img" slider in Stable Diffusion.[9] Palette (Saharia et al., November 2021 at Google Research) unified colorization, inpainting, uncropping, and JPEG restoration in one conditional diffusion model and outperformed task-specific GANs.[8]
Stable Diffusion (Rombach et al., December 2021) operated in latent space and shipped native image-to-image and inpainting modes that became the default for open-source generative art.[21] ControlNet (Zhang, Rao, and Agrawala, February 2023) added a trainable copy of a diffusion U-Net conditioned on edges, depth, normal maps, segmentation, or human pose, with zero-initialized convolutions that prevent disturbing the base model.[10] InstructPix2Pix (Brooks, Holynski, and Efros, November 2022) trained a diffusion model to follow natural-language edits using a synthetic dataset built from GPT-3 and Stable Diffusion.[11] DiffEdit, Prompt-to-Prompt, and Null-text Inversion offered training-free editing. IP-Adapter (Ye et al., August 2023 at Tencent AI Lab) decoupled cross-attention to accept text and image prompts using only 22 million extra parameters.[12]
The diffusion framework brought several structural advantages to image-to-image tasks. First, a diffusion model trained on large-scale data naturally builds a strong prior over realistic images; when used for editing, this prior constrains the output to look plausible even in regions that are far from the input. Second, noise level (called "denoising strength" or "strength" in user interfaces) became a natural dial for controlling how much the output deviates from the input: low strength preserves most of the original content while high strength allows radical changes. Third, diffusion models could be conditioned on many signal types by concatenating them as additional channels or by injecting them through cross-attention, making the same backbone reusable across tasks.
Stable Diffusion 3 (Stability AI, June 2024) and FLUX.1 (Black Forest Labs, August 2024) switched to diffusion transformer (DiT) backbones. Black Forest Labs released FLUX.1 Tools in November 2024, including FLUX.1 Fill for inpainting and outpainting, FLUX.1 Depth and Canny for structural conditioning, and FLUX.1 Redux for variations.[16] Meta's Emu Edit (Sheynin et al., November 2023) treated instruction-based editing as a multi-task problem with sixteen skills.[13] ByteDance's SeedEdit (Wang et al., November 2024) aligned a text-to-image diffusion model to editing and shipped in the Doubao app, with SeedEdit 3.0 following in 2025. Google's Imagen 3 and Gemini 2.0 image API added competitive editing endpoints during 2024.
FLUX.1 Kontext (Black Forest Labs, May 2025) extended the DiT approach to in-context image editing, where the model conditions on both the source image and a text instruction using in-context learning rather than specialized adapter modules. This enabled strong identity preservation during edits that modify only specified regions of a scene. Adobe's Firefly Image 4 Ultra, released in 2025, brought diffusion-based generative fill to professional photographers working with 200+ megapixel files.
Early img2img networks used encoder-decoder generators. The U-Net with skip connections from Ronneberger, Fischer, and Brox (2015) became the most common backbone after Pix2Pix.[1] Skip connections pass feature maps from encoder layers directly to the corresponding decoder layers, preserving fine-grained spatial detail that would otherwise be lost in the compressed bottleneck representation. GAN systems pair a generator with a discriminator that may operate at the patch level (PatchGAN), at multiple scales (pix2pixHD), or with attention.
PatchGAN discriminators classify overlapping image patches rather than the full image.[1] This is computationally efficient and encourages the generator to produce locally plausible texture, but may miss global inconsistencies. Multi-scale discriminators (pix2pixHD) operate at several image resolutions simultaneously, providing gradient signals about both fine detail and coarse structure.
Diffusion systems use a U-Net or DiT denoiser conditioned on the input image via channel concatenation, cross-attention, or adapter modules. In the channel concatenation approach used by Palette and early versions of Stable Diffusion Inpaint, the conditioning image (or the masked input) is concatenated with the noisy latent before the denoiser.[8] This is simple but requires retraining the denoiser; it cannot easily be applied to a pretrained text-to-image model without modification.
ControlNet's trainable encoder branch, IP-Adapter's decoupled cross-attention, and T2I-Adapter's projection layers all expose the denoiser to image conditioning without modifying the original base model.[10] This matters because the base model carries a strong image prior learned from billions of images; disturbing its weights risks losing the diversity and realism that makes diffusion generation powerful. Zero-initialized convolutions, the key innovation in ControlNet, allow the trainable branch to start as a no-op and gradually learn the conditioning signal during fine-tuning.[10]
SDEdit uses a simpler approach: add a controlled amount of Gaussian noise to the input image and then run the reverse diffusion process from that noisy state.[9] The noise level controls the tradeoff between fidelity to the input and conformity to the text prompt or style.[9] At low noise levels the output looks similar to the input with minor tweaks; at high noise levels the result may share only the rough composition with the input.
Transformer restoration networks like SwinIR, Restormer, and HAT have replaced pure CNNs for super-resolution and denoising since 2021.[6] These models use window-based or channel-based self-attention to capture long-range dependencies that convolutions struggle with, improving coherence in textures that repeat across large spatial distances.[23]
Image-to-image conditioning takes several technical forms:
Concatenation: The conditioning image is stacked with the noisy input along the channel dimension before entering the denoiser. Used by Palette, SR3, and Stable Diffusion's inpainting model.[8]
Cross-attention injection: Conditioning signals are encoded to a sequence of tokens and fed into the denoiser through cross-attention layers. Used by IP-Adapter for image prompts and by InstructPix2Pix for text instructions.[12]
ControlNet branches: A separate trainable copy of the denoiser encoder processes the conditioning image, and its outputs are added to the frozen decoder's residuals. Enables dense structural conditioning (edges, depth, pose) at high fidelity.[10]
Adapter modules: Compact adapter networks (T2I-Adapter, ControlLoRA) map the conditioning image to feature residuals added to specific layers. More parameter-efficient than ControlNet.
In-context conditioning: Newer models (FLUX.1 Kontext) concatenate source and target tokens in the same attention window, letting the model reason about the relationship between the two at every layer.
| Model | Year | Organization | Task focus | Architecture |
|---|---|---|---|---|
| SRCNN | 2014 | CUHK | Single image super-resolution | 3-layer CNN |
| Pix2Pix | 2016 | UC Berkeley | Paired conditional image translation | U-Net + PatchGAN |
| SRGAN | 2016 | Twitter, Imperial College | 4x super-resolution with adversarial loss | ResNet + GAN |
| CycleGAN | 2017 | UC Berkeley | Unpaired translation with cycle consistency | Two ResNet GANs |
| pix2pixHD | 2017 | NVIDIA, UC Berkeley | High-resolution semantic to photo | Multi-scale GAN |
| StarGAN | 2017 | Korea University, Clova | Multi-domain attribute translation | Single GAN |
| ESRGAN | 2018 | SenseTime, CUHK | Enhanced super-resolution | RRDB + GAN |
| GauGAN (SPADE) | 2019 | NVIDIA | Semantic image synthesis | SPADE normalization GAN |
| Real-ESRGAN | 2021 | Tencent ARC Lab | Real-world blind super-resolution | RRDB + high-order degradation |
| SwinIR | 2021 | ETH Zurich | Transformer-based restoration | Swin Transformer |
| LaMa | 2021 | Samsung Research | Large mask inpainting via FFT convolutions | Fast Fourier Conv. |
| Palette | 2021 | Google Research | Unified diffusion image translation | Conditional U-Net |
| SDEdit | 2021 | Stanford, CMU | Diffusion-based stroke editing | Score-based SDE |
| MAT | 2022 | CUHK | Mask-aware transformer inpainting | ViT + style modulation |
| Stable Diffusion Inpaint | 2022 | Stability AI | Latent diffusion inpainting | LDM U-Net |
| InstructPix2Pix | 2022 | UC Berkeley | Instruction-following diffusion editor | LDM U-Net |
| ControlNet | 2023 | Stanford | Structural conditioning for diffusion | Trainable encoder branch |
| IP-Adapter | 2023 | Tencent AI Lab | Image-prompt adapter | Decoupled cross-attention |
| Emu Edit | 2023 | Meta | Multi-task instruction editing | LDM with task embeddings |
| SeedEdit | 2024 | ByteDance | Aligned diffusion image editing | DiT-based |
| FLUX.1 Fill | 2024 | Black Forest Labs | DiT-based inpainting and outpainting | 12B MM-DiT |
| FLUX.1 Kontext | 2025 | Black Forest Labs | In-context image editing | MM-DiT |
The two dominant technical paradigms for image-to-image translation, generative adversarial networks and diffusion models, differ in how they model the output distribution and how they are trained.
GAN generators learn a deterministic mapping from a noise vector and a conditioning input to an output image. Training uses two adversarial objectives: the generator tries to fool a discriminator, and the discriminator tries to classify real and generated images correctly. GANs produce outputs in a single forward pass, making inference fast (milliseconds per image on a GPU). But GAN training is sensitive to the balance between generator and discriminator; training instability, mode collapse, and vanishing gradients are common. Achieving very high output diversity is difficult: a GAN may learn to generate one excellent output per conditioning image rather than the full distribution of possible outputs.
Diffusion models learn to reverse a noise-addition process. A forward process gradually corrupts a real image into Gaussian noise over many steps. The model is trained to predict and reverse one step of this corruption given the current noisy image and a timestep embedding. At inference, the model starts from pure noise and applies hundreds of denoising steps, each shifting the sample toward the image distribution. Conditioning signals (including the input image) guide this trajectory. Diffusion models produce richer output diversity, are easier to train stably, and scale better with data and compute. Their main drawback is inference speed: even with accelerated samplers like DDIM or DPM-Solver, generating a 512x512 image may take dozens of model evaluations.
For image-to-image tasks specifically, the choice between GANs and diffusion models involves additional tradeoffs. Fast restoration tasks (denoising, JPEG artifact removal) often use GANs or feed-forward CNNs because a single forward pass is sufficient and inference latency matters. Creative editing tasks (inpainting, style transfer, instruction-based editing) have shifted almost entirely to diffusion models because of their stronger generative prior and easier multi-modal conditioning.
| Property | GAN | Diffusion model |
|---|---|---|
| Inference speed | Very fast (single forward pass) | Slower (many denoising steps) |
| Training stability | Difficult, mode collapse risk | Stable, MSE-like loss |
| Output diversity | Often limited | High with guidance |
| Perceptual quality at scale | Good but hard to improve | Excellent with scale |
| Conditioning flexibility | Specialized architectures needed | Easy via concatenation or cross-attention |
| Text guidance | Rare | Native through CLIP or T5 embeddings |
A central challenge in image-to-image modeling is controlling which aspects of the input are preserved and which are transformed. Different methods expose this control at different levels of granularity.
In diffusion img2img (SDEdit), the denoising strength parameter controls how many diffusion steps are applied.[9] Starting from a small amount of noise preserves most of the input's structure; starting from nearly pure noise allows dramatic changes. A strength of 1.0 is equivalent to text-to-image generation with the output initialized to noise, while a strength of 0.3 retains most of the original pixel arrangement.
ControlNet accepts dense conditioning images: Canny edges, HED soft edges, depth maps, normal maps, OpenPose skeleton coordinates, segmentation masks, MLSD straight-line detection, scribbles, and line art.[10] Each conditioning type requires a separately trained ControlNet module; multiple modules can be stacked with weighted blending.[10] This allows a user to, for example, fix the human pose in a generated image while completely changing its clothing, background, and lighting.
For DiT-based models like FLUX.1, Black Forest Labs released FLUX.1 Depth and FLUX.1 Canny as dedicated depth-conditioned and edge-conditioned variants, training the full model to accept these conditioning signals rather than using an adapter approach.[16]
InstructPix2Pix and its successors accept a text instruction like "add snow to the ground" or "make this person older" and apply the edit while preserving unspecified parts of the scene.[11] The challenge is building a training set: instruction-image-output triplets. InstructPix2Pix solved this by generating synthetic edits using GPT-3 to write instructions and Stable Diffusion with Prompt-to-Prompt to execute them, then filtering the dataset.[11] Emu Edit collected a cleaner human-curated dataset and trained with a multi-task objective that includes recognition tasks (image captioning, segmentation) alongside editing, improving precision.[13]
IP-Adapter and its variants (InstantID, PuLID, IP-Adapter-FaceID) add a reference image as a second conditioning signal.[12] The reference is encoded by an image encoder (CLIP ViT for style, a face recognition model for identity) and injected through decoupled cross-attention layers that operate in parallel with the text cross-attention.[12] This enables "transfer this person's face to a new scene" or "apply the color palette of this painting" as one-shot operations without fine-tuning the base model.[12]
Common training and evaluation datasets include:
| Dataset | Year | Size | Typical use |
|---|---|---|---|
| ImageNet | 2009 | 1.2M images | Pretraining and classification losses |
| Cityscapes | 2016 | 25K images | Semantic-to-photo translation |
| COCO | 2014 | 330K images | Captioned editing, segmentation, inpainting |
| FFHQ | 2019 | 70K faces | Face editing and StyleGAN training |
| CelebA-HQ | 2017 | 30K faces | Attribute editing and inpainting |
| Places | 2014 | 10M images | Inpainting and scene synthesis |
| DIV2K | 2017 | 1K images | Super-resolution benchmarks |
| LAION-5B | 2022 | 5.85B pairs | Diffusion model pretraining |
| MagicBrush | 2023 | 10K triples | Instruction-based editing |
| PIPE | 2023 | 1M pairs | Paired img2img for diffusion fine-tuning |
| InstructPix2Pix dataset | 2022 | 454K triples | Instruction editing (synthetic) |
DIV2K (Diverse 2K resolution) is the primary benchmark dataset for super-resolution and denoising. It contains 800 training images and 100 validation images at 2K resolution, along with bicubicly downsampled versions at x2, x3, and x4 scales. Models are commonly evaluated on Set5 (5 images), Set14 (14 images), BSD100 (100 images from Berkeley Segmentation Dataset), and Urban100 (100 urban images) alongside DIV2K.
No single metric captures image quality. Researchers report a mix of pixel accuracy, perceptual similarity, and distributional scores.
| Metric | Year | Notes |
|---|---|---|
| PSNR | classical | Pixel-wise error, used in super-resolution; higher is better |
| SSIM | 2004 | Structural similarity, closer to human judgment than PSNR |
| LPIPS | 2018 | Learned perceptual similarity (Zhang et al.); lower is better |
| FID | 2017 | Distance between feature distributions (Heusel et al.); lower is better |
| KID | 2018 | Kernel inception distance, unbiased for small datasets |
| CLIP image similarity | 2021 | Cross-modal embedding distance for edit fidelity |
| DINO similarity | 2021 | Self-supervised feature distance for subject preservation |
| MagicBrush score | 2023 | Manual edit benchmark for instruction following |
| EditVal | 2023 | Compositional edit benchmark |
| CLIP-T | 2021 | Text-image CLIP score measuring how well the edit matches the instruction |
PSNR (Peak Signal-to-Noise Ratio) is defined as 10 log10(MAX^2 / MSE) where MAX is the maximum pixel value and MSE is the mean squared error between the output and a reference image. It measures pixel-level reconstruction fidelity. PSNR is widely used for super-resolution but correlates poorly with human perceptual quality; a blurry output that is slightly wrong everywhere may outscore a sharp output with occasional pixel errors.
SSIM (Structural Similarity Index Measure), introduced by Wang, Bovik, Sheikh, and Simoncelli (2004), computes a weighted combination of luminance similarity, contrast similarity, and structural similarity within local image windows.[19] SSIM correlates better with human judgment of degradation than PSNR but still favors smoothed outputs.[19]
LPIPS (Learned Perceptual Image Patch Similarity), introduced by Zhang et al. (2018), computes the distance between AlexNet, VGG, or SqueezeNet feature activations of two image patches.[15] LPIPS was calibrated on a large dataset of human perceptual judgments and correlates far better with human preference than PSNR or SSIM.[15] It is now the standard perceptual loss for image restoration.
FID (Frechet Inception Distance) measures the distance between the Inception-v3 feature distributions of a set of generated images and a set of real reference images, using the Frechet distance between fitted Gaussian distributions.[20] FID captures diversity and realism at the distribution level but is sensitive to the reference dataset choice and the number of samples used for estimation.[20]
For instruction-based editing, the standard protocol is to report CLIP-based image similarity (measuring how much of the original is preserved) alongside CLIP text-image similarity (measuring how well the edit was applied). These two metrics are in tension: applying a large edit scores high on text-image similarity but low on image preservation, so researchers typically report both and examine the tradeoff curve.
Image-to-image models are deployed across creative, scientific, and consumer software. Adobe Photoshop's Generative Fill, released in beta on 23 May 2023 and shipped in September 2023, uses Adobe Firefly to fill user-drawn masks.[17] Lightroom and Topaz Labs ship AI-based denoise, sharpen, and upscale tools. Magnific AI built a business around diffusion super-resolution. DeOldify restores historical photographs and films. AR camera filters in Snapchat, TikTok, and Instagram are partly powered by image-to-image GANs.
E-commerce uses virtual try-on systems to fit clothing to user photos. Real estate platforms inpaint to remove clutter or stage furniture. Medical imaging uses cross-modal translation to predict CT from MRI, denoise low-dose CT, and accelerate MRI reconstruction. Satellite teams use super-resolution to enhance low-cost sensors and translate between SAR and optical bands. Video game studios generate texture maps and concept art with diffusion editors. Animation pipelines integrate ControlNet and IP-Adapter for character consistency. Face anonymization, deepfake detection, and forensic restoration also build on these components.
Generative inpainting has replaced PatchMatch-based Content-Aware Fill as the primary object removal tool in professional photo editing workflows. Diffusion inpainting models learn not only to fill the missing region plausibly but to understand scene geometry, lighting direction, and subject consistency. A model filling the space behind a removed tree in a sunset photograph will attempt to continue the gradient of the sky rather than paste in a neutral background patch.
Super-resolution is used in digital restoration of archival film and photographs. Upscalers like Real-ESRGAN and Topaz Gigapixel AI apply multiple passes of enhancement, sometimes combined with face restoration modules like GFPGAN or CodeFormer that specialize in restoring facial detail from heavily compressed inputs.[4]
Virtual try-on (VTON) systems take a product image and a person image as inputs and produce a composite showing the person wearing the product. Early VTON systems used warping networks to deform the clothing texture to fit the person's body pose; modern diffusion-based approaches like TryOnDiffusion (Zhu et al., 2023) generate the composite directly from the two inputs, handling complex fabric folds and body occlusion more naturally.
In medical imaging, cross-modal translation converts MRI images to synthetic CT volumes, enabling radiation treatment planning in cases where CT scans are unavailable or undesirable (for example, in pediatric patients where CT radiation exposure is a concern). CycleGAN-based methods were widely studied for this task; supervised approaches that use paired MRI/CT datasets have since taken the lead in clinical settings.[2]
In remote sensing, super-resolution and SAR-to-optical translation are used to extract information from satellite imagery. Commercial earth observation companies use super-resolution to enhance the effective resolution of low-cost satellite constellations. Sentinel-2 multispectral images have a native 10-meter ground sample distance; diffusion-based super-resolution pipelines can convincingly synthesize 2-meter or finer imagery for monitoring crops, infrastructure, and disasters.
Image-to-image models still have well-documented failure modes. Pixel-perfect editing is hard: small prompt changes can shift global lighting or color balance. Identity preservation during face edits is unreliable across long generation chains. Fine text in scenes is often garbled, although DiT-based models since 2024 have improved. Hands, fingers, and small accessories remain frequent error spots. Inpainting models can hallucinate plausible but unrelated content into masked regions. Diffusion models are slow at inference relative to GANs; FLUX.1 Fill needs several seconds per image while a CycleGAN forward pass takes milliseconds. Training-data biases produce uneven performance across skin tones and cultural settings, and many models inherit copyright disputes from web-scraped corpora like LAION-5B.
Additional limitations include:
Edit leakage: Changes made to one region of an image can propagate unexpectedly to other regions, particularly in diffusion models where the denoising process operates globally. Masking the region to be edited reduces but does not eliminate this effect.
Semantic inconsistency: Inpainting can fill a masked region with content that is visually plausible in isolation but semantically inconsistent with the rest of the scene (a door that opens into the wrong side of a building, a reflection that does not match the lighting).
GAN artifacts: Convolutional GAN generators produce characteristic checkerboard artifacts, high-frequency noise patterns, and color fringing near edges. These artifacts are often imperceptible at first glance but visible under close inspection.
Super-resolution hallucination: Upscaling models trained with adversarial losses may hallucinate texture details (skin pores, fabric weave, architectural ornament) that are not present in the original low-resolution image. The output looks sharper but is not more faithful to the actual scene.[3]
Computational cost at scale: Diffusion-based image-to-image models are substantially more expensive to run than feed-forward CNN or GAN alternatives. For real-time applications like AR filters or video editing, this remains a significant practical barrier.
Data and consent concerns: Many models were trained on web-scraped image datasets. For inpainting and style transfer specifically, this means the model may reproduce recognizable stylistic elements from artists whose work appeared in training data. Legal clarity around this issue remains unsettled as of 2025.
Image-to-image models are closely related to several adjacent areas of generative AI. Text-to-image models share the same diffusion and GAN architectures; indeed, most text-to-image models include image-to-image and inpainting modes as standard features. The difference is in the primary conditioning source: a text prompt versus an input image.
Video-to-video translation applies image-to-image methods temporally, adding consistency constraints between frames. Stable Video Diffusion and AnimateDiff extended latent diffusion models to video by adding temporal attention layers. Text-to-video models like Sora and Wan Video use diffusion transformers over video patches, with image-to-image operations available as image conditioning for the first frame.
3D reconstruction systems increasingly use image-to-image translation as a preprocessing step. Normal maps, depth maps, and albedo maps estimated from single images are used as conditioning inputs for NeRF and Gaussian Splatting pipelines.
Diffusion models trained for image-to-image tasks also appear as components in larger systems: a super-resolution model may be appended to a text-to-image pipeline to refine output quality, or an inpainting model may be used inside a robotic manipulation system to augment training data by filling in objects at desired locations.