Image-to-Image Models
Last reviewed
May 11, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 ยท 2,440 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 ยท 2,440 words
Add missing citations, update stale details, or suggest a clearer explanation.
Image-to-image models (often shortened to img2img) are machine learning systems that take an input image and output a transformed version of it. The transformation can change style, modality, resolution, content, or composition. Typical tasks include translating semantic layouts into photographs, converting horse pictures into zebra pictures, upscaling low-resolution photos, filling masked regions, colorizing grayscale pictures, removing noise, and editing scenes based on text instructions.
The field grew out of classical image restoration and texture synthesis. Modern systems are dominated by convolutional neural networks, generative adversarial networks, and diffusion models. Many widely deployed computer vision products in 2024 and 2025, including Adobe Photoshop Generative Fill, Stable Diffusion inpainting, Midjourney Vary, and FLUX.1 Fill, are built on image-to-image architectures.
An image-to-image model learns a function f: X to Y where both X and Y are images. The function may be deterministic or stochastic. Researchers categorize the field by how training data is collected and by the change being made.
| Category | Example task |
|---|---|
| Paired translation | Semantic map to photo, sketch to photo |
| Unpaired translation | Horses to zebras, summer to winter |
| Style transfer | Photo to Van Gogh painting |
| Modality conversion | Depth to color, infrared to RGB |
| Super-resolution | 64x64 image to 512x512 |
| Denoising | Low-light photo cleanup |
| Deblurring | Camera-shake correction |
| Inpainting | Object removal, photo restoration |
| Outpainting | Generative canvas expansion |
| Colorization | Black-and-white film restoration |
| Text-guided editing | "Make it snow" |
Before deep learning, image-to-image tasks relied on hand-crafted filters. Classical super-resolution used interpolation (bicubic, Lanczos) and example-based methods that copied patches from a database. Denoising used wavelet shrinkage, non-local means (Buades, Coll, and Morel 2005), and BM3D (Dabov et al. 2007). Patch-based texture synthesis from Efros and Leung (1999) and image quilting (Efros and Freeman 2001) influenced later inpainting work. PatchMatch (Barnes et al. 2009) powered Photoshop's Content-Aware Fill from 2010 onward.
Deep learning entered the field with SRCNN (Dong et al. 2014), a three-layer convolutional network for single image super-resolution. DCGAN by Radford, Metz, and Chintala (2015) standardized convolutional generative adversarial networks. Gatys, Ecker, and Bethge (2015) published "A Neural Algorithm of Artistic Style," using VGG feature statistics to combine the content of a photograph with the style of a painting and launching neural style transfer. The VGG perceptual loss from Johnson, Alahi, and Fei-Fei (2016) became the standard surrogate for visual similarity.
GAN-based image-to-image translation became a distinct subfield with Pix2Pix (Isola, Zhu, Zhou, and Efros, November 2016). Pix2Pix combined a U-Net generator with a PatchGAN discriminator and an L1 reconstruction term, producing a general-purpose framework for paired tasks like edges-to-photo and semantic-map-to-photo. CycleGAN (Zhu, Park, Isola, and Efros, March 2017) removed the requirement for paired data with a cycle consistency loss that demanded F(G(x)) approximately equal to x. CycleGAN's horse-to-zebra and summer-to-winter demonstrations went viral on social media.
Follow-ups pushed quality and flexibility. pix2pixHD (Wang et al., November 2017) reached megapixel resolutions. StarGAN (Choi et al., November 2017) translated between many domains with a single network. MUNIT (Huang et al., April 2018) added disentangled content and style codes. SPADE/GauGAN (Park, Liu, Wang, and Zhu, March 2019 at NVIDIA) introduced spatially adaptive normalization for semantic image synthesis and powered a popular landscape painting demo. StyleGAN-derived editing tools used GAN inversion to project real photos into a latent space and manipulate attributes such as age or expression.
SRGAN (Ledig et al., September 2016) was the first GAN to deliver photo-realistic 4x super-resolution by combining adversarial training with a VGG perceptual loss. ESRGAN (Wang et al., September 2018) refined SRGAN with a residual-in-residual dense block generator and won the PIRM perceptual challenge. EDSR (Lim et al. 2017) and RCAN added residual channel attention. Real-ESRGAN (Wang, Xie, Dong, and Shan, July 2021 at Tencent) trained on synthetic high-order degradations to handle blurry, compressed real-world inputs and became a popular open-source upscaler. SwinIR (Liang et al., August 2021) used a Swin Transformer backbone with fewer parameters. Later transformer models include HAT (Chen et al. 2022), DAT, and Restormer (Zamir et al. 2022). Diffusion-based super-resolution, including SR3 (Saharia et al. 2021) and Magnific AI, became prominent after 2022.
Inpainting models reconstruct missing regions of an image. PartialConv (Liu et al. 2018 at NVIDIA) introduced masked convolutions that only attend to known pixels. DeepFill v1 and v2 (Yu et al. 2018, 2019) added gated convolutions and contextual attention. EdgeConnect (Nazeri et al. 2019) inpainted edges first and then filled colors. LaMa (Suvorov et al., September 2021 at Samsung Research) used fast Fourier convolutions for a global receptive field, allowing inpainting of large masks at resolutions far higher than seen during training. MAT (Li et al., March 2022 at CUHK) combined a mask-aware transformer with style modulation. CoModGAN (Zhao et al. 2021) added stochastic modulation. Stable Diffusion Inpaint (Stability AI, August 2022) and FLUX.1 Fill (Black Forest Labs, November 2024) brought diffusion-based inpainting to mainstream creative software.
Learning-based colorization predicts chrominance channels of a grayscale input. Zhang, Isola, and Efros (2016) framed colorization as classification over quantized colors. Iizuka, Simo-Serra, and Ishikawa (2016) fused global and local features. DeOldify, by Jason Antic in 2018, became the most-used colorization tool for archival photos and film. Modern diffusion editors handle colorization as one of many text-conditioned tasks.
Diffusion models reshaped image-to-image translation between 2021 and 2024. SDEdit (Meng et al., August 2021) added Gaussian noise to a stroke painting or rough input and then ran reverse diffusion to produce realistic photographs; this method is the basis of the standard "img2img" slider in Stable Diffusion. Palette (Saharia et al., November 2021 at Google Research) unified colorization, inpainting, uncropping, and JPEG restoration in one conditional diffusion model and outperformed task-specific GANs.
Stable Diffusion (Rombach et al., December 2021) operated in latent space and shipped native image-to-image and inpainting modes that became the default for open-source generative art. ControlNet (Zhang, Rao, and Agrawala, February 2023) added a trainable copy of a diffusion U-Net conditioned on edges, depth, normal maps, segmentation, or human pose, with zero-initialized convolutions that prevent disturbing the base model. InstructPix2Pix (Brooks, Holynski, and Efros, November 2022) trained a diffusion model to follow natural-language edits using a synthetic dataset built from GPT-3 and Stable Diffusion. DiffEdit, Prompt-to-Prompt, and Null-text Inversion offered training-free editing. IP-Adapter (Ye et al., August 2023 at Tencent AI Lab) decoupled cross-attention to accept text and image prompts using only 22 million extra parameters.
Stable Diffusion 3 (Stability AI, June 2024) and FLUX.1 (Black Forest Labs, August 2024) switched to diffusion transformer (DiT) backbones. Black Forest Labs released FLUX.1 Tools in November 2024, including FLUX.1 Fill for inpainting and outpainting, FLUX.1 Depth and Canny for structural conditioning, and FLUX.1 Redux for variations. Meta's Emu Edit (Sheynin et al., November 2023) treated instruction-based editing as a multi-task problem with sixteen skills. ByteDance's SeedEdit (Wang et al., November 2024) aligned a text-to-image diffusion model to editing and shipped in the Doubao app, with SeedEdit 3.0 following in 2025. Google's Imagen 3 and Gemini 2.0 image API added competitive editing endpoints during 2024.
Early img2img networks used encoder-decoder generators. The U-Net with skip connections from Ronneberger, Fischer, and Brox (2015) became the most common backbone after Pix2Pix. GAN systems pair a generator with a discriminator that may operate at the patch level (PatchGAN), at multiple scales (pix2pixHD), or with attention. Diffusion systems use a U-Net or DiT denoiser conditioned on the input via channel concatenation, cross-attention, or adapter modules. ControlNet's trainable encoder branch, IP-Adapter's decoupled cross-attention, and T2I-Adapter's projection layers all expose the denoiser to image conditioning. Transformer restoration networks like SwinIR, Restormer, and HAT have replaced pure CNNs for super-resolution and denoising since 2021.
| Model | Year | Organization | Task focus |
|---|---|---|---|
| SRCNN | 2014 | CUHK | Single image super-resolution |
| Pix2Pix | 2016 | UC Berkeley | Paired conditional image translation |
| SRGAN | 2016 | Twitter, Imperial College | 4x super-resolution with adversarial loss |
| CycleGAN | 2017 | UC Berkeley | Unpaired translation with cycle consistency |
| pix2pixHD | 2017 | NVIDIA, UC Berkeley | High-resolution semantic to photo |
| StarGAN | 2017 | Korea University, Clova | Multi-domain attribute translation |
| ESRGAN | 2018 | SenseTime, CUHK | Enhanced super-resolution |
| GauGAN (SPADE) | 2019 | NVIDIA | Semantic image synthesis |
| Real-ESRGAN | 2021 | Tencent ARC Lab | Real-world blind super-resolution |
| SwinIR | 2021 | ETH Zurich | Transformer-based restoration |
| LaMa | 2021 | Samsung Research | Large mask inpainting via FFT convolutions |
| Palette | 2021 | Google Research | Unified diffusion image translation |
| SDEdit | 2021 | Stanford, CMU | Diffusion-based stroke editing |
| MAT | 2022 | CUHK | Mask-aware transformer inpainting |
| Stable Diffusion Inpaint | 2022 | Stability AI | Latent diffusion inpainting |
| InstructPix2Pix | 2022 | UC Berkeley | Instruction-following diffusion editor |
| ControlNet | 2023 | Stanford | Structural conditioning for diffusion |
| IP-Adapter | 2023 | Tencent AI Lab | Image-prompt adapter |
| Emu Edit | 2023 | Meta | Multi-task instruction editing |
| SeedEdit | 2024 | ByteDance | Aligned diffusion image editing |
| FLUX.1 Fill | 2024 | Black Forest Labs | DiT-based inpainting and outpainting |
Common training and evaluation datasets include:
| Dataset | Year | Typical use |
|---|---|---|
| ImageNet | 2009 | Pretraining and classification losses |
| Cityscapes | 2016 | Semantic-to-photo translation |
| COCO | 2014 | Captioned editing, segmentation |
| FFHQ | 2019 | Face editing and StyleGAN training |
| CelebA-HQ | 2017 | Attribute editing and inpainting |
| Places | 2014 | Inpainting and scene synthesis |
| DIV2K | 2017 | Super-resolution benchmarks |
| LAION-5B | 2022 | Diffusion model pretraining |
| MagicBrush | 2023 | Instruction-based editing |
No single metric captures image quality. Researchers report a mix of pixel accuracy, perceptual similarity, and distributional scores.
| Metric | Year | Notes |
|---|---|---|
| PSNR | classical | Pixel-wise error, used in super-resolution |
| SSIM | 2004 | Structural similarity, closer to human judgment than PSNR |
| LPIPS | 2018 | Learned perceptual similarity (Zhang et al.) |
| FID | 2017 | Distance between feature distributions (Heusel et al.) |
| KID | 2018 | Kernel inception distance, unbiased for small datasets |
| CLIP image similarity | 2021 | Cross-modal embedding distance for edit fidelity |
| DINO similarity | 2021 | Self-supervised feature distance for subject preservation |
| MagicBrush score | 2023 | Manual edit benchmark for instruction following |
| EditVal | 2023 | Compositional edit benchmark |
Image-to-image models are deployed across creative, scientific, and consumer software. Adobe Photoshop's Generative Fill, released in beta on 23 May 2023 and shipped in September 2023, uses Adobe Firefly to fill user-drawn masks. Lightroom and Topaz Labs ship AI-based denoise, sharpen, and upscale tools. Magnific AI built a business around diffusion super-resolution. DeOldify restores historical photographs and films. AR camera filters in Snapchat, TikTok, and Instagram are partly powered by image-to-image GANs.
E-commerce uses virtual try-on systems to fit clothing to user photos. Real estate platforms inpaint to remove clutter or stage furniture. Medical imaging uses cross-modal translation to predict CT from MRI, denoise low-dose CT, and accelerate MRI reconstruction. Satellite teams use super-resolution to enhance low-cost sensors and translate between SAR and optical bands. Video game studios generate texture maps and concept art with diffusion editors. Animation pipelines integrate ControlNet and IP-Adapter for character consistency. Face anonymization, deepfake detection, and forensic restoration also build on these components.
Image-to-image models still have well-documented failure modes. Pixel-perfect editing is hard: small prompt changes can shift global lighting or color balance. Identity preservation during face edits is unreliable across long generation chains. Fine text in scenes is often garbled, although DiT-based models since 2024 have improved. Hands, fingers, and small accessories remain frequent error spots. Inpainting models can hallucinate plausible but unrelated content into masked regions. Diffusion models are slow at inference relative to GANs; FLUX.1 Fill needs several seconds per image while a CycleGAN forward pass takes milliseconds. Training-data biases produce uneven performance across skin tones and cultural settings, and many models inherit copyright disputes from web-scraped corpora like LAION-5B.