Unconditional Image Generation Models
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,060 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,060 words
Add missing citations, update stale details, or suggest a clearer explanation.
Unconditional image generation models are generative neural networks that learn the marginal distribution p(x) of a set of training images and produce new samples from that learned distribution, with no extra input such as a text caption, class label, or partial image. The model is given only random noise (or a latent code) and must return a plausible image. This contrasts with conditional image generation, where the network is asked to produce an image consistent with some side information, for example a class label as in BigGAN or a text prompt as in Stable Diffusion.
Unconditional generation has been the standard benchmark setting for foundational generative model research because it isolates how well a model captures the data distribution itself. The dominant families are variational autoencoders, generative adversarial networks, normalizing flows, autoregressive image models, and diffusion models. Between roughly 2018 and 2021, GAN families such as StyleGAN held the state of the art on most benchmarks; from 2021 onward diffusion models and their successors (flow matching, consistency models) have generally produced lower Frechet Inception Distance on ImageNet, faces, and most object datasets.
See also: Computer Vision Models and Tasks, text-to-image models
Given a training set sampled from an unknown distribution p_data(x), an unconditional image generation model approximates p_data with a parametric distribution p_theta(x) and provides a sampler that returns x ~ p_theta. Properties that distinguish the families include whether the sampler is one-shot or iterative, whether the model gives an exact log-likelihood, and whether the latent space is structured for editing. Unconditional models are evaluated almost entirely with sample-based metrics, because likelihood numbers are not always comparable across families. Standard benchmarks are CIFAR-10, CelebA-HQ and FFHQ faces, LSUN bedrooms and churches, and class-unconditional ImageNet at 64, 128, 256, and 512 resolutions.
The five canonical model families differ substantially in their generative mechanism. VAEs use a continuous latent bottleneck and optimize a variational lower bound, making them fast but prone to blurry outputs. GANs frame generation as an adversarial game and produce sharp one-shot samples, but are prone to training instability and mode collapse. Normalizing flows build a fully invertible pipeline, granting exact likelihood at the cost of an architecture constrained to be bijective. Autoregressive models decompose the image into a sequence of pixel or token predictions, achieving tractable likelihoods but generating images one element at a time. Diffusion models iteratively refine a noisy signal and currently dominate most quantitative benchmarks, at the cost of requiring many forward passes per sample.
The variational autoencoder was introduced by Diederik Kingma and Max Welling in late 2013 (arXiv 1312.6114). A VAE learns an encoder q(z|x) and a decoder p(x|z) jointly with the evidence lower bound (ELBO), and samples by drawing z from a standard Gaussian prior and running it through the decoder. The ELBO balances a reconstruction term against a KL divergence penalty that keeps the posterior close to the prior. Vanilla VAEs produce blurry samples because the per-pixel likelihood term, typically Gaussian or Bernoulli, encourages averaging over possible outputs rather than committing to sharp details.
Several variants addressed these limitations. Beta-VAE (Higgins et al. 2017) increased the KL weight to encourage a more disentangled latent space, improving controllability at some cost to reconstruction quality. VQ-VAE (van den Oord, Vinyals and Kavukcuoglu 2017, arXiv 1711.00937) replaced the continuous Gaussian latent with a discrete codebook of learned vectors. The encoder maps inputs to the nearest codebook entry, and the gradients flow through a straight-through estimator; a separate autoregressive prior over the discrete codes is then trained to give a proper generative model. VQ-VAE-2 (Razavi et al. 2019, arXiv 1906.00446) extended this to a hierarchical top and bottom latent and reached 256 ImageNet samples competitive with the best GANs of the time. NVAE (Vahdat and Kautz 2020) showed that a carefully designed hierarchical VAE with residual cells and batch normalisation tailored for deep networks could close the gap on natural images without the discrete codebook.
The GAN framework was introduced by Ian Goodfellow et al. in 2014 (arXiv 1406.2661, NeurIPS 2014) as a two-player minimax game between a generator and a discriminator. The generator maps random noise to synthetic images; the discriminator attempts to classify real images from generated ones; and each network is updated to beat the other. The first practical convolutional variant was DCGAN (Radford, Metz and Chintala 2015, arXiv 1511.06434), which set the basic template of strided convolutions, batch normalisation, and a tanh output. DCGAN demonstrated that learned convolutional features were semantically meaningful and that arithmetic in the latent space produced coherent interpolations.
Wasserstein GAN (Arjovsky et al. 2017, arXiv 1701.07875) replaced the Jensen-Shannon divergence with the Wasserstein-1 distance, which provides a smoother training signal and a loss that correlates better with sample quality. WGAN-GP (Gulrajani et al. 2017) substituted weight clipping with a gradient penalty on the discriminator, further stabilising training.
Progressive growing of GANs (Karras et al. 2017, arXiv 1710.10196) trained at increasing resolutions, starting from 4 by 4 and progressively doubling until 1024 by 1024, producing the first photorealistic high-resolution face images on CelebA-HQ. StyleGAN (Karras, Laine and Aila 2018, arXiv 1812.04948) replaced traditional latent injection with a style-based generator using adaptive instance normalisation (AdaIN), mapping the input noise through a fully connected mapping network to an intermediate W latent space before injecting styles at each resolution. StyleGAN achieved FID 4.40 on CelebA-HQ. StyleGAN2 (Karras et al. 2019, arXiv 1912.04958) fixed droplet and progressive artifacts by redesigning the normalisation layers, reaching FID 3.48 on FFHQ at 1024 by 1024. StyleGAN3 (Karras et al. 2021, arXiv 2106.12423) removed aliasing so internal features become equivariant to translation and rotation, improving temporal consistency for video generation and animation.
BigGAN (Brock, Donahue and Simonyan 2018, arXiv 1809.11096) showed that scaling batch sizes to 2048 with orthogonal regularisation and a truncation trick gave a large jump on class-conditional ImageNet, reaching IS 166.3 and FID 9.6 at 128 by 128 resolution. Although BigGAN used class conditioning, its architectural contributions influenced unconditional training as well.
Normalizing flow models build an invertible mapping between a simple base density and the data, giving an exact tractable log-likelihood. Every step of the model is a bijective function with a tractable Jacobian determinant, so the exact log-probability of any sample can be computed in a single forward pass through the inverse network.
NICE (Dinh, Krueger and Bengio 2014) introduced additive coupling layers as the core building block for tractable bijections. RealNVP (Dinh, Sohl-Dickstein and Bengio 2016) extended these to affine couplings and added a multi-scale architecture that factored out half the channels at each spatial resolution. Glow (Kingma and Dhariwal 2018, arXiv 1807.03039) from OpenAI added invertible 1 by 1 convolutions as learnable permutations between coupling layers, and actnorm layers to replace batch normalisation. Glow produced sharp 256 by 256 face images with semantically smooth interpolations and demonstrated realistic attribute manipulations by moving in the latent space. The main limitation of flow models for images is that the architecture must preserve the full pixel dimensionality throughout, leading to much larger memory costs than models that use a compressed latent representation.
Autoregressive models factor the joint distribution of an image as a product of conditional distributions over individual pixels or tokens, p(x) = product over i of p(x_i | x_1, ..., x_{i-1}). This gives a tractable exact likelihood with no architectural constraints beyond autoregressive ordering.
PixelRNN and PixelCNN (van den Oord, Kalchbrenner and Kavukcuoglu 2016, arXiv 1601.06759, ICML 2016 best paper) introduced the autoregressive approach to natural image modelling. PixelRNN used two-dimensional LSTM units, processing images row by row and generating each pixel conditioned on all preceding pixels; PixelCNN replaced the recurrent layers with masked convolutions, which are much faster to train because convolutions parallelize across spatial positions, though generation remains sequential at test time. PixelCNN++ (Salimans et al. 2017) improved PixelCNN with logistic mixture likelihoods, skip connections, and downsampling.
The Image Transformer (Parmar et al. 2018) replaced the recurrent core with a self-attention block restricted to local windows, enabling larger receptive fields without the sequential bottleneck of recurrence. VQGAN (Esser, Rombach and Ommer 2020, arXiv 2012.09841) trained a discrete autoencoder with a perceptual and adversarial loss, then trained a transformer to model the code sequence; this two-stage recipe, encoding images as short sequences of discrete tokens and modelling those tokens autoregressively, became the template for later token-based image models. MaskGIT (Chang et al. 2022) introduced parallel masked decoding so tokens can be sampled in a small number of refinement steps rather than one at a time, greatly reducing generation latency.
An important precursor to diffusion models was the noise-conditional score network (NCSN) introduced by Song and Ermon at NeurIPS 2019. NCSN trained a shared neural network to estimate the score function (gradient of the log probability density) at multiple noise levels simultaneously, then sampled by running annealed Langevin dynamics from high noise to low noise. This established the score-matching perspective on iterative generation and directly influenced the unified SDE framework.
The original diffusion idea was published by Sohl-Dickstein et al. in 2015 (arXiv 1503.03585), framing generation as the reverse of a fixed Gaussian noising process. The modern formulation, DDPM (Ho, Jain and Abbeel 2020, arXiv 2006.11239), trained a U-Net to predict the noise added at each step and reached FID 3.17 on CIFAR-10, the first time a diffusion model surpassed GAN quality on a standard benchmark. Improved DDPM (Nichol and Dhariwal 2021, arXiv 2102.09672) learned the reverse-process variance with a cosine noise schedule, further improving both log-likelihood and sample quality.
DDIM (Song, Meng and Ermon 2020, arXiv 2010.02502) introduced a deterministic sampler giving high-quality samples in 50 steps instead of 1,000 by re-parameterising the reverse process as an ODE rather than an SDE. Score-SDE (Song et al. 2020, arXiv 2011.13456) unified DDPM and noise-conditional score networks under a stochastic differential equation view, showing that both are specific discretisations of continuous-time diffusion processes and enabling a range of numerical ODE/SDE solvers.
ADM, also called Guided Diffusion (Dhariwal and Nichol 2021, arXiv 2105.05233), produced FID 2.97 on ImageNet 128, 4.59 at 256, and 7.72 at 512, the first time diffusion beat BigGAN on ImageNet. Latent Diffusion Models (Rombach et al. 2021, arXiv 2112.10752) ran diffusion in the latent space of a pretrained VQGAN autoencoder, cutting compute by roughly an order of magnitude; the same architecture underpins Stable Diffusion. EDM (Karras et al. 2022, arXiv 2206.00364) reorganised the diffusion design space, providing principled choices for the noise schedule, preconditioning, and sampler, and reached FID 1.79 on class-conditional CIFAR-10 with 35 network evaluations per sample. Consistency models (Song, Dhariwal, Chen and Sutskever 2023, arXiv 2303.01469) learned a direct map from any noise level to the clean image, so one or two function evaluations suffice. Flow Matching (Lipman et al. 2022, arXiv 2210.02747) trained continuous normalising flows by regressing on a fixed conditional vector field, and Rectified Flow (Liu, Gong and Liu 2022) used straightened transport paths; both underpin Stable Diffusion 3.
| Model | Year | Family | Group | Notable result |
|---|---|---|---|---|
| DCGAN | 2015 | GAN | Facebook AI Research | First stable conv GAN; introduced strided-conv template |
| Progressive GAN | 2017 | GAN | NVIDIA | First 1024 by 1024 face images; CIFAR-10 IS 8.80 |
| BigGAN | 2018 | GAN | DeepMind | ImageNet 128 IS 166.3, FID 9.6 (class-conditional) |
| Glow | 2018 | Flow | OpenAI | Exact likelihood; interpolatable face latent space |
| StyleGAN | 2018 | GAN | NVIDIA | Style-based generator; FFHQ FID 4.40 |
| VQ-VAE-2 | 2019 | VAE plus prior | DeepMind | Hierarchical discrete codes; 256 by 256 ImageNet |
| StyleGAN2 | 2019 | GAN | NVIDIA | Fixed droplet artifacts; FFHQ FID 3.48 |
| DDPM | 2020 | Diffusion | Berkeley | CIFAR-10 FID 3.17; beat GANs on that benchmark |
| VQGAN | 2020 | Autoregressive | Heidelberg CompVis | Discrete tokens plus transformer; perceptual codec |
| Improved DDPM | 2021 | Diffusion | OpenAI | Cosine schedule, learned variance |
| ADM (Guided Diffusion) | 2021 | Diffusion | OpenAI | First diffusion to beat BigGAN on ImageNet |
| StyleGAN3 | 2021 | GAN | NVIDIA | Alias-free; equivariant internal features |
| Latent Diffusion | 2021 | Diffusion | Heidelberg CompVis | Diffusion in compressed VQGAN latent space |
| MaskGIT | 2022 | Masked AR | Parallel masked decoding; 8 to 16 steps | |
| EDM | 2022 | Diffusion | NVIDIA | CIFAR-10 FID 1.79 in 35 steps; unified design space |
| DiT | 2022 | Diffusion | Meta AI | Transformer backbone; ImageNet 256 FID 2.27 |
| Consistency Models | 2023 | Distilled diffusion | OpenAI | One-step CIFAR-10 FID 3.55; two-step 2.93 |
| Dataset | Resolution | Size | Notes |
|---|---|---|---|
| MNIST | 28 by 28 grayscale | 60,000 | Handwritten digits; used for early proof-of-concept |
| CIFAR-10 | 32 by 32 | 50,000 | Ten object classes; dominant low-resolution benchmark |
| CelebA-HQ | up to 1024 | 30,000 | Celebrity faces; used by Progressive GAN |
| FFHQ | 1024 by 1024 | 70,000 | Flickr-Faces-HQ; released with StyleGAN; diverse ages and ethnicities |
| LSUN | 256 typical | millions | Bedrooms, churches, cats; tests object-level structure |
| ImageNet | 64 / 128 / 256 / 512 | 1.28 million | 1000 classes; used class-unconditionally for fairness |
| AFHQ | 512 by 512 | about 15,000 | Animal faces; three classes (cats, dogs, wildlife) |
| CUB-200-2011 | 256 typical | about 12,000 | Fine-grained bird images; used for conditional evaluation |
CIFAR-10 at 32 by 32 resolution is the most-cited unconditional benchmark because it is cheap to train on and has a long track record of reported numbers. FFHQ at 1024 by 1024 is the standard high-resolution face benchmark. Unconditional ImageNet generation at 256 by 256 is increasingly used to test whether a model can handle a highly multimodal distribution covering 1,000 object categories without any label conditioning.
A reliable evaluation of unconditional generators requires metrics that capture both the quality of individual samples and the coverage of the true data distribution. No single metric captures both simultaneously.
The Inception Score (IS, Salimans et al. 2016) feeds generated images through an ImageNet-pretrained Inception network and rewards samples for which the predicted class distribution is sharp (high quality) and the marginal over samples is broad (diversity). IS correlates with visual quality but does not compare generated samples to real data and is insensitive to mode dropping.
The Frechet Inception Distance (FID, Heusel et al. 2017, arXiv 1706.08500) computes the 2-Wasserstein distance between Gaussian fits to Inception v3 activations of the real and generated sets at the 2048-dimensional pool layer. Lower FID means the generated distribution matches the real one more closely in feature space. FID is the de facto standard metric for unconditional generation, but it has known limitations: it is statistically biased (the expected value depends on sample size and model), it is sensitive to the choice of Inception backbone, it encodes ImageNet biases (prioritising texture and edge statistics), and it can mis-rank models on non-photographic domains such as medical images or artwork.
Kernel Inception Distance (KID, Binkowski et al. 2018) replaces the Gaussian assumption with a polynomial-kernel MMD estimator, which is unbiased and can be computed on small sample sizes.
Precision and recall for generative models (Kynkaanniemi et al. 2019, arXiv 1904.06991) form explicit non-parametric manifold representations of the real and generated distributions: precision measures what fraction of generated samples lie in the real manifold (quality), and recall measures what fraction of the real manifold is covered by generated samples (diversity). A later refinement, density and coverage (Naeem et al. 2020), addresses edge cases where the Kynkaanniemi metrics can fail, such as identical distributions or distributions with outliers.
sFID uses spatial features from an intermediate Inception layer rather than the global pool layer, making it more sensitive to fine-grained spatial structure. CLIP-FID substitutes the Inception backbone with a CLIP image encoder, which may correlate better with human judgements on some datasets.
Several techniques recur across families. Spectral normalisation (Miyato et al. 2018) constrains discriminator Lipschitz constants and stabilises GAN training. The two-time-scale update rule (TTUR) uses different learning rates for the generator and discriminator. Self attention blocks at intermediate resolutions, introduced as SAGAN (Zhang et al. 2018), help with long-range structure. Large batch training, exponential moving averages of generator weights, R1 regularisation on the discriminator, and adaptive data augmentation (ADA, Karras et al. 2020) are routinely used to train GANs at high resolution with limited data. Classifier-free guidance (Ho and Salimans 2022) is the dominant tool for trading diversity against fidelity in diffusion sampling.
For diffusion models, architecture choices have a large effect on quality. Early DDPM used a simple U-Net with residual blocks. ADM added attention at multiple resolutions and used grouped convolutions with 256 channels or more. DiT (Peebles and Xie 2022) replaced the U-Net entirely with a Vision-Transformer-style architecture that operates on patchified latent tokens, conditioning on the diffusion timestep and optional class labels via adaptive layer normalisation. DiT showed clean FID scaling laws with compute: larger models with more Gflops consistently achieve lower FID, with DiT-XL/2 reaching FID 2.27 on class-conditional ImageNet 256.
Diffusion model quality depends heavily on the sampler. DDIM gives a deterministic ODE integrator that recovers DDPM quality in tens of steps. Heun, Euler-A, DPM-Solver and DPM-Solver++ (Lu et al. 2022) use higher-order ODE methods to cut sampling further. UniPC (Zhao et al. 2023) adds predictor-corrector steps. Progressive distillation (Salimans and Ho 2022) and consistency distillation repeatedly halve the number of required steps, leading to two-, four-, or eight-step samplers that approach multistep teacher quality.
The sampling speed gap between diffusion models and one-shot GANs was a significant practical concern from 2020 to 2023. Consistency models, rectified flow, and flow matching have largely closed this gap: single-step consistency models achieve FID 3.55 on CIFAR-10 (two steps: FID 2.93), compared with DDPM's original 1,000-step FID 3.17. For high-resolution generation the best distilled samplers produce outputs indistinguishable from multi-step models in four to eight steps.
Unconditional generation and conditional generation are closely related: most architectures can be adapted to either regime by adding or removing a conditioning signal. This relationship has shaped research in several important ways.
The classifier-free guidance technique from Ho and Salimans (2022) works by training a single diffusion model jointly as unconditional and class-conditional, then at sampling time interpolating between the two score estimates. This means the unconditional model is always trained as a byproduct of conditional training in modern diffusion pipelines. Guidance strength is the primary knob for trading diversity (unconditional direction) against fidelity to the condition (conditional direction).
Many of the highest-impact generative models were originally developed in the unconditional setting and later extended to conditional generation. DDPM was an unconditional model on CIFAR-10 and CelebA; ADM added classifier guidance; and Stable Diffusion added text conditioning through cross-attention. StyleGAN was unconditional on FFHQ faces; its W latent space was later exploited for text-guided editing through GAN inversion combined with a CLIP-based direction finding step (StyleCLIP, Patashnik et al. 2021).
The shift toward large-scale text-to-image models from 2022 onward (DALL-E 2, Stable Diffusion, Imagen, Midjourney) has moved practitioner attention away from purely unconditional benchmarks. However, unconditional generation on CIFAR-10 and ImageNet 256 remains the standard way to compare architectural innovations in isolation, precisely because it removes the confound of conditioning quality. For a survey of models that add text or label conditioning, see text-to-image models.
By 2024 and 2025 the landscape has shifted in three ways. First, diffusion models and flow-matching variants dominate FID on ImageNet, CelebA-HQ, FFHQ, and similar datasets; ADM started this trend and DiT (Peebles and Xie 2022, arXiv 2212.09748), with a transformer backbone, pushed class-conditional ImageNet 256 to FID 2.27. Second, distillation methods such as consistency models and rectified flow have closed most of the latency gap that originally favoured GANs. Third, GAN-family models, especially StyleGAN3 and its descendants, remain the dominant tools for face editing through inversion in the W or W+ latent space.
A fourth trend is the return to structured unconditional generation through self-supervised representations. RCG (Li et al. 2023, "Return of Unconditional Generation," arXiv 2312.03701) proposed a self-supervised representation generation method that first generates a self-supervised feature vector and then generates an image conditioned on that vector, achieving FID 2.15 on ImageNet 256 without any class labels. This approach treats unconditional generation as a two-stage problem: modelling the distribution of semantic representations, then generating pixels conditioned on those representations, which can leverage the rich structure learned by self-supervised encoders without requiring annotated labels.
Unconditional generators have a broad range of practical applications beyond benchmark evaluation.
Synthetic data generation addresses the shortage of labeled or sensitive real data. In medical imaging, GANs and diffusion models generate synthetic radiographs, histology slides, and brain MRI scans that supplement small clinical datasets. Synthetic data can also be used for patient de-identification, generating realistic medical images with no connection to real individuals.
Face editing through GAN inversion is one of the most widely studied applications of StyleGAN and its variants. GAN inversion maps a real photograph to a latent code in the W or W+ space of a pretrained StyleGAN model, then edits the code by moving along semantic directions discovered via supervised or self-supervised analysis. The W space supports coarse edits with high editability; the extended W+ space (one code per style layer) supports better reconstruction at the cost of reduced editability; and various intermediate representations have been proposed to balance the two. Discovered directions correspond to interpretable attributes including age, expression, pose, hairstyle, and illumination, allowing controlled manipulation of real photographs.
Data augmentation for discriminative models is an established use case for unconditional generators. A generator trained on a dataset can produce additional training images to supplement a small real dataset. The synthetic images are most useful when the generator captures fine-grained intra-class variation that is hard to capture with geometric augmentation alone.
Anomaly detection uses the generative model as a reference: a real input image is inverted into the latent space, and the distance between the original and the reconstruction signals how well the model can explain the image. Regions or images that the model cannot reconstruct faithfully are flagged as anomalous. This approach has been applied to industrial inspection and medical screening.
Generative art and creative tools are a major consumer application. Unconditional face models, texture generators, and abstract image models are used in interactive design tools, video game asset pipelines, and artistic installations.
Face anonymisation systems replace real faces in a video or image dataset with synthetic faces generated by a model trained on the same distribution, preserving statistical properties (pose, age, expression distribution) while removing personally identifiable information.
Unconditional image generation models share several fundamental limitations that have motivated the continued evolution of the field.
Mode collapse is a long-standing failure mode for GANs, where the generator learns to produce a limited subset of the training distribution rather than covering all modes. A model trained on a face dataset may generate photorealistic faces of a narrow age range or ethnicity without generating diverse samples. Mode collapse is hard to detect from FID alone because FID rewards a small improvement in quality over a loss of diversity if both are mild.
GAN training instability stems from the adversarial objective: the generator and discriminator can enter cycles where one overwhelms the other, leading to oscillations or divergence. This instability is the primary reason most state-of-the-art generative models since 2021 use diffusion objectives rather than adversarial training, despite diffusion models being slower to sample.
Sampling cost for diffusion models is substantially higher than for one-shot GANs. Even with fast ODE solvers and distillation, generating a single high-resolution image typically requires 4 to 50 forward passes through the denoising network, compared to a single pass for a GAN generator. For applications requiring real-time generation at high resolution, this remains a practical concern.
Evaluation reliability is a persistent problem. FID is a proxy metric that reflects Inception v3 features trained on ImageNet. It is sensitive to sample size, statistically biased, and encodes the specific texture and edge preferences of the Inception backbone. It can mis-rank generators on non-photographic domains such as medical scans, satellite imagery, or abstract art. The field lacks a universally accepted perceptual quality metric.
Training data bias affects all generative models. Unconditional face models trained on web-scraped datasets such as CelebA or FFHQ exhibit systematic demographic skew because those datasets over-represent certain ages, skin tones, and facial structures. Models replicate and can amplify these biases in downstream applications.
High-resolution generation cost in memory and compute remains a barrier for smaller research groups. Training a state-of-the-art unconditional model on ImageNet 256 or FFHQ 1024 requires hundreds of GPU-hours and careful hyperparameter tuning.
Limited compositionality is a challenge specific to unconditional models. Because there is no structured guidance from a text prompt or class label, the model must learn all of p_data(x) implicitly. Composing novel object combinations, unusual viewpoints, or rare attribute conjunctions that are underrepresented in training data is harder without explicit conditioning.
Before deep networks, image synthesis was dominated by parametric texture synthesis (Heeger and Bergen 1995), patch-based methods such as Efros and Leung 1999 and image quilting (Efros and Freeman 2001), and Markov random fields. These methods modeled local statistics but could not capture global object structure.
The VAE paper (Kingma and Welling 2013) and the GAN paper (Goodfellow et al. 2014) both appeared within months of each other and established the two main paradigms for learned image generation. Early VAE samples on MNIST and small face datasets were blurry but demonstrated that latent codes could be interpolated smoothly. Early GAN samples were unstable and low-resolution, but DCGAN (2015) established that convolutional architectures could be trained reliably, producing recognizable bedrooms, faces, and objects at 64 by 64 resolution. The original NCSN (Song and Ermon 2019) and the first DDPM (Ho et al. 2020) followed several years later, demonstrating that iterative score-based and diffusion approaches could surpass GAN quality.
The period from 2017 to 2020 was dominated by GAN improvements. Progressive GAN brought high-resolution photorealistic face generation within reach. StyleGAN and its successors refined the architecture to the point that generated faces were indistinguishable from real ones in double-blind perceptual studies, becoming well-known through the website thispersondoesnotexist.com. Glow demonstrated that normalizing flows could produce semantically editable face images with exact likelihoods. VQ-VAE-2 showed that discrete autoencoders could generate diverse, high-quality ImageNet images.
The DDPM paper (2020) marked the beginning of the diffusion model era. Within two years, the combination of DDIM sampling, improved architectures (ADM), and latent-space diffusion had produced models that scored better on FID than the best GANs on every major benchmark. The ADM paper title, "Diffusion Models Beat GANs on Image Synthesis," marked a clear inflection point. Flow-matching and SDE-based frameworks unified the theoretical picture and laid the groundwork for rapid architectural iteration.
DiT (2022) demonstrated that replacing U-Nets with transformers in diffusion models gave predictable FID scaling with model size, directly analogous to the scaling laws observed in large language models. This has driven a shift toward transformer-based backbones in both unconditional and conditional generation pipelines, including the Stable Diffusion 3 and Flux models that rely on MM-DiT architectures. EDM and its successors showed that careful choice of noise schedule and preconditioning could improve efficiency substantially without architectural changes.