Unconditional Image Generation Models
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,496 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,496 words
Add missing citations, update stale details, or suggest a clearer explanation.
Unconditional image generation models are generative neural networks that learn the marginal distribution p(x) of a set of training images and produce new samples from that learned distribution, with no extra input such as a text caption, class label, or partial image. The model is given only random noise (or a latent code) and must return a plausible image. This contrasts with conditional image generation, where the network is asked to produce an image consistent with some side information, for example a class label as in BigGAN or a text prompt as in Stable Diffusion.
Unconditional generation has been the standard benchmark setting for foundational generative model research because it isolates how well a model captures the data distribution itself. The dominant families are variational autoencoders, generative adversarial networks, normalizing flows, autoregressive image models, and diffusion models. Between roughly 2018 and 2021, GAN families such as StyleGAN held the state of the art on most benchmarks; from 2021 onward diffusion models and their successors (flow matching, consistency models) have generally produced lower Frechet Inception Distance on ImageNet, faces, and most object datasets.
See also: Computer Vision Models and Tasks
Given a training set sampled from an unknown distribution p_data(x), an unconditional image generation model approximates p_data with a parametric distribution p_theta(x) and provides a sampler that returns x ~ p_theta. Properties that distinguish the families include whether the sampler is one-shot or iterative, whether the model gives an exact log-likelihood, and whether the latent space is structured for editing. Unconditional models are evaluated almost entirely with sample-based metrics, because likelihood numbers are not always comparable across families. Standard benchmarks are CIFAR-10, CelebA-HQ and FFHQ faces, LSUN bedrooms and churches, and class-unconditional ImageNet at 64, 128, 256, and 512 resolutions.
Before deep networks, image synthesis was dominated by parametric texture synthesis (Heeger and Bergen 1995), patch-based methods such as Efros and Leung 1999 and image quilting (Efros and Freeman 2001), and Markov random fields. These methods modeled local statistics but could not capture global object structure.
The variational autoencoder was introduced by Diederik Kingma and Max Welling in late 2013 (arXiv 1312.6114). A VAE learns an encoder q(z|x) and a decoder p(x|z) jointly with the evidence lower bound, and samples by drawing z from a prior and running it through the decoder. Vanilla VAEs produce blurry samples because the per-pixel likelihood term encourages averaging. Beta-VAE (Higgins et al. 2017) weights the KL term to control disentanglement. VQ-VAE (van den Oord, Vinyals and Kavukcuoglu 2017, arXiv 1711.00937) replaced the continuous Gaussian latent with a discrete codebook, then trained a separate autoregressive prior over the codes. VQ-VAE-2 (Razavi et al. 2019, arXiv 1906.00446) extended this to a hierarchical top and bottom latent and reached 256 ImageNet samples competitive with the best GANs of the time. NVAE (Vahdat and Kautz 2020) showed that a carefully designed hierarchical VAE could close the gap on natural images.
The GAN framework was introduced by Ian Goodfellow et al. in 2014 (arXiv 1406.2661, NeurIPS 2014) as a two-player minimax game between a generator and a discriminator. The first practical convolutional variant was DCGAN (Radford, Metz and Chintala 2015, arXiv 1511.06434), which set the basic template of strided convolutions, batch normalisation, and a tanh output. Wasserstein GAN (Arjovsky et al. 2017, arXiv 1701.07875) replaced the Jensen Shannon objective with the Wasserstein-1 distance for a more stable training signal; WGAN-GP (Gulrajani et al. 2017) added a gradient penalty. Progressive growing of GANs (Karras et al. 2017, arXiv 1710.10196) trained at increasing resolutions, producing the first photorealistic 1024 by 1024 face images. StyleGAN (Karras, Laine and Aila 2018, arXiv 1812.04948) added a style based generator with adaptive instance normalisation and a mapped intermediate W latent space. StyleGAN2 (Karras et al. 2019, arXiv 1912.04958) fixed droplet and progressive artifacts. StyleGAN3 (Karras et al. 2021, arXiv 2106.12423) removed aliasing so internal features become equivariant to translation and rotation. BigGAN (Brock, Donahue and Simonyan 2018, arXiv 1809.11096) showed that scaling batch sizes to 2048 with orthogonal regularisation and a truncation trick gave a large jump on class-conditional ImageNet, reaching IS 166.3 and FID 9.6 at 128.
Normalizing flow models build an invertible mapping between a simple base density and the data, giving an exact tractable log-likelihood. NICE (Dinh, Krueger and Bengio 2014) introduced coupling layers. RealNVP (Dinh, Sohl-Dickstein and Bengio 2016) added affine couplings and a multi-scale architecture. Glow (Kingma and Dhariwal 2018, arXiv 1807.03039) from OpenAI added invertible 1 by 1 convolutions and produced sharp face samples while training to an exact log-likelihood objective.
PixelRNN and PixelCNN (van den Oord, Kalchbrenner and Kavukcuoglu 2016, arXiv 1601.06759, ICML 2016 best paper) factor the image distribution as a product of per-pixel conditionals and generate pixels one at a time. The Image Transformer (Parmar et al. 2018) replaced the recurrent core with a self attention block restricted to local windows. VQGAN (Esser, Rombach and Ommer 2020, arXiv 2012.09841) trained a discrete autoencoder with a perceptual and adversarial loss, then trained a transformer to model the code sequence; this two-stage recipe became the template for later token based image models. MaskGIT (Chang et al. 2022) introduced parallel masked decoding so tokens can be sampled in a small number of refinement steps.
The original diffusion idea was published by Sohl-Dickstein et al. in 2015 (arXiv 1503.03585), framing generation as the reverse of a fixed Gaussian noising process. The modern formulation, DDPM (Ho, Jain and Abbeel 2020, arXiv 2006.11239), trained a U-Net to predict the noise added at each step and reached FID 3.17 on CIFAR-10. Improved DDPM (Nichol and Dhariwal 2021, arXiv 2102.09672) learned the reverse-process variance with a cosine noise schedule. DDIM (Song, Meng and Ermon 2020, arXiv 2010.02502) introduced a deterministic sampler giving high-quality samples in 50 steps instead of 1,000. Score-SDE (Song et al. 2020, arXiv 2011.13456) unified DDPM and noise-conditional score networks under a stochastic differential equation view. ADM, also called Guided Diffusion (Dhariwal and Nichol 2021, arXiv 2105.05233), produced FID 2.97 on ImageNet 128, 4.59 at 256, and 7.72 at 512, the first time diffusion beat BigGAN on ImageNet. Latent Diffusion Models (Rombach et al. 2021, arXiv 2112.10752) ran diffusion in the latent space of a pretrained VQGAN autoencoder; the same architecture underpins Stable Diffusion. EDM (Karras et al. 2022, arXiv 2206.00364) reorganised the diffusion design space and reached FID 1.79 on class-conditional CIFAR-10 with 35 network evaluations per sample. Consistency models (Song, Dhariwal, Chen and Sutskever 2023, arXiv 2303.01469) learned a direct map from noise to data so one or two function evaluations are enough. Flow Matching (Lipman et al. 2022, arXiv 2210.02747) trained continuous normalising flows by regressing on a fixed conditional vector field, and Rectified Flow (Liu, Gong and Liu 2022) used straightened transport paths; both underpin Stable Diffusion 3.
| Model | Year | Family | Group | Notable result |
|---|---|---|---|---|
| DCGAN | 2015 | GAN | Facebook AI Research | First stable conv GAN |
| Progressive GAN | 2017 | GAN | NVIDIA | First 1024 CelebA-HQ faces; CIFAR IS 8.80 |
| BigGAN | 2018 | GAN | DeepMind | ImageNet 128 IS 166.3, FID 9.6 |
| Glow | 2018 | Flow | OpenAI | Exact likelihood; sharp faces |
| StyleGAN | 2018 | GAN | NVIDIA | Style based generator on FFHQ |
| VQ-VAE-2 | 2019 | VAE plus prior | DeepMind | Hierarchical 256 ImageNet |
| StyleGAN2 | 2019 | GAN | NVIDIA | Fixed droplet artifacts |
| DDPM | 2020 | Diffusion | Berkeley | CIFAR-10 FID 3.17 |
| VQGAN | 2020 | Autoregressive | Heidelberg CompVis | Discrete tokens plus transformer |
| Improved DDPM | 2021 | Diffusion | OpenAI | Cosine schedule, learned variance |
| ADM | 2021 | Diffusion | OpenAI | First to beat BigGAN on ImageNet |
| StyleGAN3 | 2021 | GAN | NVIDIA | Alias-free, equivariant |
| Latent Diffusion | 2021 | Diffusion | Heidelberg CompVis | Diffusion in VQGAN latent |
| MaskGIT | 2022 | Masked AR | Parallel decoding 8 to 16 steps | |
| EDM | 2022 | Diffusion | NVIDIA | CIFAR FID 1.79 in 35 steps |
| DiT | 2022 | Diffusion | Meta AI | Transformer backbone; ImageNet 256 FID 2.27 |
| Consistency Models | 2023 | Distilled diffusion | OpenAI | One-step CIFAR-10 FID 3.55 |
| Dataset | Resolution | Size | Notes |
|---|---|---|---|
| MNIST | 28 by 28 grayscale | 60,000 | Handwritten digits |
| CIFAR-10 | 32 by 32 | 50,000 | Ten object classes |
| CelebA-HQ | up to 1024 | 30,000 | Celebrity faces; used by Progressive GAN |
| FFHQ | 1024 by 1024 | 70,000 | Flickr-Faces-HQ, released with StyleGAN |
| LSUN | 256 typical | millions | Bedrooms, churches, cats |
| ImageNet | 64 / 128 / 256 / 512 | 1.28 million | 1000 classes |
| AFHQ | 512 by 512 | about 15,000 | Animal faces, three classes |
| CUB-200-2011 | 256 typical | about 12,000 | Fine grained birds |
The Inception Score (IS, Salimans et al. 2016) measures the conditional entropy of an ImageNet-pretrained Inception network on generated samples; it correlates with quality but is insensitive to mode dropping. The Frechet Inception Distance (FID, Heusel et al. 2017, arXiv 1706.08500) computes the 2-Wasserstein distance between Gaussian fits to Inception v3 activations of the real and generated sets at the 2048-dimensional pool layer; it is the de facto standard. Kernel Inception Distance (KID, Binkowski et al. 2018) replaces the Gaussian assumption with a polynomial-kernel MMD. Precision and recall for generative models (Kynkaanniemi et al. 2019), and the related density and coverage metric, separately measure sample quality and distributional coverage. sFID uses spatial features instead of pool features, and CLIP-FID swaps the Inception backbone for a CLIP image encoder.
Several techniques recur across families. Spectral normalisation (Miyato et al. 2018) constrains discriminator Lipschitz constants and stabilises GAN training. The two-time-scale update rule (TTUR) uses different learning rates for the generator and discriminator. Self attention blocks at intermediate resolutions, introduced as SAGAN (Zhang et al. 2018), help with long-range structure. Large batch training, exponential moving averages of generator weights, R1 regularisation on the discriminator, and adaptive data augmentation (ADA, Karras et al. 2020) are routinely used to train GANs at high resolution with limited data. Classifier-free guidance (Ho and Salimans 2022) is the dominant tool for trading diversity against fidelity in diffusion sampling.
Diffusion model quality depends heavily on the sampler. DDIM gives a deterministic ODE integrator that recovers DDPM quality in tens of steps. Heun, Euler-A, DPM-Solver and DPM-Solver++ (Lu et al. 2022) use higher-order ODE methods to cut sampling further. UniPC (Zhao et al. 2023) adds predictor corrector steps. Progressive distillation (Salimans and Ho 2022) and consistency distillation repeatedly halve the number of required steps, leading to two-, four-, or eight-step samplers that approach multistep teacher quality.
By 2024 and 2025 the landscape has shifted in three ways. First, diffusion models and flow-matching variants dominate FID on ImageNet, CelebA-HQ, FFHQ, and similar datasets; ADM started this trend and DiT (Peebles and Xie 2022, arXiv 2212.09748), with a transformer backbone, pushed class-conditional ImageNet 256 to FID 2.27. Second, distillation methods such as consistency models and rectified flow have closed most of the latency gap that originally favoured GANs. Third, GAN-family models, especially StyleGAN3 and its descendants, remain the dominant tools for face editing through inversion in the W or W+ latent space.
Unconditional generators are used for synthetic data generation where real data is scarce or sensitive, including medical imaging. They support face editing through GAN inversion, mapping a real photograph to a latent code that can be moved along semantic directions for age, expression, or pose. They serve as building blocks for data augmentation, power latent-space exploration in creative tools and generative art, and enable anomaly detection by comparing real inputs to the closest sample the model can produce. Face anonymisation systems use them to replace identities while preserving non-identity attributes.
Mode collapse, where the generator covers only a subset of the data distribution, is a long-standing failure mode for GANs and is hard to detect from FID alone. GAN training instability is the primary reason most recent state-of-the-art models are diffusion-based. Diffusion sampling, even with fast solvers, typically requires more compute per sample than a GAN. Evaluation reliability is a real concern: FID is a proxy that depends on the Inception backbone and is biased toward features useful on ImageNet, so it can mis-rank generators on faces, medical scans, or art. High-resolution generation remains expensive in memory and time. Unconditional models inherit training data biases, so face models trained on web-scraped sets show systematic demographic skew.