Wasserstein GAN (WGAN)
Last reviewed
Apr 30, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 3,790 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 3,790 words
Add missing citations, update stale details, or suggest a clearer explanation.
Wasserstein GAN (WGAN) is a variant of the generative adversarial network introduced by Martin Arjovsky, Soumith Chintala, and Léon Bottou in the January 2017 arXiv preprint Wasserstein GAN (arXiv:1701.07875), later published at the International Conference on Machine Learning in 2017. The method replaces the Jensen-Shannon divergence implicitly minimised by the original GAN of Goodfellow et al. (2014) with the Wasserstein-1 distance, also known as the Earth mover's distance (EMD). The result is a more stable training procedure, a measurable loss that correlates with the perceived quality of samples, and a partial remedy to the mode-collapse pathology that plagued earlier GAN variants.
The paper appeared alongside a companion theoretical paper by Arjovsky and Bottou, Towards Principled Methods for Training Generative Adversarial Networks (arXiv:1701.04862), which diagnosed why ordinary GAN training is unstable. A few months later Gulrajani, Ahmed, Arjovsky, Dumoulin, and Courville published Improved Training of Wasserstein GANs (arXiv:1704.00028) at NeurIPS 2017, replacing the original WGAN's weight clipping with a gradient penalty. That variant, WGAN-GP, became the canonical formulation used in most subsequent GAN papers, and the Wasserstein loss it embodies is now a standard option in deep generative modelling toolkits.
The original GAN trains a generator G and a discriminator D in a minimax game whose value function, when D is optimal, is equivalent to the Jensen-Shannon (JS) divergence between the data distribution P_r and the generator distribution P_g. Goodfellow et al. (2014) proved that the global minimum of this objective is attained when P_g = P_r and the JS divergence equals zero. In practice, however, the training dynamics are notoriously fragile.
In early 2017 Arjovsky and Bottou (arXiv:1701.04862) gave a theoretical account of why. When the supports of P_r and P_g lie on low-dimensional manifolds, which is the typical situation for natural images modelled by a deep generator, those supports almost surely have measure-zero overlap. In that regime the JS divergence is constant (equal to log 2), the Kullback-Leibler divergence is infinite, and the gradient of the discriminator with respect to the generator parameters either vanishes or explodes. Three concrete failure modes followed from this analysis:
D becomes too good at telling real from fake, its output saturates and the generator stops receiving useful signal.WGAN was designed to address exactly these failures by changing the underlying distance.
The Wasserstein-1 distance, also called the Earth mover's distance, between two probability measures P_r and P_g on a metric space (X, d) is defined as
W(P_r, P_g) = inf over γ in Π(P_r, P_g) of E_{(x, y) ~ γ} [ ||x - y|| ]
where Π(P_r, P_g) is the set of all joint distributions whose marginals are P_r and P_g. Each such joint γ is called a transport plan: γ(x, y) is the mass moved from point x (in P_r) to point y (in P_g), and the cost of a plan is the expected distance moved. The Wasserstein-1 distance is the minimum cost of any plan that turns P_r into P_g. The intuition that gives the metric its colloquial name is exactly that one of moving piles of dirt: W(P_r, P_g) is the minimum amount of work required to reshape one distribution into the other.
The key property that makes this metric useful for GAN training is that, under mild regularity assumptions on the generator, W(P_r, P_g_θ) is continuous and almost-everywhere differentiable in the generator parameters θ. Arjovsky, Chintala, and Bottou (2017) prove this in Theorem 1 of the WGAN paper. Total variation, KL, and JS divergences fail one or both of these properties when supports do not overlap. A simple example given in the paper involves two parallel line segments in the plane, separated by a parameter θ. The Wasserstein distance between them is |θ|, smooth in θ, while JS, KL, and reverse-KL are all discontinuous at θ = 0.
The primal infimum-over-couplings form is intractable for the high-dimensional distributions that GANs work with. Optimal transport theory supplies a dual formulation, the Kantorovich-Rubinstein duality, that turns the infimum into a supremum over 1-Lipschitz functions:
W(P_r, P_g) = sup over f with ||f||_L ≤ 1 of E_{x ~ P_r} [ f(x) ] - E_{x ~ P_g} [ f(x) ]
A function f is 1-Lipschitz when |f(x) - f(y)| ≤ ||x - y|| for all x, y. The duality says: parameterise a 1-Lipschitz function f, push real samples up and fake samples down, and the gap converges to the Wasserstein distance.
WGAN replaces the GAN discriminator with a network f_w (renamed critic because it no longer outputs a probability) that approximates this supremum. The critic outputs an unbounded scalar, has no sigmoid on its head, and is trained to maximise the dual gap. The generator is then trained to push that gap down by changing the fake distribution.
Algorithm 1 of Arjovsky, Chintala, and Bottou (2017) specifies the full training loop. The headline choices are:
L_critic = E[f_w(x_real)] - E[f_w(G_θ(z))].L_gen = -E[f_w(G_θ(z))].w to the interval [-c, c] with the default c = 0.01.n_critic = 5.α = 5e-5 and batch size m = 64.The authors explicitly recommend RMSProp over Adam in the original paper because they observed that Adam's momentum interacted badly with the non-stationary critic loss when weight clipping was active, leading to occasional training divergence. With these settings WGAN trains stably across the DCGAN architectures Radford et al. used in 2015, and also across several alternative critic architectures, including a multilayer perceptron without batch normalisation that reportedly fails outright in standard GANs.
A practical benefit of the dual form is that the critic can, and indeed should, be trained to near optimality between generator updates. In ordinary GANs this is a mistake because the JS-style discriminator saturates; in WGAN the closer the critic is to optimal, the better the gradient signal the generator receives.
The original paper acknowledged that weight clipping is a crude way to enforce 1-Lipschitz continuity. Gulrajani et al. (2017) pursued the idea further and made the case sharply. They identified three concrete pathologies:
+c or -c, producing a binary weight distribution rather than the rich one a deep network can use. Histograms in the paper show this saturation clearly.c too small makes activations and gradients vanish through depth; choosing it too large makes them blow up. Tuning c is brittle.Gulrajani et al. (2017) propose to drop weight clipping in favour of a soft constraint added directly to the critic loss. The trick uses the fact that a critic that achieves the supremum in the Kantorovich-Rubinstein duality has gradient norm exactly 1 almost everywhere along straight-line paths between real and fake samples. They penalise deviations from that property:
L_critic = E[f_w(G_θ(z))] - E[f_w(x_real)] + λ · E_{x̂} [ (||∇_{x̂} f_w(x̂)||_2 - 1)^2 ]
where x̂ = ε · x_real + (1 - ε) · G_θ(z) with ε sampled uniformly from [0, 1]. The penalty coefficient is λ = 10. The interpolated samples x̂ lie on the straight line between a real and a fake sample, which is exactly the support along which an optimal critic has unit gradient norm.
Unlike the original WGAN, WGAN-GP works well with Adam. The recommended setting is Adam with learning rate = 1e-4, β_1 = 0.0, and β_2 = 0.9, again with n_critic = 5. Batch normalisation in the critic is replaced by layer normalisation because batch norm couples samples within a minibatch in ways that interfere with the per-sample gradient penalty. Under these settings WGAN-GP trains stably on a wide range of architectures, including 101-layer ResNets and language models over discrete data using a continuous relaxation.
WGAN-GP became the de facto WGAN variant in research code from 2017 onward. The reference implementation at igul222/improved_wgan_training is widely cited, and the original PyTorch repository for plain WGAN at martinarjovsky/WassersteinGAN is still used pedagogically.
The gradient penalty is only one way to encourage a critic to be 1-Lipschitz. Several alternatives, most published within a year or two of WGAN-GP, became important in their own right.
| Method | Year | Lipschitz mechanism | Reference |
|---|---|---|---|
| Weight clipping (original WGAN) | 2017 | Hard-clip every weight to [-0.01, 0.01] | Arjovsky, Chintala, Bottou (arXiv:1701.07875) |
| Gradient penalty (WGAN-GP) | 2017 | Soft penalty `λ · ( | |
| Lipschitz penalty (WGAN-LP) | 2017 | One-sided penalty `λ · max(0, | |
| Consistency term (WGAN-CT) | 2018 | Augment GP with a Lipschitz consistency penalty between perturbed pairs | Wei et al., ICLR 2018 |
| Spectral normalisation (SN-GAN) | 2018 | Divide each weight matrix by its spectral norm via power iteration | Miyato, Kataoka, Koyama, Yoshida (arXiv:1802.05957) |
| Cramér distance (Cramér GAN) | 2017 | Replace Wasserstein with the Cramér distance to fix biased gradients | Bellemare et al. (arXiv:1705.10743) |
Spectral normalisation in particular has had a long afterlife. By dividing each weight matrix by an estimate of its largest singular value, it bounds the Lipschitz constant of the entire network without requiring any extra term in the loss. Miyato et al. (2018) showed it stabilised both standard hinge-loss GANs and WGANs on CIFAR-10, STL-10, and ImageNet. Brock et al. used it as a core ingredient of BigGAN (arXiv:1809.11096), which at ICLR 2019 reached an Inception Score of 166.3 and FID of 9.6 on 128x128 ImageNet, and StyleGAN-family architectures from NVIDIA also use spectral-norm-style constraints.
Why does weight clipping enforce, even approximately, a 1-Lipschitz constraint? A feedforward network with ReLU or other 1-Lipschitz activations has Lipschitz constant bounded by the product of the operator norms of its weight matrices. Bounding every weight to [-c, c] upper-bounds each operator norm by c · sqrt(d) where d is the matrix dimension, which in turn bounds the network's overall Lipschitz constant. The bound is loose, the constant changes with depth, and the network's actual Lipschitz constant after clipping is usually much smaller than 1. That looseness is exactly the source of the capacity loss Gulrajani et al. document.
Why does the gradient penalty work? Proposition 1 of Gulrajani et al. (2017) says that the optimal critic in the Kantorovich-Rubinstein dual has unit gradient norm almost everywhere along the lines connecting samples drawn from P_r and P_g. Penalising deviations from unit norm at sampled interpolation points is therefore a soft surrogate for the constraint that the optimum already satisfies. The penalty does not strictly enforce 1-Lipschitz everywhere, but it pulls the critic toward a function family in which the dual is well-approximated.
More broadly, the WGAN framework connects deep generative modelling to optimal transport theory and the classical Monge-Kantorovich problem, which goes back to Gaspard Monge in 1781 and Leonid Kantorovich's 1942 reformulation. Generators end up optimising a transport cost between latent noise and the data manifold, and the critic plays the role of a Kantorovich potential.
The practical reasons WGAN and especially WGAN-GP became so widely used can be enumerated cleanly. Stability across architectures is the headline benefit: the original paper trained the same WGAN setup on DCGAN backbones, vanilla MLPs, and architectures without batch normalisation, all without divergence. Mode collapse is reduced substantially compared to the JS-divergence formulation, although it is not eliminated. The critic loss correlates with sample quality in a way the original GAN's discriminator loss does not, which is invaluable for monitoring runs and tuning hyperparameters. And the framework has firm theoretical grounding in optimal transport, with non-trivial theorems about continuity and convergence rather than the ad hoc justifications that tend to accompany other GAN tricks.
WGAN is not a free lunch. Several limitations have been documented since 2017:
λ than the two-sided penalty in WGAN-GP, suggesting that small changes to how the constraint is encoded matter.WGAN, and especially WGAN-GP, served as the default loss function and stability baseline for nearly every major GAN paper between 2017 and 2020. A short list of architectures and methods that built on or compared against WGAN-GP gives a sense of the breadth.
| Model / paper | Year | Relationship to WGAN | Reference |
|---|---|---|---|
| Progressive GAN (PG-GAN) | 2018 | Used WGAN-GP loss for training a progressively grown generator on 1024x1024 faces | Karras et al., ICLR 2018 (arXiv:1710.10196) |
| SN-GAN | 2018 | Spectral normalisation as a cleaner Lipschitz bound on the discriminator, often paired with hinge loss | Miyato et al., ICLR 2018 (arXiv:1802.05957) |
| Self-Attention GAN (SAGAN) | 2018 | Hinge loss plus spectral norm on both networks; explicit successor in the Lipschitz lineage | Zhang et al., ICML 2019 (arXiv:1805.08318) |
| BigGAN | 2019 | Spectral normalisation applied to both generator and discriminator at large scale | Brock, Donahue, Simonyan (arXiv:1809.11096) |
| StyleGAN family | 2019-2021 | Uses non-saturating logistic loss with R1 gradient penalty inspired by WGAN-GP and Mescheder et al. | Karras et al., NVIDIA |
| Cramér GAN | 2017 | Replaces the Wasserstein metric with the Cramér distance to fix biased sample gradients | Bellemare et al. (arXiv:1705.10743) |
| Sliced Wasserstein methods | 2018+ | Approximate Wasserstein distance via 1D projections to avoid the Lipschitz constraint entirely | Various, including SWGAN |
Beyond GANs themselves, the paper opened the door to a large body of optimal-transport-based machine-learning research. Sinkhorn divergences, Wasserstein autoencoders, Wasserstein gradient flows for sampling, and optimal-transport regularisers in domain adaptation all share intellectual lineage with the WGAN line.
The original PyTorch reference implementation by Martin Arjovsky lives at martinarjovsky/WassersteinGAN on GitHub. It uses the DCGAN architecture, trains with RMSProp at learning rate 5e-5, and applies weight clipping to [-0.01, 0.01]. The Improved-WGAN-GP code by Gulrajani at igul222/improved_wgan_training is the canonical TensorFlow implementation of WGAN-GP, including ResNet variants and the language-model relaxation. Numerous third-party reimplementations exist in PyTorch Lightning, Keras, and JAX, including community-maintained examples on Hugging Face's model hub. The MathWorks Deep Learning Toolbox documentation includes a worked WGAN-GP tutorial as one of its canonical generative-model recipes.
For sanity checking the loss, one practical recipe due to the original paper: plot the negative critic loss -(E[f_w(x_real)] - E[f_w(G_θ(z))]) over training. It tracks the Wasserstein distance between the real and generated distributions and decreases roughly monotonically when training is healthy. If it stops decreasing or oscillates, the critic is undertrained or the constraint is poorly enforced.
From 2020 onward, diffusion models such as denoising diffusion probabilistic models (Ho et al., 2020), score-based generative models (Song and Ermon, 2019; Song et al., 2021), and Stable Diffusion (Rombach et al., 2022) overtook GANs as the dominant approach to image synthesis on most benchmarks. Autoregressive image transformers and masked-token models like MaskGIT and Muse picked up the rest. WGAN's place in 2025-era machine learning is therefore mostly historical and pedagogical, with one important exception: in domains where one-shot inference matters more than sample diversity, such as super-resolution, real-time generation, and certain scientific simulators, GAN-style training, often with a Wasserstein or hinge loss and spectral normalisation, remains competitive.
| Model | Divergence / loss | Lipschitz / regularisation | Year | Key paper |
|---|---|---|---|---|
| Vanilla GAN | Jensen-Shannon (minimax) | None | 2014 | Goodfellow et al. (arXiv:1406.2661) |
| Non-saturating GAN | Cross-entropy with -log D(G(z)) generator | None | 2014 | Goodfellow et al. (arXiv:1406.2661) |
| DCGAN | JS-style cross entropy | Architectural priors only | 2015 | Radford, Metz, Chintala (arXiv:1511.06434) |
| WGAN | Wasserstein-1 (KR dual) | Weight clipping, c = 0.01 | 2017 | Arjovsky, Chintala, Bottou (arXiv:1701.07875) |
| WGAN-GP | Wasserstein-1 (KR dual) | Two-sided gradient penalty, λ = 10 | 2017 | Gulrajani et al. (arXiv:1704.00028) |
| WGAN-LP | Wasserstein-1 (KR dual) | One-sided Lipschitz penalty | 2017 | Petzka et al. (arXiv:1709.08894) |
| Cramér GAN | Cramér (energy) distance | Gradient penalty | 2017 | Bellemare et al. (arXiv:1705.10743) |
| SN-GAN | Hinge loss (commonly) | Spectral normalisation | 2018 | Miyato et al. (arXiv:1802.05957) |
| MMD-GAN | Maximum Mean Discrepancy | Kernel-based | 2017 | Li et al. (arXiv:1705.08584) |
| LSGAN | Least squares (Pearson chi-squared) | None | 2016 | Mao et al. (arXiv:1611.04076) |
| Hinge GAN | Hinge loss | Often combined with SN | 2017 | Lim and Ye (arXiv:1705.02894) |
| BEGAN | Auto-encoder reconstruction equilibrium | Equilibrium term | 2017 | Berthelot, Schumm, Metz (arXiv:1703.10717) |
| BigGAN | Hinge loss | Spectral normalisation, orthogonal regularisation | 2019 | Brock, Donahue, Simonyan (arXiv:1809.11096) |
| Progressive GAN | WGAN-GP | Gradient penalty, progressive growth | 2018 | Karras et al. (arXiv:1710.10196) |
| StyleGAN | Non-saturating, R1 penalty | Zero-centred gradient penalty | 2019 | Karras, Laine, Aila (arXiv:1812.04948) |
The Wasserstein distance is one element of a much older mathematical machinery. The Monge formulation, posed by Gaspard Monge in 1781, asked for a deterministic transport map T: X -> X that pushes P_r onto P_g while minimising total transport cost. Monge's problem can fail to have a solution when no deterministic map exists. Leonid Kantorovich's 1942 relaxation replaced maps with joint distributions, giving the infimum-over-couplings form used in WGAN. The Kantorovich-Rubinstein theorem, proved in the 1950s, provides the duality between the primal and the supremum-over-1-Lipschitz form. WGAN is thus, in a precise sense, a deep-learning instantiation of a 240-year-old optimisation problem.
This connection is more than historical decoration. It justifies why WGAN's loss is a metric (symmetric, non-negative, vanishing iff distributions agree), why it is well-defined on disjoint supports, and why training dynamics admit clean theoretical analysis. The framework has since been generalised through Sinkhorn divergences, Wasserstein-2 GANs (Liu et al., 2018), and unbalanced optimal transport variants for partial matching.