Wasserstein GAN (WGAN)

Deep Learning Generative AI Mathematics

23 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v3 · 4,609 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A Wasserstein GAN (WGAN) is a generative adversarial network that trains its two networks to minimise the Wasserstein-1 distance, also called the Earth mover's distance, between the real data distribution and the generated distribution, instead of the Jensen-Shannon divergence used by the original GAN. It was introduced in January 2017 by Martin Arjovsky, Soumith Chintala, and Léon Bottou (arXiv:1701.07875) ^[4]. WGAN replaces the GAN's discriminator with a critic, a network constrained to be 1-Lipschitz that outputs an unbounded score rather than a probability, and this single change yields more stable training, a partial fix for mode collapse, and a loss value that actually correlates with sample quality. As the authors state in the abstract, "we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches" ^[4].

The paper was published at the International Conference on Machine Learning in 2017 ^[4]. It appeared alongside a companion theoretical paper by Arjovsky and Bottou, Towards Principled Methods for Training Generative Adversarial Networks (arXiv:1701.04862), which diagnosed why ordinary GAN training is unstable ^[3]. A few months later Gulrajani, Ahmed, Arjovsky, Dumoulin, and Courville published Improved Training of Wasserstein GANs (arXiv:1704.00028) at NeurIPS 2017, replacing the original WGAN's weight clipping with a gradient penalty ^[5]. That variant, WGAN-GP, became the canonical formulation used in most subsequent GAN papers, and the Wasserstein loss it embodies is now a standard option in deep generative modelling toolkits.

What is a Wasserstein GAN?

A Wasserstein GAN is a GAN variant whose objective is the Wasserstein-1 distance between the data distribution P_r and the generator distribution P_g, estimated through the Kantorovich-Rubinstein dual as a supremum over 1-Lipschitz functions ^[4]. In place of the original GAN's discriminator, which outputs a probability that an input is real, WGAN uses a critic f_w that outputs a real-valued score, has no sigmoid on its head, and is trained to maximise the gap E[f_w(x_real)] - E[f_w(x_fake)]. The generator is trained to close that gap. Because the Wasserstein distance is continuous and almost-everywhere differentiable in the generator parameters under mild assumptions, the critic keeps supplying a usable gradient even when the real and generated distributions do not overlap, which is precisely the case where a standard GAN's gradient collapses ^[4].

Three headline properties define the WGAN approach:

A meaningful loss. The (negated) critic loss estimates the Wasserstein distance between P_r and P_g, so the training curve decreases as sample quality improves, unlike the original GAN's discriminator loss ^[4].
Train the critic to optimality. In a standard GAN, training the discriminator too well saturates it and starves the generator; in a WGAN the closer the critic is to optimal, the better the gradient the generator receives, so the critic is updated multiple times (default n_critic = 5) per generator step ^[4].
A Lipschitz constraint. The dual is only valid when the critic is 1-Lipschitz, so WGAN enforces this constraint, originally with weight clipping and later, in WGAN-GP, with a gradient penalty ^[4]^[5].

Background and motivation

The original GAN trains a generator G and a discriminator D in a minimax game whose value function, when D is optimal, is equivalent to the Jensen-Shannon (JS) divergence between the data distribution P_r and the generator distribution P_g ^[1]. Goodfellow et al. (2014) proved that the global minimum of this objective is attained when P_g = P_r and the JS divergence equals zero ^[1]. In practice, however, the training dynamics are notoriously fragile.

In early 2017 Arjovsky and Bottou (arXiv:1701.04862) gave a theoretical account of why ^[3]. When the supports of P_r and P_g lie on low-dimensional manifolds, which is the typical situation for natural images modelled by a deep generator, those supports almost surely have measure-zero overlap. In that regime the JS divergence is constant (equal to log 2), the Kullback-Leibler divergence is infinite, and the gradient of the discriminator with respect to the generator parameters either vanishes or explodes ^[3]. Three concrete failure modes followed from this analysis:

Vanishing gradients on a confident discriminator. If D becomes too good at telling real from fake, its output saturates and the generator stops receiving useful signal.
Mode collapse. The generator tends to concentrate mass on a small number of modes that fool the current discriminator, ignoring the rest of the support.
No usable training metric. Loss curves do not correlate with sample quality, so practitioners cannot tell when to stop training or compare runs.

WGAN was designed to address exactly these failures by changing the underlying distance ^[4].

What is the Wasserstein-1 (Earth mover's) distance?

The Wasserstein-1 distance, also called the Earth mover's distance (EMD), between two probability measures P_r and P_g on a metric space (X, d) is defined as

W(P_r, P_g) = inf over γ in Π(P_r, P_g) of E_{(x, y) ~ γ} [ ||x - y|| ]

where Π(P_r, P_g) is the set of all joint distributions whose marginals are P_r and P_g ^[4]. Each such joint γ is called a transport plan: γ(x, y) is the mass moved from point x (in P_r) to point y (in P_g), and the cost of a plan is the expected distance moved. The Wasserstein-1 distance is the minimum cost of any plan that turns P_r into P_g. The intuition that gives the metric its colloquial name is exactly that one of moving piles of dirt: W(P_r, P_g) is the minimum amount of work required to reshape one distribution into the other.

The key property that makes this metric useful for GAN training is that, under mild regularity assumptions on the generator, W(P_r, P_g_θ) is continuous and almost-everywhere differentiable in the generator parameters θ. Arjovsky, Chintala, and Bottou (2017) prove this in Theorem 1 of the WGAN paper ^[4]. Total variation, KL, and JS divergences fail one or both of these properties when supports do not overlap. A simple example given in the paper involves two parallel line segments in the plane, separated by a parameter θ. The Wasserstein distance between them is |θ|, smooth in θ, while JS, KL, and reverse-KL are all discontinuous at θ = 0 ^[4].

What is Kantorovich-Rubinstein duality?

The primal infimum-over-couplings form is intractable for the high-dimensional distributions that GANs work with. Optimal transport theory supplies a dual formulation, the Kantorovich-Rubinstein duality, that turns the infimum into a supremum over 1-Lipschitz functions:

W(P_r, P_g) = sup over f with ||f||_L ≤ 1 of E_{x ~ P_r} [ f(x) ] - E_{x ~ P_g} [ f(x) ]

A function f is 1-Lipschitz when |f(x) - f(y)| ≤ ||x - y|| for all x, y ^[4]. The duality says: parameterise a 1-Lipschitz function f, push real samples up and fake samples down, and the gap converges to the Wasserstein distance.

WGAN replaces the GAN discriminator with a network f_w (renamed critic because it no longer outputs a probability) that approximates this supremum ^[4]. The critic outputs an unbounded scalar, has no sigmoid on its head, and is trained to maximise the dual gap. The generator is then trained to push that gap down by changing the fake distribution.

How is the original WGAN trained?

Algorithm 1 of Arjovsky, Chintala, and Bottou (2017) specifies the full training loop ^[4]. The headline choices are:

Critic loss (to be maximised): L_critic = E[f_w(x_real)] - E[f_w(G_θ(z))].
Generator loss (to be minimised): L_gen = -E[f_w(G_θ(z))].
Lipschitz constraint via weight clipping: after each critic gradient step, clamp every weight w to the interval [-c, c] with the default c = 0.01 ^[4].
Multiple critic updates per generator update: the default is n_critic = 5 ^[4].
Optimizer: RMSProp with learning rate α = 5e-5 and batch size m = 64 ^[4].

The authors explicitly recommend RMSProp over Adam in the original paper because they observed that Adam's momentum interacted badly with the non-stationary critic loss when weight clipping was active, leading to occasional training divergence ^[4]. With these settings WGAN trains stably across the DCGAN architectures Radford et al. used in 2015, and also across several alternative critic architectures, including a multilayer perceptron without batch normalisation that reportedly fails outright in standard GANs ^[2]^[4].

A practical benefit of the dual form is that the critic can, and indeed should, be trained to near optimality between generator updates. In ordinary GANs this is a mistake because the JS-style discriminator saturates; in WGAN the closer the critic is to optimal, the better the gradient signal the generator receives ^[4].

What are the limits of weight clipping?

The original paper acknowledged that weight clipping is a crude way to enforce 1-Lipschitz continuity, calling it "a clearly terrible way to enforce a Lipschitz constraint" in the follow-up work ^[5]. Gulrajani et al. (2017) pursued the idea further and made the case sharply. They identified three concrete pathologies ^[5]:

Capacity underuse. Most of the critic's weights end up sitting at one of the two clip values +c or -c, producing a binary weight distribution rather than the rich one a deep network can use. Histograms in the paper show this saturation clearly.
Vanishing or exploding gradients. Choosing c too small makes activations and gradients vanish through depth; choosing it too large makes them blow up. Tuning c is brittle.
Suboptimal Lipschitz approximation. The set of functions representable under hard weight clipping is a strict, awkward subset of the 1-Lipschitz functions, and the optimal critic is not in it.

What is WGAN-GP (gradient penalty)?

WGAN-GP is the improved WGAN of Gulrajani et al. (2017) that drops weight clipping in favour of a soft constraint added directly to the critic loss ^[5]. The trick uses the fact that a critic that achieves the supremum in the Kantorovich-Rubinstein duality has gradient norm exactly 1 almost everywhere along straight-line paths between real and fake samples. They penalise deviations from that property:

L_critic = E[f_w(G_θ(z))] - E[f_w(x_real)] + λ · E_{x̂} [ (||∇_{x̂} f_w(x̂)||_2 - 1)^2 ]

where x̂ = ε · x_real + (1 - ε) · G_θ(z) with ε sampled uniformly from [0, 1]. The penalty coefficient is λ = 10 ^[5]. The interpolated samples x̂ lie on the straight line between a real and a fake sample, which is exactly the support along which an optimal critic has unit gradient norm.

Unlike the original WGAN, WGAN-GP works well with Adam. The recommended setting is Adam with learning rate = 1e-4, β_1 = 0.0, and β_2 = 0.9, again with n_critic = 5 ^[5]. Batch normalisation in the critic is replaced by layer normalisation because batch norm couples samples within a minibatch in ways that interfere with the per-sample gradient penalty ^[5]. Gulrajani et al. report that the method "enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data" ^[5].

WGAN-GP became the de facto WGAN variant in research code from 2017 onward. The reference implementation at igul222/improved_wgan_training is widely cited ^[16], and the original PyTorch repository for plain WGAN at martinarjovsky/WassersteinGAN is still used pedagogically ^[15].

How does WGAN differ from a standard GAN?

The table below summarises the core differences between a standard (Jensen-Shannon) GAN, the original WGAN, and WGAN-GP.

Aspect	Standard GAN	WGAN	WGAN-GP
Objective	Jensen-Shannon divergence	Wasserstein-1 (KR dual)	Wasserstein-1 (KR dual)
Second network	Discriminator (outputs probability, has sigmoid)	Critic (real-valued score, no sigmoid)	Critic (real-valued score, no sigmoid)
Lipschitz constraint	None	Weight clipping to `[-0.01, 0.01]`	Gradient penalty, `λ = 10`
Optimizer	Adam (typical)	RMSProp, `α = 5e-5`	Adam, `lr = 1e-4`, `β_1 = 0`, `β_2 = 0.9`
Critic/disc updates per gen step	1 (usually)	`n_critic = 5`	`n_critic = 5`
Normalisation in critic	Batch norm (typical)	Batch norm allowed	Layer norm (batch norm avoided)
Loss interpretable?	No	Yes (estimates Wasserstein distance)	Yes (estimates Wasserstein distance)
Key paper	Goodfellow et al. 2014 (arXiv:1406.2661)	Arjovsky et al. 2017 (arXiv:1701.07875)	Gulrajani et al. 2017 (arXiv:1704.00028)

The practical upshot is that a standard GAN's discriminator stops giving the generator a useful gradient once it becomes confident, whereas WGAN's critic, being unbounded and 1-Lipschitz, keeps providing a smooth signal that points the generator toward the data distribution ^[3]^[4].

Other Lipschitz-enforcing techniques

The gradient penalty is only one way to encourage a critic to be 1-Lipschitz. Several alternatives, most published within a year or two of WGAN-GP, became important in their own right.

Method	Year	Lipschitz mechanism	Reference
Weight clipping (original WGAN)	2017	Hard-clip every weight to `[-0.01, 0.01]`	Arjovsky, Chintala, Bottou (arXiv:1701.07875)
Gradient penalty (WGAN-GP)	2017	Soft penalty `λ · (
Lipschitz penalty (WGAN-LP)	2017	One-sided penalty `λ · max(0,
Consistency term (WGAN-CT)	2018	Augment GP with a Lipschitz consistency penalty between perturbed pairs	Wei et al., ICLR 2018
Spectral normalisation (SN-GAN)	2018	Divide each weight matrix by its spectral norm via power iteration	Miyato, Kataoka, Koyama, Yoshida (arXiv:1802.05957)
Cramér distance (Cramér GAN)	2017	Replace Wasserstein with the Cramér distance to fix biased gradients	Bellemare et al. (arXiv:1705.10743)

Spectral normalisation in particular has had a long afterlife. By dividing each weight matrix by an estimate of its largest singular value, it bounds the Lipschitz constant of the entire network without requiring any extra term in the loss ^[12]. Miyato et al. (2018) showed it stabilised both standard hinge-loss GANs and WGANs on CIFAR-10, STL-10, and ImageNet ^[12]. Brock et al. used it as a core ingredient of BigGAN (arXiv:1809.11096), which at ICLR 2019 reached an Inception Score of 166.3 and FID of 9.6 on 128x128 ImageNet, more than a 100 percent improvement over the previous Inception Score state of the art ^[13]. StyleGAN-family architectures from NVIDIA also use spectral-norm-style constraints ^[14].

Why do these tricks work?

Why does weight clipping enforce, even approximately, a 1-Lipschitz constraint? A feedforward network with ReLU or other 1-Lipschitz activations has Lipschitz constant bounded by the product of the operator norms of its weight matrices. Bounding every weight to [-c, c] upper-bounds each operator norm by c · sqrt(d) where d is the matrix dimension, which in turn bounds the network's overall Lipschitz constant. The bound is loose, the constant changes with depth, and the network's actual Lipschitz constant after clipping is usually much smaller than 1. That looseness is exactly the source of the capacity loss Gulrajani et al. document ^[5].

Why does the gradient penalty work? Proposition 1 of Gulrajani et al. (2017) says that the optimal critic in the Kantorovich-Rubinstein dual has unit gradient norm almost everywhere along the lines connecting samples drawn from P_r and P_g ^[5]. Penalising deviations from unit norm at sampled interpolation points is therefore a soft surrogate for the constraint that the optimum already satisfies. The penalty does not strictly enforce 1-Lipschitz everywhere, but it pulls the critic toward a function family in which the dual is well-approximated.

More broadly, the WGAN framework connects deep generative modelling to optimal transport theory and the classical Monge-Kantorovich problem, which goes back to Gaspard Monge in 1781 and Leonid Kantorovich's 1942 reformulation. Generators end up optimising a transport cost between latent noise and the data manifold, and the critic plays the role of a Kantorovich potential.

What are the strengths of WGAN?

The practical reasons WGAN and especially WGAN-GP became so widely used can be enumerated cleanly. Stability across architectures is the headline benefit: the original paper trained the same WGAN setup on DCGAN backbones, vanilla MLPs, and architectures without batch normalisation, all without divergence ^[4]. Mode collapse is reduced substantially compared to the JS-divergence formulation, although it is not eliminated ^[4]. The critic loss correlates with sample quality in a way the original GAN's discriminator loss does not, which is invaluable for monitoring runs and tuning hyperparameters ^[4]. And the framework has firm theoretical grounding in optimal transport, with non-trivial theorems about continuity and convergence rather than the ad hoc justifications that tend to accompany other GAN tricks ^[4].

What are the limitations and critiques of WGAN?

WGAN is not a free lunch. Several limitations have been documented since 2017:

Slower convergence in some settings. Because the critic is updated five times per generator step and uses a smaller effective learning rate, wall-clock training time per generator update is higher than for an Adam-tuned non-saturating GAN ^[4].
Critic tuning still required. The Lipschitz constant the critic actually achieves depends on architecture, on the gradient-penalty coefficient, and on the interpolation distribution. Defaults work, but they are not insensitive ^[5].
The Lipschitz constraint is never exactly satisfied. Both weight clipping and the gradient penalty are surrogates. Petzka, Fischer, and Lukovnikov (2017) showed that a one-sided penalty (WGAN-LP) is more robust to the choice of λ than the two-sided penalty in WGAN-GP, suggesting that small changes to how the constraint is encoded matter ^[8].
Empirical comparisons are less clear-cut than originally hoped. Lucic, Kurach, Michalski, Gelly, and Bousquet (2018), in Are GANs Created Equal? A Large-Scale Study (NeurIPS 2018, arXiv:1711.10337), ran controlled hyperparameter sweeps on seven GAN variants including non-saturating GAN, WGAN, WGAN-GP, BEGAN, LSGAN, DRAGAN, and the Maximum Mean Discrepancy GAN, evaluated by FID. They reported that with sufficient hyperparameter optimisation and random restarts, most variants reached similar scores; they did not find evidence that any algorithm consistently outperformed the non-saturating GAN ^[9].
Convergence is not guaranteed. Mescheder, Geiger, and Nowozin (2018) in Which Training Methods for GANs do actually Converge? (ICML 2018, arXiv:1801.04406) showed by counterexample that WGAN and WGAN-GP with a finite number of critic updates per generator update do not always converge to the equilibrium when distributions are not absolutely continuous. They proposed simplified zero-centred gradient penalties (often called R1) that do converge ^[11].
FID metric used to score WGAN-GP itself was introduced after. Heusel et al. (2017), in GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NIPS 2017, arXiv:1706.08500), introduced the Fréchet Inception Distance and the two time-scale update rule (TTUR). They report that TTUR improves WGAN-GP training and use FID as the headline evaluation metric ^[7].

How influential was WGAN?

WGAN, and especially WGAN-GP, served as the default loss function and stability baseline for nearly every major GAN paper between 2017 and 2020. A short list of architectures and methods that built on or compared against WGAN-GP gives a sense of the breadth.

Model / paper	Year	Relationship to WGAN	Reference
Progressive GAN (PG-GAN)	2018	Used WGAN-GP loss for training a progressively grown generator on `1024x1024` faces	Karras et al., ICLR 2018 (arXiv:1710.10196)
SN-GAN	2018	Spectral normalisation as a cleaner Lipschitz bound on the discriminator, often paired with hinge loss	Miyato et al., ICLR 2018 (arXiv:1802.05957)
Self-Attention GAN (SAGAN)	2018	Hinge loss plus spectral norm on both networks; explicit successor in the Lipschitz lineage	Zhang et al., ICML 2019 (arXiv:1805.08318)
BigGAN	2019	Spectral normalisation applied to both generator and discriminator at large scale	Brock, Donahue, Simonyan (arXiv:1809.11096)
StyleGAN family	2019-2021	Uses non-saturating logistic loss with R1 gradient penalty inspired by WGAN-GP and Mescheder et al.	Karras et al., NVIDIA
Cramér GAN	2017	Replaces the Wasserstein metric with the Cramér distance to fix biased sample gradients	Bellemare et al. (arXiv:1705.10743)
Sliced Wasserstein methods	2018+	Approximate Wasserstein distance via 1D projections to avoid the Lipschitz constraint entirely	Various, including SWGAN

Beyond GANs themselves, the paper opened the door to a large body of optimal-transport-based machine-learning research. Sinkhorn divergences, Wasserstein autoencoders, Wasserstein gradient flows for sampling, and optimal-transport regularisers in domain adaptation all share intellectual lineage with the WGAN line.

Implementations

The original PyTorch reference implementation by Martin Arjovsky lives at martinarjovsky/WassersteinGAN on GitHub ^[15]. It uses the DCGAN architecture, trains with RMSProp at learning rate 5e-5, and applies weight clipping to [-0.01, 0.01] ^[4]^[15]. The Improved-WGAN-GP code by Gulrajani at igul222/improved_wgan_training is the canonical TensorFlow implementation of WGAN-GP, including ResNet variants and the language-model relaxation ^[16]. Numerous third-party reimplementations exist in PyTorch Lightning, Keras, and JAX, including community-maintained examples on Hugging Face's model hub. The MathWorks Deep Learning Toolbox documentation includes a worked WGAN-GP tutorial as one of its canonical generative-model recipes ^[17].

For sanity checking the loss, one practical recipe due to the original paper: plot the negative critic loss -(E[f_w(x_real)] - E[f_w(G_θ(z))]) over training. It tracks the Wasserstein distance between the real and generated distributions and decreases roughly monotonically when training is healthy. If it stops decreasing or oscillates, the critic is undertrained or the constraint is poorly enforced ^[4].

What replaced WGAN?

From 2020 onward, diffusion models such as denoising diffusion probabilistic models (Ho et al., 2020), score-based generative models (Song and Ermon, 2019; Song et al., 2021), and Stable Diffusion (Rombach et al., 2022) overtook GANs as the dominant approach to image synthesis on most benchmarks. Autoregressive image transformers and masked-token models like MaskGIT and Muse picked up the rest. WGAN's place in 2025-era machine learning is therefore mostly historical and pedagogical, with one important exception: in domains where one-shot inference matters more than sample diversity, such as super-resolution, real-time generation, and certain scientific simulators, GAN-style training, often with a Wasserstein or hinge loss and spectral normalisation, remains competitive.

Model	Divergence / loss	Lipschitz / regularisation	Year	Key paper
Vanilla GAN	Jensen-Shannon (minimax)	None	2014	Goodfellow et al. (arXiv:1406.2661)
Non-saturating GAN	Cross-entropy with `-log D(G(z))` generator	None	2014	Goodfellow et al. (arXiv:1406.2661)
DCGAN	JS-style cross entropy	Architectural priors only	2015	Radford, Metz, Chintala (arXiv:1511.06434)
WGAN	Wasserstein-1 (KR dual)	Weight clipping, `c = 0.01`	2017	Arjovsky, Chintala, Bottou (arXiv:1701.07875)
WGAN-GP	Wasserstein-1 (KR dual)	Two-sided gradient penalty, `λ = 10`	2017	Gulrajani et al. (arXiv:1704.00028)
WGAN-LP	Wasserstein-1 (KR dual)	One-sided Lipschitz penalty	2017	Petzka et al. (arXiv:1709.08894)
Cramér GAN	Cramér (energy) distance	Gradient penalty	2017	Bellemare et al. (arXiv:1705.10743)
SN-GAN	Hinge loss (commonly)	Spectral normalisation	2018	Miyato et al. (arXiv:1802.05957)
MMD-GAN	Maximum Mean Discrepancy	Kernel-based	2017	Li et al. (arXiv:1705.08584)
LSGAN	Least squares (Pearson chi-squared)	None	2016	Mao et al. (arXiv:1611.04076)
Hinge GAN	Hinge loss	Often combined with SN	2017	Lim and Ye (arXiv:1705.02894)
BEGAN	Auto-encoder reconstruction equilibrium	Equilibrium term	2017	Berthelot, Schumm, Metz (arXiv:1703.10717)
BigGAN	Hinge loss	Spectral normalisation, orthogonal regularisation	2019	Brock, Donahue, Simonyan (arXiv:1809.11096)
Progressive GAN	WGAN-GP	Gradient penalty, progressive growth	2018	Karras et al. (arXiv:1710.10196)
StyleGAN	Non-saturating, R1 penalty	Zero-centred gradient penalty	2019	Karras, Laine, Aila (arXiv:1812.04948)

Connection to optimal transport

The Wasserstein distance is one element of a much older mathematical machinery. The Monge formulation, posed by Gaspard Monge in 1781, asked for a deterministic transport map T: X -> X that pushes P_r onto P_g while minimising total transport cost. Monge's problem can fail to have a solution when no deterministic map exists. Leonid Kantorovich's 1942 relaxation replaced maps with joint distributions, giving the infimum-over-couplings form used in WGAN. The Kantorovich-Rubinstein theorem, proved in the 1950s, provides the duality between the primal and the supremum-over-1-Lipschitz form. WGAN is thus, in a precise sense, a deep-learning instantiation of a 240-year-old optimisation problem.

This connection is more than historical decoration. It justifies why WGAN's loss is a metric (symmetric, non-negative, vanishing iff distributions agree), why it is well-defined on disjoint supports, and why training dynamics admit clean theoretical analysis. The framework has since been generalised through Sinkhorn divergences, Wasserstein-2 GANs (Liu et al., 2018), and unbalanced optimal transport variants for partial matching.

ELI5: What is a Wasserstein GAN, simply?

Imagine two piles of sand, one shaped like the real data and one shaped like the fake data a generative model produces. The Wasserstein distance is the least amount of shovelling needed to reshape the fake pile into the real one. A standard GAN uses a judge that just says "real" or "fake," and once that judge becomes very confident it stops giving the artist any helpful feedback. A Wasserstein GAN instead uses a critic that scores how far apart the two sand piles are, and that score keeps giving the artist a clear direction to improve even when the piles look very different. Because the critic's score is a real measurement of distance, the training curve goes down as the pictures get better, which makes the whole process easier to follow and tune ^[4].

References

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. *Generative Adversarial Nets*. Advances in Neural Information Processing Systems 27 (NIPS 2014). arXiv:1406.2661. ↩
Radford, Alec, Luke Metz, and Soumith Chintala. *Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks*. ICLR 2016. arXiv:1511.06434. ↩
Arjovsky, Martin, and Léon Bottou. *Towards Principled Methods for Training Generative Adversarial Networks*. ICLR 2017. arXiv:1701.04862. ↩
Arjovsky, Martin, Soumith Chintala, and Léon Bottou. *Wasserstein GAN*. International Conference on Machine Learning, 2017. arXiv:1701.07875. ↩
Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. *Improved Training of Wasserstein GANs*. NeurIPS 2017. arXiv:1704.00028. ↩
Bellemare, Marc G., Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. *The Cramer Distance as a Solution to Biased Wasserstein Gradients*. 2017. arXiv:1705.10743.
Heusel, Martin, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. *GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium*. NIPS 2017. arXiv:1706.08500. ↩
Petzka, Henning, Asja Fischer, and Denis Lukovnikov. *On the Regularization of Wasserstein GANs*. ICLR 2018. arXiv:1709.08894. ↩
Lucic, Mario, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. *Are GANs Created Equal? A Large-Scale Study*. NeurIPS 2018. arXiv:1711.10337. ↩
Karras, Tero, Timo Aila, Samuli Laine, and Jaakko Lehtinen. *Progressive Growing of GANs for Improved Quality, Stability, and Variation*. ICLR 2018. arXiv:1710.10196.
Mescheder, Lars, Andreas Geiger, and Sebastian Nowozin. *Which Training Methods for GANs do actually Converge?* ICML 2018. arXiv:1801.04406. ↩
Miyato, Takeru, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. *Spectral Normalization for Generative Adversarial Networks*. ICLR 2018. arXiv:1802.05957. ↩
Brock, Andrew, Jeff Donahue, and Karen Simonyan. *Large Scale GAN Training for High Fidelity Natural Image Synthesis*. ICLR 2019. arXiv:1809.11096. ↩
Karras, Tero, Samuli Laine, and Timo Aila. *A Style-Based Generator Architecture for Generative Adversarial Networks*. CVPR 2019. arXiv:1812.04948. ↩
Original WGAN reference implementation: github.com/martinarjovsky/WassersteinGAN. ↩
Reference WGAN-GP implementation: github.com/igul222/improved_wgan_training. ↩
MathWorks. *Train Wasserstein GAN with Gradient Penalty (WGAN-GP)*. MATLAB Deep Learning Toolbox documentation. mathworks.com. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Aaron Courville BigGAN CycleGAN DCGAN (Deep Convolutional GAN)GAN RMSProp StyleGAN Unconditional Image Generation Models

What is a Wasserstein GAN?

Background and motivation

What is the Wasserstein-1 (Earth mover's) distance?

What is Kantorovich-Rubinstein duality?

How is the original WGAN trained?

What are the limits of weight clipping?

What is WGAN-GP (gradient penalty)?

How does WGAN differ from a standard GAN?

Other Lipschitz-enforcing techniques

Why do these tricks work?

What are the strengths of WGAN?

What are the limitations and critiques of WGAN?

How influential was WGAN?

Implementations

What replaced WGAN?

Comparison with related GAN models

Connection to optimal transport

ELI5: What is a Wasserstein GAN, simply?

See also

References

Improve this article

Related Articles

Minimax Loss

Wasserstein Loss

Broadcasting

Convolution

Cross-Entropy

Sigmoid Function

What links here

Related Articles

Minimax Loss

Wasserstein Loss

Broadcasting

Convolution

Cross-Entropy

Sigmoid Function

What links here