Wasserstein Loss

Wasserstein loss is a loss function derived from the Wasserstein distance, a mathematical metric rooted in optimal transport theory that quantifies the minimum cost of transforming one probability distribution into another. Also known as the earth mover's distance (EMD), the Wasserstein distance measures how much "work" is required to reshape one distribution to match another, where work is defined as the amount of probability mass moved multiplied by the distance it travels. In machine learning, Wasserstein loss is best known for its role in training Wasserstein Generative Adversarial Networks (WGANs), where it replaces the Jensen-Shannon divergence used in standard GANs. The metric provides more stable gradients, reduces mode collapse, and yields meaningful training curves that correlate with sample quality. Beyond generative modeling, the Wasserstein distance has found applications in domain adaptation, distributional reinforcement learning, natural language processing, and computer vision.

Explain like I'm 5 (ELI5)

Imagine you have two piles of sand in different shapes. One pile represents real pictures, and the other pile represents pictures made by a computer. You want to know how different these two piles are. The Wasserstein loss is like asking: "How much sand do I need to move, and how far do I need to carry each shovelful, to reshape my computer's pile so it looks exactly like the real pile?" If the piles are already pretty similar, you only need to move a little sand a short distance, so the score is low. If they are very different, you need to move a lot of sand a long way, so the score is high. The computer uses this score to learn how to make better pictures, trying to make the score as low as possible.

Historical background

The Monge problem (1781)

The mathematical foundations of the Wasserstein distance trace back to 1781, when French mathematician Gaspard Monge formulated the optimal transport problem. Monge considered the problem of moving a pile of soil (deblais) to fill an excavation (remblais) while minimizing the total transportation cost. His formulation required a deterministic mapping: each unit of mass at a source location must be sent to exactly one target location, with no splitting allowed. Formally, the Monge problem seeks a transport map T that minimizes:

$$\inf_T \int |x - T(x)|^p , dP(x)$$

subject to the constraint that T pushes the source distribution P forward to the target distribution Q. This formulation, while intuitive, has a significant limitation: a deterministic map may not exist when the source distribution contains point masses (discrete distributions), because mass splitting is forbidden.

The Kantorovich relaxation (1942)

In 1942, Soviet mathematician and economist Leonid Kantorovich reformulated the problem by relaxing the deterministic mapping requirement. Instead of requiring each source point to map to a single destination, Kantorovich allowed probabilistic transport plans where mass at a source point can be split across multiple destinations. This relaxation transforms the problem into a linear programming problem, which always has a solution.

Kantorovich's formulation seeks a joint distribution (coupling) whose marginals match the source and target distributions, while minimizing the expected transportation cost. This work, originally motivated by economic production planning, earned Kantorovich the 1975 Nobel Prize in Economics (shared with Tjalling Koopmans).

Naming of the Wasserstein metric

The metric is named after Leonid Vaserstein (transliterated as "Wasserstein" in German), who studied it in the context of Markov processes in 1969. The name "Wasserstein distance" was coined by Roland Dobrushin in 1970. However, since Kantorovich defined the metric 27 years earlier, some scholars advocate using the term "Kantorovich metric" or "Kantorovich-Rubinstein metric" to reflect historical priority more accurately. In the computer science community, the W1 distance is commonly called the earth mover's distance, a term popularized by Yossi Rubner and colleagues in 1998 for content-based image retrieval.

Mathematical definition

General Wasserstein-p distance

Let (M, d) be a metric space, and let P and Q be two probability distributions on M with finite p-th moments. The Wasserstein-p distance between P and Q is defined as:

$$W_p(P, Q) = \left( \inf_{\gamma \in \Gamma(P, Q)} \int \int d(x, y)^p , d\gamma(x, y) \right)^{1/p}$$

where:

p >= 1 is the order of the distance
d(x, y) is the ground metric (distance function) on the space M
Gamma(P, Q) denotes the set of all joint distributions (couplings) with marginals P and Q
gamma is a transport plan specifying how mass is moved from P to Q

The infimum is taken over all valid couplings. The coupling gamma* that achieves this infimum is called the optimal transport plan.

Special cases

Variant	Cost function	Also known as	Key properties
W1	d(x, y)	Earth mover's distance, Kantorovich-Rubinstein metric	Has a dual formulation via Lipschitz functions; used in WGANs
W2	d(x, y)^2	Quadratic Wasserstein distance	Has closed-form solution for Gaussian distributions; used in Frechet Inception Distance
W_infinity	ess sup d(x, y)	Chebyshev transport distance	Uses essential supremum instead of Lp norm

One-dimensional case

For one-dimensional distributions with cumulative distribution functions F_P and F_Q, the Wasserstein-p distance has a convenient closed-form expression using quantile functions (inverse CDFs):

$$W_p(P, Q) = \left( \int_0^1 |F_P^{-1}(q) - F_Q^{-1}(q)|^p , dq \right)^{1/p}$$

This makes computation straightforward in the one-dimensional case, as it reduces to comparing the sorted values of the two distributions.

Closed-form for Gaussian distributions

When both distributions are multivariate Gaussians, P = N(m1, C1) and Q = N(m2, C2), the squared W2 distance has a closed-form solution:

$$W_2^2(P, Q) = |m_1 - m_2|_2^2 + \text{tr}\left(C_1 + C_2 - 2(C_2^{1/2} C_1 C_2^{1/2})^{1/2}\right)$$

The trace term in this expression is the squared Bures metric between the two covariance matrices. This formula is the basis for the Frechet Inception Distance (FID), a widely used metric for evaluating generative models.

Kantorovich-Rubinstein duality

Computing the Wasserstein distance directly through the primal (coupling) formulation requires solving a large optimization problem over the space of joint distributions, which is computationally expensive. The Kantorovich-Rubinstein duality theorem provides an equivalent but more tractable formulation for the W1 distance.

The dual form states:

$$W_1(P, Q) = \sup_{|f|L \leq 1} \left[ \mathbb{E}{x \sim P}[f(x)] - \mathbb{E}_{x \sim Q}[f(x)] \right]$$

where the supremum is taken over all 1-Lipschitz functions f. A function f is 1-Lipschitz if |f(x) - f(y)| <= d(x, y) for all x, y in the space. The function that achieves the supremum is called the Kantorovich potential.

This duality is central to the WGAN framework. Rather than solving the primal transport problem, WGANs parameterize the function f as a neural network (called the "critic") and optimize it to approximate the supremum. The critic's output is not bounded to [0, 1] as in standard GAN discriminators; instead, it produces unbounded real-valued scores.

Lipschitz constraint

The duality requires the critic function to be 1-Lipschitz, meaning its output cannot change faster than its input changes. Formally, a function f: X -> R is K-Lipschitz if:

$$|f(x_1) - f(x_2)| \leq K \cdot d(x_1, x_2)$$

for all x1, x2 in X. When K = 1, the function is 1-Lipschitz. Enforcing this constraint on a neural network is one of the main practical challenges in implementing WGANs, and several techniques have been proposed to address it (discussed in the section on Lipschitz constraint enforcement).

Comparison with other divergence measures

The following table compares the Wasserstein distance with other common measures of distributional difference used in generative modeling and machine learning.

Property	Wasserstein distance	KL divergence	Jensen-Shannon divergence	Total variation distance
Symmetric	Yes	No	Yes	Yes
Triangle inequality	Yes (true metric)	No	No	Yes
Handles non-overlapping supports	Yes (provides meaningful gradients)	No (undefined or infinite)	Saturates at log(2) with zero gradients	Saturates at 1
Gradient behavior	Smooth, continuous gradients everywhere	Can produce infinite values	Vanishing gradients when supports are disjoint	Discontinuous
Sensitivity to geometry	Yes (accounts for ground metric)	No (only compares pointwise probabilities)	No	No
Computational cost	Higher (requires solving optimization problem)	Low (closed-form for known distributions)	Low (closed-form for known distributions)	Low
Use in standard GANs	No	Indirectly (via cross-entropy)	Yes (original GAN objective)	Rarely
Use in WGANs	Yes	No	No	No

A key advantage of the Wasserstein distance is its sensitivity to the underlying geometry of the data space. While KL divergence and Jensen-Shannon divergence only compare the probability mass assigned to each point, the Wasserstein distance also considers how far apart the points are. This means it can meaningfully distinguish between distributions that are "almost overlapping" and distributions that are "far apart," even when both pairs have zero overlap.

For training GANs in high-dimensional spaces (such as image generation at 256x256 resolution), the generator and real data distributions are typically supported on low-dimensional manifolds that may have negligible overlap. In this regime, the Jensen-Shannon divergence saturates at its maximum value of log(2), providing zero gradient signal to the generator. The Wasserstein distance, by contrast, remains well-defined and provides useful gradients, enabling continued learning.

Wasserstein GANs (WGANs)

Original WGAN formulation

The Wasserstein GAN was introduced by Martin Arjovsky, Soumith Chintala, and Leon Bottou in their 2017 paper. The key innovation was replacing the Jensen-Shannon divergence in the standard GAN objective with the W1 distance, estimated through its Kantorovich-Rubinstein dual formulation.

In a standard GAN, the discriminator D outputs a probability in [0, 1] indicating whether an input is real or fake, and the training objective is:

$$\min_G \max_D \left[ \mathbb{E}{x \sim P{data}}[\log D(x)] + \mathbb{E}_{z \sim P_z}[\log(1 - D(G(z)))] \right]$$

In a WGAN, the discriminator is replaced by a critic f_w (parameterized by weights w) that outputs unbounded real values. The WGAN objective is:

$$\min_G \max_{|f_w|L \leq 1} \left[ \mathbb{E}{x \sim P_{data}}[f_w(x)] - \mathbb{E}_{z \sim P_z}[f_w(G(z))] \right]$$

At optimality, the critic loss equals the W1 distance (up to a constant factor) between the generated distribution and the real data distribution. The generator then minimizes this estimated distance.

Training algorithm

The WGAN training procedure differs from standard GAN training in several ways:

Critic updates per generator update. The critic is trained for multiple iterations (typically 5) per single generator update. This allows the critic to better approximate the Wasserstein distance before the generator takes a step.
No sigmoid activation. The critic's final layer has no activation function, producing unbounded outputs rather than probabilities.
No logarithm in the loss. The loss functions are simply the mean critic scores, without the logarithmic terms used in standard GANs.
Optimizer choice. The original paper recommended using RMSProp instead of Adam, as momentum-based optimizers sometimes caused instability with the Wasserstein loss.

Advantages over standard GANs

The WGAN framework offers several practical benefits:

Meaningful loss metric. The critic loss approximates the W1 distance, which correlates with the quality of generated samples. This allows practitioners to monitor training progress by watching the loss curve, unlike standard GANs where the discriminator loss is not informative.
Reduced mode collapse. Standard GANs often suffer from mode collapse, where the generator produces samples from only a few modes of the data distribution. The Wasserstein loss reduces this tendency because it provides gradients that account for the full geometry of the distribution.
Stable gradients. Because the Wasserstein distance provides non-zero gradients even when the generated and real distributions do not overlap, the generator always receives a useful learning signal.
Less sensitivity to architecture and hyperparameters. WGANs are more robust to the choice of network architecture and training hyperparameters compared to standard GANs.

Lipschitz constraint enforcement

The Kantorovich-Rubinstein duality requires the critic to be a 1-Lipschitz function. Enforcing this constraint on a deep neural network is the central practical challenge in WGAN training. Several methods have been proposed, each with different trade-offs.

Weight clipping

The original WGAN paper proposed weight clipping as a simple way to enforce the Lipschitz constraint. After each gradient update, all weights in the critic network are clamped to a fixed range [-c, c], where c is a small constant (typically 0.01). This ensures that the critic's weights remain bounded, which indirectly bounds the function's Lipschitz constant.

Weight clipping has several known problems:

Capacity underuse. The weights tend to cluster at the boundary values -c and c, meaning the network does not use its full parameter space.
Sensitivity to c. If c is too small, gradients vanish; if c is too large, gradients may explode and training becomes unstable.
Bias toward simple functions. The clipped critic tends to learn overly simple approximations, ignoring higher-order structure in the data distribution.
Slow convergence. The crude enforcement of the constraint leads to slower training compared to more refined methods.

Gradient penalty (WGAN-GP)

Gulrajani, Ahmed, Arjovsky, Dumoulin, and Courville introduced the gradient penalty method in their 2017 paper "Improved Training of Wasserstein GANs." Instead of clipping weights, WGAN-GP adds a penalty term to the critic loss that encourages the gradient norm of the critic's output with respect to its input to be close to 1.

The WGAN-GP critic loss is:

$$L_{critic} = \mathbb{E}{\tilde{x} \sim P_g}[f_w(\tilde{x})] - \mathbb{E}{x \sim P_r}[f_w(x)] + \lambda , \mathbb{E}{\hat{x} \sim P{\hat{x}}}\left[(|\nabla_{\hat{x}} f_w(\hat{x})|_2 - 1)^2\right]$$

where:

lambda is the gradient penalty coefficient (typically 10)
x_hat is sampled uniformly along straight lines between pairs of real and generated points: x_hat = epsilon * x + (1 - epsilon) * x_tilde, with epsilon ~ Uniform(0, 1)
The penalty encourages the gradient norm to equal 1 at these interpolated points

WGAN-GP offers several improvements over weight clipping:

It allows the critic to learn more complex functions without artificially constraining the weight space.
It provides more stable training across a wider range of architectures, including deep ResNets with over 100 layers.
It eliminates the need to tune the clipping parameter c.
It works well with the Adam optimizer, unlike weight clipping which required RMSProp.

However, WGAN-GP has its own limitations. The gradient penalty is computed only at interpolated points between real and generated samples, so it does not enforce the Lipschitz constraint globally across the entire input space. Additionally, computing the gradient penalty requires an extra forward and backward pass through the critic, increasing computational cost.

Spectral normalization

Miyato, Kataoka, Koyama, and Yoshida proposed spectral normalization in their 2018 ICLR paper. This method directly constrains the Lipschitz constant of the critic by normalizing the weight matrices by their spectral norm (largest singular value).

For each weight matrix W in the critic, spectral normalization replaces W with:

$$\bar{W} = \frac{W}{\sigma(W)}$$

where sigma(W) is the largest singular value of W. This ensures that each layer has a Lipschitz constant of at most 1, and since the composition of 1-Lipschitz functions is also 1-Lipschitz, the entire network satisfies the constraint.

Computing the exact spectral norm at every training step would be expensive, so the authors use power iteration (typically a single iteration per update) with memoized singular vectors for efficiency. Compared to WGAN-GP, spectral normalization:

Has lower computational overhead (no extra forward/backward pass needed)
Enforces the constraint globally rather than only at interpolated points
Provides a more stable and consistent enforcement of the Lipschitz condition

Comparison of constraint enforcement methods

Method	Introduced	Constraint type	Computational overhead	Stability	Global enforcement
Weight clipping	Arjovsky et al., 2017	Hard (weight clamping)	Minimal	Poor (sensitive to clipping range)	Yes, but crude
Gradient penalty (WGAN-GP)	Gulrajani et al., 2017	Soft (penalty term in loss)	High (extra forward/backward pass)	Good	No (only at interpolated points)
Spectral normalization	Miyato et al., 2018	Hard (weight matrix rescaling)	Low (single power iteration)	Good	Yes

Implementation

Pseudocode

The following pseudocode outlines the WGAN-GP training loop:

Algorithm: WGAN-GP Training
Input: gradient penalty coefficient lambda, number of critic iterations n_critic,
       batch size m, Adam hyperparameters alpha, beta1, beta2

for each training iteration do:
    for t = 1, ..., n_critic do:
        Sample real batch {x_1, ..., x_m} from training data
        Sample latent batch {z_1, ..., z_m} from prior p(z)
        Generate fake batch {x_tilde_1, ..., x_tilde_m} = G(z_1), ..., G(z_m)
        Sample epsilon_1, ..., epsilon_m ~ Uniform(0, 1)
        Compute interpolated samples: x_hat_i = epsilon_i * x_i + (1 - epsilon_i) * x_tilde_i
        
        L_critic = (1/m) * sum(f_w(x_tilde_i)) - (1/m) * sum(f_w(x_i))
                   + lambda * (1/m) * sum((||grad(f_w(x_hat_i))||_2 - 1)^2)
        
        Update critic weights w using Adam on L_critic
    end for
    
    Sample latent batch {z_1, ..., z_m} from prior p(z)
    L_generator = -(1/m) * sum(f_w(G(z_i)))
    Update generator weights theta using Adam on L_generator
end for

PyTorch implementation sketch

A simplified implementation of the Wasserstein loss with gradient penalty in PyTorch:

import torch
import torch.nn as nn
import torch.autograd as autograd

def wasserstein_loss_critic(critic, real_data, fake_data, lambda_gp=10):
    """Compute WGAN-GP critic loss."""
    # Wasserstein distance estimate
    critic_real = critic(real_data).mean()
    critic_fake = critic(fake_data).mean()
    wasserstein_distance = critic_real - critic_fake
    
    # Gradient penalty
    batch_size = real_data.size(0)
    epsilon = torch.rand(batch_size, 1, 1, 1, device=real_data.device)
    interpolated = epsilon * real_data + (1 - epsilon) * fake_data
    interpolated.requires_grad_(True)
    
    critic_interpolated = critic(interpolated)
    gradients = autograd.grad(
        outputs=critic_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(critic_interpolated),
        create_graph=True,
        retain_graph=True
    )<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>
    
    gradients = gradients.view(batch_size, -1)
    gradient_penalty = ((gradients.norm(2, dim=1) - 1) ** 2).mean()
    
    # Total critic loss
    loss = -wasserstein_distance + lambda_gp * gradient_penalty
    return loss

def wasserstein_loss_generator(critic, fake_data):
    """Compute WGAN generator loss."""
    return -critic(fake_data).mean()

Key implementation details

No sigmoid in the critic. The critic's output layer should not use a sigmoid or softmax activation. The output should be a raw, unbounded scalar.
Label convention. In frameworks that require labels, real samples are labeled as -1 and fake samples as +1 (the opposite of standard GANs). The Wasserstein loss can then be computed as mean(labels * critic_output).
Batch normalization in the critic. WGAN-GP should not use batch normalization in the critic, because batch normalization creates dependencies between samples in a batch, which can interfere with the per-sample gradient penalty. Layer normalization or instance normalization can be used instead.
Critic-to-generator update ratio. The original WGAN paper recommended 5 critic updates per generator update. WGAN-GP reduced this to 1 in some experiments, though 5 remains common.

Computational complexity and scalability

Exact computation

For discrete distributions with n support points each, computing the exact Wasserstein distance requires solving a linear program. Using the Hungarian algorithm or network simplex method, this has a time complexity of O(n^3). For continuous distributions, exact computation is generally intractable.

Sinkhorn approximation

To address the computational cost, Cuturi (2013) introduced entropy-regularized optimal transport, which adds an entropic penalty to the transport plan. The resulting optimization problem can be solved efficiently using the Sinkhorn-Knopp algorithm, an iterative matrix scaling procedure.

The Sinkhorn distance is defined as:

$$S_\epsilon(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \left[ \langle C, \gamma \rangle + \epsilon , H(\gamma) \right]$$

where C is the cost matrix, H(gamma) is the entropy of the transport plan, and epsilon > 0 is the regularization strength. As epsilon approaches 0, the Sinkhorn distance converges to the true Wasserstein distance.

The Sinkhorn algorithm achieves epsilon-accurate solutions in O(n^2 / epsilon) time, which is significantly faster than exact computation for large problems.

Sliced Wasserstein distance

The sliced Wasserstein distance offers another computationally efficient alternative. It projects the high-dimensional distributions onto random one-dimensional subspaces, computes the (closed-form) one-dimensional Wasserstein distance for each projection, and averages the results:

$$SW_p(P, Q) = \left( \int_{S^{d-1}} W_p^p(R_\theta P, R_\theta Q) , d\theta \right)^{1/p}$$

where R_theta denotes the Radon transform (projection) along direction theta, and the integral is over the unit sphere S^(d-1). In practice, the integral is approximated by sampling a finite number of random projections.

Recent variants include:

Variant	Year	Key idea
Generalized sliced Wasserstein	2019	Uses generalized Radon transforms beyond linear projections
Max-sliced Wasserstein	2019	Optimizes over projection directions instead of averaging
Tree-sliced Wasserstein	2019	Projects onto tree metric spaces instead of lines
Distributional sliced Wasserstein	2020	Learns a distribution over projections
Spherical sliced Wasserstein	2024	Extends to data on spherical manifolds

Applications

Generative adversarial networks

The most prominent application of Wasserstein loss is in training WGANs and their variants. WGANs have been applied to image generation, text generation, music synthesis, and 3D shape generation. The training stability provided by the Wasserstein loss has made it a standard choice when training on complex, high-dimensional data where standard GAN objectives fail.

Frechet Inception Distance (FID)

The Frechet Inception Distance, one of the most widely used metrics for evaluating generative models, is based on the W2 distance between Gaussian approximations of real and generated image feature distributions. The FID computes the W2 distance between two multivariate Gaussians fitted to the Inception network features of real and generated images, using the closed-form Gaussian Wasserstein formula.

Distributional reinforcement learning

In distributional reinforcement learning, agents model the full distribution of returns rather than just the expected value. The distributional Bellman operator is a contraction in the p-Wasserstein metric, providing a theoretical foundation for algorithms like QR-DQN (Quantile Regression DQN). QR-DQN uses quantile regression to minimize the Wasserstein distance between the predicted and target return distributions. This approach outperforms methods like C51 that minimize KL divergence instead, achieving a 33% improvement in median scores on the Atari 2600 benchmark.

Domain adaptation

Wasserstein distance has been used to measure and minimize the discrepancy between source and target domain feature distributions. By training a domain critic network to estimate the empirical Wasserstein distance and then minimizing it, models learn domain-invariant representations. This approach provides a tighter generalization bound compared to methods based on other divergence measures.

Natural language processing

The Word Mover's Distance (WMD), introduced by Kusner et al. (2015), is a special case of the Wasserstein distance applied to text documents. WMD represents documents as distributions over word embeddings and computes the minimum cost of transforming one document's word distribution into another's. It has been used for document similarity, text classification, and automatic text evaluation (for example, the BaryScore metric).

Fairness in machine learning

Wasserstein distance has been applied to fairness-aware machine learning, where it quantifies the difference in prediction distributions across protected groups. Compared to KL divergence, the Wasserstein distance provides a more stable optimization target for fairness constraints and is robust to the choice of classification threshold in downstream models.

Image retrieval

The earth mover's distance was one of the earliest applications of Wasserstein distance in computer science, used for comparing color histograms of images in content-based image retrieval systems. By treating color histograms as distributions, the EMD captures perceptual similarity more accurately than bin-by-bin comparison metrics like chi-squared distance or histogram intersection.

Theoretical properties

Metrization of weak convergence

The Wasserstein-p distance metrizes weak convergence of probability measures (plus convergence of p-th moments). This means that a sequence of distributions converges in the Wasserstein-p metric if and only if it converges weakly and the p-th moments converge. This property makes Wasserstein distance a natural choice for problems where the topology of weak convergence is relevant.

Completeness

If the underlying metric space (M, d) is complete and separable (a Polish space), then the Wasserstein space (the space of probability measures with finite p-th moments, equipped with the W_p metric) is also complete and separable. This ensures that optimization over this space is well-posed.

Geodesics and displacement interpolation

The W2 metric has a rich geometric structure. The space of probability measures equipped with W2 is an infinite-dimensional Riemannian manifold (informally). Geodesics in this space correspond to displacement interpolations, where mass moves along optimal transport paths. This geometric perspective, developed by Felix Otto and others, connects optimal transport to partial differential equations through the concept of gradient flows.

Limitations and challenges

Despite its advantages, Wasserstein loss has several limitations:

Computational cost. Even with the dual formulation, training WGANs requires multiple critic updates per generator step, making training slower than standard GANs.
Approximation quality. The neural network critic can only approximate the supremum in the dual formulation. In practice, the estimated Wasserstein distance may not be accurate, especially early in training.
Curse of dimensionality. The sample complexity of estimating Wasserstein distance grows exponentially with dimension. For distributions in d-dimensional space, O(n^(1+d)) samples may be needed for accurate estimation.
Lipschitz constraint enforcement. All methods for enforcing the Lipschitz constraint are approximate, and the quality of the Wasserstein distance estimate depends on how well the constraint is satisfied.
Not always necessary. For some applications and architectures, standard GAN losses or other alternatives (such as hinge loss or least-squares loss) may perform comparably or better, with lower computational cost.

Key papers and timeline

Year	Paper	Authors	Contribution
1781	Memoire sur la theorie des deblais et des remblais	Gaspard Monge	Formulated the original optimal transport problem
1942	On the translocation of masses	Leonid Kantorovich	Relaxed formulation with transport plans; linear programming connection
2013	Sinkhorn Distances: Lightspeed Computation of Optimal Transport	Marco Cuturi	Entropy-regularized optimal transport for efficient computation
2015	From Word Embeddings to Document Distances	Matt Kusner et al.	Word Mover's Distance for text similarity
2017	Wasserstein GAN	Arjovsky, Chintala, Bottou	Introduced WGAN with weight clipping
2017	Improved Training of Wasserstein GANs	Gulrajani et al.	Gradient penalty method (WGAN-GP)
2017	A Distributional Perspective on Reinforcement Learning	Bellemare, Dabney, Munos	Distributional RL with Wasserstein distance theory
2018	Spectral Normalization for Generative Adversarial Networks	Miyato et al.	Spectral normalization for Lipschitz constraint
2018	Distributional RL with Quantile Regression	Dabney et al.	QR-DQN minimizing Wasserstein distance

References

Arjovsky, M., Chintala, S., & Bottou, L. (2017). "Wasserstein GAN." *Proceedings of the 34th International Conference on Machine Learning (ICML)*. arXiv:1701.07875.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). "Improved Training of Wasserstein GANs." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1704.00028.
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). "Spectral Normalization for Generative Adversarial Networks." *International Conference on Learning Representations (ICLR)*.
Villani, C. (2008). *Optimal Transport: Old and New*. Springer. The standard mathematical reference for optimal transport theory.
Cuturi, M. (2013). "Sinkhorn Distances: Lightspeed Computation of Optimal Transport." *Advances in Neural Information Processing Systems (NeurIPS)*.
Kantorovich, L. V. (1942). "On the translocation of masses." *Doklady Akademii Nauk SSSR*, 37(7-8), 227-229.
Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). "The Earth Mover's Distance as a Metric for Image Retrieval." *International Journal of Computer Vision*, 40(2), 99-121.
Bellemare, M. G., Dabney, W., & Munos, R. (2017). "A Distributional Perspective on Reinforcement Learning." *Proceedings of the 34th International Conference on Machine Learning (ICML)*.
Dabney, W., Rowland, M., Bellemare, M. G., & Munos, R. (2018). "Distributional Reinforcement Learning with Quantile Regression." *Proceedings of the AAAI Conference on Artificial Intelligence*.
Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). "From Word Embeddings To Document Distances." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium." *Advances in Neural Information Processing Systems (NeurIPS)*.
Peyre, G. & Cuturi, M. (2019). "Computational Optimal Transport." *Foundations and Trends in Machine Learning*, 11(5-6), 355-607.
Kolouri, S., Nadjahi, K., Siber, U., Martin, R., & Rohde, G. K. (2019). "Generalized Sliced Wasserstein Distances." *Advances in Neural Information Processing Systems (NeurIPS)*.
Shen, J., Qu, Y., Zhang, W., & Yu, Y. (2018). "Wasserstein Distance Guided Representation Learning for Domain Adaptation." *Proceedings of the AAAI Conference on Artificial Intelligence*.

Explain like I'm 5 (ELI5)

Historical background

The Monge problem (1781)

The Kantorovich relaxation (1942)

Naming of the Wasserstein metric

Mathematical definition

General Wasserstein-p distance

Special cases

One-dimensional case

Closed-form for Gaussian distributions

Kantorovich-Rubinstein duality

Lipschitz constraint

Comparison with other divergence measures

Wasserstein GANs (WGANs)

Original WGAN formulation

Training algorithm

Advantages over standard GANs

Lipschitz constraint enforcement

Weight clipping

Gradient penalty (WGAN-GP)

Spectral normalization

Comparison of constraint enforcement methods

Implementation

Pseudocode

PyTorch implementation sketch

Key implementation details

Computational complexity and scalability

Exact computation

Sinkhorn approximation

Sliced Wasserstein distance

Applications

Generative adversarial networks

Frechet Inception Distance (FID)

Distributional reinforcement learning

Domain adaptation

Natural language processing

Fairness in machine learning

Image retrieval

Theoretical properties

Metrization of weak convergence

Completeness

Geodesics and displacement interpolation

Limitations and challenges

Key papers and timeline

See also

References

Improve this article

Related Articles

Wasserstein GAN (WGAN)

ARC-AGI 2

Minimax Loss

Generative AI

Denoising Diffusion Probabilistic Model

Earth mover's distance (EMD)

Explain like I'm 5 (ELI5)

Historical background

The Monge problem (1781)

The Kantorovich relaxation (1942)

Naming of the Wasserstein metric

Mathematical definition

General Wasserstein-p distance

Special cases

One-dimensional case

Closed-form for Gaussian distributions

Kantorovich-Rubinstein duality

Lipschitz constraint

Comparison with other divergence measures

Wasserstein GANs (WGANs)

Original WGAN formulation

Training algorithm

Advantages over standard GANs

Lipschitz constraint enforcement

Weight clipping

Gradient penalty (WGAN-GP)

Spectral normalization

Comparison of constraint enforcement methods

Implementation

Pseudocode

PyTorch implementation sketch

Key implementation details