# Wasserstein Loss

> Source: https://aiwiki.ai/wiki/wasserstein_loss
> Updated: 2026-07-11
> Categories: Generative AI, Machine Learning, Mathematics, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Wasserstein loss** is a [loss function](/wiki/loss_function) for training generative models that measures the distance between two probability distributions as the Wasserstein-1 distance, the minimum cost of moving probability mass to turn one distribution into the other. It is best known as the training objective of the [Wasserstein Generative Adversarial Network](/wiki/generative_adversarial_network) (WGAN), introduced by Martin Arjovsky, Soumith Chintala, and Leon Bottou in 2017, where it replaces the Jensen-Shannon divergence used in the standard [GAN](/wiki/gan).[1] Because the Wasserstein-1 distance, also called the earth mover's distance (EMD), stays well-defined and provides smooth, non-zero gradients even when the real and generated distributions barely overlap, it yields more stable training, reduces mode collapse, and produces a loss curve that correlates with sample quality.[1]

The Wasserstein distance is a true metric rooted in optimal transport theory that quantifies the minimum cost of transforming one [probability distribution](/wiki/probability_distribution) into another, where cost is the amount of probability mass moved multiplied by the distance it travels. In [machine learning](/wiki/machine_learning), Wasserstein loss is applied through the Kantorovich-Rubinstein dual formulation, in which a 1-Lipschitz "critic" network estimates the distance.[1] Beyond [generative adversarial networks](/wiki/generative_adversarial_network), the Wasserstein distance also underpins the [Frechet Inception Distance](/wiki/frechet_inception_distance),[11] distributional [reinforcement learning](/wiki/reinforcement_learning),[8] [domain adaptation](/wiki/transfer_learning),[14] [natural language processing](/wiki/natural_language_processing),[10] and [computer vision](/wiki/computer_vision).[7]

## Explain like I'm 5 (ELI5)

Imagine you have two piles of sand in different shapes. One pile represents real pictures, and the other pile represents pictures made by a computer. You want to know how different these two piles are. The Wasserstein loss is like asking: "How much sand do I need to move, and how far do I need to carry each shovelful, to reshape my computer's pile so it looks exactly like the real pile?" If the piles are already pretty similar, you only need to move a little sand a short distance, so the score is low. If they are very different, you need to move a lot of sand a long way, so the score is high. The computer uses this score to learn how to make better pictures, trying to make the score as low as possible.

## What is the Wasserstein loss?

The Wasserstein loss is the value of the Wasserstein-1 distance (W1) between a model's generated distribution and the real data distribution, used as the quantity a generative model minimizes during training.[1] The Wasserstein-1 distance answers a transport question: given a "source" distribution and a "target" distribution, what is the least total work needed to rearrange the source into the target, where work is mass moved times the distance moved? Because the answer accounts for how far apart the two distributions sit in the underlying space, the Wasserstein loss gives a graded, geometry-aware sense of similarity rather than a yes/no overlap test.

In practice the loss is not computed by solving the full transport problem directly. Instead, WGANs use the Kantorovich-Rubinstein duality, which rewrites W1 as a maximization over 1-Lipschitz functions and lets a [neural network](/wiki/neural_network) critic estimate the distance.[1] At optimality, the critic's loss equals the W1 distance (up to a constant), so the number the generator is minimizing is a meaningful estimate of how far its samples are from real data.

## Where does the Wasserstein distance come from?

### The Monge problem (1781)

The mathematical foundations of the Wasserstein distance trace back to 1781, when French mathematician Gaspard Monge formulated the optimal transport problem. Monge considered the problem of moving a pile of soil (deblais) to fill an excavation (remblais) while minimizing the total transportation cost. His formulation required a deterministic mapping: each unit of mass at a source location must be sent to exactly one target location, with no splitting allowed. Formally, the Monge problem seeks a transport map *T* that minimizes:

$$\inf_T \int \|x - T(x)\|^p \, dP(x)$$

subject to the constraint that *T* pushes the source distribution *P* forward to the target distribution *Q*. This formulation, while intuitive, has a significant limitation: a deterministic map may not exist when the source distribution contains point masses (discrete distributions), because mass splitting is forbidden.

### The Kantorovich relaxation (1942)

In 1942, Soviet mathematician and economist Leonid Kantorovich reformulated the problem by relaxing the deterministic mapping requirement.[6] Instead of requiring each source point to map to a single destination, Kantorovich allowed probabilistic transport plans where mass at a source point can be split across multiple destinations. This relaxation transforms the problem into a [linear programming](/wiki/convex_optimization) problem, which always has a solution.[6]

Kantorovich's formulation seeks a joint distribution (coupling) whose marginals match the source and target distributions, while minimizing the expected transportation cost. This work, originally motivated by economic production planning, earned Kantorovich the 1975 Nobel Prize in Economics (shared with Tjalling Koopmans).

### Naming of the Wasserstein metric

The metric is named after Leonid Vaserstein (transliterated as "Wasserstein" in German), who studied it in the context of Markov processes in 1969. The name "Wasserstein distance" was coined by Roland Dobrushin in 1970. However, since Kantorovich defined the metric 27 years earlier, some scholars advocate using the term "Kantorovich metric" or "Kantorovich-Rubinstein metric" to reflect historical priority more accurately. In the computer science community, the W1 distance is commonly called the [earth mover's distance](/wiki/earth_mover_s_distance), a term popularized by Yossi Rubner and colleagues in 1998 for content-based image retrieval.[7]

## Mathematical definition

### General Wasserstein-p distance

Let $$(M, d)$$ be a metric space, and let P and Q be two probability distributions on $$M$$ with finite p-th moments. The Wasserstein-p distance between P and Q is defined as:

$$W_p(P, Q) = \left( \inf_{\gamma \in \Gamma(P, Q)} \int \int d(x, y)^p \, d\gamma(x, y) \right)^{1/p}$$

where:
- $$p \ge 1$$ is the order of the distance
- $$d(x, y)$$ is the ground metric (distance function) on the space $$M$$
- $$\Gamma(P, Q)$$ denotes the set of all joint distributions (couplings) with marginals P and Q
- $$\gamma$$ is a transport plan specifying how mass is moved from P to Q

The infimum is taken over all valid couplings. The coupling $$\gamma^*$$ that achieves this infimum is called the optimal transport plan.[4]

### Special cases

| Variant | Cost function | Also known as | Key properties |
|---------|--------------|---------------|----------------|
| $$W_1$$ | $$d(x, y)$$ | Earth mover's distance, Kantorovich-Rubinstein metric | Has a dual formulation via Lipschitz functions; used in WGANs |
| $$W_2$$ | $$d(x, y)^2$$ | Quadratic Wasserstein distance | Has closed-form solution for Gaussian distributions; used in [Frechet Inception Distance](/wiki/frechet_inception_distance) |
| $$W_\infty$$ | $$\operatorname{ess\,sup} \, d(x, y)$$ | Chebyshev transport distance | Uses essential supremum instead of $$L^p$$ norm |

### One-dimensional case

For one-dimensional distributions with cumulative distribution functions $$F_P$$ and $$F_Q$$, the Wasserstein-p distance has a convenient closed-form expression using quantile functions (inverse CDFs):

$$W_p(P, Q) = \left( \int_0^1 |F_P^{-1}(q) - F_Q^{-1}(q)|^p \, dq \right)^{1/p}$$

This makes computation straightforward in the one-dimensional case, as it reduces to comparing the sorted values of the two distributions.

### Closed-form for Gaussian distributions

When both distributions are multivariate Gaussians, $$P = \mathcal{N}(m_1, C_1)$$ and $$Q = \mathcal{N}(m_2, C_2)$$, the squared W2 distance has a closed-form solution:

$$W_2^2(P, Q) = \|m_1 - m_2\|_2^2 + \text{tr}\left(C_1 + C_2 - 2(C_2^{1/2} C_1 C_2^{1/2})^{1/2}\right)$$

The trace term in this expression is the squared Bures metric between the two covariance matrices. This formula is the basis for the [Frechet Inception Distance](/wiki/frechet_inception_distance) (FID), a widely used metric for evaluating [generative models](/wiki/generative_adversarial_network).[11]

## How does the Kantorovich-Rubinstein duality work?

Computing the Wasserstein distance directly through the primal (coupling) formulation requires solving a large optimization problem over the space of joint distributions, which is computationally expensive. The Kantorovich-Rubinstein duality theorem provides an equivalent but more tractable formulation for the W1 distance.[1]

The dual form states:

$$W_1(P, Q) = \sup_{\|f\|_L \leq 1} \left[ \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{x \sim Q}[f(x)] \right]$$

where the supremum is taken over all 1-Lipschitz functions $$f$$. A function $$f$$ is 1-Lipschitz if $$|f(x) - f(y)| \le d(x, y)$$ for all $$x, y$$ in the space. The function that achieves the supremum is called the Kantorovich potential.

This duality is central to the WGAN framework. Rather than solving the primal transport problem, WGANs parameterize the function f as a [neural network](/wiki/neural_network) (called the "critic") and optimize it to approximate the supremum.[1] The critic's output is not bounded to [0, 1] as in standard GAN discriminators; instead, it produces unbounded real-valued scores.

### Lipschitz constraint

The duality requires the critic function to be 1-Lipschitz, meaning its output cannot change faster than its input changes. Formally, a function $$f: X \to \mathbb{R}$$ is K-Lipschitz if:

$$|f(x_1) - f(x_2)| \leq K \cdot d(x_1, x_2)$$

for all $$x_1, x_2 \in X$$. When $$K = 1$$, the function is 1-Lipschitz. Enforcing this constraint on a neural network is one of the main practical challenges in implementing WGANs, and several techniques have been proposed to address it (discussed in the section on Lipschitz constraint enforcement).

## How does the Wasserstein loss compare with other divergence measures?

The following table compares the Wasserstein distance with other common measures of distributional difference used in generative modeling and machine learning.

| Property | Wasserstein distance | [KL divergence](/wiki/kl_divergence) | Jensen-Shannon divergence | Total variation distance |
|----------|---------------------|----------------|--------------------------|-------------------------|
| Symmetric | Yes | No | Yes | Yes |
| Triangle inequality | Yes (true metric) | No | No | Yes |
| Handles non-overlapping supports | Yes (provides meaningful gradients) | No (undefined or infinite) | Saturates at $$\log(2)$$ with zero gradients | Saturates at 1 |
| Gradient behavior | Smooth, continuous gradients everywhere | Can produce infinite values | Vanishing gradients when supports are disjoint | Discontinuous |
| Sensitivity to geometry | Yes (accounts for ground metric) | No (only compares pointwise probabilities) | No | No |
| Computational cost | Higher (requires solving optimization problem) | Low (closed-form for known distributions) | Low (closed-form for known distributions) | Low |
| Use in standard GANs | No | Indirectly (via cross-entropy) | Yes (original GAN objective) | Rarely |
| Use in WGANs | Yes | No | No | No |

A key advantage of the Wasserstein distance is its sensitivity to the underlying geometry of the data space.[1] While KL divergence and Jensen-Shannon divergence only compare the probability mass assigned to each point, the Wasserstein distance also considers how far apart the points are. This means it can meaningfully distinguish between distributions that are "almost overlapping" and distributions that are "far apart," even when both pairs have zero overlap.

For training GANs in high-dimensional spaces (such as image generation at 256x256 resolution), the [generator](/wiki/generator) and real data distributions are typically supported on low-dimensional manifolds that may have negligible overlap. In this regime, the Jensen-Shannon divergence saturates at its maximum value of $$\log(2)$$, providing zero gradient signal to the generator.[1] The Wasserstein distance, by contrast, remains well-defined and provides useful gradients, enabling continued learning.

## How does the Wasserstein GAN use it?

### Original WGAN formulation

The Wasserstein GAN was introduced by Martin Arjovsky, Soumith Chintala, and Leon Bottou in a paper first submitted to arXiv on January 26, 2017.[1] The key innovation was replacing the Jensen-Shannon divergence in the standard GAN objective with the W1 distance, estimated through its Kantorovich-Rubinstein dual formulation. The authors summarized the result directly in the abstract: "we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches."[1]

In a standard GAN, the [discriminator](/wiki/discriminator) D outputs a probability in [0, 1] indicating whether an input is real or fake, and the training objective is:

$$\min_G \max_D \left[ \mathbb{E}_{x \sim P_{data}}[\log D(x)] + \mathbb{E}_{z \sim P_z}[\log(1 - D(G(z)))] \right]$$

In a WGAN, the discriminator is replaced by a critic f_w (parameterized by weights w) that outputs unbounded real values. The WGAN objective is:

$$\min_G \max_{\|f_w\|_L \leq 1} \left[ \mathbb{E}_{x \sim P_{data}}[f_w(x)] - \mathbb{E}_{z \sim P_z}[f_w(G(z))] \right]$$

At optimality, the critic loss equals the W1 distance (up to a constant factor) between the generated distribution and the real data distribution.[1] The generator then minimizes this estimated distance.

### Training algorithm

The WGAN training procedure differs from standard GAN training in several ways:[1]

1. **Critic updates per generator update.** The critic is trained for multiple iterations (typically 5) per single generator update. This allows the critic to better approximate the Wasserstein distance before the generator takes a step.
2. **No sigmoid activation.** The critic's final layer has no activation function, producing unbounded outputs rather than probabilities.
3. **No logarithm in the loss.** The loss functions are simply the mean critic scores, without the logarithmic terms used in standard GANs.
4. **Optimizer choice.** The original paper recommended using RMSProp instead of Adam, as momentum-based optimizers sometimes caused instability with the Wasserstein loss.

## Why is the Wasserstein loss better than the standard GAN loss?

The WGAN framework offers several practical benefits over the original GAN objective:[1]

- **Meaningful loss metric.** The critic loss approximates the W1 distance, which correlates with the quality of generated samples. This allows practitioners to monitor training progress by watching the loss curve, unlike standard GANs where the discriminator loss is not informative.
- **Reduced mode collapse.** Standard GANs often suffer from mode collapse, where the generator produces samples from only a few modes of the data distribution. The Wasserstein loss reduces this tendency because it provides gradients that account for the full geometry of the distribution.
- **Stable gradients.** Because the Wasserstein distance provides non-zero gradients even when the generated and real distributions do not overlap, the generator always receives a useful learning signal.
- **Less sensitivity to architecture and [hyperparameters](/wiki/hyperparameter).** WGANs are more robust to the choice of network architecture and training hyperparameters compared to standard GANs.

The core reason for these improvements is that the Jensen-Shannon objective used by the standard GAN gives the generator no gradient when the real and generated distributions have disjoint supports, a common situation in high-dimensional image modeling, whereas the Wasserstein-1 distance varies smoothly with the generator's parameters and keeps the learning signal alive.[1]

## How is the 1-Lipschitz constraint enforced?

The Kantorovich-Rubinstein duality requires the critic to be a 1-Lipschitz function. Enforcing this constraint on a [deep neural network](/wiki/deep_learning) is the central practical challenge in WGAN training. Several methods have been proposed, each with different trade-offs.

### Weight clipping

The original WGAN paper proposed weight clipping as a simple way to enforce the Lipschitz constraint.[1] After each gradient update, all weights in the critic network are clamped to a fixed range $$[-c, c]$$, where c is a small constant (typically 0.01). This ensures that the critic's weights remain bounded, which indirectly bounds the function's Lipschitz constant.

Weight clipping has several known problems:[2]

- **Capacity underuse.** The weights tend to cluster at the boundary values -c and c, meaning the network does not use its full parameter space.
- **Sensitivity to c.** If c is too small, gradients vanish; if c is too large, gradients may explode and training becomes unstable.
- **Bias toward simple functions.** The clipped critic tends to learn overly simple approximations, ignoring higher-order structure in the data distribution.
- **Slow convergence.** The crude enforcement of the constraint leads to slower training compared to more refined methods.

### Gradient penalty (WGAN-GP)

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville introduced the gradient penalty method in their paper "Improved Training of Wasserstein GANs," first submitted to arXiv on March 31, 2017.[2] The paper diagnosed weight clipping as the source of WGAN's remaining instability and proposed a direct alternative: "We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input."[2] Instead of clipping weights, WGAN-GP adds a penalty term to the critic loss that encourages the gradient norm of the critic's output with respect to its input to be close to 1.

The WGAN-GP critic loss is:

$$L_{critic} = \mathbb{E}_{\tilde{x} \sim P_g}[f_w(\tilde{x})] - \mathbb{E}_{x \sim P_r}[f_w(x)] + \lambda \, \mathbb{E}_{\hat{x} \sim P_{\hat{x}}}\left[(\|\nabla_{\hat{x}} f_w(\hat{x})\|_2 - 1)^2\right]$$

where:
- $$\lambda$$ is the gradient penalty coefficient (typically 10)
- $$\hat{x}$$ is sampled uniformly along straight lines between pairs of real and generated points: $$\hat{x} = \epsilon x + (1 - \epsilon) \tilde{x}$$, with $$\epsilon \sim \mathrm{Uniform}(0, 1)$$
- The penalty encourages the gradient norm to equal 1 at these interpolated points

WGAN-GP offers several improvements over weight clipping:[2]

- It allows the critic to learn more complex functions without artificially constraining the weight space.
- It enables stable training of a wide variety of architectures with almost no hyperparameter tuning, including 101-layer [ResNets](/wiki/resnet) and language models over discrete data, and achieves high-quality generations on CIFAR-10 and LSUN bedrooms.
- It eliminates the need to tune the clipping parameter c.
- It works well with the Adam optimizer, unlike weight clipping which required RMSProp.

However, WGAN-GP has its own limitations. The gradient penalty is computed only at interpolated points between real and generated samples, so it does not enforce the Lipschitz constraint globally across the entire input space. Additionally, computing the gradient penalty requires an extra forward and backward pass through the critic, increasing computational cost.

### Spectral normalization

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida proposed spectral normalization in their 2018 ICLR paper.[3] This method directly constrains the Lipschitz constant of the critic by normalizing the weight matrices by their spectral norm (largest singular value).

For each weight matrix W in the critic, spectral normalization replaces W with:

$$\bar{W} = \frac{W}{\sigma(W)}$$

where $$\sigma(W)$$ is the largest singular value of $$W$$. This ensures that each layer has a Lipschitz constant of at most 1, and since the composition of 1-Lipschitz functions is also 1-Lipschitz, the entire network satisfies the constraint.[3]

Computing the exact spectral norm at every training step would be expensive, so the authors use power iteration (typically a single iteration per update) with memoized singular vectors for efficiency. Compared to WGAN-GP, spectral normalization:

- Has lower computational overhead (no extra forward/backward pass needed)
- Enforces the constraint globally rather than only at interpolated points
- Provides a more stable and consistent enforcement of the Lipschitz condition

### Comparison of constraint enforcement methods

| Method | Introduced | Constraint type | Computational overhead | Stability | Global enforcement |
|--------|-----------|----------------|----------------------|-----------|-------------------|
| Weight clipping | Arjovsky et al., 2017 | Hard (weight clamping) | Minimal | Poor (sensitive to clipping range) | Yes, but crude |
| Gradient penalty (WGAN-GP) | Gulrajani et al., 2017 | Soft (penalty term in loss) | High (extra forward/backward pass) | Good | No (only at interpolated points) |
| Spectral normalization | Miyato et al., 2018 | Hard (weight matrix rescaling) | Low (single power iteration) | Good | Yes |

## How is the Wasserstein loss implemented?

### Pseudocode

The following pseudocode outlines the WGAN-GP training loop:[2]

```
Algorithm: WGAN-GP Training
Input: gradient penalty coefficient lambda, number of critic iterations n_critic,
       batch size m, Adam hyperparameters alpha, beta1, beta2

for each training iteration do:
    for t = 1, ..., n_critic do:
        Sample real batch {x_1, ..., x_m} from training data
        Sample latent batch {z_1, ..., z_m} from prior p(z)
        Generate fake batch {x_tilde_1, ..., x_tilde_m} = G(z_1), ..., G(z_m)
        Sample epsilon_1, ..., epsilon_m ~ Uniform(0, 1)
        Compute interpolated samples: x_hat_i = epsilon_i * x_i + (1 - epsilon_i) * x_tilde_i
        
        L_critic = (1/m) * sum(f_w(x_tilde_i)) - (1/m) * sum(f_w(x_i))
                   + lambda * (1/m) * sum((||grad(f_w(x_hat_i))||_2 - 1)^2)
        
        Update critic weights w using Adam on L_critic
    end for
    
    Sample latent batch {z_1, ..., z_m} from prior p(z)
    L_generator = -(1/m) * sum(f_w(G(z_i)))
    Update generator weights theta using Adam on L_generator
end for
```

### PyTorch implementation sketch

A simplified implementation of the Wasserstein loss with gradient penalty in PyTorch:

```python
import torch
import torch.nn as nn
import torch.autograd as autograd

def wasserstein_loss_critic(critic, real_data, fake_data, lambda_gp=10):
    """Compute WGAN-GP critic loss."""
    # Wasserstein distance estimate
    critic_real = critic(real_data).mean()
    critic_fake = critic(fake_data).mean()
    wasserstein_distance = critic_real - critic_fake
    
    # Gradient penalty
    batch_size = real_data.size(0)
    epsilon = torch.rand(batch_size, 1, 1, 1, device=real_data.device)
    interpolated = epsilon * real_data + (1 - epsilon) * fake_data
    interpolated.requires_grad_(True)
    
    critic_interpolated = critic(interpolated)
    gradients = autograd.grad(
        outputs=critic_interpolated,
        inputs=interpolated,
        grad_outputs=torch.ones_like(critic_interpolated),
        create_graph=True,
        retain_graph=True
    )[0]
    
    gradients = gradients.view(batch_size, -1)
    gradient_penalty = ((gradients.norm(2, dim=1) - 1) ** 2).mean()
    
    # Total critic loss
    loss = -wasserstein_distance + lambda_gp * gradient_penalty
    return loss

def wasserstein_loss_generator(critic, fake_data):
    """Compute WGAN generator loss."""
    return -critic(fake_data).mean()
```

### Key implementation details

- **No sigmoid in the critic.** The critic's output layer should not use a sigmoid or softmax activation. The output should be a raw, unbounded scalar.
- **Label convention.** In frameworks that require labels, real samples are labeled as -1 and fake samples as +1 (the opposite of standard GANs). The Wasserstein loss can then be computed as `mean(labels * critic_output)`.
- **[Batch normalization](/wiki/batch_normalization) in the critic.** WGAN-GP should not use batch normalization in the critic, because batch normalization creates dependencies between samples in a batch, which can interfere with the per-sample gradient penalty.[2] Layer normalization or instance normalization can be used instead.
- **Critic-to-generator update ratio.** The original WGAN paper recommended 5 critic updates per generator update. WGAN-GP reduced this to 1 in some experiments, though 5 remains common.

## Computational complexity and scalability

### Exact computation

For discrete distributions with n support points each, computing the exact Wasserstein distance requires solving a linear program. Using the Hungarian algorithm or network simplex method, this has a time complexity of $$O(n^3)$$. For continuous distributions, exact computation is generally intractable.

### Sinkhorn approximation

To address the computational cost, Marco Cuturi (2013) introduced entropy-regularized optimal transport, which adds an entropic penalty to the transport plan.[5] The resulting optimization problem can be solved efficiently using the Sinkhorn-Knopp algorithm, an iterative matrix scaling procedure.

The Sinkhorn distance is defined as:

$$S_\epsilon(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \left[ \langle C, \gamma \rangle + \epsilon \, H(\gamma) \right]$$

where $$C$$ is the cost matrix, $$H(\gamma)$$ is the entropy of the transport plan, and $$\epsilon > 0$$ is the regularization strength. As $$\epsilon$$ approaches 0, the Sinkhorn distance converges to the true Wasserstein distance.

The Sinkhorn algorithm achieves $$\epsilon$$-accurate solutions in $$O(n^2 / \epsilon)$$ time, which is significantly faster than exact computation for large problems.[12]

### Sliced Wasserstein distance

The sliced Wasserstein distance offers another computationally efficient alternative. It projects the high-dimensional distributions onto random one-dimensional subspaces, computes the (closed-form) one-dimensional Wasserstein distance for each projection, and averages the results:

$$SW_p(P, Q) = \left( \int_{S^{d-1}} W_p^p(R_\theta P, R_\theta Q) \, d\theta \right)^{1/p}$$

where $$R_\theta$$ denotes the Radon transform (projection) along direction $$\theta$$, and the integral is over the unit sphere $$S^{d-1}$$. In practice, the integral is approximated by sampling a finite number of random projections.

Recent variants include:

| Variant | Year | Key idea |
|---------|------|----------|
| Generalized sliced Wasserstein | 2019 | Uses generalized Radon transforms beyond linear projections |
| Max-sliced Wasserstein | 2019 | Optimizes over projection directions instead of averaging |
| Tree-sliced Wasserstein | 2019 | Projects onto tree metric spaces instead of lines |
| Distributional sliced Wasserstein | 2020 | Learns a distribution over projections |
| Spherical sliced Wasserstein | 2024 | Extends to data on spherical manifolds |

## What is the Wasserstein loss used for?

### Generative adversarial networks

The most prominent application of Wasserstein loss is in training WGANs and their variants.[1] WGANs have been applied to image generation, text generation, music synthesis, and 3D shape generation. The training stability provided by the Wasserstein loss has made it a standard choice when training on complex, high-dimensional data where standard GAN objectives fail.

### Frechet Inception Distance (FID)

The [Frechet Inception Distance](/wiki/frechet_inception_distance), one of the most widely used metrics for evaluating generative models, is based on the W2 distance between Gaussian approximations of real and generated image feature distributions.[11] The FID computes the W2 distance between two multivariate Gaussians fitted to the Inception network features of real and generated images, using the closed-form Gaussian Wasserstein formula.

### Distributional reinforcement learning

In distributional [reinforcement learning](/wiki/reinforcement_learning), agents model the full distribution of returns rather than just the expected value. The distributional Bellman operator is a contraction in the p-Wasserstein metric, providing a theoretical foundation for algorithms like QR-DQN (Quantile Regression DQN).[8] QR-DQN uses quantile regression to minimize the Wasserstein distance between the predicted and target return distributions.[9] This approach outperforms methods like C51 that minimize KL divergence instead, achieving a 33% improvement in median scores on the Atari 2600 benchmark.[9]

### Domain adaptation

Wasserstein distance has been used to measure and minimize the discrepancy between source and target domain feature distributions.[14] By training a domain critic network to estimate the empirical Wasserstein distance and then minimizing it, models learn domain-invariant representations. This approach provides a tighter generalization bound compared to methods based on other divergence measures.

### Natural language processing

The Word Mover's Distance (WMD), introduced by Kusner et al. (2015), is a special case of the Wasserstein distance applied to text documents.[10] WMD represents documents as distributions over word embeddings and computes the minimum cost of transforming one document's word distribution into another's. It has been used for document similarity, text classification, and automatic text evaluation (for example, the BaryScore metric).

### Fairness in machine learning

Wasserstein distance has been applied to fairness-aware machine learning, where it quantifies the difference in prediction distributions across protected groups. Compared to KL divergence, the Wasserstein distance provides a more stable optimization target for fairness constraints and is robust to the choice of classification threshold in downstream models.

### Image retrieval

The earth mover's distance was one of the earliest applications of Wasserstein distance in computer science, used for comparing color histograms of images in content-based image retrieval systems.[7] By treating color histograms as distributions, the EMD captures perceptual similarity more accurately than bin-by-bin comparison metrics like chi-squared distance or histogram intersection.

## Theoretical properties

### Metrization of weak convergence

The Wasserstein-p distance metrizes weak convergence of probability measures (plus convergence of p-th moments).[4] This means that a sequence of distributions converges in the Wasserstein-p metric if and only if it converges weakly and the p-th moments converge. This property makes Wasserstein distance a natural choice for problems where the topology of weak convergence is relevant.

### Completeness

If the underlying metric space $$(M, d)$$ is complete and separable (a Polish space), then the Wasserstein space (the space of probability measures with finite p-th moments, equipped with the W_p metric) is also complete and separable.[4] This ensures that optimization over this space is well-posed.

### Geodesics and displacement interpolation

The W2 metric has a rich geometric structure. The space of probability measures equipped with W2 is an infinite-dimensional Riemannian manifold (informally). Geodesics in this space correspond to displacement interpolations, where mass moves along optimal transport paths. This geometric perspective, developed by Felix Otto and others, connects optimal transport to partial differential equations through the concept of gradient flows.[4]

## What are the limitations of the Wasserstein loss?

Despite its advantages, Wasserstein loss has several limitations:

- **Computational cost.** Even with the dual formulation, training WGANs requires multiple critic updates per generator step, making training slower than standard GANs.
- **Approximation quality.** The neural network critic can only approximate the supremum in the dual formulation. In practice, the estimated Wasserstein distance may not be accurate, especially early in training.
- **Curse of dimensionality.** The sample complexity of estimating Wasserstein distance grows exponentially with dimension. For distributions in d-dimensional space, $$O(n^{1+d})$$ samples may be needed for accurate estimation.
- **Lipschitz constraint enforcement.** All methods for enforcing the Lipschitz constraint are approximate, and the quality of the Wasserstein distance estimate depends on how well the constraint is satisfied.
- **Not always necessary.** For some applications and architectures, standard GAN losses or other alternatives (such as hinge loss or least-squares loss) may perform comparably or better, with lower computational cost.

## Key papers and timeline

| Year | Paper | Authors | Contribution |
|------|-------|---------|-------------|
| 1781 | Memoire sur la theorie des deblais et des remblais | Gaspard Monge | Formulated the original optimal transport problem |
| 1942 | On the translocation of masses | Leonid Kantorovich | Relaxed formulation with transport plans; linear programming connection |
| 2013 | Sinkhorn Distances: Lightspeed Computation of Optimal Transport | Marco Cuturi | Entropy-regularized optimal transport for efficient computation |
| 2015 | From Word Embeddings to Document Distances | Matt Kusner et al. | Word Mover's Distance for text similarity |
| 2017 | Wasserstein GAN | Arjovsky, Chintala, Bottou | Introduced WGAN with weight clipping |
| 2017 | Improved Training of Wasserstein GANs | Gulrajani et al. | Gradient penalty method (WGAN-GP) |
| 2017 | A Distributional Perspective on Reinforcement Learning | Bellemare, Dabney, Munos | Distributional RL with Wasserstein distance theory |
| 2018 | Spectral Normalization for Generative Adversarial Networks | Miyato et al. | Spectral normalization for Lipschitz constraint |
| 2018 | Distributional RL with Quantile Regression | Dabney et al. | QR-DQN minimizing Wasserstein distance |

## See also

- [Generative adversarial network](/wiki/generative_adversarial_network)
- [Loss function](/wiki/loss_function)
- [Frechet Inception Distance](/wiki/frechet_inception_distance)
- [Discriminator](/wiki/discriminator)
- [Generator](/wiki/generator)
- [Deep learning](/wiki/deep_learning)
- [Regularization](/wiki/regularization)
- [Reinforcement learning](/wiki/reinforcement_learning)

## References

1. Arjovsky, M., Chintala, S., & Bottou, L. (2017). "Wasserstein GAN." *Proceedings of the 34th International Conference on Machine Learning (ICML)*. arXiv:1701.07875.
2. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). "Improved Training of Wasserstein GANs." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1704.00028.
3. Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). "Spectral Normalization for Generative Adversarial Networks." *International Conference on Learning Representations (ICLR)*. arXiv:1802.05957.
4. Villani, C. (2008). *Optimal Transport: Old and New*. Springer. The standard mathematical reference for optimal transport theory.
5. Cuturi, M. (2013). "Sinkhorn Distances: Lightspeed Computation of Optimal Transport." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1306.0895.
6. Kantorovich, L. V. (1942). "On the translocation of masses." *Doklady Akademii Nauk SSSR*, 37(7-8), 227-229.
7. Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). "The Earth Mover's Distance as a Metric for Image Retrieval." *International Journal of Computer Vision*, 40(2), 99-121.
8. Bellemare, M. G., Dabney, W., & Munos, R. (2017). "A Distributional Perspective on Reinforcement Learning." *Proceedings of the 34th International Conference on Machine Learning (ICML)*. arXiv:1707.06887.
9. Dabney, W., Rowland, M., Bellemare, M. G., & Munos, R. (2018). "Distributional Reinforcement Learning with Quantile Regression." *Proceedings of the AAAI Conference on Artificial Intelligence*. arXiv:1710.10044.
10. Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). "From Word Embeddings To Document Distances." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
11. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1706.08500.
12. Peyre, G. & Cuturi, M. (2019). "Computational Optimal Transport." *Foundations and Trends in Machine Learning*, 11(5-6), 355-607.
13. Kolouri, S., Nadjahi, K., Siber, U., Martin, R., & Rohde, G. K. (2019). "Generalized Sliced Wasserstein Distances." *Advances in Neural Information Processing Systems (NeurIPS)*.
14. Shen, J., Qu, Y., Zhang, W., & Yu, Y. (2018). "Wasserstein Distance Guided Representation Learning for Domain Adaptation." *Proceedings of the AAAI Conference on Artificial Intelligence*. arXiv:1707.01217.