The Earth Mover's Distance (EMD), also known as the Wasserstein-1 distance, Kantorovich-Rubinstein metric, or Mallows's distance, is a measure of dissimilarity between two probability distributions over a metric space. It originates from the optimal transport problem in mathematics and has become one of the most widely used distance metrics in machine learning, computer vision, and natural language processing.
The name comes from a physical analogy: if two distributions are imagined as two different ways of piling up dirt over a region, the EMD represents the minimum total work needed to reshape one pile into the other, where work is defined as the amount of dirt moved multiplied by the distance it travels. This intuitive interpretation makes the EMD both mathematically rigorous and practically meaningful.
Imagine you and your friend each have a sandbox, and you've both built dirt piles in different spots. Your pile might have a big mound on the left side, while your friend's pile has the dirt spread out on the right side. The Earth Mover's Distance asks: what is the least amount of work you would have to do to make your sandbox look exactly like your friend's? You measure work by multiplying how much dirt you move by how far you carry it. If you only need to slide a little dirt a short distance, the EMD is small, meaning the two sandboxes were already pretty similar. If you need to carry a lot of dirt a long way, the EMD is large, meaning they were very different.
This idea applies to anything that can be described as a distribution: colors in an image, words in a document, or data points on a graph. The EMD gives a single number that captures how different two such distributions are.
The mathematical foundations of EMD trace back to 1781, when French mathematician Gaspard Monge formulated the optimal transport problem. Monge, who was also an engineer designing military fortifications, wanted to find the most efficient way to move soil from one configuration to another. He posed the question: given a pile of material at various source locations and a desired arrangement at destination locations, what assignment of sources to destinations minimizes the total transportation cost?
Monge's formulation, however, was restrictive. It required each unit of mass to be transported from a single source to a single destination, with no splitting allowed. This constraint made the problem analytically intractable in many cases.
In 1942, Soviet mathematician and economist Leonid Kantorovich reformulated Monge's problem using the framework of linear programming. Kantorovich's key insight was to allow mass splitting: a source could send fractions of its mass to multiple destinations. This relaxation transformed the problem into a linear program that was both theoretically solvable and computationally feasible. Kantorovich's work earned him the Nobel Prize in Economics in 1975 (shared with Tjalling Koopmans) for contributions to the theory of optimal allocation of resources.
The resulting framework is commonly referred to as the Monge-Kantorovich transportation problem. It became a cornerstone of mathematical optimization and laid the groundwork for what is now called optimal transport theory.
The term "Earth Mover's Distance" was proposed by Jorge Stolfi in 1994. The name caught on in the computer science community, particularly after the influential 1998 paper by Yossi Rubner, Carlo Tomasi, and Leonid J. Guibas, who applied it to content-based image retrieval. In the mathematics and statistics communities, the same quantity is more commonly called the Wasserstein distance (specifically the first Wasserstein distance, W1), a name coined by R. L. Dobrushin in 1970 after encountering the concept in the work of Leonid Vaserstein. The spelling "Wasserstein" is a transliteration of the Yiddish-origin surname Vasershtein.
Cedric Villani, who received the Fields Medal in 2010 (for work on Landau damping and the Boltzmann equation), authored two comprehensive reference books on optimal transport theory: "Topics in Optimal Transportation" (2003) and "Optimal Transport, Old and New" (2008). These texts helped popularize optimal transport methods across mathematics, physics, and applied sciences.
Let (M, d) be a metric space, and let P and Q be two probability distributions on M. A transport plan (or coupling) between P and Q is a joint distribution gamma on M x M whose marginals are P and Q. The set of all such couplings is denoted Gamma(P, Q).
The p-Wasserstein distance between P and Q is defined as:
$$W_p(P, Q) = \left( \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y)^p , d\gamma(x, y) \right)^{1/p}$$
The Earth Mover's Distance corresponds to the case p = 1:
$$\text{EMD}(P, Q) = W_1(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y) , d\gamma(x, y)$$
Intuitively, gamma describes how mass is moved from locations drawn according to P to locations drawn according to Q. The integral computes the expected cost (distance) under that plan. The infimum finds the cheapest such plan.
In practice, distributions are often represented as discrete sets of weighted clusters, called signatures. Given two signatures:
where each p_i and q_j is a feature vector (cluster center) and w_pi, w_qj are the corresponding weights, and a ground distance d_ij = d(p_i, q_j) between cluster centers, the EMD is found by solving the following linear program:
Minimize:
$$\sum_{i=1}^{m} \sum_{j=1}^{n} f_{ij} \cdot d_{ij}$$
Subject to:
The matrix F = [f_ij] is the optimal flow. The normalized EMD is then:
$$\text{EMD}(P, Q) = \frac{\sum_{i} \sum_{j} f_{ij} \cdot d_{ij}}{\sum_{i} \sum_{j} f_{ij}}$$
Normalization by total flow makes the result comparable across pairs of distributions with different total masses.
For the W1 distance, an equivalent dual formulation exists:
$$W_1(P, Q) = \sup_{|f|L \le 1} \left( \mathbb{E}{x \sim P}[f(x)] - \mathbb{E}_{y \sim Q}[f(y)] \right)$$
where the supremum is taken over all 1-Lipschitz functions f (functions satisfying |f(x) - f(y)| <= d(x, y) for all x, y). This duality is known as the Kantorovich-Rubinstein duality and is the theoretical foundation for Wasserstein GANs.
In the special case of one-dimensional distributions, the EMD has a simple closed-form solution. For two probability distributions P and Q with cumulative distribution functions F_P and F_Q:
$$W_1(P, Q) = \int_{-\infty}^{\infty} |F_P(x) - F_Q(x)| , dx$$
That is, the EMD equals the area between the two CDFs. For discrete distributions represented as histograms with bins of equal width, this reduces to a linear-time computation:
This makes the one-dimensional EMD far cheaper to compute than the general case.
The EMD (when defined between distributions of equal total mass) satisfies all the axioms of a metric:
| Property | Description |
|---|---|
| Non-negativity | EMD(P, Q) >= 0, with equality if and only if P = Q |
| Symmetry | EMD(P, Q) = EMD(Q, P) |
| Triangle inequality | EMD(P, R) <= EMD(P, Q) + EMD(Q, R) |
| Identity of indiscernibles | EMD(P, Q) = 0 implies P = Q |
When partial matching is allowed (i.e., the distributions have unequal total mass and excess mass is discarded at no cost), the resulting measure is no longer a true metric because it can violate the triangle inequality.
Compared to other probability distribution distances:
| Distance/Divergence | Handles disjoint support | Metric | Differentiable | Sensitive to geometry |
|---|---|---|---|---|
| Earth Mover's Distance (W1) | Yes | Yes | Yes (via dual) | Yes |
| Kullback-Leibler divergence | No (infinite if supports differ) | No (asymmetric) | Yes | No |
| Jensen-Shannon divergence | Yes | Yes (square root) | Yes | No |
| Total variation distance | Yes | Yes | Limited | No |
| Kolmogorov-Smirnov distance | Yes | Yes | No | Partially |
A distinctive advantage of the EMD is its sensitivity to the underlying geometry of the metric space. Unlike bin-to-bin distances such as the chi-squared distance, the EMD accounts for "cross-bin" relationships. For example, in comparing color histograms, the EMD recognizes that dark red is closer to bright red than to blue, whereas bin-wise distances treat all bin mismatches equally.
Computing the exact EMD requires solving a transportation problem, which is a special case of linear programming. The standard approach uses the network simplex algorithm, a specialized variant of the simplex method for flow problems.
| Algorithm | Time complexity | Notes |
|---|---|---|
| Network simplex | O(N^3 log N) | Standard method; N is the number of bins |
| Hungarian algorithm | O(N^3) | For bipartite matching (uniform weights) |
| 1D closed form | O(N) | One-dimensional distributions only |
| Auction algorithm | O(N^3) | Alternative for assignment problems |
The cubic-or-worse complexity makes exact EMD computation impractical for large-scale applications, especially when distributions are represented as high-dimensional histograms with thousands of bins.
Several approaches have been developed to reduce the computational cost of EMD:
Sinkhorn distances (Cuturi, 2013). Marco Cuturi proposed adding an entropic regularization term to the optimal transport problem. This transforms the linear program into a strictly convex problem that can be solved using the Sinkhorn-Knopp matrix scaling algorithm. The key benefit is computational: the Sinkhorn algorithm relies only on matrix-vector multiplications, reducing complexity to approximately O(N^2) per iteration with fast convergence. On the MNIST dataset, Sinkhorn distances were shown to be thousands of times faster than exact EMD solvers.
Sliced Wasserstein distance. This approach projects high-dimensional distributions onto random one-dimensional subspaces, computes the closed-form 1D Wasserstein distance for each projection, and averages the results. The complexity per projection is O(N log N) due to sorting. The sliced Wasserstein distance is provably equivalent to the true Wasserstein distance in a weak sense and has found applications in generative modeling and point cloud analysis.
Tree-based approximations. Methods such as the tree-EMD embed the ground distance into a tree metric, allowing the EMD to be computed in linear time via a simple bottom-up traversal. These methods sacrifice some accuracy for significant speed gains.
Wavelet-based approximations. The EMD can be approximated using wavelet decompositions of the distributions, with complexity linear in the number of bins.
The most influential early application of EMD was in content-based image retrieval, as introduced by Rubner, Tomasi, and Guibas (2000). Their approach represented images as distributions of color features (signatures), where each feature is a cluster center in color space and its weight is the fraction of pixels belonging to that cluster. The EMD between two image signatures captures perceptual similarity more effectively than traditional histogram distances because it accounts for the similarity between different colors rather than treating each color bin independently.
Experimental results showed that EMD-based retrieval outperformed methods based on histogram intersection, chi-squared distance, and other bin-to-bin comparisons, particularly for queries involving color variations and lighting changes.
DeepEMD (Zhang et al., CVPR 2020) introduced a differentiable EMD layer for few-shot learning. The method computes the EMD between dense feature extraction maps of a query image and a support image, treating local feature vectors as distribution elements. The EMD identifies the optimal matching between spatial regions of the two images, providing a structured distance metric that is more informative than global feature comparison.
By using the implicit function theorem, the authors made the EMD computation differentiable, enabling end-to-end training with backpropagation. DeepEMD achieved state-of-the-art results on five few-shot learning benchmarks, with improvements of up to 7% over previous methods.
EMD has been applied to comparing texture descriptors, contour features, and shape distributions. Because the EMD naturally handles distributions with different numbers of clusters (through partial matching), it is well-suited for comparing objects whose feature representations vary in complexity.
In video understanding, EMD is used to compare temporal distributions of motion features across video clips. The metric's ability to capture temporal misalignment makes it useful for action recognition tasks where the same action may be performed at different speeds.
The Word Mover's Distance (WMD), introduced by Kusner et al. (2015), adapts the EMD framework to measure semantic similarity between text documents. Each document is represented as a weighted collection of word embeddings (such as Word2Vec vectors), where the weight of each word is its normalized frequency in the document. The WMD between two documents is the minimum cumulative distance that the word embeddings of one document must travel in the embedding space to match the word embeddings of the other.
Formally, if document A has words with embeddings {a1, ..., am} and normalized weights {w_a1, ..., w_am}, and document B has words with embeddings {b1, ..., bn} and normalized weights {w_b1, ..., w_bn}, the WMD is:
$$\text{WMD}(A, B) = \min_{T \geq 0} \sum_{i=1}^{m} \sum_{j=1}^{n} T_{ij} \cdot |a_i - b_j|_2$$
subject to the constraints that the row sums of T equal w_a and the column sums equal w_b.
WMD captures semantic relationships that bag-of-words models miss. For example, "The president addressed the nation" and "The leader spoke to the country" have no words in common, but WMD assigns a small distance because the word embeddings for "president"/"leader", "addressed"/"spoke", and "nation"/"country" are close in the embedding space.
Experiments by Kusner et al. demonstrated that WMD achieved the lowest k-nearest-neighbor classification error on eight document classification benchmarks compared to seven baselines including bag-of-words, TF-IDF, and latent semantic analysis.
Huang and Guo (NeurIPS 2016) extended WMD with a supervised variant (S-WMD) that learns a transformation of the word embedding space to improve classification accuracy. The transformation is optimized so that documents from the same class have lower WMD and documents from different classes have higher WMD.
While WMD provides high-quality distance measurements, its computational cost (cubic in the vocabulary size per pair) limits scalability. Later work explored faster approximations, including relaxed WMD (which uses lower bounds from the dual problem) and methods based on sentence-level embeddings that approximate WMD without solving the full transport problem.
One of the most impactful applications of the EMD in deep learning came with the introduction of the Wasserstein GAN (WGAN) by Arjovsky, Chintala, and Bottou (2017). Traditional generative adversarial networks train the generator by minimizing the Jensen-Shannon divergence between the generated and real data distributions. This objective suffers from vanishing gradients when the discriminator becomes too strong, because the JS divergence saturates when the two distributions have disjoint supports (which is common in high-dimensional spaces).
WGAN replaces the JS divergence with the Wasserstein-1 (Earth Mover's) distance. Using the Kantorovich-Rubinstein duality, the W1 distance can be estimated by training a critic network (replacing the discriminator) that is constrained to be 1-Lipschitz. The WGAN objective is:
$$\min_G \max_{|D|L \le 1} \left( \mathbb{E}{x \sim P_{\text{real}}}[D(x)] - \mathbb{E}_{z \sim P_z}[D(G(z))] \right)$$
Key advantages of WGAN over standard GANs:
| Aspect | Standard GAN | Wasserstein GAN |
|---|---|---|
| Loss function | Jensen-Shannon divergence | Wasserstein-1 distance |
| Gradient behavior | Vanishing gradients when discriminator is optimal | Continuous, informative gradients everywhere |
| Training stability | Prone to mode collapse and oscillation | More stable training dynamics |
| Correlation with quality | Generator loss does not correlate with sample quality | Critic loss correlates with generated sample quality |
| Lipschitz constraint | None | Required (via weight clipping or gradient penalty) |
The original WGAN enforced the Lipschitz constraint through weight clipping, which sometimes led to capacity underuse. Gulrajani et al. (2017) proposed WGAN-GP (gradient penalty), which enforces the constraint by penalizing the gradient norm of the critic, yielding further improvements in training stability and sample quality.
The Wasserstein distance has also been used in autoencoder-based generative models. Wasserstein autoencoders (Tolstikhin et al., 2018) minimize the Wasserstein distance between the data distribution and the model distribution, providing an alternative to the KL-divergence penalty used in variational autoencoders.
The Wasserstein distance has found applications in single-cell RNA sequencing analysis. The waddR package (Schefzik et al., 2021) uses the 2-Wasserstein distance to detect differential gene expression distributions between conditions. Unlike traditional methods that only test for differences in mean expression, the Wasserstein-based approach can detect changes in variance, distribution shape (such as unimodal versus multimodal), and the proportion of zero counts. The W2 distance can be decomposed into components capturing shifts in mean, variance, and shape, providing interpretable results.
Optimal transport methods have also been applied to reconstruct cell trajectories from snapshot single-cell data, using the Wasserstein-Fisher-Rao distance to simultaneously model both cell transport and population growth.
The original Kantorovich formulation was motivated by economic resource allocation. Modern applications include distributionally robust optimization (DRO), where decision-makers optimize for worst-case performance over distributions within a Wasserstein ball around the empirical distribution. This approach provides robustness guarantees against distribution shift.
Wasserstein distance has been used as a fairness criterion in algorithmic decision-making. By measuring the distance between outcome distributions across protected groups (such as race or gender), practitioners can quantify disparate impact. Wasserstein-based distributionally robust optimization ensures that models perform consistently across varied data distributions, promoting individual fairness.
The "Ocean Mover's Distance" (Hydrological Wasserstein distance) applies optimal transport concepts to compare oceanographic data distributions. In high-energy physics, differentiable EMD has been used for data compression at the Large Hadron Collider.
Several open-source libraries provide EMD and Wasserstein distance computations:
| Library | Language | Features |
|---|---|---|
SciPy (scipy.stats.wasserstein_distance) | Python | 1D Wasserstein distance; also wasserstein_distance_nd for N-dimensional |
| POT (Python Optimal Transport) | Python | Full optimal transport solver, Sinkhorn, sliced Wasserstein, and more |
| PyEMD | Python | Wrapper around C++ EMD implementation; now recommends POT for new projects |
PyTorch (geomloss) | Python | GPU-accelerated Sinkhorn divergences and kernel-based OT |
| OpenCV | C++ / Python | EMD via cv2.EMD for histogram comparison |
| TensorFlow | Python | Custom implementations via tf.linalg operations |
R (transport, waddR) | R | Exact and approximate OT; Wasserstein tests for genomics |
A basic example using SciPy:
from scipy.stats import wasserstein_distance
# Two 1D distributions represented as value-weight pairs
dist_a = [1, 2, 3, 4, 5]
dist_b = [2, 3, 4, 5, 6]
emd = wasserstein_distance(dist_a, dist_b)
print(f"EMD: {emd}") # Output: EMD: 1.0
For the general discrete case with a custom ground distance matrix:
import numpy as np
import ot # POT library
# Two discrete distributions
a = np.array([0.4, 0.6])
b = np.array([0.3, 0.7])
# Ground distance matrix
M = np.array([[0.0, 1.5],
[1.5, 0.0]])
# Compute exact EMD
emd = ot.emd2(a, b, M)
print(f"EMD: {emd}") # Output: EMD: 0.15
The EMD is part of a broader family of optimal transport distances. Understanding how it relates to other measures helps clarify when to use it:
| Metric | Relationship to EMD |
|---|---|
| Wasserstein-1 (W1) | Identical to EMD for probability distributions of equal total mass |
| Wasserstein-2 (W2) | Uses squared distances; emphasizes large displacements more than W1 |
| Wasserstein-p (Wp) | Generalizes EMD with p-th power of distance; EMD is the p=1 case |
| Mallows's distance | Same as Wasserstein distance; name used primarily in statistics |
| Frechet Inception Distance (FID) | Uses Wasserstein-2 between Gaussian approximations of feature distributions |
| Sliced Wasserstein | Approximation of Wasserstein via 1D projections; faster but less precise |
| Sinkhorn divergence | Entropy-regularized version; faster to compute, smooth and differentiable |
| Maximum Mean Discrepancy (MMD) | Kernel-based distance; does not require solving an optimization problem |
Despite its many advantages, the EMD has several limitations: