See also: Wasserstein distance, Optimal transport, Sinkhorn algorithm
The Earth Mover's Distance (EMD) is a measure of dissimilarity between two probability distributions defined on a metric space. Intuitively, if each distribution is viewed as a pile of dirt spread over the space, the EMD is the minimum total amount of work (mass times distance) needed to reshape one pile into the other. The metric is mathematically equivalent to the Wasserstein-1 distance, and it is also called the Mallows distance in statistics and the Kantorovich-Rubinstein metric in functional analysis. EMD has become a workhorse distance in machine learning because it captures the geometry of the underlying space, remains well defined when supports do not overlap, and provides smooth gradients suitable for training neural networks.
The distance was popularized in computer vision by Yossi Rubner, Carlo Tomasi, and Leonidas Guibas in their 2000 paper "The Earth Mover's Distance as a Metric for Image Retrieval" published in the International Journal of Computer Vision. Their work showed that EMD outperforms histogram-bin distances such as L1 and chi-squared on color and texture retrieval tasks because it accounts for the perceptual similarity between bins rather than treating bins as independent buckets. Since then EMD has spread far beyond image retrieval. It underlies the loss function in Wasserstein GAN, the cost in word mover's distance for document similarity, and the matching objective in single-cell trajectory inference.
The optimal transport problem behind EMD has roots that stretch back more than two centuries. Gaspard Monge formulated the original version in 1781 in his memoir Mémoire sur la théorie des déblais et des remblais, written during his work on military fortifications. Monge asked: what is the cheapest way to move a pile of soil (the déblai) into a target pile of equal volume (the remblai) when the cost per unit of soil is its travel distance? His formulation required a deterministic assignment from each grain of source mass to a single destination, which makes the problem hard to solve and sometimes infeasible.
In 1942, the Soviet mathematician and economist Leonid Kantorovich relaxed Monge's problem by allowing mass to be split across multiple destinations, recasting the question as a linear program over joint probability measures. This Kantorovich relaxation turned an intractable combinatorial problem into a tractable convex one, and it forms the modern foundation of optimal transport theory. Kantorovich shared the 1975 Nobel Memorial Prize in Economic Sciences with Tjalling Koopmans for his work on the optimal allocation of resources, of which the transport problem is the canonical example.
The name Wasserstein distance was coined by R. L. Dobrushin in 1970 after he encountered a 1969 paper by Leonid Vaseršteĭn studying Markov processes on large automata systems. The computer vision community independently rediscovered the same object in the 1980s and 1990s. Stanford University doctoral student Yossi Rubner, working with Carlo Tomasi and Leonidas Guibas, gave it the evocative name Earth Mover's Distance in their 1998 ICCV paper and the 2000 IJCV journal version. The IJCV paper has since been cited tens of thousands of times.
Imagine two histograms of color in two photographs. A histogram-bin metric like L1 treats every bin as independent: shifting all the red mass to slightly-redder bins counts as a complete change. EMD knows that nearby bins are similar, so a small shift produces a small distance. The same intuition extends to point clouds, signed measures, and continuous distributions on any metric space.
A more visual analogy: imagine pile P sitting on the number line at positions {1, 2, 3} with mass {0.5, 0.3, 0.2}, and pile Q at positions {2, 3, 4} with mass {0.2, 0.5, 0.3}. To turn P into Q you can leave most of the mass in place and shift small amounts to the right. The minimum total mass-distance you must pay is the EMD. Kullback-Leibler divergence, by contrast, does not even know that 1 and 2 are nearby; it only compares per-bin probabilities and goes to infinity if any bin in Q has zero mass where P has positive mass.
Let P and Q be two probability distributions on a metric space (X, d). The Kantorovich formulation of EMD is:
EMD(P, Q) = inf over T in Pi(P, Q) of integral over X x X of d(x, y) dT(x, y)
where Pi(P, Q) is the set of all transport plans, joint probability measures on X x X whose marginals are P and Q. A transport plan T(x, y) specifies how much mass moves from x to y; the marginal constraints ensure that the total outflow from each source matches P and the total inflow to each destination matches Q.
For discrete distributions with point masses {p_i} at positions {x_i} and {q_j} at positions {y_j}, the formulation becomes a finite linear program:
minimize sum over i, j of T_ij d(x_i, y_j)
subject to T_ij >= 0, sum over j of T_ij = p_i, sum over i of T_ij = q_j.
The Wasserstein-p distance generalizes EMD by raising the ground distance to the p-th power before integrating and taking the p-th root of the result:
W_p(P, Q) = (inf over T of integral d(x, y)^p dT(x, y))^(1/p).
EMD corresponds to W_1, the Wasserstein-1 distance. W_2 is also widely used because it has nice differential geometric properties and gives rise to displacement interpolation along geodesics in Wasserstein space.
A key fact, used heavily in modern deep learning, is that W_1 admits a dual formulation:
W_1(P, Q) = sup over f Lipschitz with constant 1 of E_P[f(x)] - E_Q[f(x)].
The supremum runs over all 1-Lipschitz functions f. This is the Kantorovich-Rubinstein theorem. The dual replaces a high-dimensional linear program with an optimization over a function class, which neural networks can parameterize. This is the basis of the WGAN critic.
EMD satisfies the axioms of a true metric and behaves better than many alternatives on distributions whose supports differ.
| Property | Holds for EMD? | Notes |
|---|---|---|
| Non-negativity | Yes | EMD(P, Q) >= 0. |
| Identity of indiscernibles | Yes | EMD(P, Q) = 0 iff P = Q. |
| Symmetry | Yes | EMD(P, Q) = EMD(Q, P). |
| Triangle inequality | Yes | EMD(P, R) <= EMD(P, Q) + EMD(Q, R). |
| Defined on disjoint supports | Yes | Returns a finite value even when supports do not overlap. |
| Differentiable in P, Q | Almost everywhere | Continuous and differentiable a.e., unlike JS divergence which is locally constant. |
| Sensitive to ground geometry | Yes | Captures spatial structure through d(x, y). |
| Convex in (P, Q) | Yes | The transport LP is convex. |
For histograms with the same total mass, EMD is a true metric. When the two distributions have different total mass, the original Rubner-Tomasi-Guibas definition normalizes by the total flow, which sacrifices the triangle inequality to support partial matches.
Probability distances and divergences differ in symmetry, sensitivity to geometry, and behavior on disjoint supports. The choice of distance shapes the gradients a model receives during training.
| Distance / divergence | Symmetric? | Metric? | Defined on disjoint supports? | Uses ground geometry? |
|---|---|---|---|---|
| Earth mover's distance / Wasserstein-1 | Yes | Yes | Yes | Yes |
| Wasserstein-2 | Yes | Yes | Yes | Yes |
| Kullback-Leibler divergence | No | No | No (infinite) | No |
| Jensen-Shannon divergence | Yes | Square root is a metric | Yes (bounded by log 2) | No |
| Total variation | Yes | Yes | Yes (bounded by 1) | No |
| Hellinger distance | Yes | Yes | Yes (bounded) | No |
| Maximum mean discrepancy (MMD) | Yes | Pseudo-metric | Yes | Through kernel |
| Frechet inception distance (FID) | Yes | Yes (under Gaussian assumption) | Yes | Yes (in feature space) |
The last column matters most for high-dimensional generative models. KL and JS treat distributions as bags of probabilities and ignore the geometry of the sample space. Two Gaussians with disjoint supports look equally far apart to JS no matter how close their means are. EMD measures the actual mean distance and yields informative gradients even in the worst-case disjoint setting.
The direct linear program solving EMD between two histograms with n bins has cubic complexity. This is the main practical drawback of EMD.
| Algorithm | Complexity | Notes |
|---|---|---|
| Network simplex (transportation simplex) | O(n^3 log n) | Exact; standard CPU solver. Network simplex is typically 200 to 300 times faster than the general simplex method for the same problem size. |
| Hungarian algorithm | O(n^3) | Exact; for the assignment special case where masses are equal counts. Developed by Harold Kuhn in 1955. |
| Auction algorithm (Bertsekas) | O(n^3) typical | Exact; parallelizable. |
| Sinkhorn algorithm (entropic OT) | O(n^2 / epsilon^2) | Approximate; uses entropic regularization (Cuturi 2013). Differentiable and GPU-friendly. |
| Sliced Wasserstein | O(L n log n) | Approximate; averages 1D OT over L random projections (Bonneel, Rabin, Peyre, Pfister 2015). |
| Tree-sliced Wasserstein | O(n log n) per tree | Approximate; reduces 1D structure to tree structure. |
| 1D closed form | O(n log n) | Exact; for distributions on R, EMD reduces to integrating the absolute difference of CDFs. |
For problems with n in the thousands or tens of thousands, exact solvers become impractical. The Sinkhorn algorithm and sliced variants are the standard practical compromises.
In 2013 Marco Cuturi published "Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances" at NeurIPS, which transformed how OT is used in machine learning. He added an entropic penalty to the transport plan:
minimize <T, C> - lambda H(T)
where C is the cost matrix C_ij = d(x_i, y_j), H(T) is the Shannon entropy of T, and lambda > 0 controls smoothing. The optimal T has the form diag(u) K diag(v) where K = exp(-C / lambda), and u, v are non-negative vectors found by alternating row and column normalization. This iterative scaling procedure was studied by Richard Sinkhorn in 1964, hence the name.
The entropic problem is strictly convex, has a unique solution, runs orders of magnitude faster than network simplex (O(n^2) per Sinkhorn iteration), is parallelizable on GPUs, and is differentiable with respect to the cost matrix and the input distributions. The last property unlocked OT as a layer or loss function inside deep learning models. The price is a small bias: the Sinkhorn solution is not exactly W_1, and divergence-style corrections such as the Sinkhorn divergence by Aude Genevay, Gabriel Peyre, and Marco Cuturi are needed to recover an unbiased loss when lambda is large.
The original 2000 Rubner-Tomasi-Guibas paper applied EMD to content-based image retrieval. They represented each image as a signature, a small set of weighted color or texture clusters obtained by clustering, and used EMD with a perceptually meaningful color distance as the dissimilarity. EMD outperformed histogram intersection, chi-squared, and Kullback-Leibler on color and texture databases, especially when the histograms had been quantized differently. Subsequent work in computer vision used EMD for shape matching, color transfer between images, point cloud comparison, and feature-map alignment in deep networks.
In 2017 Martin Arjovsky, Soumith Chintala, and Leon Bottou introduced WGAN at ICML. Vanilla GAN training minimizes a Jensen-Shannon divergence (or a related f-divergence) between the data distribution and the generator distribution. When the two distributions have nearly disjoint supports, which happens almost always for high-dimensional natural images and a low-dimensional generator manifold, JS divergence is locally constant. The discriminator gradient vanishes and training stalls or collapses.
WGAN replaces JS with the Wasserstein-1 distance, exploiting the Kantorovich-Rubinstein dual. The discriminator (renamed critic) parameterizes a 1-Lipschitz function and is trained to maximize E_real[f(x)] - E_fake[f(x)]. The generator is trained to minimize the same quantity. The Lipschitz constraint was originally enforced by clipping the critic's weights to a small box such as [-0.01, 0.01]. The result was famously stable training, a meaningful loss curve that correlates with sample quality, and almost no mode collapse.
The weight clipping trick is crude. Ishaan Gulrajani and coauthors fixed it in 2017 with WGAN-GP, Improved Training of Wasserstein GANs. They added a gradient penalty on the critic, encouraging the norm of its gradient with respect to interpolations between real and fake samples to stay near 1. WGAN-GP enables stable training of very deep architectures including 101-layer ResNet generators and even discrete-output models such as character-level language models.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger introduced the word mover's distance (WMD) at ICML 2015 in "From Word Embeddings To Document Distances". They represented each document as a normalized bag of words and computed the EMD between two documents using the Euclidean distance between word2vec embeddings as the ground distance. WMD measures the minimum total distance the embedded words of one document must travel to reach the embedded words of the other.
The metric has no hyperparameters and beat seven strong baselines (BM25, LSI, LDA, mSDA, and others) on eight document classification datasets, often by large margins on k-nearest-neighbor classification. WMD became a standard semantic similarity baseline and inspired many follow-ups, including supervised WMD, relaxed WMD, and various speedups based on triangle-inequality bounds.
Nicolas Courty, Remi Flamary, Devis Tuia, and Alain Rakotomamonjy proposed Optimal Transport for Domain Adaptation in their 2017 IEEE TPAMI paper. The idea: given labeled samples from a source domain and unlabeled samples from a target domain, compute the optimal transport plan from source to target and use it to map source samples (and their labels) into the target domain. A regularizer encourages source samples of the same class to remain close after transport. The method outperformed prior unsupervised domain adaptation baselines on visual benchmarks and inspired a wave of OT-based transfer learning research.
Geoffrey Schiebinger, Eric Lander, and colleagues introduced Waddington-OT in a 2019 Cell paper, Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. They sequenced 315,000 single cells at half-day intervals across 18 days of fibroblast reprogramming and used optimal transport to infer ancestor-descendant relationships between cells observed at adjacent time points. The transport plan reveals a developmental landscape that traditional clustering methods miss, predicting transcription factors and signaling molecules that influence cell fate. The work has been extended to many other developmental and disease contexts.
Optimal transport ideas show up throughout modern generative modeling. The Wasserstein-2 geodesic between distributions provides a principled interpolation that flow-matching and rectified-flow methods use to define training targets for diffusion models. The 2023 flow matching family of papers by Yaron Lipman, Ricky T. Q. Chen, and others trains a velocity field that transports a simple prior to the data distribution along a chosen path, often inspired by OT. OT-CFM (optimal-transport conditional flow matching) explicitly computes mini-batch OT plans during training to straighten the trajectories.
Most machine learning frameworks now ship with OT primitives, and dedicated libraries cover the more advanced solvers.
| Library | Language | Notes |
|---|---|---|
| Python Optimal Transport (POT) | Python | The reference open-source toolbox; implements network simplex, Sinkhorn, Wasserstein barycenters, GW, partial OT, and more. Flamary et al. 2021 JMLR. |
| OTT-JAX | JAX | OT solvers in JAX with full GPU and TPU support, autodiff, and JIT. Maintained by Google Research. |
| GeomLoss | PyTorch | Sinkhorn divergences, MMD, and Hausdorff loss with online and multiscale solvers. Feydy, Seguy et al. |
| torch.cdist + scipy.optimize.linear_sum_assignment | PyTorch / SciPy | Build the cost matrix and solve the assignment LP exactly; suitable for small problems. |
| EMD-OpenCV (cv::EMD) | C++ / Python | Implements the original Rubner-Tomasi-Guibas signature-based EMD. |
| ott (R) and transport (R) | R | Discrete OT in the R ecosystem. |
| EMD2 (Mosek, Gurobi) | Various | Commercial LP solvers handle exact EMD on moderate-size problems. |
EMD is not a free lunch. Its main drawbacks are well known.
Imagine you have two piles of LEGO bricks on the floor. The first pile is in the shape of a cat and the second pile is in the shape of a dog. You want to know how different the two shapes are. One way: pick up bricks from the cat pile, walk them across the floor, and drop them into the dog pile until both piles look the same. Each brick you carry costs work equal to how heavy it is times how far you walk. The Earth Mover's Distance is the smallest total work you need, no matter how cleverly you plan the trips. If the cat and dog are sitting almost on top of each other, the work is small. If they are in different rooms, the work is big. Computers use this idea to compare images, documents, and even the way cells change shape over time.