Earth mover's distance (EMD)

The Earth Mover's Distance (EMD) is a measure of dissimilarity between two probability distributions defined on a metric space. Intuitively, if each distribution is viewed as a pile of dirt spread over the space, the EMD is the minimum total amount of work (mass times distance) needed to reshape one pile into the other. The metric is mathematically equivalent to the Wasserstein-1 distance, and it is also called the Mallows distance in statistics and the Kantorovich-Rubinstein metric in functional analysis. EMD has become a workhorse distance in machine learning because it captures the geometry of the underlying space, remains well defined when supports do not overlap, and provides smooth gradients suitable for training neural networks.

The distance was popularized in computer vision by Yossi Rubner, Carlo Tomasi, and Leonidas Guibas in their 2000 paper "The Earth Mover's Distance as a Metric for Image Retrieval" published in the International Journal of Computer Vision. Their work showed that EMD outperforms histogram-bin distances such as L1 and chi-squared on color and texture retrieval tasks because it accounts for the perceptual similarity between bins rather than treating bins as independent buckets. Since then EMD has spread far beyond image retrieval. It underlies the loss function in Wasserstein GAN, the cost in word mover's distance for document similarity, and the matching objective in single-cell trajectory inference.

history

The optimal transport problem behind EMD has roots that stretch back more than two centuries. Gaspard Monge formulated the original version in 1781 in his memoir Mémoire sur la théorie des déblais et des remblais, written during his work on military fortifications. Monge asked: what is the cheapest way to move a pile of soil (the déblai) into a target pile of equal volume (the remblai) when the cost per unit of soil is its travel distance? His formulation required a deterministic assignment from each grain of source mass to a single destination, which makes the problem hard to solve and sometimes infeasible.

In 1942, the Soviet mathematician and economist Leonid Kantorovich relaxed Monge's problem by allowing mass to be split across multiple destinations, recasting the question as a linear program over joint probability measures. This Kantorovich relaxation turned an intractable combinatorial problem into a tractable convex one, and it forms the modern foundation of optimal transport theory. Kantorovich shared the 1975 Nobel Memorial Prize in Economic Sciences with Tjalling Koopmans for his work on the optimal allocation of resources, of which the transport problem is the canonical example.

The name Wasserstein distance was coined by R. L. Dobrushin in 1970 after he encountered a 1969 paper by Leonid Vaseršteĭn studying Markov processes on large automata systems. The computer vision community independently rediscovered the same object in the 1980s and 1990s. Stanford University doctoral student Yossi Rubner, working with Carlo Tomasi and Leonidas Guibas, gave it the evocative name Earth Mover's Distance in their 1998 ICCV paper and the 2000 IJCV journal version. The IJCV paper has since been cited tens of thousands of times.

intuition

Imagine two histograms of color in two photographs. A histogram-bin metric like L1 treats every bin as independent: shifting all the red mass to slightly-redder bins counts as a complete change. EMD knows that nearby bins are similar, so a small shift produces a small distance. The same intuition extends to point clouds, signed measures, and continuous distributions on any metric space.

A more visual analogy: imagine pile P sitting on the number line at positions {1, 2, 3} with mass {0.5, 0.3, 0.2}, and pile Q at positions {2, 3, 4} with mass {0.2, 0.5, 0.3}. To turn P into Q you can leave most of the mass in place and shift small amounts to the right. The minimum total mass-distance you must pay is the EMD. Kullback-Leibler divergence, by contrast, does not even know that 1 and 2 are nearby; it only compares per-bin probabilities and goes to infinity if any bin in Q has zero mass where P has positive mass.

mathematical formulation

Let P and Q be two probability distributions on a metric space (X, d). The Kantorovich formulation of EMD is:

EMD(P, Q) = inf over T in Pi(P, Q) of integral over X x X of d(x, y) dT(x, y)

where Pi(P, Q) is the set of all transport plans, joint probability measures on X x X whose marginals are P and Q. A transport plan T(x, y) specifies how much mass moves from x to y; the marginal constraints ensure that the total outflow from each source matches P and the total inflow to each destination matches Q.

For discrete distributions with point masses {p_i} at positions {x_i} and {q_j} at positions {y_j}, the formulation becomes a finite linear program:

minimize sum over i, j of T_ij d(x_i, y_j)

subject to T_ij >= 0, sum over j of T_ij = p_i, sum over i of T_ij = q_j.

The Wasserstein-p distance generalizes EMD by raising the ground distance to the p-th power before integrating and taking the p-th root of the result:

W_p(P, Q) = (inf over T of integral d(x, y)^p dT(x, y))^(1/p).

EMD corresponds to W_1, the Wasserstein-1 distance. W_2 is also widely used because it has nice differential geometric properties and gives rise to displacement interpolation along geodesics in Wasserstein space.

Kantorovich-Rubinstein duality

A key fact, used heavily in modern deep learning, is that W_1 admits a dual formulation:

W_1(P, Q) = sup over f Lipschitz with constant 1 of E_P[f(x)] - E_Q[f(x)].

The supremum runs over all 1-Lipschitz functions f. This is the Kantorovich-Rubinstein theorem. The dual replaces a high-dimensional linear program with an optimization over a function class, which neural networks can parameterize. This is the basis of the WGAN critic.

properties

EMD satisfies the axioms of a true metric and behaves better than many alternatives on distributions whose supports differ.

Property	Holds for EMD?	Notes
Non-negativity	Yes	EMD(P, Q) >= 0.
Identity of indiscernibles	Yes	EMD(P, Q) = 0 iff P = Q.
Symmetry	Yes	EMD(P, Q) = EMD(Q, P).
Triangle inequality	Yes	EMD(P, R) <= EMD(P, Q) + EMD(Q, R).
Defined on disjoint supports	Yes	Returns a finite value even when supports do not overlap.
Differentiable in P, Q	Almost everywhere	Continuous and differentiable a.e., unlike JS divergence which is locally constant.
Sensitive to ground geometry	Yes	Captures spatial structure through d(x, y).
Convex in (P, Q)	Yes	The transport LP is convex.

For histograms with the same total mass, EMD is a true metric. When the two distributions have different total mass, the original Rubner-Tomasi-Guibas definition normalizes by the total flow, which sacrifices the triangle inequality to support partial matches.

comparison with other distribution distances

Probability distances and divergences differ in symmetry, sensitivity to geometry, and behavior on disjoint supports. The choice of distance shapes the gradients a model receives during training.

Distance / divergence	Symmetric?	Metric?	Defined on disjoint supports?	Uses ground geometry?
Earth mover's distance / Wasserstein-1	Yes	Yes	Yes	Yes
Wasserstein-2	Yes	Yes	Yes	Yes
Kullback-Leibler divergence	No	No	No (infinite)	No
Jensen-Shannon divergence	Yes	Square root is a metric	Yes (bounded by log 2)	No
Total variation	Yes	Yes	Yes (bounded by 1)	No
Hellinger distance	Yes	Yes	Yes (bounded)	No
Maximum mean discrepancy (MMD)	Yes	Pseudo-metric	Yes	Through kernel
Frechet inception distance (FID)	Yes	Yes (under Gaussian assumption)	Yes	Yes (in feature space)

The last column matters most for high-dimensional generative models. KL and JS treat distributions as bags of probabilities and ignore the geometry of the sample space. Two Gaussians with disjoint supports look equally far apart to JS no matter how close their means are. EMD measures the actual mean distance and yields informative gradients even in the worst-case disjoint setting.

computational complexity

The direct linear program solving EMD between two histograms with n bins has cubic complexity. This is the main practical drawback of EMD.

Algorithm	Complexity	Notes
Network simplex (transportation simplex)	O(n^3 log n)	Exact; standard CPU solver. Network simplex is typically 200 to 300 times faster than the general simplex method for the same problem size.
Hungarian algorithm	O(n^3)	Exact; for the assignment special case where masses are equal counts. Developed by Harold Kuhn in 1955.
Auction algorithm (Bertsekas)	O(n^3) typical	Exact; parallelizable.
Sinkhorn algorithm (entropic OT)	O(n^2 / epsilon^2)	Approximate; uses entropic regularization (Cuturi 2013). Differentiable and GPU-friendly.
Sliced Wasserstein	O(L n log n)	Approximate; averages 1D OT over L random projections (Bonneel, Rabin, Peyre, Pfister 2015).
Tree-sliced Wasserstein	O(n log n) per tree	Approximate; reduces 1D structure to tree structure.
1D closed form	O(n log n)	Exact; for distributions on R, EMD reduces to integrating the absolute difference of CDFs.

For problems with n in the thousands or tens of thousands, exact solvers become impractical. The Sinkhorn algorithm and sliced variants are the standard practical compromises.

the Sinkhorn algorithm

In 2013 Marco Cuturi published "Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances" at NeurIPS, which transformed how OT is used in machine learning. He added an entropic penalty to the transport plan:

minimize <T, C> - lambda H(T)

where C is the cost matrix C_ij = d(x_i, y_j), H(T) is the Shannon entropy of T, and lambda > 0 controls smoothing. The optimal T has the form diag(u) K diag(v) where K = exp(-C / lambda), and u, v are non-negative vectors found by alternating row and column normalization. This iterative scaling procedure was studied by Richard Sinkhorn in 1964, hence the name.

The entropic problem is strictly convex, has a unique solution, runs orders of magnitude faster than network simplex (O(n^2) per Sinkhorn iteration), is parallelizable on GPUs, and is differentiable with respect to the cost matrix and the input distributions. The last property unlocked OT as a layer or loss function inside deep learning models. The price is a small bias: the Sinkhorn solution is not exactly W_1, and divergence-style corrections such as the Sinkhorn divergence by Aude Genevay, Gabriel Peyre, and Marco Cuturi are needed to recover an unbiased loss when lambda is large.

major applications

image retrieval and computer vision

The original 2000 Rubner-Tomasi-Guibas paper applied EMD to content-based image retrieval. They represented each image as a signature, a small set of weighted color or texture clusters obtained by clustering, and used EMD with a perceptually meaningful color distance as the dissimilarity. EMD outperformed histogram intersection, chi-squared, and Kullback-Leibler on color and texture databases, especially when the histograms had been quantized differently. Subsequent work in computer vision used EMD for shape matching, color transfer between images, point cloud comparison, and feature-map alignment in deep networks.

Wasserstein GAN

In 2017 Martin Arjovsky, Soumith Chintala, and Leon Bottou introduced WGAN at ICML. Vanilla GAN training minimizes a Jensen-Shannon divergence (or a related f-divergence) between the data distribution and the generator distribution. When the two distributions have nearly disjoint supports, which happens almost always for high-dimensional natural images and a low-dimensional generator manifold, JS divergence is locally constant. The discriminator gradient vanishes and training stalls or collapses.

WGAN replaces JS with the Wasserstein-1 distance, exploiting the Kantorovich-Rubinstein dual. The discriminator (renamed critic) parameterizes a 1-Lipschitz function and is trained to maximize E_real[f(x)] - E_fake[f(x)]. The generator is trained to minimize the same quantity. The Lipschitz constraint was originally enforced by clipping the critic's weights to a small box such as [-0.01, 0.01]. The result was famously stable training, a meaningful loss curve that correlates with sample quality, and almost no mode collapse.

The weight clipping trick is crude. Ishaan Gulrajani and coauthors fixed it in 2017 with WGAN-GP, Improved Training of Wasserstein GANs. They added a gradient penalty on the critic, encouraging the norm of its gradient with respect to interpolations between real and fake samples to stay near 1. WGAN-GP enables stable training of very deep architectures including 101-layer ResNet generators and even discrete-output models such as character-level language models.

word mover's distance for documents

Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger introduced the word mover's distance (WMD) at ICML 2015 in "From Word Embeddings To Document Distances". They represented each document as a normalized bag of words and computed the EMD between two documents using the Euclidean distance between word2vec embeddings as the ground distance. WMD measures the minimum total distance the embedded words of one document must travel to reach the embedded words of the other.

The metric has no hyperparameters and beat seven strong baselines (BM25, LSI, LDA, mSDA, and others) on eight document classification datasets, often by large margins on k-nearest-neighbor classification. WMD became a standard semantic similarity baseline and inspired many follow-ups, including supervised WMD, relaxed WMD, and various speedups based on triangle-inequality bounds.

domain adaptation

Nicolas Courty, Remi Flamary, Devis Tuia, and Alain Rakotomamonjy proposed Optimal Transport for Domain Adaptation in their 2017 IEEE TPAMI paper. The idea: given labeled samples from a source domain and unlabeled samples from a target domain, compute the optimal transport plan from source to target and use it to map source samples (and their labels) into the target domain. A regularizer encourages source samples of the same class to remain close after transport. The method outperformed prior unsupervised domain adaptation baselines on visual benchmarks and inspired a wave of OT-based transfer learning research.

single-cell genomics

Geoffrey Schiebinger, Eric Lander, and colleagues introduced Waddington-OT in a 2019 Cell paper, Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. They sequenced 315,000 single cells at half-day intervals across 18 days of fibroblast reprogramming and used optimal transport to infer ancestor-descendant relationships between cells observed at adjacent time points. The transport plan reveals a developmental landscape that traditional clustering methods miss, predicting transcription factors and signaling molecules that influence cell fate. The work has been extended to many other developmental and disease contexts.

diffusion models and generative modeling

Optimal transport ideas show up throughout modern generative modeling. The Wasserstein-2 geodesic between distributions provides a principled interpolation that flow-matching and rectified-flow methods use to define training targets for diffusion models. The 2023 flow matching family of papers by Yaron Lipman, Ricky T. Q. Chen, and others trains a velocity field that transports a simple prior to the data distribution along a chosen path, often inspired by OT. OT-CFM (optimal-transport conditional flow matching) explicitly computes mini-batch OT plans during training to straighten the trajectories.

implementations

Most machine learning frameworks now ship with OT primitives, and dedicated libraries cover the more advanced solvers.

Library	Language	Notes
Python Optimal Transport (POT)	Python	The reference open-source toolbox; implements network simplex, Sinkhorn, Wasserstein barycenters, GW, partial OT, and more. Flamary et al. 2021 JMLR.
OTT-JAX	JAX	OT solvers in JAX with full GPU and TPU support, autodiff, and JIT. Maintained by Google Research.
GeomLoss	PyTorch	Sinkhorn divergences, MMD, and Hausdorff loss with online and multiscale solvers. Feydy, Seguy et al.
torch.cdist + scipy.optimize.linear_sum_assignment	PyTorch / SciPy	Build the cost matrix and solve the assignment LP exactly; suitable for small problems.
EMD-OpenCV (cv::EMD)	C++ / Python	Implements the original Rubner-Tomasi-Guibas signature-based EMD.
ott (R) and transport (R)	R	Discrete OT in the R ecosystem.
EMD2 (Mosek, Gurobi)	Various	Commercial LP solvers handle exact EMD on moderate-size problems.

limitations

EMD is not a free lunch. Its main drawbacks are well known.

Computational cost. Even with Sinkhorn, OT is more expensive than KL or MMD, and exact solvers scale cubically.
Choice of ground metric. The user must pick d(x, y), and the result depends strongly on this choice. In high dimensions, Euclidean distance often gives weak signals (the curse of dimensionality), and learning a good metric is itself a research topic.
Sample complexity. Estimating W_1 from samples in dimension d requires roughly n proportional to epsilon^(-d) samples, an exponential blowup. Sliced and entropic variants partially mitigate this but introduce bias.
Numerical instability. Sinkhorn iterations can underflow when lambda is small (sharp transport plans), requiring log-domain stabilized implementations.
Bias of entropic OT. The Sinkhorn distance is a biased estimator of the true Wasserstein distance, especially for small lambda. Sinkhorn divergences correct this but at extra compute cost.

explain like I'm 5

Imagine you have two piles of LEGO bricks on the floor. The first pile is in the shape of a cat and the second pile is in the shape of a dog. You want to know how different the two shapes are. One way: pick up bricks from the cat pile, walk them across the floor, and drop them into the dog pile until both piles look the same. Each brick you carry costs work equal to how heavy it is times how far you walk. The Earth Mover's Distance is the smallest total work you need, no matter how cleverly you plan the trips. If the cat and dog are sitting almost on top of each other, the work is small. If they are in different rooms, the work is big. Computers use this idea to compare images, documents, and even the way cells change shape over time.

references

Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The Earth Mover's Distance as a Metric for Image Retrieval. International Journal of Computer Vision, 40(2), 99-121. https://link.springer.com/article/10.1023/A:1026543900054
Monge, G. (1781). Memoire sur la theorie des deblais et des remblais. Histoire de l'Academie Royale des Sciences de Paris.
Kantorovich, L. V. (1942). On the translocation of masses. Doklady Akademii Nauk SSSR, 37, 199-201.
Kantorovich, L. V. (1975). Nobel Memorial Prize in Economic Sciences (with T. C. Koopmans).
Cuturi, M. (2013). Sinkhorn Distances: Lightspeed Computation of Optimal Transport. Advances in Neural Information Processing Systems 26. https://arxiv.org/abs/1306.0895
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein Generative Adversarial Networks. Proceedings of the 34th International Conference on Machine Learning, PMLR 70. https://proceedings.mlr.press/v70/arjovsky17a.html
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved Training of Wasserstein GANs. Advances in Neural Information Processing Systems 30. https://arxiv.org/abs/1704.00028
Kusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger, K. Q. (2015). From Word Embeddings to Document Distances. Proceedings of the 32nd International Conference on Machine Learning, PMLR 37. https://proceedings.mlr.press/v37/kusnerb15.html
Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. (2017). Optimal Transport for Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9), 1853-1865. https://arxiv.org/abs/1507.00504
Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Solomon, A., et al. (2019). Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell, 176(4), 928-943. https://www.cell.com/cell/fulltext/S0092-8674(19)30039-X
Bonneel, N., Rabin, J., Peyre, G., and Pfister, H. (2015). Sliced and Radon Wasserstein Barycenters of Measures. Journal of Mathematical Imaging and Vision, 51(1), 22-45.
Peyre, G. and Cuturi, M. (2019). Computational Optimal Transport. Foundations and Trends in Machine Learning, 11(5-6), 355-607. https://arxiv.org/abs/1803.00567
Flamary, R., Courty, N., Gramfort, A., et al. (2021). POT: Python Optimal Transport. Journal of Machine Learning Research, 22(78), 1-8. https://jmlr.org/papers/v22/20-451.html
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2), 83-97.
Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. Annals of Mathematical Statistics, 35(2), 876-879.
Villani, C. (2008). Optimal Transport: Old and New. Springer.
Genevay, A., Peyre, G., and Cuturi, M. (2018). Learning Generative Models with Sinkhorn Divergences. AISTATS.
Wikipedia: Wasserstein metric. https://en.wikipedia.org/wiki/Wasserstein_metric

history

intuition

mathematical formulation

Kantorovich-Rubinstein duality

properties

comparison with other distribution distances

computational complexity

the Sinkhorn algorithm

major applications

image retrieval and computer vision

Wasserstein GAN

word mover's distance for documents

domain adaptation

single-cell genomics

diffusion models and generative modeling

implementations

limitations

explain like I'm 5

references

Improve this article

Related Articles

Wasserstein Loss

Wasserstein GAN (WGAN)

history

intuition

mathematical formulation

Kantorovich-Rubinstein duality

properties

comparison with other distribution distances

computational complexity

the Sinkhorn algorithm

major applications

image retrieval and computer vision

Wasserstein GAN

word mover's distance for documents

domain adaptation

single-cell genomics

diffusion models and generative modeling

implementations

limitations

explain like I'm 5

references

Related Articles

Wasserstein Loss

Wasserstein GAN (WGAN)