Earth Mover's Distance

Computer Vision Machine Learning Mathematics Statistics

22 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v6 · 4,473 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Earth Mover's Distance (EMD), also known as the Wasserstein-1 distance, Kantorovich-Rubinstein metric, or Mallows's distance, is a measure of dissimilarity between two probability distributions over a metric space: it is the minimum total "work" required to turn one distribution into the other, where work is the amount of mass moved multiplied by the distance it travels. It originates from the optimal transport problem in mathematics and has become one of the most widely used distance metrics in machine learning, computer vision, and natural language processing. Formally the EMD equals the first Wasserstein distance (W1) from optimal transport theory, and its 1-Lipschitz dual form is the mathematical basis of the Wasserstein GAN ^[8].

The name comes from a physical analogy: if two distributions are imagined as two different ways of piling up dirt over a region, the EMD represents the minimum total work needed to reshape one pile into the other, where work is defined as the amount of dirt moved multiplied by the distance it travels. This intuitive interpretation makes the EMD both mathematically rigorous and practically meaningful. The measure was first applied to computer vision problems by Peleg, Werman, and Rom in 1989 ^[10], and Rubner, Tomasi, and Guibas later defined it for image retrieval as "the minimal cost that must be paid to transform one distribution into the other" ^[3].

ELI5: Explain like I'm 5

Imagine you and your friend each have a sandbox, and you've both built dirt piles in different spots. Your pile might have a big mound on the left side, while your friend's pile has the dirt spread out on the right side. The Earth Mover's Distance asks: what is the least amount of work you would have to do to make your sandbox look exactly like your friend's? You measure work by multiplying how much dirt you move by how far you carry it. If you only need to slide a little dirt a short distance, the EMD is small, meaning the two sandboxes were already pretty similar. If you need to carry a lot of dirt a long way, the EMD is large, meaning they were very different.

This idea applies to anything that can be described as a distribution: colors in an image, words in a document, or data points on a graph. The EMD gives a single number that captures how different two such distributions are.

Historical background

Origins in optimal transport

The mathematical foundations of EMD trace back to 1781, when French mathematician Gaspard Monge formulated the optimal transport problem ^[1]. Monge, who was also an engineer designing military fortifications, wanted to find the most efficient way to move soil from one configuration to another. He posed the question: given a pile of material at various source locations and a desired arrangement at destination locations, what assignment of sources to destinations minimizes the total transportation cost?

Monge's formulation, however, was restrictive. It required each unit of mass to be transported from a single source to a single destination, with no splitting allowed. This constraint made the problem analytically intractable in many cases.

Kantorovich's relaxation

In 1942, Soviet mathematician and economist Leonid Kantorovich reformulated Monge's problem using the framework of linear programming ^[2]. Kantorovich's key insight was to allow mass splitting: a source could send fractions of its mass to multiple destinations. This relaxation transformed the problem into a linear program that was both theoretically solvable and computationally feasible. Kantorovich shared the 1975 Nobel Memorial Prize in Economic Sciences with Tjalling Koopmans "for their contributions to the theory of optimum allocation of resources" ^[17].

The resulting framework is commonly referred to as the Monge-Kantorovich transportation problem. It became a cornerstone of mathematical optimization and laid the groundwork for what is now called optimal transport theory.

Who named the Earth Mover's Distance?

The term "Earth Mover's Distance" was proposed by computer scientist Jorge Stolfi, in a 1994 personal communication to Leonidas Guibas that was later credited in the Rubner, Tomasi, and Guibas paper ^[3]. The name first appeared in print in the 1998 IEEE International Conference on Computer Vision (ICCV) paper "A Metric for Distributions with Applications to Image Databases" by Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas ^[15], and it entered wide use after their expanded 1998-to-2000 work applying it to content-based image retrieval ^[3]. The same authors credit Peleg, Werman, and Rom (1989) with first proposing the measure for vision problems ^[10].

In the mathematics and statistics communities, the same quantity is more commonly called the Wasserstein distance (specifically the first Wasserstein distance, W1), a name coined by R. L. Dobrushin in 1970 after encountering the concept in the work of Leonid Vaserstein ^[14]. The spelling "Wasserstein" is a transliteration of the surname Vasershtein.

Cedric Villani, who received the Fields Medal in 2010 for his work on nonlinear Landau damping and convergence to equilibrium for the Boltzmann equation ^[18], authored two comprehensive reference books on optimal transport theory: "Topics in Optimal Transportation" (2003) ^[4] and "Optimal Transport, Old and New" (2008) ^[5]. These texts helped popularize optimal transport methods across mathematics, physics, and applied sciences.

How is the Earth Mover's Distance defined?

The general optimal transport problem

Let $(M, d)$ be a metric space, and let P and Q be two probability distributions on $M$ . A transport plan (or coupling) between P and Q is a joint distribution $\gamma$ on $M \times M$ whose marginals are P and Q. The set of all such couplings is denoted $\Gamma(P, Q)$ .

The p-Wasserstein distance between P and Q is defined as:

$W_p(P, Q) = \left( \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y)^p \, d\gamma(x, y) \right)^{1/p}$

The Earth Mover's Distance corresponds to the case $p = 1$ :

$\text{EMD}(P, Q) = W_1(P, Q) = \inf_{\gamma \in \Gamma(P, Q)} \int_{M \times M} d(x, y) \, d\gamma(x, y)$

Intuitively, $\gamma$ describes how mass is moved from locations drawn according to P to locations drawn according to Q. The integral computes the expected cost (distance) under that plan. The infimum finds the cheapest such plan.

Discrete formulation

In practice, distributions are often represented as discrete sets of weighted clusters, called signatures. Given two signatures:

$P = \{(p_1, w_{p1}), (p_2, w_{p2}), \ldots, (p_m, w_{pm})\}$ with total weight $W_P$
$Q = \{(q_1, w_{q1}), (q_2, w_{q2}), \ldots, (q_n, w_{qn})\}$ with total weight $W_Q$

where each $p_i$ and $q_j$ is a feature vector (cluster center) and $w_{pi}, w_{qj}$ are the corresponding weights, and a ground distance $d_{ij} = d(p_i, q_j)$ between cluster centers, the EMD is found by solving the following linear program:

Minimize:

$\sum_{i=1}^{m} \sum_{j=1}^{n} f_{ij} \cdot d_{ij}$

Subject to:

$f_{ij} \ge 0$ for all $i, j$
$\sum_j f_{ij} \le w_{pi}$ for all $i$ (supply constraint)
$\sum_i f_{ij} \le w_{qj}$ for all $j$ (demand constraint)
$\sum_i \sum_j f_{ij} = \min(W_P, W_Q)$ (total flow constraint)

The matrix $F = [f_{ij}]$ is the optimal flow. The normalized EMD is then:

$\text{EMD}(P, Q) = \frac{\sum_{i} \sum_{j} f_{ij} \cdot d_{ij}}{\sum_{i} \sum_{j} f_{ij}}$

Normalization by total flow makes the result comparable across pairs of distributions with different total masses ^[3].

Kantorovich-Rubinstein duality

For the W1 distance, an equivalent dual formulation exists:

$W_1(P, Q) = \sup_{\|f\|_L \le 1} \left( \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{y \sim Q}[f(y)] \right)$

where the supremum is taken over all 1-Lipschitz functions $f$ (functions satisfying $|f(x) - f(y)| \le d(x, y)$ for all $x, y$ ). This duality is known as the Kantorovich-Rubinstein duality and is the theoretical foundation for Wasserstein GANs ^[8].

Closed-form solution in one dimension

In the special case of one-dimensional distributions, the EMD has a simple closed-form solution. For two probability distributions P and Q with cumulative distribution functions $F_P$ and $F_Q$ :

$W_1(P, Q) = \int_{-\infty}^{\infty} |F_P(x) - F_Q(x)| \, dx$

That is, the EMD equals the area between the two CDFs. For discrete distributions represented as histograms with bins of equal width, this reduces to a linear-time computation:

Set $\mathrm{EMD}_0 = 0$
For each bin $i$ : $\mathrm{EMD}_{i+1} = P_i + \mathrm{EMD}_i - Q_i$
Total distance = sum of $|\mathrm{EMD}_i|$

This makes the one-dimensional EMD far cheaper to compute than the general case.

Is the Earth Mover's Distance a metric?

Yes. The EMD, when defined between distributions of equal total mass, satisfies all the axioms of a metric, which is why Rubner, Tomasi, and Guibas titled their foundational paper "The Earth Mover's Distance as a Metric for Image Retrieval" ^[3]:

Property	Description
Non-negativity	$\mathrm{EMD}(P, Q) \ge 0$ , with equality if and only if $P = Q$
Symmetry	$\mathrm{EMD}(P, Q) = \mathrm{EMD}(Q, P)$
Triangle inequality	$\mathrm{EMD}(P, R) \le \mathrm{EMD}(P, Q) + \mathrm{EMD}(Q, R)$
Identity of indiscernibles	$\mathrm{EMD}(P, Q) = 0$ implies $P = Q$

When partial matching is allowed (i.e., the distributions have unequal total mass and excess mass is discarded at no cost), the resulting measure is no longer a true metric because it can violate the triangle inequality.

Compared to other probability distribution distances:

Distance/Divergence	Handles disjoint support	Metric	Differentiable	Sensitive to geometry
Earth Mover's Distance (W1)	Yes	Yes	Yes (via dual)	Yes
Kullback-Leibler divergence	No (infinite if supports differ)	No (asymmetric)	Yes	No
Jensen-Shannon divergence	Yes	Yes (square root)	Yes	No
Total variation distance	Yes	Yes	Limited	No
Kolmogorov-Smirnov distance	Yes	Yes	No	Partially

A distinctive advantage of the EMD is its sensitivity to the underlying geometry of the metric space. Unlike bin-to-bin distances such as the chi-squared distance, the EMD accounts for "cross-bin" relationships. For example, in comparing color histograms, the EMD recognizes that dark red is closer to bright red than to blue, whereas bin-wise distances treat all bin mismatches equally.

How is the Earth Mover's Distance computed?

Exact computation

Computing the exact EMD requires solving a transportation problem, which is a special case of linear programming. The standard approach uses the network simplex algorithm, a specialized variant of the simplex method for flow problems.

Algorithm	Time complexity	Notes
Network simplex	$O(N^3 \log N)$	Standard method; $N$ is the number of bins
Hungarian algorithm	$O(N^3)$	For bipartite matching (uniform weights)
1D closed form	$O(N)$	One-dimensional distributions only
Auction algorithm	$O(N^3)$	Alternative for assignment problems

The cubic-or-worse complexity makes exact EMD computation impractical for large-scale applications, especially when distributions are represented as high-dimensional histograms with thousands of bins. Cuturi noted that the linear program's cost "is prohibitive whenever the histograms' dimension exceeds a few hundreds" ^[6].

Approximate and accelerated methods

Several approaches have been developed to reduce the computational cost of EMD:

Sinkhorn distances (Cuturi, 2013). Marco Cuturi proposed adding an entropic regularization term to the optimal transport problem ^[6]. This transforms the linear program into a strictly convex problem that can be solved using the Sinkhorn-Knopp matrix scaling algorithm. The key benefit is computational: the Sinkhorn algorithm relies only on matrix-vector multiplications, reducing complexity to approximately $O(N^2)$ per iteration with fast convergence. Cuturi reported that the regularized distance can be computed "at a speed that is several orders of magnitude faster than that of transportation solvers," and on the MNIST dataset the method ran thousands of times faster than exact EMD solvers ^[6].

Sliced Wasserstein distance. This approach projects high-dimensional distributions onto random one-dimensional subspaces, computes the closed-form 1D Wasserstein distance for each projection, and averages the results. The complexity per projection is $O(N \log N)$ due to sorting. The sliced Wasserstein distance is provably equivalent to the true Wasserstein distance in a weak sense and has found applications in generative modeling and point cloud analysis.

Tree-based approximations. Methods such as the tree-EMD embed the ground distance into a tree metric, allowing the EMD to be computed in linear time via a simple bottom-up traversal. These methods sacrifice some accuracy for significant speed gains.

Wavelet-based approximations. The EMD can be approximated using wavelet decompositions of the distributions, with complexity linear in the number of bins.

Applications in computer vision

Image retrieval

The most influential early application of EMD was in content-based image retrieval, introduced by Rubner, Tomasi, and Guibas at ICCV 1998 ^[15] and developed fully in their 2000 International Journal of Computer Vision paper ^[3]. Their approach represented images as distributions of color features (signatures), where each feature is a cluster center in color space and its weight is the fraction of pixels belonging to that cluster. The EMD between two image signatures captures perceptual similarity more effectively than traditional histogram distances because it accounts for the similarity between different colors rather than treating each color bin independently. The authors reported that combining the EMD with a vector-quantization representation "often accounts for perceptual similarity better than other previously proposed methods" ^[3].

Experimental results showed that EMD-based retrieval outperformed methods based on histogram intersection, chi-squared distance, and other bin-to-bin comparisons, particularly for queries involving color variations and lighting changes.

Few-shot learning

DeepEMD (Zhang et al., CVPR 2020) introduced a differentiable EMD layer for few-shot learning ^[11]. The method computes the EMD between dense feature extraction maps of a query image and a support image, treating local feature vectors as distribution elements. The EMD identifies the optimal matching between spatial regions of the two images, providing a structured distance metric that is more informative than global feature comparison.

By using the implicit function theorem, the authors made the EMD computation differentiable, enabling end-to-end training with backpropagation. DeepEMD achieved "new state-of-the-art performance on five few-shot learning benchmarks with significant advantages (up to 7%)" ^[11].

Texture and shape analysis

EMD has been applied to comparing texture descriptors, contour features, and shape distributions. Because the EMD naturally handles distributions with different numbers of clusters (through partial matching), it is well-suited for comparing objects whose feature representations vary in complexity.

Video analysis and action recognition

In video understanding, EMD is used to compare temporal distributions of motion features across video clips. The metric's ability to capture temporal misalignment makes it useful for action recognition tasks where the same action may be performed at different speeds.

Applications in natural language processing

What is Word Mover's Distance?

The Word Mover's Distance (WMD), introduced by Kusner et al. (2015), adapts the EMD framework to measure semantic similarity between text documents ^[7]. Each document is represented as a weighted collection of word embeddings (such as Word2Vec vectors), where the weight of each word is its normalized frequency in the document. The authors describe the WMD as "the minimum amount of distance that the embedded words of one document need to travel to reach the embedded words of another document" ^[7].

Formally, if document A has words with embeddings $\{a_1, \ldots, a_m\}$ and normalized weights $\{w_{a1}, \ldots, w_{am}\}$ , and document B has words with embeddings $\{b_1, \ldots, b_n\}$ and normalized weights $\{w_{b1}, \ldots, w_{bn}\}$ , the WMD is:

$\text{WMD}(A, B) = \min_{T \geq 0} \sum_{i=1}^{m} \sum_{j=1}^{n} T_{ij} \cdot \|a_i - b_j\|_2$

subject to the constraints that the row sums of $T$ equal $w_a$ and the column sums equal $w_b$ .

WMD captures semantic relationships that bag-of-words models miss. For example, "The president addressed the nation" and "The leader spoke to the country" have no words in common, but WMD assigns a small distance because the word embeddings for "president"/"leader", "addressed"/"spoke", and "nation"/"country" are close in the embedding space.

Kusner et al. reported that WMD achieved the lowest k-nearest-neighbor classification error on eight real-world document classification datasets compared with seven state-of-the-art baselines, including bag-of-words, TF-IDF, and latent semantic analysis ^[7].

Supervised Word Mover's Distance

Huang et al. (NeurIPS 2016) extended WMD with a supervised variant (S-WMD) that learns a transformation of the word embedding space to improve classification accuracy ^[16]. The transformation is optimized so that documents from the same class have lower WMD and documents from different classes have higher WMD.

Sentence and document embeddings

While WMD provides high-quality distance measurements, its computational cost (cubic in the vocabulary size per pair) limits scalability. Later work explored faster approximations, including relaxed WMD (which uses lower bounds from the dual problem) and methods based on sentence-level embeddings that approximate WMD without solving the full transport problem.

Applications in generative modeling

Why do Wasserstein GANs use the Earth Mover's Distance?

One of the most impactful applications of the EMD in deep learning came with the introduction of the Wasserstein GAN (WGAN) by Arjovsky, Chintala, and Bottou, first posted to arXiv in January 2017 and published at ICML 2017 ^[8]. Traditional generative adversarial networks train the generator by minimizing the Jensen-Shannon divergence between the generated and real data distributions. This objective suffers from vanishing gradients when the discriminator becomes too strong, because the JS divergence saturates when the two distributions have disjoint supports (which is common in high-dimensional spaces).

The WGAN paper defines "the Earth-Mover (EM) distance or Wasserstein-1" as the infimum over all couplings of the expected transport cost, and shows why it produces usable gradients where the JS divergence does not ^[8]. In the paper's motivating example of two parallel lines separated by a distance $\theta$ , the EM distance varies smoothly as $W(P_0, P_\theta) = |\theta|$ , whereas the JS divergence stays pinned at $\log 2$ for every $\theta \ne 0$ and drops discontinuously to 0 only when the supports coincide. The smooth W1 curve provides a gradient everywhere; the flat JS curve does not.

WGAN therefore replaces the JS divergence with the Wasserstein-1 (Earth Mover's) distance. Using the Kantorovich-Rubinstein duality, the W1 distance can be estimated by training a critic network (replacing the discriminator) that is constrained to be 1-Lipschitz. The WGAN objective is:

$\min_G \max_{\|D\|_L \le 1} \left( \mathbb{E}_{x \sim P_{\text{real}}}[D(x)] - \mathbb{E}_{z \sim P_z}[D(G(z))] \right)$

The authors state that WGAN can "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches" ^[8]. Key advantages of WGAN over standard GANs:

Aspect	Standard GAN	Wasserstein GAN
Loss function	Jensen-Shannon divergence	Wasserstein-1 distance
Gradient behavior	Vanishing gradients when discriminator is optimal	Continuous, informative gradients everywhere
Training stability	Prone to mode collapse and oscillation	More stable training dynamics
Correlation with quality	Generator loss does not correlate with sample quality	Critic loss correlates with generated sample quality
Lipschitz constraint	None	Required (via weight clipping or gradient penalty)

The original WGAN enforced the Lipschitz constraint through weight clipping, which sometimes led to capacity underuse. Gulrajani et al. (2017) proposed WGAN-GP (gradient penalty), which enforces the constraint by penalizing the gradient norm of the critic, yielding further improvements in training stability and sample quality ^[9].

Wasserstein autoencoders

The Wasserstein distance has also been used in autoencoder-based generative models. Wasserstein autoencoders (Tolstikhin et al., 2018) minimize the Wasserstein distance between the data distribution and the model distribution, providing an alternative to the KL-divergence penalty used in variational autoencoders ^[12].

Applications in other domains

Biology and genomics

The Wasserstein distance has found applications in single-cell RNA sequencing analysis. The waddR package (Schefzik et al., 2021) uses the 2-Wasserstein distance to detect differential gene expression distributions between conditions ^[13]. Unlike traditional methods that only test for differences in mean expression, the Wasserstein-based approach can detect changes in variance, distribution shape (such as unimodal versus multimodal), and the proportion of zero counts. The W2 distance can be decomposed into components capturing shifts in mean, variance, and shape, providing interpretable results.

Optimal transport methods have also been applied to reconstruct cell trajectories from snapshot single-cell data, using the Wasserstein-Fisher-Rao distance to simultaneously model both cell transport and population growth.

Economics and finance

The original Kantorovich formulation was motivated by economic resource allocation ^[2]. Modern applications include distributionally robust optimization (DRO), where decision-makers optimize for worst-case performance over distributions within a Wasserstein ball around the empirical distribution. This approach provides robustness guarantees against distribution shift.

Fairness in machine learning

Wasserstein distance has been used as a fairness criterion in algorithmic decision-making. By measuring the distance between outcome distributions across protected groups (such as race or gender), practitioners can quantify disparate impact. Wasserstein-based distributionally robust optimization ensures that models perform consistently across varied data distributions, promoting individual fairness.

Physics and climate science

The "Ocean Mover's Distance" (Hydrological Wasserstein distance) applies optimal transport concepts to compare oceanographic data distributions. In high-energy physics, differentiable EMD has been used for data compression at the Large Hadron Collider.

Software implementations

Several open-source libraries provide EMD and Wasserstein distance computations:

Library	Language	Features
SciPy (`scipy.stats.wasserstein_distance`)	Python	1D Wasserstein distance; also `wasserstein_distance_nd` for N-dimensional
POT (Python Optimal Transport)	Python	Full optimal transport solver, Sinkhorn, sliced Wasserstein, and more
PyEMD	Python	Wrapper around C++ EMD implementation; now recommends POT for new projects
PyTorch (`geomloss`)	Python	GPU-accelerated Sinkhorn divergences and kernel-based OT
OpenCV	C++ / Python	EMD via `cv2.EMD` for histogram comparison
TensorFlow	Python	Custom implementations via `tf.linalg` operations
R (`transport`, `waddR`)	R	Exact and approximate OT; Wasserstein tests for genomics

A basic example using SciPy:

from scipy.stats import wasserstein_distance

# Two 1D distributions represented as value-weight pairs
dist_a = [1, 2, 3, 4, 5]
dist_b = [2, 3, 4, 5, 6]

emd = wasserstein_distance(dist_a, dist_b)
print(f"EMD: {emd}")  # Output: EMD: 1.0

For the general discrete case with a custom ground distance matrix:

import numpy as np
import ot  # POT library

# Two discrete distributions
a = np.array([0.4, 0.6])
b = np.array([0.3, 0.7])

# Ground distance matrix
M = np.array([[0.0, 1.5],
              [1.5, 0.0]])

# Compute exact EMD
emd = ot.emd2(a, b, M)
print(f"EMD: {emd}")  # Output: EMD: 0.15

Relationship to other metrics

The EMD is part of a broader family of optimal transport distances. Understanding how it relates to other measures helps clarify when to use it:

Metric	Relationship to EMD
Wasserstein-1 (W1)	Identical to EMD for probability distributions of equal total mass
Wasserstein-2 (W2)	Uses squared distances; emphasizes large displacements more than W1
Wasserstein-p (Wp)	Generalizes EMD with p-th power of distance; EMD is the p=1 case
Mallows's distance	Same as Wasserstein distance; name used primarily in statistics
Frechet Inception Distance (FID)	Uses Wasserstein-2 between Gaussian approximations of feature distributions
Sliced Wasserstein	Approximation of Wasserstein via 1D projections; faster but less precise
Sinkhorn divergence	Entropy-regularized version; faster to compute, smooth and differentiable
Maximum Mean Discrepancy (MMD)	Kernel-based distance; does not require solving an optimization problem

Limitations

Despite its many advantages, the EMD has several limitations:

Computational cost. Exact computation scales as $O(N^3 \log N)$ for distributions with $N$ support points. This makes it impractical for very large or high-dimensional distributions without approximation ^[6].
Choice of ground distance. The EMD depends on the ground distance d(x, y) between points in the metric space. Different choices of ground distance (Euclidean, cosine, Mahalanobis) lead to different EMD values, and the optimal choice depends on the application.
Curse of dimensionality. Sample complexity for estimating the Wasserstein distance from data grows exponentially with dimension. In high-dimensional spaces, many more samples are needed for reliable estimation.
Sensitivity to outliers. Because the W1 distance is linear in the displacement, a single outlier moved far away can disproportionately increase the distance. Higher-order Wasserstein distances (W2, Wp for p > 1) are even more sensitive.
Partial matching issues. When distributions have unequal total mass and partial matching is used, the metric properties (specifically the triangle inequality) are lost.

References

Monge, G. (1781). "Memoire sur la theorie des deblais et des remblais." Histoire de l'Academie Royale des Sciences de Paris. ↩
Kantorovich, L. V. (1942). "On the translocation of masses." Doklady Akademii Nauk SSSR, 37(7-8), 227-229. ↩
Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). "The Earth Mover's Distance as a Metric for Image Retrieval." International Journal of Computer Vision, 40(2), 99-121. ↩
Villani, C. (2003). "Topics in Optimal Transportation." Graduate Studies in Mathematics, Vol. 58. American Mathematical Society. ↩
Villani, C. (2008). "Optimal Transport: Old and New." Grundlehren der mathematischen Wissenschaften, Vol. 338. Springer. ↩
Cuturi, M. (2013). "Sinkhorn Distances: Lightspeed Computation of Optimal Transport." Advances in Neural Information Processing Systems 26 (NeurIPS). ↩
Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). "From Word Embeddings To Document Distances." Proceedings of the 32nd International Conference on Machine Learning (ICML), 957-966. ↩
Arjovsky, M., Chintala, S., & Bottou, L. (2017). "Wasserstein Generative Adversarial Networks." Proceedings of the 34th International Conference on Machine Learning (ICML), 214-223. arXiv:1701.07875. ↩
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). "Improved Training of Wasserstein GANs." Advances in Neural Information Processing Systems 30 (NeurIPS). ↩
Peleg, S., Werman, M., & Rom, H. (1989). "A Unified Approach to the Change of Resolution: Space and Gray-Level." IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), 739-742. ↩
Zhang, C., Cai, Y., Lin, G., & Shen, C. (2020). "DeepEMD: Few-Shot Image Classification with Differentiable Earth Mover's Distance and Structured Classifiers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12203-12213. ↩
Tolstikhin, I. O., Bousquet, O., Gelly, S., & Schoelkopf, B. (2018). "Wasserstein Auto-Encoders." Proceedings of the International Conference on Learning Representations (ICLR). ↩
Schefzik, R., Flesch, J., & Goncalves, A. (2021). "Fast identification of differential distributions in single-cell RNA-sequencing data with waddR." Bioinformatics, 37(19), 3204-3211. ↩
Dobrushin, R. L. (1970). "Prescribing a system of random variables by conditional distributions." Theory of Probability and its Applications, 15(3), 458-486. ↩
Rubner, Y., Tomasi, C., & Guibas, L. J. (1998). "A Metric for Distributions with Applications to Image Databases." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 59-66. ↩
Huang, G., Guo, C., Kusner, M. J., Sun, Y., Sha, F., & Weinberger, K. Q. (2016). "Supervised Word Mover's Distance." Advances in Neural Information Processing Systems 29 (NeurIPS), 4862-4870. ↩
The Royal Swedish Academy of Sciences (1975). "The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel 1975: Leonid Vitaliyevich Kantorovich and Tjalling C. Koopmans." NobelPrize.org. ↩
International Mathematical Union (2010). "Fields Medals 2010: Cedric Villani." International Congress of Mathematicians, Hyderabad, India. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms/All Terms Wasserstein GAN (WGAN)Wasserstein Loss

ELI5: Explain like I'm 5

Historical background

Origins in optimal transport

Kantorovich's relaxation

Who named the Earth Mover's Distance?

How is the Earth Mover's Distance defined?

The general optimal transport problem

Discrete formulation

Kantorovich-Rubinstein duality

Closed-form solution in one dimension

Is the Earth Mover's Distance a metric?

How is the Earth Mover's Distance computed?

Exact computation

Approximate and accelerated methods

Applications in computer vision

Image retrieval

Few-shot learning

Texture and shape analysis

Video analysis and action recognition

Applications in natural language processing

What is Word Mover's Distance?

Supervised Word Mover's Distance

Sentence and document embeddings

Applications in generative modeling

Why do Wasserstein GANs use the Earth Mover's Distance?

Wasserstein autoencoders

Applications in other domains

Biology and genomics

Economics and finance

Fairness in machine learning

Physics and climate science

Software implementations

Relationship to other metrics

Limitations

See also

References

Improve this article

Related Articles

Independently and Identically Distributed (i.i.d.)

Log-Odds

Principal Component Analysis (PCA)

Curse of Dimensionality

KL Divergence (Kullback-Leibler Divergence)

Bayes' theorem

What links here

Related Articles

Independently and Identically Distributed (i.i.d.)

Log-Odds

Principal Component Analysis (PCA)

Curse of Dimensionality

KL Divergence (Kullback-Leibler Divergence)

Bayes' theorem

What links here