Similarity Measure

A similarity measure (also called a similarity function or similarity metric) is a real-valued function that quantifies the degree of resemblance between two objects. In machine learning, statistics, and data science, similarity measures are used to compare data points, documents, images, and other entities. They assign large values to pairs of objects that are alike and small (or zero) values to pairs that differ. The closely related concept of a distance metric (or dissimilarity measure) works in the opposite direction: small values indicate similar objects, while large values indicate dissimilar ones.

Similarity measures appear throughout virtually every subfield of machine learning. Clustering algorithms group data points by similarity. Classification algorithms such as k-nearest neighbors rely on distance computations to make predictions. Recommendation systems match users to items by comparing preference vectors. Information retrieval systems rank documents by their similarity to a query. Selecting the right similarity measure for a given task can have a significant effect on model performance.

Explain like I'm 5 (ELI5)

Imagine you have a big box of crayons and you want to sort them by color. You pick up two crayons and look at them side by side. If they look almost the same color, you say they are very similar and put them in the same pile. If one is bright red and the other is dark blue, they are not similar at all, so they go in different piles.

A similarity measure is the rule you use to decide how alike two crayons are. There are different rules for different situations. Sometimes you care about the exact shade. Sometimes you only care whether two crayons are both "warm" colors or both "cool" colors. Computers use these rules too, but instead of crayons they compare numbers, words, or pictures.

Formal definition

A similarity measure on a set X is a function s: X x X → R that satisfies the following properties for all x, y in X:

Non-negativity: s(x, y) ≥ 0
Maximum at identity: s(x, y) is maximized when x = y
Symmetry: s(x, y) = s(y, x)

A distance metric (or simply a metric) d: X x X → R must satisfy:

Non-negativity: d(x, y) ≥ 0
Identity of indiscernibles: d(x, y) = 0 if and only if x = y
Symmetry: d(x, y) = d(y, x)
Triangle inequality: d(x, z) ≤ d(x, y) + d(y, z)

Not every commonly used similarity or distance function satisfies all metric axioms. For example, cosine similarity does not obey the triangle inequality in its raw form, and Kullback-Leibler divergence is not symmetric. Functions that relax one or more axioms are sometimes called semi-metrics, quasi-metrics, or divergences depending on which property is dropped.

Relationship between similarity and distance

Similarity and distance are inversely related. Several standard transformations convert one to the other:

Transformation	Formula	Notes
Exponential decay	s = exp(-d * γ)	Common in kernel methods; γ controls the decay rate
Inverse	s = 1 / (1 + d)	Simple and bounded between 0 and 1
Linear	s = 1 - d/d_max	Requires a known upper bound d_max
Gaussian	s = exp(-d² / 2σ²)	Used in spectral clustering and RBF kernels

Geometric and vector-based measures

Euclidean distance

The Euclidean distance (also called L2 distance) is the straight-line distance between two points in n-dimensional space:

d(x, y) = √(Σᵢ (xᵢ - yᵢ)²)

It is the most intuitive distance measure, corresponding to the length of the line segment connecting two points. Euclidean distance is a true metric because it satisfies non-negativity, identity, symmetry, and the triangle inequality.

Euclidean distance is widely used in k-nearest neighbors, k-means clustering, and principal component analysis. However, it can become unreliable in high-dimensional spaces due to the "curse of dimensionality," where distances between points tend to converge and become less discriminative.

Manhattan distance

The Manhattan distance (also called L1 distance, taxicab distance, or city-block distance) sums the absolute differences along each dimension:

d(x, y) = Σᵢ |xᵢ - yᵢ|

The name comes from the grid-like street layout of Manhattan, where travel between two points follows the grid rather than a straight line. Manhattan distance is more robust to outliers than Euclidean distance because it does not square the differences. It is also a true metric.

Minkowski distance

The Minkowski distance is a generalization that includes both Euclidean and Manhattan distance as special cases:

d(x, y) = (Σᵢ |xᵢ - yᵢ|^p)^(1/p)

The parameter p determines the type of distance:

Value of p	Resulting metric	Also known as
p = 1	Manhattan distance	L1 norm, taxicab distance
p = 2	Euclidean distance	L2 norm
p → ∞	Chebyshev distance	L∞ norm, maximum metric

The Chebyshev distance equals the maximum absolute difference along any single dimension: d(x, y) = maxᵢ |xᵢ - yᵢ|. It is used in chess (the minimum number of moves a king needs to travel between two squares) and in warehouse logistics.

Cosine similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors, ignoring their magnitudes:

cos(θ) = (x · y) / (‖x‖ · ‖y‖) = Σᵢ xᵢyᵢ / (√(Σᵢ xᵢ²) · √(Σᵢ yᵢ²))

The result ranges from -1 (opposite directions) through 0 (orthogonal) to +1 (same direction). When vectors contain only non-negative values (common in text processing), the range is 0 to 1.

Cosine similarity is one of the most widely used measures in natural language processing and information retrieval. It powers document similarity in the vector space model, where documents are represented as TF-IDF vectors. Because it ignores magnitude, cosine similarity is not affected by document length, which makes it well suited for comparing texts of different sizes.

Cosine distance is defined as 1 - cos(θ). Note that cosine distance is not a proper metric because it does not satisfy the triangle inequality in general, although a modified form based on the angular distance (arccos(cos(θ)) / π) does satisfy it.

Mahalanobis distance

The Mahalanobis distance accounts for correlations between variables by incorporating the covariance matrix of the data:

d(x, y) = √((x - y)ᵀ S⁻¹ (x - y))

Here S is the covariance matrix. When S is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. The Mahalanobis distance effectively rescales and rotates the feature space so that correlated features are handled properly. It is especially useful in anomaly detection, where it determines whether a new data point lies within the normal distribution of training data, and in multivariate statistical tests.

Set-based measures

Jaccard index

The Jaccard index (also called the Jaccard similarity coefficient or intersection over union) measures the overlap between two finite sets:

J(A, B) = |A ∩ B| / |A ∪ B|

The value ranges from 0 (no shared elements) to 1 (identical sets). When both sets are empty, the Jaccard index is conventionally defined as 1.

The Jaccard distance, defined as 1 - J(A, B), is a true metric. The Jaccard index is widely used in:

Document similarity: comparing sets of words, n-grams, or shingles for plagiarism detection and near-duplicate identification
Object detection: measuring intersection over union (IoU) between predicted and ground-truth bounding boxes
Ecology: comparing species composition across habitats
Recommendation systems: comparing binary user-item interaction sets

Sorensen-Dice coefficient

The Sorensen-Dice coefficient is closely related to the Jaccard index but weights shared elements more heavily:

DSC(A, B) = 2|A ∩ B| / (|A| + |B|)

It ranges from 0 to 1 and is commonly used in image segmentation to evaluate the overlap between predicted and ground-truth masks. The Dice coefficient can be converted to the Jaccard index via J = DSC / (2 - DSC). Unlike the Jaccard distance, the Dice distance (1 - DSC) does not satisfy the triangle inequality.

Overlap coefficient

The overlap coefficient (or Szymkiewicz-Simpson coefficient) divides the intersection size by the size of the smaller set:

overlap(A, B) = |A ∩ B| / min(|A|, |B|)

This measure equals 1 whenever one set is a subset of the other, regardless of how much larger the other set is. It is useful when comparing sets of very different sizes.

String and sequence similarity measures

Levenshtein distance (edit distance)

The Levenshtein distance between two strings is the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into the other. It was defined by Soviet mathematician Vladimir Levenshtein in 1965.

For example, the Levenshtein distance between "kitten" and "sitting" is 3:

kitten → sitten (substitute k with s)
sitten → sittin (substitute e with i)
sittin → sitting (insert g)

The Levenshtein distance is computed using dynamic programming with a matrix of size (m+1) x (n+1), where m and n are the lengths of the two strings. The time complexity is O(m * n).

Applications include spell checking, DNA sequence alignment, optical character recognition correction, and fuzzy string matching in search engines.

Damerau-Levenshtein distance

The Damerau-Levenshtein distance extends the Levenshtein distance by also allowing transpositions of two adjacent characters as a single edit operation. This is motivated by the observation that transpositions account for a large share of human typing errors. For example, "ab" → "ba" has a Damerau-Levenshtein distance of 1 (one transposition) but a Levenshtein distance of 2 (one deletion and one insertion, or two substitutions).

Hamming distance

The Hamming distance counts the number of positions at which two strings (or binary vectors) of equal length differ:

d(x, y) = Σᵢ (xᵢ ≠ yᵢ)

For example, the Hamming distance between "karolin" and "kathrin" is 3 (positions 2, 3, and 4 differ).

The Hamming distance was introduced by Richard Hamming in 1950 in the context of error-detecting and error-correcting codes. A code with minimum Hamming distance d between codewords can detect up to d - 1 errors and correct up to ⌊(d - 1) / 2⌋ errors. In machine learning, Hamming distance is used for comparing binary feature vectors, hash codes in locality-sensitive hashing, and one-hot encoded categorical data.

Jaro and Jaro-Winkler similarity

The Jaro similarity between two strings accounts for the number and order of matching characters. Two characters are considered matching if they are the same and appear within a window of ⌊max(|s1|, |s2|) / 2⌋ - 1 positions. The formula is:

jaro(s1, s2) = (1/3) * (m/|s1| + m/|s2| + (m - t)/m)

where m is the number of matching characters and t is the number of transpositions divided by 2.

The Jaro-Winkler similarity adds a prefix bonus for strings that share a common prefix (up to 4 characters), which reflects the observation that typographical errors are less common at the beginning of a string. These measures are widely used in record linkage and deduplication of person names in databases.

Correlation-based measures

Pearson correlation coefficient

The Pearson correlation coefficient measures the linear relationship between two variables:

r = Σᵢ (xᵢ - x̄)(yᵢ - ȳ) / (√(Σᵢ (xᵢ - x̄)²) · √(Σᵢ (yᵢ - ȳ)²))

It ranges from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear correlation. The Pearson coefficient is equivalent to cosine similarity when both vectors are centered (mean-subtracted).

Pearson correlation is used in collaborative filtering for recommendation systems, gene expression analysis, and financial data analysis. It is sensitive to outliers and assumes a linear relationship between the variables.

Spearman rank correlation

The Spearman rank correlation coefficient is a non-parametric measure that assesses how well the relationship between two variables can be described by a monotonic function. It is computed by applying the Pearson formula to the rank-transformed values of each variable rather than the raw values:

ρ = 1 - (6 Σᵢ dᵢ²) / (n(n² - 1))

where dᵢ is the difference between the ranks of corresponding values.

Because it operates on ranks rather than raw values, Spearman correlation is robust to outliers and does not assume a linear relationship. It is particularly useful for ordinal data and for detecting monotonic but non-linear associations.

Probability distribution measures

Kullback-Leibler divergence

The Kullback-Leibler (KL) divergence measures how one probability distribution P diverges from a reference distribution Q:

D_KL(P ‖ Q) = Σᵢ P(i) · log(P(i) / Q(i))

KL divergence is always non-negative and equals zero if and only if P = Q. It is not symmetric (D_KL(P ‖ Q) ≠ D_KL(Q ‖ P)) and does not satisfy the triangle inequality, so it is not a true metric. It is sometimes called "relative entropy."

KL divergence is widely used in machine learning as a component of loss functions. The cross-entropy loss used to train neural networks can be decomposed into the entropy of the target distribution plus the KL divergence from the target to the model's predicted distribution. It also appears in variational inference, where the goal is to minimize the KL divergence between an approximate posterior and the true posterior distribution.

Jensen-Shannon divergence

The Jensen-Shannon (JS) divergence symmetrizes the KL divergence:

JSD(P ‖ Q) = (1/2) D_KL(P ‖ M) + (1/2) D_KL(Q ‖ M)

where M = (P + Q) / 2. The JS divergence is symmetric, always finite, and bounded between 0 and log(2). Its square root is a true metric. It has been used as the training objective in the original generative adversarial network (GAN) formulation.

Wasserstein distance (earth mover's distance)

The Wasserstein distance (also called the earth mover's distance or Kantorovich metric) is derived from optimal transport theory. Intuitively, it measures the minimum "work" required to transform one probability distribution into another, where work is defined as the amount of probability mass moved times the distance it is moved.

For one-dimensional distributions with cumulative distribution functions F and G:

W₁(F, G) = ∫ |F(x) - G(x)| dx

Unlike KL divergence, the Wasserstein distance is a true metric, is always finite, and provides meaningful gradients even when the distributions do not overlap. This property made it the basis for the Wasserstein GAN (WGAN), which addressed training instability in the original GAN framework.

Bhattacharyya distance and Hellinger distance

The Bhattacharyya coefficient measures the overlap between two distributions:

BC(P, Q) = Σᵢ √(P(i) · Q(i))

The Bhattacharyya distance is defined as -ln(BC). It does not satisfy the triangle inequality.

The Hellinger distance, defined as:

H(P, Q) = (1/√2) · √(Σᵢ (√P(i) - √Q(i))²)

is related to the Bhattacharyya coefficient via H² = 1 - BC. The Hellinger distance is a proper metric and is bounded between 0 and 1.

Kernel functions as similarity measures

A kernel function k(x, y) computes a similarity score between two data points, often corresponding to an inner product in a higher-dimensional (possibly infinite-dimensional) feature space. Kernels must be positive semi-definite. They are central to support vector machines (SVMs) and other kernel methods, where they allow linear algorithms to learn non-linear patterns without explicitly computing the high-dimensional feature mapping (the "kernel trick").

Kernel	Formula	Key properties
Linear	k(x, y) = xᵀy	Simplest kernel; no mapping to higher dimensions
Polynomial	k(x, y) = (γ · xᵀy + c₀)^d	Captures feature interactions up to degree d
RBF (Gaussian)	k(x, y) = exp(-γ ‖x - y‖²)	Universal approximator; locality controlled by γ
Laplacian	k(x, y) = exp(-γ ‖x - y‖₁)	Uses L1 distance; useful for noiseless data
Sigmoid	k(x, y) = tanh(γ · xᵀy + c₀)	Resembles neural network activation; not always PSD
Chi-squared	k(x, y) = exp(-γ Σ (xᵢ - yᵢ)² / (xᵢ + yᵢ))	Common in computer vision for histogram comparison

RBF kernel

The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is the most widely used kernel function. Its output decreases smoothly from 1 (when x = y) toward 0 as the distance between x and y increases. The hyperparameter γ (gamma) controls how quickly the similarity drops off with distance: large γ values produce a narrow, localized kernel, while small γ values produce a broad kernel.

The RBF kernel is a universal approximator, meaning it can approximate any continuous function to arbitrary precision given enough data. In practice, it is the default kernel choice for SVMs. It has only two hyperparameters to tune: the regularization parameter C and the kernel width γ.

Polynomial kernel

The polynomial kernel maps input vectors into a feature space of polynomial combinations up to degree d. Unlike the RBF kernel, it can capture global patterns. The degree parameter controls model complexity: degree 1 yields a linear kernel, while higher degrees capture increasingly complex feature interactions. Polynomial kernels are used in natural language processing and image recognition.

Embedding-based similarity

Modern deep learning systems represent objects (text, images, audio) as dense vectors called embeddings. These embeddings are learned by neural networks such that semantically similar objects are mapped to nearby points in the embedding space. Once objects are embedded, standard vector similarity measures (cosine similarity, Euclidean distance, dot product) can be applied to compare them.

Text embeddings

Early text embedding methods such as Word2Vec and GloVe learned fixed vector representations for individual words. Later models such as BERT and sentence transformers produce contextualized embeddings that capture the meaning of words in context. Modern large language models generate embeddings with hundreds to thousands of dimensions that encode rich semantic information.

Cosine similarity is the standard measure for comparing text embeddings because it is invariant to vector magnitude, allowing fair comparison between embeddings of different lengths or norms. This is the foundation of semantic search and retrieval-augmented generation (RAG) systems.

Image and multimodal embeddings

Convolutional neural networks and vision transformers produce embedding vectors for images. Models such as CLIP learn a shared embedding space for both text and images, enabling cross-modal similarity search (finding images that match a text query, and vice versa). Cosine similarity and dot product are the most common measures used in these multimodal spaces.

Vector similarity search

As embedding-based approaches have grown, so has the need for efficient similarity search over large collections of vectors. Vector databases such as Pinecone, Milvus, Weaviate, and Qdrant are designed to store millions or billions of embeddings and support fast nearest-neighbor queries.

Exact nearest-neighbor search has O(n) cost per query, which is impractical for large datasets. Approximate nearest neighbor (ANN) algorithms trade a small amount of accuracy for large speedups:

Algorithm/Library	Approach	Notes
FAISS (Meta)	Multiple index types: flat, IVF, PQ, HNSW	Supports GPU acceleration; handles dynamic datasets
Annoy (Spotify)	Random projection trees	Fast, memory-efficient; requires full index rebuild on updates
HNSW	Hierarchical navigable small world graphs	High recall; used in Milvus, Qdrant, and others
ScaNN (Google)	Learned quantization with anisotropic scoring	Optimized for maximum inner product search
LSH	Locality-sensitive hashing	Probabilistic; maps similar items to the same hash buckets

Similarity learning

Rather than using a fixed similarity function, similarity learning (also called metric learning) trains a neural network to learn a task-specific similarity function from data. The goal is to produce an embedding space where distances reflect the desired notion of similarity.

Siamese networks

A Siamese network consists of two identical sub-networks that share the same weights. Each sub-network processes one input and produces an embedding vector. A similarity measure (typically Euclidean distance or cosine similarity) is then applied to the two embeddings. Siamese networks are used for face verification, signature verification, and one-shot learning.

Loss functions for similarity learning

Loss function	Input structure	Description
Contrastive loss	Pairs (positive/negative)	Pulls positive pairs together, pushes negative pairs apart by at least a margin
Triplet loss	Triplets (anchor, positive, negative)	Ensures the anchor-positive distance is smaller than the anchor-negative distance by a margin
N-pair loss	One positive and N-1 negatives	Generalizes triplet loss to multiple negatives simultaneously
InfoNCE / NT-Xent	One positive and many negatives in a batch	Used in contrastive learning and self-supervised learning; foundation of SimCLR, MoCo

Triplet loss, introduced in FaceNet (2015), trains the network to satisfy:

d(anchor, positive) + margin < d(anchor, negative)

Hard negative mining (selecting the most difficult negatives) is often needed to make training efficient.

Choosing the right similarity measure

The choice of similarity measure depends on the data type, the task, and practical constraints. The following table summarizes common guidelines:

Data type	Recommended measures	Reasoning
Continuous numerical vectors	Euclidean distance, cosine similarity, Mahalanobis distance	Euclidean for low-dimensional data; cosine for high-dimensional or magnitude-invariant comparisons; Mahalanobis when features are correlated
Binary or categorical vectors	Hamming distance, Jaccard index	Hamming for fixed-length binary strings; Jaccard for variable-size sets
Text (token-level)	Jaccard index, Levenshtein distance, Jaro-Winkler	Jaccard for bag-of-words overlap; edit distances for character-level similarity
Text (semantic)	Cosine similarity on embeddings	Captures meaning rather than surface form
Probability distributions	KL divergence, JS divergence, Wasserstein distance	KL for directed comparisons; JS for symmetric; Wasserstein for geometric structure
Time series	Dynamic time warping, Euclidean distance	DTW handles temporal shifts and scaling
Graphs	Graph edit distance, kernel-based measures	Specialized measures that account for structural similarity

Key considerations

Scale sensitivity. Euclidean and Manhattan distances are sensitive to the scale of features. If features have different units or ranges, normalization or standardization should be applied before computing distances.

Dimensionality. In high-dimensional spaces, distances between all pairs of points tend to converge (the curse of dimensionality), making Euclidean distance less discriminative. Cosine similarity and learned embeddings are often more effective in these settings.

Computational cost. Some measures (such as dynamic time warping and earth mover's distance) are more computationally expensive than simple vector operations. For real-time applications with large datasets, approximate methods or precomputed indexes may be required.

Metric properties. If the application relies on efficient search structures (such as ball trees or cover trees), the similarity measure should be a proper metric satisfying the triangle inequality. Non-metric measures can still be used but may require different algorithmic approaches.

Interpretability. Measures like Pearson correlation and Jaccard index have clear statistical interpretations that can be communicated to domain experts. Learned similarity functions from deep networks may be more powerful but are harder to explain.

Applications in machine learning

Clustering

Clustering algorithms depend heavily on the choice of similarity or distance measure. K-means uses Euclidean distance to assign data points to the nearest centroid. Hierarchical clustering can use any distance metric and linkage criterion. DBSCAN uses distance-based density estimation. Spectral clustering converts distances to similarities using a Gaussian kernel and then performs graph partitioning.

Classification

K-nearest neighbors (KNN) classifies a data point based on the majority class of its k nearest neighbors according to a chosen distance metric. The choice of metric (Euclidean, Manhattan, Minkowski, or learned) affects both accuracy and computational cost. Support vector machines use kernel functions as implicit similarity measures to find optimal separating hyperplanes.

Information retrieval and search

Search engines and information retrieval systems represent queries and documents as vectors and rank results by similarity. Classic approaches use TF-IDF vectors with cosine similarity. Modern semantic search systems use dense embeddings from transformers and perform approximate nearest neighbor search over vector databases.

Recommendation systems

Recommendation systems use similarity measures in two primary ways. In user-based collaborative filtering, the system finds users with similar preference patterns (measured by Pearson correlation or cosine similarity) and recommends items those similar users liked. In item-based collaborative filtering, the system finds items similar to those a user has already engaged with.

Natural language processing

Natural language processing uses similarity measures at multiple levels. Character-level measures (edit distance) support spell checking. Token-level measures (Jaccard on n-grams) support plagiarism detection. Semantic measures (cosine similarity on embeddings) support paraphrase detection, sentiment analysis, and question answering.

Computer vision

Computer vision uses similarity measures for image retrieval, face recognition, and object re-identification. Feature vectors extracted by convolutional neural networks or vision transformers are compared using cosine similarity or Euclidean distance. The Jaccard index (IoU) is the standard evaluation metric for object detection and segmentation tasks.

Anomaly detection

Anomaly detection identifies data points that are unusually far from normal examples. Mahalanobis distance is commonly used to detect multivariate outliers. Isolation forests and local outlier factor algorithms rely on distance-based density estimates.

Summary comparison of common measures

Measure	Type	Range	True metric?	Handles high dimensions well?	Common use cases
Euclidean distance	Distance	[0, ∞)	Yes	No (curse of dimensionality)	KNN, k-means, low-dimensional data
Manhattan distance	Distance	[0, ∞)	Yes	Better than Euclidean	Sparse data, grid-based problems
Cosine similarity	Similarity	[-1, 1]	No (angular variant is)	Yes	NLP, document similarity, embeddings
Jaccard index	Similarity	[0, 1]	Yes (as distance)	N/A (set-based)	Set comparison, IoU, deduplication
Hamming distance	Distance	[0, n]	Yes	Yes	Binary codes, error detection
Levenshtein distance	Distance	[0, max(m,n)]	Yes	N/A (string-based)	Spell checking, fuzzy matching
Pearson correlation	Similarity	[-1, 1]	No	Moderate	Collaborative filtering, statistics
KL divergence	Divergence	[0, ∞)	No (asymmetric)	Yes	Loss functions, variational inference
Wasserstein distance	Distance	[0, ∞)	Yes	Computationally expensive	GANs, optimal transport
RBF kernel	Similarity	(0, 1]	N/A (kernel)	Yes (with proper γ)	SVMs, kernel methods
Mahalanobis distance	Distance	[0, ∞)	Yes	Yes (accounts for correlations)	Anomaly detection, multivariate statistics

Implementation in Python

Most similarity measures are available in standard machine learning libraries:

scikit-learn: The sklearn.metrics.pairwise module provides cosine_similarity, euclidean_distances, manhattan_distances, rbf_kernel, polynomial_kernel, sigmoid_kernel, laplacian_kernel, chi2_kernel, and a general pairwise_distances function supporting many metrics.
SciPy: The scipy.spatial.distance module provides cdist and pdist with support for Euclidean, Manhattan, Minkowski, Mahalanobis, Hamming, Jaccard, cosine, and many other distance functions. scipy.stats.wasserstein_distance computes the 1D Wasserstein distance.
NLTK: The nltk.metrics.distance module provides edit_distance (Levenshtein), jaccard_distance, and jaro_winkler_similarity.
NumPy: Basic vector operations for computing dot products, norms, and distances directly.

References

Cha, S.H. (2007). "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions." *International Journal of Mathematical Models and Methods in Applied Sciences*, 1(4), 300-307.
Levenshtein, V.I. (1966). "Binary codes capable of correcting deletions, insertions, and reversals." *Soviet Physics Doklady*, 10(8), 707-710.
Jaccard, P. (1912). "The distribution of the flora in the alpine zone." *New Phytologist*, 11(2), 37-50.
Hamming, R.W. (1950). "Error detecting and error correcting codes." *Bell System Technical Journal*, 29(2), 147-160.
Mahalanobis, P.C. (1936). "On the generalized distance in statistics." *Proceedings of the National Institute of Sciences of India*, 2(1), 49-55.
Kullback, S. and Leibler, R.A. (1951). "On Information and Sufficiency." *Annals of Mathematical Statistics*, 22(1), 79-86.
Schroff, F., Kalenichenko, D., and Philbin, J. (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 815-823.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). "Wasserstein Generative Adversarial Networks." *Proceedings of the 34th International Conference on Machine Learning*, 214-223.
Johnson, J., Douze, M., and Jegou, H. (2019). "Billion-scale similarity search with GPUs." *IEEE Transactions on Big Data*, 7(3), 535-547.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Mikolov, T. et al. (2013). "Efficient Estimation of Word Representations in Vector Space." *Proceedings of the International Conference on Learning Representations (ICLR)*.
Rubner, Y., Tomasi, C., and Guibas, L.J. (2000). "The Earth Mover's Distance as a Metric for Image Retrieval." *International Journal of Computer Vision*, 40(2), 99-121.
Gower, J.C. (1971). "A General Coefficient of Similarity and Some of Its Properties." *Biometrics*, 27(4), 857-871.
Chopra, S., Hadsell, R., and LeCun, Y. (2005). "Learning a Similarity Metric Discriminatively, with Application to Face Verification." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 539-546.

Explain like I'm 5 (ELI5)

Formal definition

Relationship between similarity and distance

Geometric and vector-based measures

Euclidean distance

Manhattan distance

Minkowski distance

Cosine similarity

Mahalanobis distance

Set-based measures

Jaccard index

Sorensen-Dice coefficient

Overlap coefficient

String and sequence similarity measures

Levenshtein distance (edit distance)

Damerau-Levenshtein distance

Hamming distance

Jaro and Jaro-Winkler similarity

Correlation-based measures

Pearson correlation coefficient

Spearman rank correlation

Probability distribution measures

Kullback-Leibler divergence

Jensen-Shannon divergence

Wasserstein distance (earth mover's distance)

Bhattacharyya distance and Hellinger distance

Kernel functions as similarity measures

RBF kernel

Polynomial kernel

Embedding-based similarity

Text embeddings

Image and multimodal embeddings

Vector similarity search

Similarity learning

Siamese networks

Loss functions for similarity learning

Choosing the right similarity measure

Key considerations

Applications in machine learning

Clustering

Classification

Information retrieval and search

Recommendation systems

Natural language processing

Computer vision

Anomaly detection

Summary comparison of common measures

Implementation in Python

References

Improve this article

Related Articles

ARC-AGI 2

MTEB (Massive Text Embedding Benchmark)

Embeddings

Information Retrieval

Question answering

Retrieval-Augmented Generation (RAG)

Explain like I'm 5 (ELI5)

Formal definition

Relationship between similarity and distance

Geometric and vector-based measures

Euclidean distance

Manhattan distance

Minkowski distance

Cosine similarity

Mahalanobis distance

Set-based measures

Jaccard index

Sorensen-Dice coefficient

Overlap coefficient

String and sequence similarity measures

Levenshtein distance (edit distance)

Damerau-Levenshtein distance

Hamming distance

Jaro and Jaro-Winkler similarity

Correlation-based measures

Pearson correlation coefficient

Spearman rank correlation

Probability distribution measures

Kullback-Leibler divergence