See also: Machine learning, Clustering, Information retrieval
A similarity measure (also called a similarity function or similarity metric) is a real-valued function that quantifies the degree of resemblance between two objects. In machine learning, statistics, and data science, similarity measures are used to compare data points, documents, images, and other entities. They assign large values to pairs of objects that are alike and small (or zero) values to pairs that differ. The closely related concept of a distance metric (or dissimilarity measure) works in the opposite direction: small values indicate similar objects, while large values indicate dissimilar ones.
Similarity measures appear throughout virtually every subfield of machine learning. Clustering algorithms group data points by similarity. Classification algorithms such as k-nearest neighbors rely on distance computations to make predictions. Recommendation systems match users to items by comparing preference vectors. Information retrieval systems rank documents by their similarity to a query. Selecting the right similarity measure for a given task can have a significant effect on model performance.
Imagine you have a big box of crayons and you want to sort them by color. You pick up two crayons and look at them side by side. If they look almost the same color, you say they are very similar and put them in the same pile. If one is bright red and the other is dark blue, they are not similar at all, so they go in different piles.
A similarity measure is the rule you use to decide how alike two crayons are. There are different rules for different situations. Sometimes you care about the exact shade. Sometimes you only care whether two crayons are both "warm" colors or both "cool" colors. Computers use these rules too, but instead of crayons they compare numbers, words, or pictures.
A similarity measure on a set X is a function s: X x X → R that satisfies the following properties for all x, y in X:
A distance metric (or simply a metric) d: X x X → R must satisfy:
Not every commonly used similarity or distance function satisfies all metric axioms. For example, cosine similarity does not obey the triangle inequality in its raw form, and Kullback-Leibler divergence is not symmetric. Functions that relax one or more axioms are sometimes called semi-metrics, quasi-metrics, or divergences depending on which property is dropped.
Similarity and distance are inversely related. Several standard transformations convert one to the other:
| Transformation | Formula | Notes |
|---|---|---|
| Exponential decay | s = exp(-d * γ) | Common in kernel methods; γ controls the decay rate |
| Inverse | s = 1 / (1 + d) | Simple and bounded between 0 and 1 |
| Linear | s = 1 - d/d_max | Requires a known upper bound d_max |
| Gaussian | s = exp(-d² / 2σ²) | Used in spectral clustering and RBF kernels |
The Euclidean distance (also called L2 distance) is the straight-line distance between two points in n-dimensional space:
d(x, y) = √(Σᵢ (xᵢ - yᵢ)²)
It is the most intuitive distance measure, corresponding to the length of the line segment connecting two points. Euclidean distance is a true metric because it satisfies non-negativity, identity, symmetry, and the triangle inequality.
Euclidean distance is widely used in k-nearest neighbors, k-means clustering, and principal component analysis. However, it can become unreliable in high-dimensional spaces due to the "curse of dimensionality," where distances between points tend to converge and become less discriminative.
The Manhattan distance (also called L1 distance, taxicab distance, or city-block distance) sums the absolute differences along each dimension:
d(x, y) = Σᵢ |xᵢ - yᵢ|
The name comes from the grid-like street layout of Manhattan, where travel between two points follows the grid rather than a straight line. Manhattan distance is more robust to outliers than Euclidean distance because it does not square the differences. It is also a true metric.
The Minkowski distance is a generalization that includes both Euclidean and Manhattan distance as special cases:
d(x, y) = (Σᵢ |xᵢ - yᵢ|^p)^(1/p)
The parameter p determines the type of distance:
| Value of p | Resulting metric | Also known as |
|---|---|---|
| p = 1 | Manhattan distance | L1 norm, taxicab distance |
| p = 2 | Euclidean distance | L2 norm |
| p → ∞ | Chebyshev distance | L∞ norm, maximum metric |
The Chebyshev distance equals the maximum absolute difference along any single dimension: d(x, y) = maxᵢ |xᵢ - yᵢ|. It is used in chess (the minimum number of moves a king needs to travel between two squares) and in warehouse logistics.
Cosine similarity measures the cosine of the angle between two non-zero vectors, ignoring their magnitudes:
cos(θ) = (x · y) / (‖x‖ · ‖y‖) = Σᵢ xᵢyᵢ / (√(Σᵢ xᵢ²) · √(Σᵢ yᵢ²))
The result ranges from -1 (opposite directions) through 0 (orthogonal) to +1 (same direction). When vectors contain only non-negative values (common in text processing), the range is 0 to 1.
Cosine similarity is one of the most widely used measures in natural language processing and information retrieval. It powers document similarity in the vector space model, where documents are represented as TF-IDF vectors. Because it ignores magnitude, cosine similarity is not affected by document length, which makes it well suited for comparing texts of different sizes.
Cosine distance is defined as 1 - cos(θ). Note that cosine distance is not a proper metric because it does not satisfy the triangle inequality in general, although a modified form based on the angular distance (arccos(cos(θ)) / π) does satisfy it.
The Mahalanobis distance accounts for correlations between variables by incorporating the covariance matrix of the data:
d(x, y) = √((x - y)ᵀ S⁻¹ (x - y))
Here S is the covariance matrix. When S is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. The Mahalanobis distance effectively rescales and rotates the feature space so that correlated features are handled properly. It is especially useful in anomaly detection, where it determines whether a new data point lies within the normal distribution of training data, and in multivariate statistical tests.
The Jaccard index (also called the Jaccard similarity coefficient or intersection over union) measures the overlap between two finite sets:
J(A, B) = |A ∩ B| / |A ∪ B|
The value ranges from 0 (no shared elements) to 1 (identical sets). When both sets are empty, the Jaccard index is conventionally defined as 1.
The Jaccard distance, defined as 1 - J(A, B), is a true metric. The Jaccard index is widely used in:
The Sorensen-Dice coefficient is closely related to the Jaccard index but weights shared elements more heavily:
DSC(A, B) = 2|A ∩ B| / (|A| + |B|)
It ranges from 0 to 1 and is commonly used in image segmentation to evaluate the overlap between predicted and ground-truth masks. The Dice coefficient can be converted to the Jaccard index via J = DSC / (2 - DSC). Unlike the Jaccard distance, the Dice distance (1 - DSC) does not satisfy the triangle inequality.
The overlap coefficient (or Szymkiewicz-Simpson coefficient) divides the intersection size by the size of the smaller set:
overlap(A, B) = |A ∩ B| / min(|A|, |B|)
This measure equals 1 whenever one set is a subset of the other, regardless of how much larger the other set is. It is useful when comparing sets of very different sizes.
The Levenshtein distance between two strings is the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into the other. It was defined by Soviet mathematician Vladimir Levenshtein in 1965.
For example, the Levenshtein distance between "kitten" and "sitting" is 3:
The Levenshtein distance is computed using dynamic programming with a matrix of size (m+1) x (n+1), where m and n are the lengths of the two strings. The time complexity is O(m * n).
Applications include spell checking, DNA sequence alignment, optical character recognition correction, and fuzzy string matching in search engines.
The Damerau-Levenshtein distance extends the Levenshtein distance by also allowing transpositions of two adjacent characters as a single edit operation. This is motivated by the observation that transpositions account for a large share of human typing errors. For example, "ab" → "ba" has a Damerau-Levenshtein distance of 1 (one transposition) but a Levenshtein distance of 2 (one deletion and one insertion, or two substitutions).
The Hamming distance counts the number of positions at which two strings (or binary vectors) of equal length differ:
d(x, y) = Σᵢ (xᵢ ≠ yᵢ)
For example, the Hamming distance between "karolin" and "kathrin" is 3 (positions 2, 3, and 4 differ).
The Hamming distance was introduced by Richard Hamming in 1950 in the context of error-detecting and error-correcting codes. A code with minimum Hamming distance d between codewords can detect up to d - 1 errors and correct up to ⌊(d - 1) / 2⌋ errors. In machine learning, Hamming distance is used for comparing binary feature vectors, hash codes in locality-sensitive hashing, and one-hot encoded categorical data.
The Jaro similarity between two strings accounts for the number and order of matching characters. Two characters are considered matching if they are the same and appear within a window of ⌊max(|s1|, |s2|) / 2⌋ - 1 positions. The formula is:
jaro(s1, s2) = (1/3) * (m/|s1| + m/|s2| + (m - t)/m)
where m is the number of matching characters and t is the number of transpositions divided by 2.
The Jaro-Winkler similarity adds a prefix bonus for strings that share a common prefix (up to 4 characters), which reflects the observation that typographical errors are less common at the beginning of a string. These measures are widely used in record linkage and deduplication of person names in databases.
The Pearson correlation coefficient measures the linear relationship between two variables:
r = Σᵢ (xᵢ - x̄)(yᵢ - ȳ) / (√(Σᵢ (xᵢ - x̄)²) · √(Σᵢ (yᵢ - ȳ)²))
It ranges from -1 (perfect negative linear relationship) to +1 (perfect positive linear relationship), with 0 indicating no linear correlation. The Pearson coefficient is equivalent to cosine similarity when both vectors are centered (mean-subtracted).
Pearson correlation is used in collaborative filtering for recommendation systems, gene expression analysis, and financial data analysis. It is sensitive to outliers and assumes a linear relationship between the variables.
The Spearman rank correlation coefficient is a non-parametric measure that assesses how well the relationship between two variables can be described by a monotonic function. It is computed by applying the Pearson formula to the rank-transformed values of each variable rather than the raw values:
ρ = 1 - (6 Σᵢ dᵢ²) / (n(n² - 1))
where dᵢ is the difference between the ranks of corresponding values.
Because it operates on ranks rather than raw values, Spearman correlation is robust to outliers and does not assume a linear relationship. It is particularly useful for ordinal data and for detecting monotonic but non-linear associations.
The Kullback-Leibler (KL) divergence measures how one probability distribution P diverges from a reference distribution Q:
D_KL(P ‖ Q) = Σᵢ P(i) · log(P(i) / Q(i))
KL divergence is always non-negative and equals zero if and only if P = Q. It is not symmetric (D_KL(P ‖ Q) ≠ D_KL(Q ‖ P)) and does not satisfy the triangle inequality, so it is not a true metric. It is sometimes called "relative entropy."
KL divergence is widely used in machine learning as a component of loss functions. The cross-entropy loss used to train neural networks can be decomposed into the entropy of the target distribution plus the KL divergence from the target to the model's predicted distribution. It also appears in variational inference, where the goal is to minimize the KL divergence between an approximate posterior and the true posterior distribution.
The Jensen-Shannon (JS) divergence symmetrizes the KL divergence:
JSD(P ‖ Q) = (1/2) D_KL(P ‖ M) + (1/2) D_KL(Q ‖ M)
where M = (P + Q) / 2. The JS divergence is symmetric, always finite, and bounded between 0 and log(2). Its square root is a true metric. It has been used as the training objective in the original generative adversarial network (GAN) formulation.
The Wasserstein distance (also called the earth mover's distance or Kantorovich metric) is derived from optimal transport theory. Intuitively, it measures the minimum "work" required to transform one probability distribution into another, where work is defined as the amount of probability mass moved times the distance it is moved.
For one-dimensional distributions with cumulative distribution functions F and G:
W₁(F, G) = ∫ |F(x) - G(x)| dx
Unlike KL divergence, the Wasserstein distance is a true metric, is always finite, and provides meaningful gradients even when the distributions do not overlap. This property made it the basis for the Wasserstein GAN (WGAN), which addressed training instability in the original GAN framework.
The Bhattacharyya coefficient measures the overlap between two distributions:
BC(P, Q) = Σᵢ √(P(i) · Q(i))
The Bhattacharyya distance is defined as -ln(BC). It does not satisfy the triangle inequality.
The Hellinger distance, defined as:
H(P, Q) = (1/√2) · √(Σᵢ (√P(i) - √Q(i))²)
is related to the Bhattacharyya coefficient via H² = 1 - BC. The Hellinger distance is a proper metric and is bounded between 0 and 1.
A kernel function k(x, y) computes a similarity score between two data points, often corresponding to an inner product in a higher-dimensional (possibly infinite-dimensional) feature space. Kernels must be positive semi-definite. They are central to support vector machines (SVMs) and other kernel methods, where they allow linear algorithms to learn non-linear patterns without explicitly computing the high-dimensional feature mapping (the "kernel trick").
| Kernel | Formula | Key properties |
|---|---|---|
| Linear | k(x, y) = xᵀy | Simplest kernel; no mapping to higher dimensions |
| Polynomial | k(x, y) = (γ · xᵀy + c₀)^d | Captures feature interactions up to degree d |
| RBF (Gaussian) | k(x, y) = exp(-γ ‖x - y‖²) | Universal approximator; locality controlled by γ |
| Laplacian | k(x, y) = exp(-γ ‖x - y‖₁) | Uses L1 distance; useful for noiseless data |
| Sigmoid | k(x, y) = tanh(γ · xᵀy + c₀) | Resembles neural network activation; not always PSD |
| Chi-squared | k(x, y) = exp(-γ Σ (xᵢ - yᵢ)² / (xᵢ + yᵢ)) | Common in computer vision for histogram comparison |
The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is the most widely used kernel function. Its output decreases smoothly from 1 (when x = y) toward 0 as the distance between x and y increases. The hyperparameter γ (gamma) controls how quickly the similarity drops off with distance: large γ values produce a narrow, localized kernel, while small γ values produce a broad kernel.
The RBF kernel is a universal approximator, meaning it can approximate any continuous function to arbitrary precision given enough data. In practice, it is the default kernel choice for SVMs. It has only two hyperparameters to tune: the regularization parameter C and the kernel width γ.
The polynomial kernel maps input vectors into a feature space of polynomial combinations up to degree d. Unlike the RBF kernel, it can capture global patterns. The degree parameter controls model complexity: degree 1 yields a linear kernel, while higher degrees capture increasingly complex feature interactions. Polynomial kernels are used in natural language processing and image recognition.
Modern deep learning systems represent objects (text, images, audio) as dense vectors called embeddings. These embeddings are learned by neural networks such that semantically similar objects are mapped to nearby points in the embedding space. Once objects are embedded, standard vector similarity measures (cosine similarity, Euclidean distance, dot product) can be applied to compare them.
Early text embedding methods such as Word2Vec and GloVe learned fixed vector representations for individual words. Later models such as BERT and sentence transformers produce contextualized embeddings that capture the meaning of words in context. Modern large language models generate embeddings with hundreds to thousands of dimensions that encode rich semantic information.
Cosine similarity is the standard measure for comparing text embeddings because it is invariant to vector magnitude, allowing fair comparison between embeddings of different lengths or norms. This is the foundation of semantic search and retrieval-augmented generation (RAG) systems.
Convolutional neural networks and vision transformers produce embedding vectors for images. Models such as CLIP learn a shared embedding space for both text and images, enabling cross-modal similarity search (finding images that match a text query, and vice versa). Cosine similarity and dot product are the most common measures used in these multimodal spaces.
As embedding-based approaches have grown, so has the need for efficient similarity search over large collections of vectors. Vector databases such as Pinecone, Milvus, Weaviate, and Qdrant are designed to store millions or billions of embeddings and support fast nearest-neighbor queries.
Exact nearest-neighbor search has O(n) cost per query, which is impractical for large datasets. Approximate nearest neighbor (ANN) algorithms trade a small amount of accuracy for large speedups:
| Algorithm/Library | Approach | Notes |
|---|---|---|
| FAISS (Meta) | Multiple index types: flat, IVF, PQ, HNSW | Supports GPU acceleration; handles dynamic datasets |
| Annoy (Spotify) | Random projection trees | Fast, memory-efficient; requires full index rebuild on updates |
| HNSW | Hierarchical navigable small world graphs | High recall; used in Milvus, Qdrant, and others |
| ScaNN (Google) | Learned quantization with anisotropic scoring | Optimized for maximum inner product search |
| LSH | Locality-sensitive hashing | Probabilistic; maps similar items to the same hash buckets |
Rather than using a fixed similarity function, similarity learning (also called metric learning) trains a neural network to learn a task-specific similarity function from data. The goal is to produce an embedding space where distances reflect the desired notion of similarity.
A Siamese network consists of two identical sub-networks that share the same weights. Each sub-network processes one input and produces an embedding vector. A similarity measure (typically Euclidean distance or cosine similarity) is then applied to the two embeddings. Siamese networks are used for face verification, signature verification, and one-shot learning.
| Loss function | Input structure | Description |
|---|---|---|
| Contrastive loss | Pairs (positive/negative) | Pulls positive pairs together, pushes negative pairs apart by at least a margin |
| Triplet loss | Triplets (anchor, positive, negative) | Ensures the anchor-positive distance is smaller than the anchor-negative distance by a margin |
| N-pair loss | One positive and N-1 negatives | Generalizes triplet loss to multiple negatives simultaneously |
| InfoNCE / NT-Xent | One positive and many negatives in a batch | Used in contrastive learning and self-supervised learning; foundation of SimCLR, MoCo |
Triplet loss, introduced in FaceNet (2015), trains the network to satisfy:
d(anchor, positive) + margin < d(anchor, negative)
Hard negative mining (selecting the most difficult negatives) is often needed to make training efficient.
The choice of similarity measure depends on the data type, the task, and practical constraints. The following table summarizes common guidelines:
| Data type | Recommended measures | Reasoning |
|---|---|---|
| Continuous numerical vectors | Euclidean distance, cosine similarity, Mahalanobis distance | Euclidean for low-dimensional data; cosine for high-dimensional or magnitude-invariant comparisons; Mahalanobis when features are correlated |
| Binary or categorical vectors | Hamming distance, Jaccard index | Hamming for fixed-length binary strings; Jaccard for variable-size sets |
| Text (token-level) | Jaccard index, Levenshtein distance, Jaro-Winkler | Jaccard for bag-of-words overlap; edit distances for character-level similarity |
| Text (semantic) | Cosine similarity on embeddings | Captures meaning rather than surface form |
| Probability distributions | KL divergence, JS divergence, Wasserstein distance | KL for directed comparisons; JS for symmetric; Wasserstein for geometric structure |
| Time series | Dynamic time warping, Euclidean distance | DTW handles temporal shifts and scaling |
| Graphs | Graph edit distance, kernel-based measures | Specialized measures that account for structural similarity |
Scale sensitivity. Euclidean and Manhattan distances are sensitive to the scale of features. If features have different units or ranges, normalization or standardization should be applied before computing distances.
Dimensionality. In high-dimensional spaces, distances between all pairs of points tend to converge (the curse of dimensionality), making Euclidean distance less discriminative. Cosine similarity and learned embeddings are often more effective in these settings.
Computational cost. Some measures (such as dynamic time warping and earth mover's distance) are more computationally expensive than simple vector operations. For real-time applications with large datasets, approximate methods or precomputed indexes may be required.
Metric properties. If the application relies on efficient search structures (such as ball trees or cover trees), the similarity measure should be a proper metric satisfying the triangle inequality. Non-metric measures can still be used but may require different algorithmic approaches.
Interpretability. Measures like Pearson correlation and Jaccard index have clear statistical interpretations that can be communicated to domain experts. Learned similarity functions from deep networks may be more powerful but are harder to explain.
Clustering algorithms depend heavily on the choice of similarity or distance measure. K-means uses Euclidean distance to assign data points to the nearest centroid. Hierarchical clustering can use any distance metric and linkage criterion. DBSCAN uses distance-based density estimation. Spectral clustering converts distances to similarities using a Gaussian kernel and then performs graph partitioning.
K-nearest neighbors (KNN) classifies a data point based on the majority class of its k nearest neighbors according to a chosen distance metric. The choice of metric (Euclidean, Manhattan, Minkowski, or learned) affects both accuracy and computational cost. Support vector machines use kernel functions as implicit similarity measures to find optimal separating hyperplanes.
Search engines and information retrieval systems represent queries and documents as vectors and rank results by similarity. Classic approaches use TF-IDF vectors with cosine similarity. Modern semantic search systems use dense embeddings from transformers and perform approximate nearest neighbor search over vector databases.
Recommendation systems use similarity measures in two primary ways. In user-based collaborative filtering, the system finds users with similar preference patterns (measured by Pearson correlation or cosine similarity) and recommends items those similar users liked. In item-based collaborative filtering, the system finds items similar to those a user has already engaged with.
Natural language processing uses similarity measures at multiple levels. Character-level measures (edit distance) support spell checking. Token-level measures (Jaccard on n-grams) support plagiarism detection. Semantic measures (cosine similarity on embeddings) support paraphrase detection, sentiment analysis, and question answering.
Computer vision uses similarity measures for image retrieval, face recognition, and object re-identification. Feature vectors extracted by convolutional neural networks or vision transformers are compared using cosine similarity or Euclidean distance. The Jaccard index (IoU) is the standard evaluation metric for object detection and segmentation tasks.
Anomaly detection identifies data points that are unusually far from normal examples. Mahalanobis distance is commonly used to detect multivariate outliers. Isolation forests and local outlier factor algorithms rely on distance-based density estimates.
| Measure | Type | Range | True metric? | Handles high dimensions well? | Common use cases |
|---|---|---|---|---|---|
| Euclidean distance | Distance | [0, ∞) | Yes | No (curse of dimensionality) | KNN, k-means, low-dimensional data |
| Manhattan distance | Distance | [0, ∞) | Yes | Better than Euclidean | Sparse data, grid-based problems |
| Cosine similarity | Similarity | [-1, 1] | No (angular variant is) | Yes | NLP, document similarity, embeddings |
| Jaccard index | Similarity | [0, 1] | Yes (as distance) | N/A (set-based) | Set comparison, IoU, deduplication |
| Hamming distance | Distance | [0, n] | Yes | Yes | Binary codes, error detection |
| Levenshtein distance | Distance | [0, max(m,n)] | Yes | N/A (string-based) | Spell checking, fuzzy matching |
| Pearson correlation | Similarity | [-1, 1] | No | Moderate | Collaborative filtering, statistics |
| KL divergence | Divergence | [0, ∞) | No (asymmetric) | Yes | Loss functions, variational inference |
| Wasserstein distance | Distance | [0, ∞) | Yes | Computationally expensive | GANs, optimal transport |
| RBF kernel | Similarity | (0, 1] | N/A (kernel) | Yes (with proper γ) | SVMs, kernel methods |
| Mahalanobis distance | Distance | [0, ∞) | Yes | Yes (accounts for correlations) | Anomaly detection, multivariate statistics |
Most similarity measures are available in standard machine learning libraries:
sklearn.metrics.pairwise module provides cosine_similarity, euclidean_distances, manhattan_distances, rbf_kernel, polynomial_kernel, sigmoid_kernel, laplacian_kernel, chi2_kernel, and a general pairwise_distances function supporting many metrics.scipy.spatial.distance module provides cdist and pdist with support for Euclidean, Manhattan, Minkowski, Mahalanobis, Hamming, Jaccard, cosine, and many other distance functions. scipy.stats.wasserstein_distance computes the 1D Wasserstein distance.nltk.metrics.distance module provides edit_distance (Levenshtein), jaccard_distance, and jaro_winkler_similarity.