Machine learning terms/Clustering
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,856 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,856 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Clustering is a class of unsupervised machine learning techniques that partition a set of objects into groups, called clusters, such that objects in the same group are more similar to each other than to objects in other groups. Unlike supervised learning, clustering algorithms operate on unlabeled data: the algorithm receives feature vectors but no target labels, and it discovers structure on its own. Clustering is one of the oldest and most widely used branches of data mining, pattern recognition, and statistical analysis, with documented applications dating back to the 1930s in psychology and biology.
The central problem of clustering is to define what makes two objects similar, and then to find a partition of the data that maximizes within-cluster similarity while minimizing between-cluster similarity. Different definitions of similarity, different cluster shapes, and different assumptions about the data generating process give rise to dozens of distinct algorithms. There is no single best clustering algorithm: the right choice depends on the geometry of the data, the size of the dataset, the noise level, and what the user intends to do with the resulting clusters.
Clustering is closely related to other unsupervised tasks such as dimensionality reduction, density estimation, and anomaly detection. It is often the first step in exploratory data analysis, customer segmentation, image processing, bioinformatics, and document organization.
A partition is hard if every object belongs to exactly one cluster, or soft (also called fuzzy) if each object has a membership weight in every cluster. Clustering can also be classified as:
| Property | Description |
|---|---|
| Hard vs. soft | Whether each point belongs to one cluster or has fractional membership in many |
| Flat vs. hierarchical | Whether clusters form a single partition or a nested tree |
| Parametric vs. non-parametric | Whether the algorithm assumes a specific distribution (e.g., Gaussian) |
| Exclusive vs. overlapping | Whether clusters can share members |
| Complete vs. partial | Whether every point must be assigned to a cluster, or some can be left as noise |
K-means is the most widely used clustering algorithm. It partitions $n$ observations into $k$ clusters by assigning each observation to the cluster whose mean (the centroid) is nearest, then recomputing the centroids until the assignment stops changing. K-means is a special case of centroid-based clustering and minimizes the within-cluster sum of squared Euclidean distances.
The algorithm is also called the Lloyd algorithm after Stuart Lloyd, who described it in a 1957 Bell Labs technical note that was not formally published until 1982 in IEEE Transactions on Information Theory. The name k-means comes from a 1967 paper by James MacQueen at the Fifth Berkeley Symposium. Edward Forgy proposed an identical method in 1965, and Hugo Steinhaus outlined the idea in 1956.
Standard k-means has time complexity $O(n \cdot k \cdot d \cdot i)$ for $n$ points, $k$ clusters, $d$ dimensions, and $i$ iterations. Finding the global optimum is NP-hard, so the algorithm only finds a local minimum that depends on initial centroid placement.
K-means++ is a careful seeding procedure introduced by David Arthur and Sergei Vassilvitskii in their 2007 paper "k-means++: The Advantages of Careful Seeding" (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms). The first centroid is chosen uniformly at random; each subsequent centroid is chosen with probability proportional to the squared distance from the nearest already-chosen centroid. The authors proved that k-means++ produces a clustering whose expected cost is within $O(\log k)$ of the optimum, a substantial improvement over random initialization. K-means++ is the default initialization in scikit-learn, MATLAB, and many other libraries.
K-medoids is a variant in which each cluster is represented by an actual data point (the medoid) rather than a computed mean. Because medoids are real observations, k-medoids works with arbitrary distance functions and is more robust to outliers than k-means. The Partitioning Around Medoids (PAM) algorithm was introduced by Leonard Kaufman and Peter Rousseeuw in their 1987 report and 1990 book Finding Groups in Data: An Introduction to Cluster Analysis. PAM has higher computational cost than k-means, $O(k(n-k)^2)$ per iteration, which led to faster variants such as CLARA (Clustering Large Applications) and CLARANS (Clustering Large Applications based on Randomized Search) by Ng and Han in 1994. The closely related k-median problem minimizes the sum of L1 distances rather than squared L2 distances.
Hierarchical clustering builds a tree of clusters called a dendrogram. The two main strategies are:
The agglomerative approach was popularized by Joe H. Ward in his 1963 JASA paper, which introduced Ward's method of minimizing variance increase. Other linkage criteria include single linkage (minimum distance, studied by Florek et al. in 1951 and Sneath in 1957), complete linkage (maximum distance), and average linkage (UPGMA, Sokal and Michener 1958).
Naive agglomerative clustering is $O(n^3)$ time and $O(n^2)$ space, though SLINK (Sibson 1973) and CLINK (Defays 1977) achieve $O(n^2)$. Divisive methods include DIANA from Kaufman and Rousseeuw (1990).
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was introduced by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu at KDD-96. The paper received the SIGKDD Test of Time Award in 2014.
DBSCAN takes two parameters: $\varepsilon$ (the neighborhood radius) and MinPts (minimum points required for a dense region). A core point has at least MinPts neighbors within $\varepsilon$. Clusters are grown by connecting core points whose neighborhoods overlap; points unreachable from any core point are labeled as noise. DBSCAN discovers clusters of arbitrary shape without pre-specifying their number, but it struggles when cluster densities vary widely.
HDBSCAN (Hierarchical DBSCAN) was introduced by Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander at PAKDD 2013. It generalizes DBSCAN by varying $\varepsilon$ over a continuous range, then extracting a flat clustering from the hierarchy using a stability-based criterion. HDBSCAN handles clusters of varying density and requires only one main parameter (min_cluster_size). The reference Python implementation is the hdbscan library by Leland McInnes.
OPTICS (Ordering Points To Identify the Clustering Structure) was introduced by Ankerst, Breunig, Kriegel, and Sander at ACM SIGMOD 1999. Rather than producing a flat clustering, OPTICS produces a reachability plot that encodes the density-based clustering structure at all scales. The user can extract clusters at any chosen density level.
Mean shift is a non-parametric mode-seeking algorithm proposed by Fukunaga and Hostetler in 1975 and brought to prominence by Dorin Comaniciu and Peter Meer in IEEE TPAMI 2002. The algorithm shifts each point iteratively toward the local maximum of an estimated kernel density. Points converging to the same mode form a cluster. Mean shift requires only a bandwidth parameter and is widely used for image segmentation and visual tracking.
Spectral clustering treats the data as a graph and uses the eigenvectors of the graph Laplacian to embed the points in a low-dimensional space where standard clustering (typically k-means) is applied. The most cited modern formulation is Ng, Jordan, and Weiss (NeurIPS 2001). Earlier work by Shi and Malik (IEEE TPAMI 2000) and Meila and Shi (2001) established the graph-cut interpretation. Von Luxburg's 2007 "A Tutorial on Spectral Clustering" remains the standard reference. Spectral clustering recovers non-convex clusters that fail k-means, but $O(n^3)$ eigendecomposition limits scalability without approximations such as the Nyström method.
A Gaussian mixture model (GMM) assumes the data are generated by a weighted sum of $k$ multivariate Gaussians. Parameters are fit using the Expectation-Maximization (EM) algorithm of Dempster, Laird, and Rubin (1977). GMMs produce a soft clustering: each point has a posterior probability of belonging to each component. K-means is a limit case of EM on a GMM with isotropic Gaussians of equal variance. GMMs handle elliptical clusters and support model selection via AIC and BIC, but EM finds only local optima.
Affinity propagation was introduced by Brendan J. Frey and Delbert Dueck in Science 2007. The algorithm exchanges "responsibility" and "availability" messages between every pair of points until a stable set of exemplars emerges; each non-exemplar is assigned to its nearest exemplar. The number of clusters is not pre-specified, but $O(n^2)$ memory limits scalability.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) was introduced by Zhang, Ramakrishnan, and Livny at SIGMOD 1996. Designed for very large databases that do not fit in memory, BIRCH builds a Clustering Feature tree in a single pass, summarizing dense regions as compact statistics, then runs a global clustering on leaf entries. BIRCH is one of the few clustering algorithms with provably linear time complexity in dataset size.
| Algorithm | Authors and year | Key idea |
|---|---|---|
| Wishart's mode analysis | Wishart, 1969 | Early density-based method, predecessor of DBSCAN |
| CURE | Guha, Rastogi, Shim, 1998 | Cluster as a fixed number of well-scattered points |
| ROCK | Guha, Rastogi, Shim, 1999 | Uses links (shared neighbors) for categorical data |
| CHAMELEON | Karypis, Han, Kumar, 1999 | Graph partitioning then dynamic merging |
| DENCLUE | Hinneburg and Keim, 1998 | Density-attractor via kernel density estimation |
Every clustering algorithm depends on a notion of distance or similarity measure. Common metrics include:
| Metric | Formula | Typical use |
|---|---|---|
| Euclidean (L2) | $\sqrt{\sum_i (x_i - y_i)^2}$ | Continuous numeric features, isotropic clusters |
| Manhattan (L1) | $\sum_i \lvert x_i - y_i \rvert$ | Robust to outliers, grid-like data |
| Minkowski (Lp) | $(\sum_i \lvert x_i - y_i \rvert^p)^{1/p}$ | Generalization of L1 and L2 |
| Chebyshev (L∞) | $\max_i \lvert x_i - y_i \rvert$ | Worst-case coordinate difference |
| Cosine | $1 - \frac{x \cdot y}{\lVert x \rVert \lVert y \rVert}$ | Text and high-dimensional sparse vectors |
| Mahalanobis | $\sqrt{(x-y)^T \Sigma^{-1} (x-y)}$ | Correlated features with known covariance |
| Hamming | $\sum_i \mathbb{1}[x_i \neq y_i]$ | Binary or categorical data |
| Jaccard | $1 - \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}$ | Set-valued or sparse binary data |
| Edit (Levenshtein) | Min number of insert, delete, substitute operations | Strings, sequences |
| Dynamic time warping | Optimal alignment cost between time series | Time series analysis, speech |
| Earth mover's distance | Wasserstein-1 between probability distributions | Image histograms, distributions |
The choice of distance is itself part of the modeling decision. Cosine distance is standard for text because it ignores document length; Mahalanobis distance is appropriate when features have very different scales or are correlated; dynamic time warping handles time series of different lengths. Many algorithms also support kernelization, which replaces the inner product with a Mercer kernel and allows clustering in an implicit nonlinear feature space.
Feature scaling matters: a Euclidean k-means run on raw data with one feature in dollars and another in years will be dominated by the larger-scale feature. Standardization (zero mean, unit variance) or min-max scaling is almost always recommended.
Evaluating a clustering is harder than evaluating a classifier because there are no ground-truth labels in the typical setting. Two families of metrics are used:
Internal metrics use only the data and the clustering itself.
| Metric | Reference | Direction |
|---|---|---|
| Silhouette coefficient | Rousseeuw, 1987 | Higher is better, range [-1, 1] |
| Davies-Bouldin index | Davies and Bouldin, 1979 | Lower is better |
| Calinski-Harabasz (variance ratio) | Caliński and Harabasz, 1974 | Higher is better |
| Dunn index | Dunn, 1974 | Higher is better |
| Within-cluster sum of squares | (inertia, used by k-means) | Lower is better, but biased toward more clusters |
| Xie-Beni index | Xie and Beni, 1991 | For fuzzy clustering |
The silhouette coefficient $s(i)$ for point $i$ is defined as $(b_i - a_i) / \max(a_i, b_i)$, where $a_i$ is the mean intra-cluster distance and $b_i$ is the mean nearest-other-cluster distance. Values near 1 indicate well-separated clusters, values near 0 indicate overlapping clusters, and negative values suggest misclassification.
External metrics compare a clustering to ground-truth class labels, when those are available.
| Metric | Reference | Notes |
|---|---|---|
| Rand index | Rand, 1971 | Probability that two random points are correctly paired |
| Adjusted Rand index (ARI) | Hubert and Arabie, 1985 | Chance-corrected version of Rand index |
| Normalized mutual information (NMI) | Strehl and Ghosh, 2002 | Information-theoretic, range [0, 1] |
| Adjusted mutual information (AMI) | Vinh, Epps, Bailey, 2010 | Chance-corrected NMI |
| Fowlkes-Mallows index | Fowlkes and Mallows, 1983 | Geometric mean of pairwise precision and recall |
| V-measure (homogeneity, completeness) | Rosenberg and Hirschberg, 2007 | Harmonic mean of homogeneity and completeness |
| Purity | Manning et al., 2008 | Simple but biased toward many small clusters |
Kleinberg's impossibility theorem (Jon Kleinberg, 2002, NeurIPS) proved that no clustering function can simultaneously satisfy three intuitive axioms (scale-invariance, richness, and consistency), which is one reason why so many competing algorithms and metrics coexist.
Most partitional algorithms require the user to specify $k$ in advance. Several heuristics help:
Deep clustering jointly learns a representation and a clustering. Notable methods include:
The rise of self-supervised learning has produced methods that cluster on top of representations learned without labels. SwAV (Caron et al., 2020), SCAN (Van Gansbeke et al., 2020), and SeLa (Asano, Rupprecht, Vedaldi, 2020) all combine contrastive or self-distillation losses with online clustering. SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) provide the underlying contrastive learning framework, and the resulting embeddings cluster well under k-means without further training.
With the spread of pre-trained transformer models, a common modern workflow is to encode raw data (text, images, audio) into dense embeddings using a model like Sentence-BERT, CLIP, or a large language model, then run a classical algorithm such as HDBSCAN or k-means on the embeddings. The BERTopic library (Maarten Grootendorst, 2022) packages this idea for topic modeling: it combines transformer embeddings, UMAP dimensionality reduction, HDBSCAN clustering, and class-based TF-IDF for topic representation.
In high dimensions, the curse of dimensionality makes most distance metrics nearly useless: all pairs of points become roughly equidistant. Subspace clustering algorithms search for clusters in axis-aligned or arbitrary low-dimensional subspaces. Examples include CLIQUE (Agrawal et al., 1998), SUBCLU (Kailing, Kriegel, Kröger, 2004), and sparse subspace clustering (Elhamifar and Vidal, 2009). Sketching techniques such as random projections, locality-sensitive hashing, and the Johnson-Lindenstrauss lemma are often applied as preprocessing.
| Library | Language | Notable algorithms |
|---|---|---|
| scikit-learn | Python | k-means, k-means++, Mini-Batch, agglomerative, DBSCAN, OPTICS, GMM, mean shift, spectral, affinity propagation, BIRCH |
hdbscan | Python | HDBSCAN, robust single linkage |
pyclustering | Python, C++ | 40+ algorithms |
cluster | R | PAM, CLARA, AGNES, DIANA, FANNY |
mclust | R | GMM with model selection |
| ELKI | Java | Research-grade implementations |
| Apache Spark MLlib | Scala, Python | Distributed k-means, GMM, bisecting k-means |
| RAPIDS cuML | Python (GPU) | GPU-accelerated k-means, DBSCAN, HDBSCAN |
| FAISS | C++, Python | Large-scale k-means and nearest-neighbor search |
Marketers cluster customers by purchase history, demographics, and engagement to design targeted campaigns. RFM analysis (recency, frequency, monetary value) is a classical clustering input. K-means on standardized RFM features is still common in industry, though hierarchical and density-based methods are used when segments have unequal size.
Clustering text documents by TF-IDF or transformer embeddings reveals thematic structure in news archives, customer support tickets, and scientific literature. Latent Dirichlet Allocation (Blei, Ng, Jordan, 2003) is technically a probabilistic topic model but functions as a soft clustering. BERTopic and Top2Vec are modern embedding-based alternatives.
Grouping pixels by color, position, and texture is one of the oldest applications of clustering. Mean shift, normalized cuts, and Felzenszwalb's graph-based segmentation (2004) are classical methods. Modern semantic segmentation has largely moved to supervised deep learning, but clustering still plays a role in superpixel generation (SLIC, Achanta et al. 2012) and unsupervised pretraining.
Density-based clustering algorithms naturally identify points that do not belong to any cluster: DBSCAN labels them as noise, and HDBSCAN provides per-point outlier scores via the GLOSH algorithm (Campello et al., 2015). Anomaly detection using clustering is widely used in fraud detection, network intrusion detection, and equipment monitoring.
Gene expression analysis routinely clusters genes or samples to identify co-regulated groups. The classic Eisen et al. (1998) PNAS paper applied hierarchical clustering to microarray data. Single-cell RNA sequencing pipelines such as Seurat (Satija lab) and Scanpy use Leiden and Louvain community detection on shared nearest-neighbor graphs to identify cell types.
User-based and item-based collaborative filtering both rely on clustering or near-neighbor structure. Matrix factorization methods (Koren, Bell, Volinsky, 2009) project users and items into a shared embedding space where nearest-neighbor and clustering operations become meaningful.
Clustering also appears in network science (community detection with Louvain, Leiden, Infomap), computer vision (visual word codebooks, vector quantization for compression), audio (speaker diarization), astronomy (galaxy clustering, photometric redshift), climate science (regime identification), and cybersecurity (malware family and log clustering).
Clustering looks deceptively simple, but several pitfalls are worth flagging: