See also: Machine learning terms
Clustering is a fundamental technique in machine learning and data mining that involves grouping similar data points or objects into clusters based on some form of similarity measure or distance metric. The goal of clustering is to identify underlying patterns or structures in data, enabling efficient data representation, classification, and interpretation. Clustering is an unsupervised learning method, meaning that it does not rely on pre-labeled data for training but rather discovers inherent relationships within the data itself.
Clustering is one of the most widely used techniques in data science and has applications spanning customer segmentation, image analysis, anomaly detection, document organization, and biological taxonomy. The choice of clustering algorithm and its configuration depend on the characteristics of the data, the expected cluster shapes, and the specific goals of the analysis.
Formally, given a dataset X = {x_1, x_2, ..., x_n} of n data points in a d-dimensional feature space, clustering seeks to partition X into k groups C = {C_1, C_2, ..., C_k} such that data points within the same cluster are more similar to each other than to data points in other clusters. The definition of "similar" depends on the distance or similarity metric chosen, and different metrics can produce very different clusterings of the same data.
The idea of grouping observations by similarity has roots stretching back to the early twentieth century, when taxonomists and statisticians developed manual classification methods. In 1956, Hugo Steinhaus published one of the earliest formal descriptions of partitioning a set of points by minimizing within-group distances. The following year, Stuart P. Lloyd devised the iterative algorithm now known as Lloyd's algorithm while working at Bell Laboratories on pulse-code modulation. Although Lloyd's internal technical memorandum circulated within Bell Labs for decades, it was not formally published until 1982. Independently, E. W. Forgy described a nearly identical procedure in 1965, and James MacQueen coined the term "k-means" in his 1967 paper presented at the Fifth Berkeley Symposium on Mathematical Statistics and Probability.
Throughout the 1970s and 1980s, hierarchical clustering methods and linkage criteria were refined by researchers including Joe Ward (Ward's method, 1963) and Peter Rousseeuw (silhouette analysis, 1987). The 1990s saw the introduction of density-based approaches, most notably DBSCAN (Ester et al., 1996) and BIRCH (Zhang, Ramakrishnan, and Livny, 1996), which addressed limitations of centroid-based methods for datasets with irregular cluster shapes or varying densities. Spectral clustering gained prominence in the early 2000s through the tutorial by Ulrike von Luxburg (2007). More recently, deep learning methods such as Deep Embedded Clustering (Xie et al., 2016) and contrastive clustering approaches have pushed the field toward learning representations and cluster assignments jointly.
There are several types of clustering algorithms, each with their own strengths and weaknesses. The choice of an appropriate algorithm depends on the specific problem being addressed, the data properties, and the desired outcome.
| Type | Key idea | Example algorithms | Cluster shape |
|---|---|---|---|
| Centroid-based | Clusters are represented by a central point (centroid) | K-means, K-medoids, Mini-Batch K-means | Spherical/convex |
| Density-based | Clusters are regions of high density separated by regions of low density | DBSCAN, OPTICS, HDBSCAN, Mean Shift | Arbitrary |
| Hierarchical | Clusters are organized in a tree-like structure (dendrogram) | Agglomerative, Divisive, BIRCH | Arbitrary |
| Distribution-based | Clusters are modeled as probability distributions | Gaussian Mixture Models (GMM), Bayesian GMM | Elliptical |
| Grid-based | Data space is divided into a grid, and clustering is performed on the grid cells | STING, CLIQUE | Varies |
| Spectral | Uses eigenvalues of a similarity matrix to reduce dimensions before clustering | Spectral clustering | Arbitrary |
Centroid-based (also called partition-based) clustering algorithms divide the dataset into non-overlapping subsets or clusters, where each data point belongs to exactly one cluster. These algorithms represent each cluster by a central point and assign data points to the cluster with the nearest center.
K-means is the most widely known and commonly used clustering algorithm. It partitions the data into k clusters by iteratively refining cluster centroids to minimize the within-cluster sum of squared distances (also known as inertia).
Algorithm:
K-means++ initialization, proposed by David Arthur and Sergei Vassilvitskii in 2007, selects initial centroids that are spread out across the data, which leads to faster convergence and better final results compared to random initialization. The algorithm picks the first centroid uniformly at random, then chooses each subsequent centroid with probability proportional to the squared distance from the nearest existing centroid.
Mini-Batch K-means is a variant that uses small random batches of data in each iteration rather than the full dataset, substantially reducing computation time while typically producing results close to those of standard K-means. It is particularly useful for very large datasets that do not fit into memory.
| Property | Description |
|---|---|
| Time complexity | O(n * k * d * i), where n = samples, k = clusters, d = dimensions, i = iterations |
| Space complexity | O(n * d + k * d) |
| Cluster shape | Spherical, roughly equal-sized clusters |
| Requires | Number of clusters k specified in advance |
| Strengths | Simple, fast, scalable to large datasets |
| Weaknesses | Sensitive to initialization and outliers; assumes spherical clusters; requires choosing k |
K-medoids is similar to K-means but uses actual data points (medoids) as cluster centers instead of computed means. This makes it more robust to outliers and noise because the median-like behavior of medoids is less influenced by extreme values. The Partitioning Around Medoids (PAM) algorithm is the most common implementation. In each iteration, PAM evaluates whether swapping a medoid with a non-medoid point reduces the total dissimilarity. K-medoids is computationally more expensive than K-means, with a time complexity of O(k * (n - k)^2 * i), but it can work with any distance metric, not just Euclidean distance.
Density-based clustering algorithms identify clusters by grouping data points that are closely packed together based on a density metric. These algorithms can find arbitrarily shaped clusters and are robust to noise in the data.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise), proposed by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu in 1996, groups together data points that are closely packed in a high-density region while marking low-density regions as noise.
Core concepts:
Algorithm:
| Property | Description |
|---|---|
| Time complexity | O(n log n) with spatial indexing (e.g., KD-tree or ball tree); O(n^2) without |
| Key parameters | epsilon (neighborhood radius), minPts (minimum points to form a dense region) |
| Cluster shape | Arbitrary shape |
| Requires | Parameters epsilon and minPts (does not require k) |
| Strengths | Detects arbitrary-shaped clusters; identifies noise/outliers; does not need k |
| Weaknesses | Struggles with clusters of varying density; sensitive to epsilon and minPts; poor performance in very high dimensions |
HDBSCAN (Hierarchical DBSCAN), introduced by Campello, Moulavi, and Sander in 2013, extends DBSCAN by converting it into a hierarchical algorithm. This eliminates the need to specify the epsilon parameter. HDBSCAN builds a hierarchy of clusters at different density levels using a mutual reachability distance, then extracts the most persistent clusters by selecting those with the greatest stability. The algorithm is generally more robust and requires less parameter tuning than standard DBSCAN, with only the min_cluster_size parameter needed in most cases.
OPTICS (Ordering Points To Identify the Clustering Structure), proposed by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jorg Sander in 1999, creates an ordering of the data points that reflects their density-based clustering structure. It produces a reachability plot from which clusters at different density thresholds can be extracted. OPTICS can be viewed as a generalization of DBSCAN that handles clusters of varying density. Unlike DBSCAN, OPTICS does not produce a single partitioning but rather a density-based ordering from which multiple flat clusterings can be derived.
Mean Shift is a non-parametric, kernel-based clustering algorithm that does not require specifying the number of clusters in advance. The algorithm works by iteratively shifting each data point toward the mode (peak) of the local density estimate. At each step, a kernel function (typically Gaussian) is placed over a data point, and the point is moved to the weighted mean of the surrounding points within the kernel bandwidth. This process repeats until convergence, at which point data points that converge to the same mode are assigned to the same cluster.
Mean Shift is particularly effective for image segmentation and object tracking in computer vision. Its main limitation is the sensitivity to the bandwidth parameter. If the bandwidth is too small, the algorithm may produce too many clusters; if too large, distinct clusters may be merged.
Hierarchical clustering algorithms build a tree-like structure, known as a dendrogram, to represent the nested groupings of data points. These algorithms can be either agglomerative (bottom-up) or divisive (top-down).
Agglomerative hierarchical clustering starts with each data point as its own cluster and iteratively merges the closest pairs of clusters until only one cluster remains (or until a stopping criterion is met).
The key choice in agglomerative clustering is the linkage criterion, which determines how the distance between two clusters is measured:
| Linkage method | Distance between clusters | Characteristics |
|---|---|---|
| Single linkage | Minimum distance between any pair of points from the two clusters | Tends to produce elongated, chain-like clusters; sensitive to noise |
| Complete linkage | Maximum distance between any pair of points from the two clusters | Produces compact, roughly equal-sized clusters; sensitive to outliers |
| Average linkage | Average distance between all pairs of points from the two clusters | Balances between single and complete linkage |
| Ward's method | Increase in total within-cluster variance after merging | Tends to produce compact, spherical clusters; most commonly used |
| Centroid linkage | Distance between the centroids of the two clusters | Simple but can produce inversions in the dendrogram |
Divisive hierarchical clustering begins with all data points in a single cluster and recursively splits the clusters until each data point forms its own cluster. Divisive methods are less common in practice because they are computationally more expensive than agglomerative methods. The DIANA (Divisive Analysis) algorithm is a well-known implementation.
| Property | Description |
|---|---|
| Time complexity | O(n^3) for naive agglomerative clustering; O(n^2 log n) with efficient data structures |
| Space complexity | O(n^2) for the distance matrix |
| Cluster shape | Depends on linkage method |
| Requires | A distance metric and linkage criterion; the number of clusters can be chosen after inspecting the dendrogram |
| Strengths | Produces a full hierarchy; does not require specifying k in advance; provides a dendrogram for visual inspection |
| Weaknesses | Computationally expensive for large datasets; merges/splits are irreversible |
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), proposed by Zhang, Ramakrishnan, and Livny in 1996, is designed for efficiently clustering very large datasets. BIRCH works by building a compact summary of the data called a Clustering Feature Tree (CF Tree). Each node in the CF Tree stores a Clustering Feature vector that summarizes the statistics of a subcluster (number of points, linear sum, and sum of squares). This allows BIRCH to compress the data incrementally with a single scan of the dataset. After the CF Tree is built, a global clustering algorithm (such as agglomerative clustering) is applied to the leaf entries. BIRCH is especially useful when the dataset is too large to fit in main memory.
Distribution-based clustering assumes that data is generated from a mixture of probability distributions and aims to identify the parameters of these distributions.
Gaussian Mixture Models (GMM) assume that the data is generated from a mixture of k multivariate Gaussian (normal) distributions, each with its own mean vector and covariance matrix. The model is trained using the Expectation-Maximization (EM) algorithm.
Algorithm (EM for GMM):
Unlike K-means, which makes hard assignments (each point belongs to exactly one cluster), GMM provides soft assignments (each point has a probability of belonging to each cluster). This makes GMM more flexible for data where clusters overlap.
| Property | Description |
|---|---|
| Time complexity | O(n * k * d^2 * i), where n = samples, k = components, d = dimensions, i = iterations |
| Cluster shape | Elliptical (defined by covariance matrices) |
| Requires | Number of components k (can be selected using BIC or AIC) |
| Strengths | Soft cluster assignments; can model elliptical clusters; probabilistic framework |
| Weaknesses | Sensitive to initialization; assumes Gaussian distributions; computationally more expensive than K-means |
Grid-based clustering algorithms partition the data space into a finite number of grid cells and then perform clustering on the grid structure. These algorithms are typically faster than other clustering methods since they work with the grid cells rather than individual data points. The STING (STatistical INformation Grid) algorithm, proposed by Wang, Yang, and Muntz in 1997, divides the space into rectangular cells arranged in a hierarchical structure. Each cell stores summary statistics, allowing queries to be answered at different resolutions. CLIQUE (Clustering In QUEst) combines grid-based and density-based ideas, and is designed for high-dimensional data by identifying dense cells in axis-aligned subspaces.
Spectral clustering uses the eigenvalues (spectrum) of a similarity matrix to perform dimension reduction before applying a standard clustering algorithm (typically K-means) in the reduced space. The algorithm constructs a similarity graph (e.g., a k-nearest-neighbor graph or an epsilon-neighborhood graph), computes the graph Laplacian matrix, and finds its smallest eigenvalues. The corresponding eigenvectors form a low-dimensional representation that preserves the connectivity structure of the data.
Spectral clustering can find non-convex clusters that K-means alone would miss because the similarity graph captures the connectivity structure of the data rather than just the pairwise distances. It is widely used for image segmentation, community detection in social networks, and any setting where the data has a natural graph structure. The main drawbacks are the O(n^3) computational cost of eigendecomposition and the need to construct and store the n x n similarity matrix.
The choice of distance or similarity metric fundamentally affects how clusters are formed. Different metrics capture different notions of proximity and are suited to different types of data.
| Metric | Formula | Best used when | Notes |
|---|---|---|---|
| Euclidean | sqrt(sum((x_i - y_i)^2)) | Features are continuous and on similar scales | Most common; sensitive to scale differences |
| Manhattan (L1) | sum( | x_i - y_i | ) |
| Cosine | 1 - (x . y) / ( | x | |
| Mahalanobis | sqrt((x - y)^T * S^{-1} * (x - y)) | Features are correlated and have different variances | Accounts for feature correlations; requires computing the covariance matrix inverse |
| Minkowski (Lp) | (sum( | x_i - y_i | ^p))^{1/p} |
| Jaccard | 1 - | A intersect B | / |
| Hamming | Number of positions where symbols differ | Categorical or binary data | Counts mismatches between feature vectors |
Selecting the right metric is critical because it directly shapes which data points are considered neighbors and, consequently, which clusters are formed. For example, using Euclidean distance on text data with sparse, high-dimensional term-frequency vectors often performs poorly, whereas cosine similarity naturally handles the varying document lengths in that setting.
Many clustering algorithms require specifying the number of clusters k in advance. Several methods exist for estimating the optimal k:
| Method | How it works |
|---|---|
| Elbow method | Plot the within-cluster sum of squares (inertia) as a function of k. Look for the "elbow" point where adding more clusters yields diminishing returns. Simple but sometimes ambiguous when the elbow is not sharp. |
| Silhouette analysis | Compute the average silhouette score for different values of k. Select the k that maximizes the silhouette score. Provides a per-point quality measure that is easy to interpret. |
| Gap statistic | Proposed by Tibshirani, Walther, and Hastie (2001). Compares the within-cluster dispersion to that expected under a null reference distribution generated by uniform sampling. The optimal k maximizes the gap between the observed and reference dispersions. |
| BIC / AIC | For model-based approaches like GMM, Bayesian Information Criterion or Akaike Information Criterion can be used to balance model fit against complexity. BIC penalizes complexity more strongly than AIC and often selects simpler models. |
| Dendrogram inspection | For hierarchical clustering, inspect the dendrogram and cut at a level that produces meaningful clusters. Large jumps in the merge distance typically indicate a natural number of clusters. |
| Information-theoretic approaches | Methods such as the X-means algorithm automatically split clusters and use BIC to decide whether each split improves the model, effectively searching over k during execution. |
Evaluating the quality of clustering results is challenging because there is no ground truth in unsupervised settings. Evaluation metrics fall into two categories: internal metrics (which use only the data and cluster assignments) and external metrics (which compare cluster assignments to known labels).
| Metric | Formula / key idea | Interpretation |
|---|---|---|
| Silhouette score | For each point: s = (b - a) / max(a, b), where a = mean intra-cluster distance, b = mean nearest-cluster distance | Ranges from -1 to 1. Higher is better. Values near 0 indicate overlapping clusters. Negative values indicate misassigned points. |
| Davies-Bouldin Index (DBI) | Average over all clusters of the maximum similarity ratio R_ij = (S_i + S_j) / M_ij, where S is intra-cluster dispersion and M is between-cluster separation | Lower is better. A value of 0 indicates perfectly separated clusters. |
| Calinski-Harabasz Index | Ratio of between-cluster dispersion to within-cluster dispersion, adjusted for the number of clusters and data points | Higher is better. Favors compact, well-separated clusters. Also known as the Variance Ratio Criterion. |
| Inertia (WCSS) | Sum of squared distances from each point to its cluster centroid | Lower is better, but always decreases with more clusters, so must be used with the elbow method. |
| Dunn Index | Ratio of the minimum inter-cluster distance to the maximum intra-cluster diameter | Higher is better. Sensitive to noise and outliers. |
When ground-truth labels are available (e.g., for evaluation on benchmark datasets), external metrics measure how well the clustering matches the true structure:
| Metric | Description | Range |
|---|---|---|
| Adjusted Rand Index (ARI) | Measures the similarity between two clusterings, adjusted for chance. Counts pairs of points that are either in the same cluster or in different clusters in both partitions. | -1 to 1 (1 = perfect, 0 = random) |
| Normalized Mutual Information (NMI) | Measures the mutual information between two clusterings, normalized to account for the entropy of each partition | 0 to 1 (1 = perfect) |
| Fowlkes-Mallows Index | Geometric mean of pairwise precision and recall | 0 to 1 (1 = perfect) |
| Homogeneity | Measures whether each cluster contains only members of a single class | 0 to 1 (1 = perfect) |
| Completeness | Measures whether all members of a given class are assigned to the same cluster | 0 to 1 (1 = perfect) |
| V-Measure | Harmonic mean of homogeneity and completeness | 0 to 1 (1 = perfect) |
In many practical applications, data points are represented by high-dimensional feature vectors, which can make clustering more challenging due to the curse of dimensionality. As the number of dimensions grows, the distances between all pairs of points tend to converge, meaning that the notion of "nearest neighbor" becomes less meaningful. Data also becomes increasingly sparse in high-dimensional spaces, making it harder for density-based methods to identify clusters.
To address these challenges, several strategies are used:
Deep clustering methods combine deep neural networks with clustering objectives to jointly learn feature representations and cluster assignments. Traditional clustering algorithms operate on fixed, hand-crafted features, which may not capture the complex, nonlinear structure of high-dimensional data such as images or text. Deep clustering addresses this limitation by learning representations that are optimized for the clustering task.
Deep Embedded Clustering, proposed by Xie, Girshick, and Farhadi in 2016, uses an autoencoder to learn a low-dimensional embedding of the data, then iteratively refines both the embedding and the cluster assignments. The method proceeds in two stages. First, the autoencoder is pretrained to reconstruct the input data, learning a compressed representation. Second, K-means is applied to the learned embeddings to obtain initial cluster centroids. The model then defines a target distribution based on the current soft assignments and optimizes a KL divergence loss to sharpen the assignments and push the embeddings toward the cluster centers. Experiments on image and text datasets showed significant improvement over methods that cluster on fixed features.
SCAN (Semantic Clustering by Adopting Nearest neighbors), proposed by Van Gansbeke et al. at ECCV 2020, decouples representation learning from clustering. First, a self-supervised pretext task (such as contrastive learning with SimCLR) is used to learn semantically meaningful features. Then, a clustering head is trained to classify each image together with its nearest neighbors in the learned feature space, using a novel loss function that encourages consistent assignments. A self-labeling refinement step further improves accuracy. SCAN achieved large improvements on CIFAR-10, CIFAR-100, and STL-10 benchmarks.
Contrastive clustering methods leverage contrastive learning to train representations that are friendly to clustering. The core idea is to create augmented views of each data point and train the network so that embeddings of different augmentations of the same point are pulled together (positive pairs) while embeddings of different points are pushed apart (negative pairs). Clustering is then performed on the resulting representation space, or a clustering head is trained jointly with the contrastive objective. Variants include approaches that integrate graph convolutional networks with contrastive learning for multi-scale structure discovery, as well as multi-view contrastive methods that handle data with multiple feature sets or modalities.
The following table provides a summary comparison of the major clustering algorithms:
| Algorithm | Cluster shape | Handles noise | Requires k | Scalability | Soft assignments |
|---|---|---|---|---|---|
| K-means | Spherical | No | Yes | Excellent | No |
| K-medoids | Spherical | Partially | Yes | Moderate | No |
| DBSCAN | Arbitrary | Yes | No | Good (with indexing) | No |
| HDBSCAN | Arbitrary | Yes | No | Good | Soft (via probabilities) |
| Agglomerative | Varies by linkage | No | Optional (cut dendrogram) | Poor for large datasets | No |
| BIRCH | Spherical | Partially | Optional | Excellent (single scan) | No |
| Mean Shift | Arbitrary | Partially | No | Poor for large datasets | No |
| GMM | Elliptical | No | Yes | Moderate | Yes |
| Spectral | Arbitrary | No | Yes | Poor for large datasets | No |
| OPTICS | Arbitrary | Yes | No | Moderate | No |
Clustering large-scale datasets presents several practical challenges. The time and memory requirements of many algorithms grow quadratically or cubically with the number of data points, making them impractical for datasets with millions or billions of records.
Scalable algorithms and techniques:
Choosing the right approach depends on the dataset size, available hardware, and whether the data arrives in a stream or batch.
Clustering has a wide range of applications across many industries and scientific fields:
| Domain | Application | Common algorithms |
|---|---|---|
| Marketing | Customer segmentation, market basket analysis | K-means, GMM |
| Biology | Gene expression analysis, species taxonomy, single-cell RNA-seq | Hierarchical, UMAP + clustering, HDBSCAN |
| Computer vision | Image segmentation, object detection, image compression | K-means, spectral clustering, Mean Shift |
| Natural language processing | Document clustering, topic modeling | K-means, hierarchical |
| Cybersecurity | Anomaly detection, intrusion detection | DBSCAN, Isolation Forest |
| Medicine | Patient subtype identification, medical image analysis | GMM, hierarchical |
| Social network analysis | Community detection, influence propagation | Spectral clustering, label propagation |
| Geospatial analysis | Spatial clustering of events, hotspot detection | DBSCAN, HDBSCAN |
| Astronomy | Star classification, galaxy morphology | K-means, GMM, DBSCAN |
| Recommender systems | User/item grouping for collaborative filtering | K-means, GMM |
| Autonomous driving | LiDAR point cloud segmentation | DBSCAN, Euclidean clustering |
| Finance | Fraud detection, portfolio grouping | K-means, DBSCAN, GMM |
Imagine you have a big box of differently shaped and colored toys. Clustering in machine learning is like sorting these toys into groups based on their similarities, like putting all the red cars together or all the round balls together. This helps us understand and organize the toys better. In the same way, clustering helps computers make sense of lots of information by grouping similar pieces of data together. The computer looks at each piece of data, figures out which other pieces it is most similar to, and puts them in the same group. Nobody tells the computer how to sort the toys ahead of time; it figures out the groups on its own by noticing which toys look alike.