Clustering

Introduction

Clustering is a fundamental technique in machine learning and data mining that involves grouping similar data points or objects into clusters based on some form of similarity measure or distance metric. The goal of clustering is to identify underlying patterns or structures in data, enabling efficient data representation, classification, and interpretation. Clustering is an unsupervised learning method, meaning that it does not rely on pre-labeled data for training but rather discovers inherent relationships within the data itself.

Clustering is one of the most widely used techniques in data science and has applications spanning customer segmentation, image analysis, anomaly detection, document organization, and biological taxonomy. The choice of clustering algorithm and its configuration depend on the characteristics of the data, the expected cluster shapes, and the specific goals of the analysis.

Formally, given a dataset X = {x_1, x_2, ..., x_n} of n data points in a d-dimensional feature space, clustering seeks to partition X into k groups C = {C_1, C_2, ..., C_k} such that data points within the same cluster are more similar to each other than to data points in other clusters. The definition of "similar" depends on the distance or similarity metric chosen, and different metrics can produce very different clusterings of the same data.

Historical background

The idea of grouping observations by similarity has roots stretching back to the early twentieth century, when taxonomists and statisticians developed manual classification methods. In 1956, Hugo Steinhaus published one of the earliest formal descriptions of partitioning a set of points by minimizing within-group distances. The following year, Stuart P. Lloyd devised the iterative algorithm now known as Lloyd's algorithm while working at Bell Laboratories on pulse-code modulation. Although Lloyd's internal technical memorandum circulated within Bell Labs for decades, it was not formally published until 1982. Independently, E. W. Forgy described a nearly identical procedure in 1965, and James MacQueen coined the term "k-means" in his 1967 paper presented at the Fifth Berkeley Symposium on Mathematical Statistics and Probability.

Throughout the 1970s and 1980s, hierarchical clustering methods and linkage criteria were refined by researchers including Joe Ward (Ward's method, 1963) and Peter Rousseeuw (silhouette analysis, 1987). The 1990s saw the introduction of density-based approaches, most notably DBSCAN (Ester et al., 1996) and BIRCH (Zhang, Ramakrishnan, and Livny, 1996), which addressed limitations of centroid-based methods for datasets with irregular cluster shapes or varying densities. Spectral clustering gained prominence in the early 2000s through the tutorial by Ulrike von Luxburg (2007). More recently, deep learning methods such as Deep Embedded Clustering (Xie et al., 2016) and contrastive clustering approaches have pushed the field toward learning representations and cluster assignments jointly.

Types of clustering algorithms

There are several types of clustering algorithms, each with their own strengths and weaknesses. The choice of an appropriate algorithm depends on the specific problem being addressed, the data properties, and the desired outcome.

Type	Key idea	Example algorithms	Cluster shape
Centroid-based	Clusters are represented by a central point (centroid)	K-means, K-medoids, Mini-Batch K-means	Spherical/convex
Density-based	Clusters are regions of high density separated by regions of low density	DBSCAN, OPTICS, HDBSCAN, Mean Shift	Arbitrary
Hierarchical	Clusters are organized in a tree-like structure (dendrogram)	Agglomerative, Divisive, BIRCH	Arbitrary
Distribution-based	Clusters are modeled as probability distributions	Gaussian Mixture Models (GMM), Bayesian GMM	Elliptical
Grid-based	Data space is divided into a grid, and clustering is performed on the grid cells	STING, CLIQUE	Varies
Spectral	Uses eigenvalues of a similarity matrix to reduce dimensions before clustering	Spectral clustering	Arbitrary

Centroid-based clustering

Centroid-based (also called partition-based) clustering algorithms divide the dataset into non-overlapping subsets or clusters, where each data point belongs to exactly one cluster. These algorithms represent each cluster by a central point and assign data points to the cluster with the nearest center.

K-means

K-means is the most widely known and commonly used clustering algorithm. It partitions the data into k clusters by iteratively refining cluster centroids to minimize the within-cluster sum of squared distances (also known as inertia).

Algorithm:

Choose the number of clusters k.
Initialize k centroids randomly (or using an initialization strategy such as K-means++).
Assignment step: Assign each data point to the nearest centroid based on Euclidean distance.
Update step: Recompute each centroid as the mean of all data points assigned to that cluster.
Repeat steps 3 and 4 until the centroids no longer change significantly (convergence) or a maximum number of iterations is reached.

K-means++ initialization, proposed by David Arthur and Sergei Vassilvitskii in 2007, selects initial centroids that are spread out across the data, which leads to faster convergence and better final results compared to random initialization. The algorithm picks the first centroid uniformly at random, then chooses each subsequent centroid with probability proportional to the squared distance from the nearest existing centroid.

Mini-Batch K-means is a variant that uses small random batches of data in each iteration rather than the full dataset, substantially reducing computation time while typically producing results close to those of standard K-means. It is particularly useful for very large datasets that do not fit into memory.

Property	Description
Time complexity	O(n * k * d * i), where n = samples, k = clusters, d = dimensions, i = iterations
Space complexity	O(n * d + k * d)
Cluster shape	Spherical, roughly equal-sized clusters
Requires	Number of clusters k specified in advance
Strengths	Simple, fast, scalable to large datasets
Weaknesses	Sensitive to initialization and outliers; assumes spherical clusters; requires choosing k

K-medoids (PAM)

K-medoids is similar to K-means but uses actual data points (medoids) as cluster centers instead of computed means. This makes it more robust to outliers and noise because the median-like behavior of medoids is less influenced by extreme values. The Partitioning Around Medoids (PAM) algorithm is the most common implementation. In each iteration, PAM evaluates whether swapping a medoid with a non-medoid point reduces the total dissimilarity. K-medoids is computationally more expensive than K-means, with a time complexity of O(k * (n - k)^2 * i), but it can work with any distance metric, not just Euclidean distance.

Density-based clustering

Density-based clustering algorithms identify clusters by grouping data points that are closely packed together based on a density metric. These algorithms can find arbitrarily shaped clusters and are robust to noise in the data.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise), proposed by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu in 1996, groups together data points that are closely packed in a high-density region while marking low-density regions as noise.

Core concepts:

Core point: A point that has at least minPts neighbors within a radius of epsilon.
Border point: A point that is within epsilon of a core point but does not itself have minPts neighbors.
Noise point: A point that is neither a core point nor a border point.

Algorithm:

For each unvisited point, count the number of points within radius epsilon.
If the count is at least minPts, the point is a core point. Start a new cluster.
Expand the cluster by recursively adding all density-reachable points (core points within epsilon, and their neighbors).
Points that are not density-reachable from any core point are labeled as noise.

Property	Description
Time complexity	O(n log n) with spatial indexing (e.g., KD-tree or ball tree); O(n^2) without
Key parameters	epsilon (neighborhood radius), minPts (minimum points to form a dense region)
Cluster shape	Arbitrary shape
Requires	Parameters epsilon and minPts (does not require k)
Strengths	Detects arbitrary-shaped clusters; identifies noise/outliers; does not need k
Weaknesses	Struggles with clusters of varying density; sensitive to epsilon and minPts; poor performance in very high dimensions

HDBSCAN

HDBSCAN (Hierarchical DBSCAN), introduced by Campello, Moulavi, and Sander in 2013, extends DBSCAN by converting it into a hierarchical algorithm. This eliminates the need to specify the epsilon parameter. HDBSCAN builds a hierarchy of clusters at different density levels using a mutual reachability distance, then extracts the most persistent clusters by selecting those with the greatest stability. The algorithm is generally more robust and requires less parameter tuning than standard DBSCAN, with only the min_cluster_size parameter needed in most cases.

OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure), proposed by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jorg Sander in 1999, creates an ordering of the data points that reflects their density-based clustering structure. It produces a reachability plot from which clusters at different density thresholds can be extracted. OPTICS can be viewed as a generalization of DBSCAN that handles clusters of varying density. Unlike DBSCAN, OPTICS does not produce a single partitioning but rather a density-based ordering from which multiple flat clusterings can be derived.

Mean Shift

Mean Shift is a non-parametric, kernel-based clustering algorithm that does not require specifying the number of clusters in advance. The algorithm works by iteratively shifting each data point toward the mode (peak) of the local density estimate. At each step, a kernel function (typically Gaussian) is placed over a data point, and the point is moved to the weighted mean of the surrounding points within the kernel bandwidth. This process repeats until convergence, at which point data points that converge to the same mode are assigned to the same cluster.

Mean Shift is particularly effective for image segmentation and object tracking in computer vision. Its main limitation is the sensitivity to the bandwidth parameter. If the bandwidth is too small, the algorithm may produce too many clusters; if too large, distinct clusters may be merged.

Hierarchical clustering

Hierarchical clustering algorithms build a tree-like structure, known as a dendrogram, to represent the nested groupings of data points. These algorithms can be either agglomerative (bottom-up) or divisive (top-down).

Agglomerative clustering

Agglomerative hierarchical clustering starts with each data point as its own cluster and iteratively merges the closest pairs of clusters until only one cluster remains (or until a stopping criterion is met).

The key choice in agglomerative clustering is the linkage criterion, which determines how the distance between two clusters is measured:

Linkage method	Distance between clusters	Characteristics
Single linkage	Minimum distance between any pair of points from the two clusters	Tends to produce elongated, chain-like clusters; sensitive to noise
Complete linkage	Maximum distance between any pair of points from the two clusters	Produces compact, roughly equal-sized clusters; sensitive to outliers
Average linkage	Average distance between all pairs of points from the two clusters	Balances between single and complete linkage
Ward's method	Increase in total within-cluster variance after merging	Tends to produce compact, spherical clusters; most commonly used
Centroid linkage	Distance between the centroids of the two clusters	Simple but can produce inversions in the dendrogram

Divisive clustering

Divisive hierarchical clustering begins with all data points in a single cluster and recursively splits the clusters until each data point forms its own cluster. Divisive methods are less common in practice because they are computationally more expensive than agglomerative methods. The DIANA (Divisive Analysis) algorithm is a well-known implementation.

Property	Description
Time complexity	O(n^3) for naive agglomerative clustering; O(n^2 log n) with efficient data structures
Space complexity	O(n^2) for the distance matrix
Cluster shape	Depends on linkage method
Requires	A distance metric and linkage criterion; the number of clusters can be chosen after inspecting the dendrogram
Strengths	Produces a full hierarchy; does not require specifying k in advance; provides a dendrogram for visual inspection
Weaknesses	Computationally expensive for large datasets; merges/splits are irreversible

BIRCH

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), proposed by Zhang, Ramakrishnan, and Livny in 1996, is designed for efficiently clustering very large datasets. BIRCH works by building a compact summary of the data called a Clustering Feature Tree (CF Tree). Each node in the CF Tree stores a Clustering Feature vector that summarizes the statistics of a subcluster (number of points, linear sum, and sum of squares). This allows BIRCH to compress the data incrementally with a single scan of the dataset. After the CF Tree is built, a global clustering algorithm (such as agglomerative clustering) is applied to the leaf entries. BIRCH is especially useful when the dataset is too large to fit in main memory.

Distribution-based clustering

Distribution-based clustering assumes that data is generated from a mixture of probability distributions and aims to identify the parameters of these distributions.

Gaussian Mixture Models (GMM)

Gaussian Mixture Models (GMM) assume that the data is generated from a mixture of k multivariate Gaussian (normal) distributions, each with its own mean vector and covariance matrix. The model is trained using the Expectation-Maximization (EM) algorithm.

Algorithm (EM for GMM):

Initialize the parameters: means, covariances, and mixing coefficients for each Gaussian component.
E-step (Expectation): Compute the probability (responsibility) that each data point belongs to each Gaussian component, given the current parameters.
M-step (Maximization): Update the means, covariances, and mixing coefficients to maximize the likelihood given the current responsibilities.
Repeat steps 2 and 3 until convergence (when the log-likelihood stops improving significantly).

Unlike K-means, which makes hard assignments (each point belongs to exactly one cluster), GMM provides soft assignments (each point has a probability of belonging to each cluster). This makes GMM more flexible for data where clusters overlap.

Property	Description
Time complexity	O(n * k * d^2 * i), where n = samples, k = components, d = dimensions, i = iterations
Cluster shape	Elliptical (defined by covariance matrices)
Requires	Number of components k (can be selected using BIC or AIC)
Strengths	Soft cluster assignments; can model elliptical clusters; probabilistic framework
Weaknesses	Sensitive to initialization; assumes Gaussian distributions; computationally more expensive than K-means

Grid-based clustering

Grid-based clustering algorithms partition the data space into a finite number of grid cells and then perform clustering on the grid structure. These algorithms are typically faster than other clustering methods since they work with the grid cells rather than individual data points. The STING (STatistical INformation Grid) algorithm, proposed by Wang, Yang, and Muntz in 1997, divides the space into rectangular cells arranged in a hierarchical structure. Each cell stores summary statistics, allowing queries to be answered at different resolutions. CLIQUE (Clustering In QUEst) combines grid-based and density-based ideas, and is designed for high-dimensional data by identifying dense cells in axis-aligned subspaces.

Spectral clustering

Spectral clustering uses the eigenvalues (spectrum) of a similarity matrix to perform dimension reduction before applying a standard clustering algorithm (typically K-means) in the reduced space. The algorithm constructs a similarity graph (e.g., a k-nearest-neighbor graph or an epsilon-neighborhood graph), computes the graph Laplacian matrix, and finds its smallest eigenvalues. The corresponding eigenvectors form a low-dimensional representation that preserves the connectivity structure of the data.

Spectral clustering can find non-convex clusters that K-means alone would miss because the similarity graph captures the connectivity structure of the data rather than just the pairwise distances. It is widely used for image segmentation, community detection in social networks, and any setting where the data has a natural graph structure. The main drawbacks are the O(n^3) computational cost of eigendecomposition and the need to construct and store the n x n similarity matrix.

Distance and similarity metrics

The choice of distance or similarity metric fundamentally affects how clusters are formed. Different metrics capture different notions of proximity and are suited to different types of data.

Metric	Formula	Best used when	Notes
Euclidean	sqrt(sum((x_i - y_i)^2))	Features are continuous and on similar scales	Most common; sensitive to scale differences
Manhattan (L1)	sum(	x_i - y_i	)
Cosine	1 - (x . y) / (		x
Mahalanobis	sqrt((x - y)^T * S^{-1} * (x - y))	Features are correlated and have different variances	Accounts for feature correlations; requires computing the covariance matrix inverse
Minkowski (Lp)	(sum(	x_i - y_i	^p))^{1/p}
Jaccard	1 -	A intersect B	/
Hamming	Number of positions where symbols differ	Categorical or binary data	Counts mismatches between feature vectors

Selecting the right metric is critical because it directly shapes which data points are considered neighbors and, consequently, which clusters are formed. For example, using Euclidean distance on text data with sparse, high-dimensional term-frequency vectors often performs poorly, whereas cosine similarity naturally handles the varying document lengths in that setting.

Choosing the number of clusters

Many clustering algorithms require specifying the number of clusters k in advance. Several methods exist for estimating the optimal k:

Method	How it works
Elbow method	Plot the within-cluster sum of squares (inertia) as a function of k. Look for the "elbow" point where adding more clusters yields diminishing returns. Simple but sometimes ambiguous when the elbow is not sharp.
Silhouette analysis	Compute the average silhouette score for different values of k. Select the k that maximizes the silhouette score. Provides a per-point quality measure that is easy to interpret.
Gap statistic	Proposed by Tibshirani, Walther, and Hastie (2001). Compares the within-cluster dispersion to that expected under a null reference distribution generated by uniform sampling. The optimal k maximizes the gap between the observed and reference dispersions.
BIC / AIC	For model-based approaches like GMM, Bayesian Information Criterion or Akaike Information Criterion can be used to balance model fit against complexity. BIC penalizes complexity more strongly than AIC and often selects simpler models.
Dendrogram inspection	For hierarchical clustering, inspect the dendrogram and cut at a level that produces meaningful clusters. Large jumps in the merge distance typically indicate a natural number of clusters.
Information-theoretic approaches	Methods such as the X-means algorithm automatically split clusters and use BIC to decide whether each split improves the model, effectively searching over k during execution.

Cluster evaluation metrics

Evaluating the quality of clustering results is challenging because there is no ground truth in unsupervised settings. Evaluation metrics fall into two categories: internal metrics (which use only the data and cluster assignments) and external metrics (which compare cluster assignments to known labels).

Internal metrics

Metric	Formula / key idea	Interpretation
Silhouette score	For each point: s = (b - a) / max(a, b), where a = mean intra-cluster distance, b = mean nearest-cluster distance	Ranges from -1 to 1. Higher is better. Values near 0 indicate overlapping clusters. Negative values indicate misassigned points.
Davies-Bouldin Index (DBI)	Average over all clusters of the maximum similarity ratio R_ij = (S_i + S_j) / M_ij, where S is intra-cluster dispersion and M is between-cluster separation	Lower is better. A value of 0 indicates perfectly separated clusters.
Calinski-Harabasz Index	Ratio of between-cluster dispersion to within-cluster dispersion, adjusted for the number of clusters and data points	Higher is better. Favors compact, well-separated clusters. Also known as the Variance Ratio Criterion.
Inertia (WCSS)	Sum of squared distances from each point to its cluster centroid	Lower is better, but always decreases with more clusters, so must be used with the elbow method.
Dunn Index	Ratio of the minimum inter-cluster distance to the maximum intra-cluster diameter	Higher is better. Sensitive to noise and outliers.

External metrics

When ground-truth labels are available (e.g., for evaluation on benchmark datasets), external metrics measure how well the clustering matches the true structure:

Metric	Description	Range
Adjusted Rand Index (ARI)	Measures the similarity between two clusterings, adjusted for chance. Counts pairs of points that are either in the same cluster or in different clusters in both partitions.	-1 to 1 (1 = perfect, 0 = random)
Normalized Mutual Information (NMI)	Measures the mutual information between two clusterings, normalized to account for the entropy of each partition	0 to 1 (1 = perfect)
Fowlkes-Mallows Index	Geometric mean of pairwise precision and recall	0 to 1 (1 = perfect)
Homogeneity	Measures whether each cluster contains only members of a single class	0 to 1 (1 = perfect)
Completeness	Measures whether all members of a given class are assigned to the same cluster	0 to 1 (1 = perfect)
V-Measure	Harmonic mean of homogeneity and completeness	0 to 1 (1 = perfect)

Feature selection and dimensionality reduction

In many practical applications, data points are represented by high-dimensional feature vectors, which can make clustering more challenging due to the curse of dimensionality. As the number of dimensions grows, the distances between all pairs of points tend to converge, meaning that the notion of "nearest neighbor" becomes less meaningful. Data also becomes increasingly sparse in high-dimensional spaces, making it harder for density-based methods to identify clusters.

To address these challenges, several strategies are used:

Dimensionality reduction: Techniques such as Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and UMAP project the data into a lower-dimensional space that preserves relevant structure. PCA retains directions of maximum variance, while t-SNE and UMAP focus on preserving local neighborhood relationships. UMAP has become especially popular as a preprocessing step for clustering in genomics and single-cell analysis.
Feature selection: Selecting a subset of the most informative features before clustering reduces noise and computational cost. Filter methods (e.g., variance thresholding), wrapper methods (e.g., recursive feature elimination), and embedded methods can all be applied.
Subspace clustering: Rather than reducing all features to a single lower-dimensional space, subspace clustering algorithms search for clusters that exist in different subsets of dimensions. This approach is useful when different clusters are best described by different feature combinations. Examples include CLIQUE and SUBCLU.

Deep clustering

Deep clustering methods combine deep neural networks with clustering objectives to jointly learn feature representations and cluster assignments. Traditional clustering algorithms operate on fixed, hand-crafted features, which may not capture the complex, nonlinear structure of high-dimensional data such as images or text. Deep clustering addresses this limitation by learning representations that are optimized for the clustering task.

Deep Embedded Clustering (DEC)

Deep Embedded Clustering, proposed by Xie, Girshick, and Farhadi in 2016, uses an autoencoder to learn a low-dimensional embedding of the data, then iteratively refines both the embedding and the cluster assignments. The method proceeds in two stages. First, the autoencoder is pretrained to reconstruct the input data, learning a compressed representation. Second, K-means is applied to the learned embeddings to obtain initial cluster centroids. The model then defines a target distribution based on the current soft assignments and optimizes a KL divergence loss to sharpen the assignments and push the embeddings toward the cluster centers. Experiments on image and text datasets showed significant improvement over methods that cluster on fixed features.

SCAN

SCAN (Semantic Clustering by Adopting Nearest neighbors), proposed by Van Gansbeke et al. at ECCV 2020, decouples representation learning from clustering. First, a self-supervised pretext task (such as contrastive learning with SimCLR) is used to learn semantically meaningful features. Then, a clustering head is trained to classify each image together with its nearest neighbors in the learned feature space, using a novel loss function that encourages consistent assignments. A self-labeling refinement step further improves accuracy. SCAN achieved large improvements on CIFAR-10, CIFAR-100, and STL-10 benchmarks.

Contrastive clustering

Contrastive clustering methods leverage contrastive learning to train representations that are friendly to clustering. The core idea is to create augmented views of each data point and train the network so that embeddings of different augmentations of the same point are pulled together (positive pairs) while embeddings of different points are pushed apart (negative pairs). Clustering is then performed on the resulting representation space, or a clustering head is trained jointly with the contrastive objective. Variants include approaches that integrate graph convolutional networks with contrastive learning for multi-scale structure discovery, as well as multi-view contrastive methods that handle data with multiple feature sets or modalities.

Algorithm comparison

The following table provides a summary comparison of the major clustering algorithms:

Algorithm	Cluster shape	Handles noise	Requires k	Scalability	Soft assignments
K-means	Spherical	No	Yes	Excellent	No
K-medoids	Spherical	Partially	Yes	Moderate	No
DBSCAN	Arbitrary	Yes	No	Good (with indexing)	No
HDBSCAN	Arbitrary	Yes	No	Good	Soft (via probabilities)
Agglomerative	Varies by linkage	No	Optional (cut dendrogram)	Poor for large datasets	No
BIRCH	Spherical	Partially	Optional	Excellent (single scan)	No
Mean Shift	Arbitrary	Partially	No	Poor for large datasets	No
GMM	Elliptical	No	Yes	Moderate	Yes
Spectral	Arbitrary	No	Yes	Poor for large datasets	No
OPTICS	Arbitrary	Yes	No	Moderate	No

Scalability considerations

Clustering large-scale datasets presents several practical challenges. The time and memory requirements of many algorithms grow quadratically or cubically with the number of data points, making them impractical for datasets with millions or billions of records.

Scalable algorithms and techniques:

Mini-Batch K-means processes small random subsets of data per iteration, reducing per-iteration cost from O(n * k * d) to O(b * k * d), where b is the batch size (typically b << n). This approach can handle datasets with millions of points on a single machine.
BIRCH achieves O(n) time complexity for its first pass and uses bounded memory through the CF Tree, making it suitable for streaming data or datasets that do not fit in memory.
Approximate nearest neighbor methods (e.g., locality-sensitive hashing, approximate KD-trees) can speed up density-based algorithms like DBSCAN from O(n^2) to near-linear time.
Distributed clustering frameworks such as Apache Spark MLlib and Dask-ML partition the data across multiple nodes and run parallel versions of K-means, DBSCAN, or bisecting K-means. These frameworks can scale to billions of data points.
Sampling-based approaches cluster a representative sample of the data and then assign the remaining points to the nearest cluster. CLARA (Clustering Large Applications) applies this strategy to K-medoids.

Choosing the right approach depends on the dataset size, available hardware, and whether the data arrives in a stream or batch.

Applications

Clustering has a wide range of applications across many industries and scientific fields:

Domain	Application	Common algorithms
Marketing	Customer segmentation, market basket analysis	K-means, GMM
Biology	Gene expression analysis, species taxonomy, single-cell RNA-seq	Hierarchical, UMAP + clustering, HDBSCAN
Computer vision	Image segmentation, object detection, image compression	K-means, spectral clustering, Mean Shift
Natural language processing	Document clustering, topic modeling	K-means, hierarchical
Cybersecurity	Anomaly detection, intrusion detection	DBSCAN, Isolation Forest
Medicine	Patient subtype identification, medical image analysis	GMM, hierarchical
Social network analysis	Community detection, influence propagation	Spectral clustering, label propagation
Geospatial analysis	Spatial clustering of events, hotspot detection	DBSCAN, HDBSCAN
Astronomy	Star classification, galaxy morphology	K-means, GMM, DBSCAN
Recommender systems	User/item grouping for collaborative filtering	K-means, GMM
Autonomous driving	LiDAR point cloud segmentation	DBSCAN, Euclidean clustering
Finance	Fraud detection, portfolio grouping	K-means, DBSCAN, GMM

Explain like I'm 5 (ELI5)

Imagine you have a big box of differently shaped and colored toys. Clustering in machine learning is like sorting these toys into groups based on their similarities, like putting all the red cars together or all the round balls together. This helps us understand and organize the toys better. In the same way, clustering helps computers make sense of lots of information by grouping similar pieces of data together. The computer looks at each piece of data, figures out which other pieces it is most similar to, and puts them in the same group. Nobody tells the computer how to sort the toys ahead of time; it figures out the groups on its own by noticing which toys look alike.

References

MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations." *Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability*, 1, 281-297.
Lloyd, S.P. (1982). "Least squares quantization in PCM." *IEEE Transactions on Information Theory*, 28(2), 129-137. (Originally an internal Bell Labs memo from 1957.)
Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise." *Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96)*, 226-231.
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). "BIRCH: An efficient data clustering method for very large databases." *ACM SIGMOD Record*, 25(2), 103-114.
Arthur, D. and Vassilvitskii, S. (2007). "K-means++: The advantages of careful seeding." *Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)*, 1027-1035.
Rousseeuw, P.J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." *Journal of Computational and Applied Mathematics*, 20, 53-65.
Davies, D.L. and Bouldin, D.W. (1979). "A cluster separation measure." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, PAMI-1(2), 224-227.
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). "Maximum likelihood from incomplete data via the EM algorithm." *Journal of the Royal Statistical Society: Series B*, 39(1), 1-38.
Ankerst, M., Breunig, M.M., Kriegel, H.P., and Sander, J. (1999). "OPTICS: Ordering Points To Identify the Clustering Structure." *ACM SIGMOD Record*, 28(2), 49-60.
Von Luxburg, U. (2007). "A tutorial on spectral clustering." *Statistics and Computing*, 17(4), 395-416.
Tibshirani, R., Walther, G., and Hastie, T. (2001). "Estimating the number of clusters in a data set via the gap statistic." *Journal of the Royal Statistical Society: Series B*, 63(2), 411-423.
Xie, J., Girshick, R., and Farhadi, A. (2016). "Unsupervised deep embedding for clustering analysis." *Proceedings of the 33rd International Conference on Machine Learning (ICML)*, 478-487.
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., and Van Gool, L. (2020). "SCAN: Learning to classify images without labels." *European Conference on Computer Vision (ECCV)*, 268-285.
Campello, R.J.G.B., Moulavi, D., and Sander, J. (2013). "Density-based clustering based on hierarchical density estimates." *Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)*, 160-172.
Calinski, T. and Harabasz, J. (1974). "A dendrite method for cluster analysis." *Communications in Statistics*, 3(1), 1-27.

Introduction

Historical background

Types of clustering algorithms

Centroid-based clustering

K-means

K-medoids (PAM)

Density-based clustering

DBSCAN

HDBSCAN

OPTICS

Mean Shift

Hierarchical clustering

Agglomerative clustering

Divisive clustering

BIRCH

Distribution-based clustering

Gaussian Mixture Models (GMM)

Grid-based clustering

Spectral clustering

Distance and similarity metrics

Choosing the number of clusters

Cluster evaluation metrics

Internal metrics

External metrics

Feature selection and dimensionality reduction

Deep clustering

Deep Embedded Clustering (DEC)

SCAN

Contrastive clustering

Algorithm comparison

Scalability considerations

Applications

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN

Dimension Reduction

Anomaly Detection

Divisive Clustering

Introduction

Historical background

Types of clustering algorithms

Centroid-based clustering

K-means

K-medoids (PAM)

Density-based clustering

DBSCAN

HDBSCAN

OPTICS

Mean Shift

Hierarchical clustering

Agglomerative clustering

Divisive clustering

BIRCH

Distribution-based clustering

Gaussian Mixture Models (GMM)

Grid-based clustering

Spectral clustering

Distance and similarity metrics

Choosing the number of clusters

Cluster evaluation metrics

Internal metrics

External metrics

Feature selection and dimensionality reduction

Deep clustering

Deep Embedded Clustering (DEC)

SCAN

Contrastive clustering

Algorithm comparison

Scalability considerations

Applications

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN