Machine learning terms/Clustering

See also: Machine learning terms

Clustering is a class of unsupervised machine learning techniques that partition a set of objects into groups, called clusters, such that objects in the same group are more similar to each other than to objects in other groups. Unlike supervised learning, clustering algorithms operate on unlabeled data: the algorithm receives feature vectors but no target labels, and it discovers structure on its own. Clustering is one of the oldest and most widely used branches of data mining, pattern recognition, and statistical analysis, with documented applications dating back to the 1930s in psychology and biology.

overview

The central problem of clustering is to define what makes two objects similar, and then to find a partition of the data that maximizes within-cluster similarity while minimizing between-cluster similarity. Different definitions of similarity, different cluster shapes, and different assumptions about the data generating process give rise to dozens of distinct algorithms. There is no single best clustering algorithm: the right choice depends on the geometry of the data, the size of the dataset, the noise level, and what the user intends to do with the resulting clusters.

Clustering is closely related to other unsupervised tasks such as dimensionality reduction, density estimation, and anomaly detection. It is often the first step in exploratory data analysis, customer segmentation, image processing, bioinformatics, and document organization.

A partition is hard if every object belongs to exactly one cluster, or soft (also called fuzzy) if each object has a membership weight in every cluster. Clustering can also be classified as:

Property	Description
Hard vs. soft	Whether each point belongs to one cluster or has fractional membership in many
Flat vs. hierarchical	Whether clusters form a single partition or a nested tree
Parametric vs. non-parametric	Whether the algorithm assumes a specific distribution (e.g., Gaussian)
Exclusive vs. overlapping	Whether clusters can share members
Complete vs. partial	Whether every point must be assigned to a cluster, or some can be left as noise

core algorithms

k-means

K-means is the most widely used clustering algorithm. It partitions $n$ observations into $k$ clusters by assigning each observation to the cluster whose mean (the centroid) is nearest, then recomputing the centroids until the assignment stops changing. K-means is a special case of centroid-based clustering and minimizes the within-cluster sum of squared Euclidean distances.

The algorithm is also called the Lloyd algorithm after Stuart Lloyd, who described it in a 1957 Bell Labs technical note that was not formally published until 1982 in IEEE Transactions on Information Theory. The name k-means comes from a 1967 paper by James MacQueen at the Fifth Berkeley Symposium. Edward Forgy proposed an identical method in 1965, and Hugo Steinhaus outlined the idea in 1956.

Standard k-means has time complexity $O(n \cdot k \cdot d \cdot i)$ for $n$ points, $k$ clusters, $d$ dimensions, and $i$ iterations. Finding the global optimum is NP-hard, so the algorithm only finds a local minimum that depends on initial centroid placement.

k-means++

K-means++ is a careful seeding procedure introduced by David Arthur and Sergei Vassilvitskii in their 2007 paper "k-means++: The Advantages of Careful Seeding" (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms). The first centroid is chosen uniformly at random; each subsequent centroid is chosen with probability proportional to the squared distance from the nearest already-chosen centroid. The authors proved that k-means++ produces a clustering whose expected cost is within $O(\log k)$ of the optimum, a substantial improvement over random initialization. K-means++ is the default initialization in scikit-learn, MATLAB, and many other libraries.

k-medoids and PAM

K-medoids is a variant in which each cluster is represented by an actual data point (the medoid) rather than a computed mean. Because medoids are real observations, k-medoids works with arbitrary distance functions and is more robust to outliers than k-means. The Partitioning Around Medoids (PAM) algorithm was introduced by Leonard Kaufman and Peter Rousseeuw in their 1987 report and 1990 book Finding Groups in Data: An Introduction to Cluster Analysis. PAM has higher computational cost than k-means, $O(k(n-k)^2)$ per iteration, which led to faster variants such as CLARA (Clustering Large Applications) and CLARANS (Clustering Large Applications based on Randomized Search) by Ng and Han in 1994. The closely related k-median problem minimizes the sum of L1 distances rather than squared L2 distances.

hierarchical clustering

Hierarchical clustering builds a tree of clusters called a dendrogram. The two main strategies are:

Agglomerative clustering (bottom-up): every point starts as its own cluster, and the closest pair of clusters is merged at each step.
Divisive clustering (top-down): all points start in one cluster, which is recursively split.

The agglomerative approach was popularized by Joe H. Ward in his 1963 JASA paper, which introduced Ward's method of minimizing variance increase. Other linkage criteria include single linkage (minimum distance, studied by Florek et al. in 1951 and Sneath in 1957), complete linkage (maximum distance), and average linkage (UPGMA, Sokal and Michener 1958).

Naive agglomerative clustering is $O(n^3)$ time and $O(n^2)$ space, though SLINK (Sibson 1973) and CLINK (Defays 1977) achieve $O(n^2)$. Divisive methods include DIANA from Kaufman and Rousseeuw (1990).

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was introduced by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu at KDD-96. The paper received the SIGKDD Test of Time Award in 2014.

DBSCAN takes two parameters: $\varepsilon$ (the neighborhood radius) and MinPts (minimum points required for a dense region). A core point has at least MinPts neighbors within $\varepsilon$. Clusters are grown by connecting core points whose neighborhoods overlap; points unreachable from any core point are labeled as noise. DBSCAN discovers clusters of arbitrary shape without pre-specifying their number, but it struggles when cluster densities vary widely.

HDBSCAN

HDBSCAN (Hierarchical DBSCAN) was introduced by Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander at PAKDD 2013. It generalizes DBSCAN by varying $\varepsilon$ over a continuous range, then extracting a flat clustering from the hierarchy using a stability-based criterion. HDBSCAN handles clusters of varying density and requires only one main parameter (min_cluster_size). The reference Python implementation is the hdbscan library by Leland McInnes.

OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) was introduced by Ankerst, Breunig, Kriegel, and Sander at ACM SIGMOD 1999. Rather than producing a flat clustering, OPTICS produces a reachability plot that encodes the density-based clustering structure at all scales. The user can extract clusters at any chosen density level.

mean shift

Mean shift is a non-parametric mode-seeking algorithm proposed by Fukunaga and Hostetler in 1975 and brought to prominence by Dorin Comaniciu and Peter Meer in IEEE TPAMI 2002. The algorithm shifts each point iteratively toward the local maximum of an estimated kernel density. Points converging to the same mode form a cluster. Mean shift requires only a bandwidth parameter and is widely used for image segmentation and visual tracking.

spectral clustering

Spectral clustering treats the data as a graph and uses the eigenvectors of the graph Laplacian to embed the points in a low-dimensional space where standard clustering (typically k-means) is applied. The most cited modern formulation is Ng, Jordan, and Weiss (NeurIPS 2001). Earlier work by Shi and Malik (IEEE TPAMI 2000) and Meila and Shi (2001) established the graph-cut interpretation. Von Luxburg's 2007 "A Tutorial on Spectral Clustering" remains the standard reference. Spectral clustering recovers non-convex clusters that fail k-means, but $O(n^3)$ eigendecomposition limits scalability without approximations such as the Nyström method.

Gaussian mixture models

A Gaussian mixture model (GMM) assumes the data are generated by a weighted sum of $k$ multivariate Gaussians. Parameters are fit using the Expectation-Maximization (EM) algorithm of Dempster, Laird, and Rubin (1977). GMMs produce a soft clustering: each point has a posterior probability of belonging to each component. K-means is a limit case of EM on a GMM with isotropic Gaussians of equal variance. GMMs handle elliptical clusters and support model selection via AIC and BIC, but EM finds only local optima.

affinity propagation

Affinity propagation was introduced by Brendan J. Frey and Delbert Dueck in Science 2007. The algorithm exchanges "responsibility" and "availability" messages between every pair of points until a stable set of exemplars emerges; each non-exemplar is assigned to its nearest exemplar. The number of clusters is not pre-specified, but $O(n^2)$ memory limits scalability.

BIRCH

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) was introduced by Zhang, Ramakrishnan, and Livny at SIGMOD 1996. Designed for very large databases that do not fit in memory, BIRCH builds a Clustering Feature tree in a single pass, summarizing dense regions as compact statistics, then runs a global clustering on leaf entries. BIRCH is one of the few clustering algorithms with provably linear time complexity in dataset size.

other classical algorithms

Algorithm	Authors and year	Key idea
Wishart's mode analysis	Wishart, 1969	Early density-based method, predecessor of DBSCAN
CURE	Guha, Rastogi, Shim, 1998	Cluster as a fixed number of well-scattered points
ROCK	Guha, Rastogi, Shim, 1999	Uses links (shared neighbors) for categorical data
CHAMELEON	Karypis, Han, Kumar, 1999	Graph partitioning then dynamic merging
DENCLUE	Hinneburg and Keim, 1998	Density-attractor via kernel density estimation

distance and similarity measures

Every clustering algorithm depends on a notion of distance or similarity measure. Common metrics include:

Metric	Formula	Typical use
Euclidean (L2)	$\sqrt{\sum_i (x_i - y_i)^2}$	Continuous numeric features, isotropic clusters
Manhattan (L1)	$\sum_i \lvert x_i - y_i \rvert$	Robust to outliers, grid-like data
Minkowski (Lp)	$(\sum_i \lvert x_i - y_i \rvert^p)^{1/p}$	Generalization of L1 and L2
Chebyshev (L∞)	$\max_i \lvert x_i - y_i \rvert$	Worst-case coordinate difference
Cosine	$1 - \frac{x \cdot y}{\lVert x \rVert \lVert y \rVert}$	Text and high-dimensional sparse vectors
Mahalanobis	$\sqrt{(x-y)^T \Sigma^{-1} (x-y)}$	Correlated features with known covariance
Hamming	$\sum_i \mathbb{1}[x_i \neq y_i]$	Binary or categorical data
Jaccard	$1 - \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}$	Set-valued or sparse binary data
Edit (Levenshtein)	Min number of insert, delete, substitute operations	Strings, sequences
Dynamic time warping	Optimal alignment cost between time series	Time series analysis, speech
Earth mover's distance	Wasserstein-1 between probability distributions	Image histograms, distributions

The choice of distance is itself part of the modeling decision. Cosine distance is standard for text because it ignores document length; Mahalanobis distance is appropriate when features have very different scales or are correlated; dynamic time warping handles time series of different lengths. Many algorithms also support kernelization, which replaces the inner product with a Mercer kernel and allows clustering in an implicit nonlinear feature space.

Feature scaling matters: a Euclidean k-means run on raw data with one feature in dollars and another in years will be dominated by the larger-scale feature. Standardization (zero mean, unit variance) or min-max scaling is almost always recommended.

evaluation metrics

Evaluating a clustering is harder than evaluating a classifier because there are no ground-truth labels in the typical setting. Two families of metrics are used:

internal metrics

Internal metrics use only the data and the clustering itself.

Metric	Reference	Direction
Silhouette coefficient	Rousseeuw, 1987	Higher is better, range [-1, 1]
Davies-Bouldin index	Davies and Bouldin, 1979	Lower is better
Calinski-Harabasz (variance ratio)	Caliński and Harabasz, 1974	Higher is better
Dunn index	Dunn, 1974	Higher is better
Within-cluster sum of squares	(inertia, used by k-means)	Lower is better, but biased toward more clusters
Xie-Beni index	Xie and Beni, 1991	For fuzzy clustering

The silhouette coefficient $s(i)$ for point $i$ is defined as $(b_i - a_i) / \max(a_i, b_i)$, where $a_i$ is the mean intra-cluster distance and $b_i$ is the mean nearest-other-cluster distance. Values near 1 indicate well-separated clusters, values near 0 indicate overlapping clusters, and negative values suggest misclassification.

external metrics

External metrics compare a clustering to ground-truth class labels, when those are available.

Metric	Reference	Notes
Rand index	Rand, 1971	Probability that two random points are correctly paired
Adjusted Rand index (ARI)	Hubert and Arabie, 1985	Chance-corrected version of Rand index
Normalized mutual information (NMI)	Strehl and Ghosh, 2002	Information-theoretic, range [0, 1]
Adjusted mutual information (AMI)	Vinh, Epps, Bailey, 2010	Chance-corrected NMI
Fowlkes-Mallows index	Fowlkes and Mallows, 1983	Geometric mean of pairwise precision and recall
V-measure (homogeneity, completeness)	Rosenberg and Hirschberg, 2007	Harmonic mean of homogeneity and completeness
Purity	Manning et al., 2008	Simple but biased toward many small clusters

Kleinberg's impossibility theorem (Jon Kleinberg, 2002, NeurIPS) proved that no clustering function can simultaneously satisfy three intuitive axioms (scale-invariance, richness, and consistency), which is one reason why so many competing algorithms and metrics coexist.

choosing the number of clusters

Most partitional algorithms require the user to specify $k$ in advance. Several heuristics help:

Elbow method: plot within-cluster sum of squares against $k$ and look for the point where adding another cluster yields diminishing returns. The method is informal and the elbow is often ambiguous.
Silhouette method: compute the average silhouette coefficient for each candidate $k$ and pick the maximum.
Gap statistic: introduced by Tibshirani, Walther, and Hastie in their 2001 Journal of the Royal Statistical Society paper. It compares the within-cluster dispersion to the expected dispersion under a null reference distribution.
Bayesian information criterion (BIC): applicable to model-based clusterings such as GMMs.
X-means: an extension of k-means by Pelleg and Moore (2000) that uses BIC to select $k$ automatically.
G-means: Hamerly and Elkan (2003), splits clusters until each component looks Gaussian.
Density-based methods (DBSCAN, HDBSCAN, mean shift): avoid the question entirely by determining the number of clusters from the data.

modern clustering

deep clustering

Deep clustering jointly learns a representation and a clustering. Notable methods include:

Deep Embedded Clustering (DEC): Xie, Girshick, and Farhadi (2016) train an autoencoder and refine cluster assignments using a Student's t distribution and KL divergence target.
Improved DEC (IDEC): Guo et al. (2017) add a reconstruction loss to preserve local structure.
Deep Clustering Network (DCN): Yang et al. (2017) jointly optimize reconstruction and k-means objectives.
JULE (Joint Unsupervised Learning): Yang, Parikh, Batra (2016) combine agglomerative clustering with a CNN.
VaDE (Variational Deep Embedding): Jiang et al. (2017) use a variational autoencoder with a Gaussian mixture prior.
DeepCluster: Caron et al. (2018) iteratively cluster CNN features and use the cluster assignments as pseudo-labels.

contrastive and self-supervised clustering

The rise of self-supervised learning has produced methods that cluster on top of representations learned without labels. SwAV (Caron et al., 2020), SCAN (Van Gansbeke et al., 2020), and SeLa (Asano, Rupprecht, Vedaldi, 2020) all combine contrastive or self-distillation losses with online clustering. SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) provide the underlying contrastive learning framework, and the resulting embeddings cluster well under k-means without further training.

embedding-based clustering

With the spread of pre-trained transformer models, a common modern workflow is to encode raw data (text, images, audio) into dense embeddings using a model like Sentence-BERT, CLIP, or a large language model, then run a classical algorithm such as HDBSCAN or k-means on the embeddings. The BERTopic library (Maarten Grootendorst, 2022) packages this idea for topic modeling: it combines transformer embeddings, UMAP dimensionality reduction, HDBSCAN clustering, and class-based TF-IDF for topic representation.

subspace and high-dimensional clustering

In high dimensions, the curse of dimensionality makes most distance metrics nearly useless: all pairs of points become roughly equidistant. Subspace clustering algorithms search for clusters in axis-aligned or arbitrary low-dimensional subspaces. Examples include CLIQUE (Agrawal et al., 1998), SUBCLU (Kailing, Kriegel, Kröger, 2004), and sparse subspace clustering (Elhamifar and Vidal, 2009). Sketching techniques such as random projections, locality-sensitive hashing, and the Johnson-Lindenstrauss lemma are often applied as preprocessing.

software implementations

Library	Language	Notable algorithms
scikit-learn	Python	k-means, k-means++, Mini-Batch, agglomerative, DBSCAN, OPTICS, GMM, mean shift, spectral, affinity propagation, BIRCH
`hdbscan`	Python	HDBSCAN, robust single linkage
`pyclustering`	Python, C++	40+ algorithms
`cluster`	R	PAM, CLARA, AGNES, DIANA, FANNY
`mclust`	R	GMM with model selection
ELKI	Java	Research-grade implementations
Apache Spark MLlib	Scala, Python	Distributed k-means, GMM, bisecting k-means
RAPIDS cuML	Python (GPU)	GPU-accelerated k-means, DBSCAN, HDBSCAN
FAISS	C++, Python	Large-scale k-means and nearest-neighbor search

use cases

customer segmentation

Marketers cluster customers by purchase history, demographics, and engagement to design targeted campaigns. RFM analysis (recency, frequency, monetary value) is a classical clustering input. K-means on standardized RFM features is still common in industry, though hierarchical and density-based methods are used when segments have unequal size.

document and topic clustering

Clustering text documents by TF-IDF or transformer embeddings reveals thematic structure in news archives, customer support tickets, and scientific literature. Latent Dirichlet Allocation (Blei, Ng, Jordan, 2003) is technically a probabilistic topic model but functions as a soft clustering. BERTopic and Top2Vec are modern embedding-based alternatives.

image segmentation

Grouping pixels by color, position, and texture is one of the oldest applications of clustering. Mean shift, normalized cuts, and Felzenszwalb's graph-based segmentation (2004) are classical methods. Modern semantic segmentation has largely moved to supervised deep learning, but clustering still plays a role in superpixel generation (SLIC, Achanta et al. 2012) and unsupervised pretraining.

anomaly detection

Density-based clustering algorithms naturally identify points that do not belong to any cluster: DBSCAN labels them as noise, and HDBSCAN provides per-point outlier scores via the GLOSH algorithm (Campello et al., 2015). Anomaly detection using clustering is widely used in fraud detection, network intrusion detection, and equipment monitoring.

bioinformatics

Gene expression analysis routinely clusters genes or samples to identify co-regulated groups. The classic Eisen et al. (1998) PNAS paper applied hierarchical clustering to microarray data. Single-cell RNA sequencing pipelines such as Seurat (Satija lab) and Scanpy use Leiden and Louvain community detection on shared nearest-neighbor graphs to identify cell types.

recommender systems

User-based and item-based collaborative filtering both rely on clustering or near-neighbor structure. Matrix factorization methods (Koren, Bell, Volinsky, 2009) project users and items into a shared embedding space where nearest-neighbor and clustering operations become meaningful.

other application areas

Clustering also appears in network science (community detection with Louvain, Leiden, Infomap), computer vision (visual word codebooks, vector quantization for compression), audio (speaker diarization), astronomy (galaxy clustering, photometric redshift), climate science (regime identification), and cybersecurity (malware family and log clustering).

pitfalls

Clustering looks deceptively simple, but several pitfalls are worth flagging:

Curse of dimensionality: in high dimensions, distances concentrate and clusters become statistically indistinguishable from random points.
Initialization sensitivity: k-means and EM can converge to different local optima depending on the starting point.
Cluster validity is subjective: the same dataset can be partitioned in many equally valid ways depending on the question.
Scaling matters: features measured in different units must be normalized before applying distance-based algorithms.
Outliers: a few extreme points can pull k-means centroids far from the bulk of the data; prefer k-medoids or robust scaling.
Cluster shape assumptions: k-means assumes convex, isotropic, equally sized clusters; when that fails, prefer GMM, spectral, or density-based methods.
Stability: resampling-based stability scores (Lange et al., 2004; Ben-Hur, Elisseeff, Guyon, 2002) are useful sanity checks.

index of clustering term wiki pages

references

Lloyd, S. P. (1982). "Least Squares Quantization in PCM." IEEE Trans. Information Theory, 28(2), 129-137.
MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations." 5th Berkeley Symposium, 1, 281-297.
Arthur, D. and Vassilvitskii, S. (2007). "k-means++: The Advantages of Careful Seeding." ACM-SIAM SODA, 1027-1035.
Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.
Ward, J. H. (1963). "Hierarchical Grouping to Optimize an Objective Function." JASA, 58(301), 236-244.
Sibson, R. (1973). "SLINK: an optimally efficient algorithm for the single-link cluster method." The Computer Journal, 16(1), 30-34.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise." KDD-96, 226-231.
Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013). "Density-Based Clustering Based on Hierarchical Density Estimates." PAKDD, 160-172.
Ankerst, M. et al. (1999). "OPTICS: Ordering Points To Identify the Clustering Structure." ACM SIGMOD Record, 28(2), 49-60.
Comaniciu, D. and Meer, P. (2002). "Mean Shift: A Robust Approach Toward Feature Space Analysis." IEEE TPAMI, 24(5), 603-619.
Ng, A. Y., Jordan, M. I., and Weiss, Y. (2001). "On Spectral Clustering: Analysis and an Algorithm." NeurIPS 14.
Shi, J. and Malik, J. (2000). "Normalized Cuts and Image Segmentation." IEEE TPAMI, 22(8), 888-905.
von Luxburg, U. (2007). "A Tutorial on Spectral Clustering." Statistics and Computing, 17(4), 395-416.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm." JRSS B, 39(1), 1-38.
Frey, B. J. and Dueck, D. (2007). "Clustering by Passing Messages Between Data Points." Science, 315(5814), 972-976.
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). "BIRCH: An Efficient Data Clustering Method for Very Large Databases." ACM SIGMOD, 103-114.
Rousseeuw, P. J. (1987). "Silhouettes." Journal of Computational and Applied Mathematics, 20, 53-65.
Davies, D. L. and Bouldin, D. W. (1979). "A Cluster Separation Measure." IEEE TPAMI, 1(2), 224-227.
Caliński, T. and Harabasz, J. (1974). "A dendrite method for cluster analysis." Communications in Statistics, 3(1), 1-27.
Hubert, L. and Arabie, P. (1985). "Comparing partitions." Journal of Classification, 2(1), 193-218.
Tibshirani, R., Walther, G., and Hastie, T. (2001). "Estimating the number of clusters in a data set via the gap statistic." JRSS B, 63(2), 411-423.
Kleinberg, J. (2002). "An Impossibility Theorem for Clustering." NeurIPS 15.
Xie, J., Girshick, R., and Farhadi, A. (2016). "Unsupervised Deep Embedding for Clustering Analysis." ICML, 478-487.
Grootendorst, M. (2022). "BERTopic." arXiv:2203.05794.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." JMLR, 12, 2825-2830.
Jain, A. K. (2010). "Data clustering: 50 years beyond K-means." Pattern Recognition Letters, 31(8), 651-666.
Xu, R. and Wunsch, D. (2005). "Survey of Clustering Algorithms." IEEE TNN, 16(3), 645-678.

overview

core algorithms

k-means

k-means++

k-medoids and PAM

hierarchical clustering

DBSCAN

HDBSCAN

OPTICS

mean shift

spectral clustering

Gaussian mixture models

affinity propagation

BIRCH

other classical algorithms

distance and similarity measures

evaluation metrics

internal metrics

external metrics

choosing the number of clusters

modern clustering

deep clustering

contrastive and self-supervised clustering

embedding-based clustering

subspace and high-dimensional clustering

software implementations

use cases

customer segmentation

document and topic clustering

image segmentation

anomaly detection

bioinformatics

recommender systems

other application areas

pitfalls

index of clustering term wiki pages

references

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Machine learning terms/Fundamentals

Machine learning terms/Google Cloud

overview

core algorithms

k-means

k-means++

k-medoids and PAM

hierarchical clustering

DBSCAN

HDBSCAN

OPTICS

mean shift

spectral clustering

Gaussian mixture models

affinity propagation

BIRCH

other classical algorithms

distance and similarity measures

evaluation metrics

internal metrics

external metrics

choosing the number of clusters

modern clustering

deep clustering

contrastive and self-supervised clustering

embedding-based clustering

subspace and high-dimensional clustering

software implementations

use cases

customer segmentation

document and topic clustering

image segmentation

anomaly detection

bioinformatics

recommender systems

other application areas

pitfalls