Centroid-based clustering

Introduction

Centroid-based clustering is a family of machine learning algorithms that group data by representing each cluster with a single prototype point called a centroid. A point is assigned to whichever cluster has the nearest centroid under some distance measure. The approach falls within unsupervised learning, since the algorithms infer structure from features alone without labelled targets.

The canonical example is the k-means clustering algorithm, which partitions a dataset into K groups by minimising the sum of squared distances between each point and the mean of its assigned cluster. Variants such as k-medoids, k-medians, k-modes, k-prototypes, fuzzy c-means, mini-batch k-means, and bisecting k-means change the prototype, distance function, or optimisation strategy, but all share the same idea: describe each cluster by one summary point and assign data based on proximity to that summary.

Centroid-based methods are popular because they are simple, easy to explain, and fast on moderate-sized data. Weaknesses include a tendency to find roughly spherical clusters of similar size and a strong dependence on K and the initial centroid placement.

How it works

Given n points in some feature space, a centroid-based algorithm chooses K prototype points (the centroids) and assigns every data point to one of them so that points in the same cluster are as close to their centroid as possible. The standard objective for k-means is the within-cluster sum of squares (WCSS), also called the inertia. For a partition into clusters with centroids m_1, ..., m_K, the WCSS is the sum over all clusters of the squared Euclidean distances between each point and its centroid.

Minimising WCSS exactly is NP-hard, even in the plane and even for only two clusters (Mahajan, Nimbhorkar, and Varadarajan, 2009). In practice, algorithms use heuristic iterative procedures that converge to a local minimum. The most widely used is Lloyd's algorithm.

Voronoi tessellation

Once centroids are fixed, assigning every point in feature space to its nearest centroid defines a Voronoi tessellation: the space is partitioned into K convex regions, one per centroid. When each region's centroid is also the mean of the points it covers, the result is a centroidal Voronoi tessellation. Lloyd's algorithm can be viewed as iteratively building such a tessellation by recomputing the Voronoi regions and moving each centroid to the mean of its region.

This explains why standard k-means produces clusters with straight, flat boundaries. The decision surface between any two clusters is a hyperplane bisecting the line between their centroids, so clusters are always convex, roughly globe-shaped regions. Datasets with curved or interlocking shapes are usually a poor fit.

K-means and Lloyd's algorithm

Stuart P. Lloyd of Bell Labs proposed the standard procedure in a 1957 internal memorandum on pulse-code modulation. The work circulated informally and was not published until 1982 in IEEE Transactions on Information Theory. Joel Max independently published a similar quantisation procedure in 1960 (hence "Lloyd-Max"), and Edward W. Forgy presented the same clustering scheme in 1965 ("Lloyd-Forgy"). The term "k-means" was coined by James MacQueen in 1967, though the basic idea traces back to Hugo Steinhaus in 1956.

Algorithm steps

Lloyd's algorithm for k-means proceeds as follows:

Choose an initial set of K centroids, often by randomly picking K data points or by using k-means++.
Assignment step: assign every data point to the cluster whose centroid is nearest under squared Euclidean distance.
Update step: recompute each centroid as the arithmetic mean of the points currently assigned to it.
Repeat steps 2 and 3 until assignments no longer change, the centroids move by less than some tolerance, or a maximum iteration count is reached.

Each iteration runs in time proportional to n times K times d, where d is the number of features. The WCSS decreases monotonically with each step, so the algorithm always converges in finite time to a stationary point that is only guaranteed to be a local minimum. Most implementations run the algorithm several times with different initial centroids and keep the run with the lowest WCSS.

K-means assumes clusters are convex, roughly equal in size and variance, and shaped like spheres in the feature space. When these assumptions hold, the algorithm is fast and produces interpretable clusters; when they do not, k-means can give visibly wrong results, for example splitting one elongated cluster or merging two adjacent clusters of unequal size. Worst-case running time can be superpolynomial in n, but on realistic data the number of iterations is usually well under fifty, with most of the improvement happening in the first handful.

Initialisation: k-means++

Plain random initialisation can produce clusterings arbitrarily worse than the optimal partition. In 2007, David Arthur and Sergei Vassilvitskii proposed k-means++, a seeding scheme that spreads the initial centroids out probabilistically:

Pick the first centroid uniformly at random from the data.
For each remaining point x, compute D(x), the distance from x to the nearest centroid already chosen.
Pick the next centroid with probability proportional to D(x) squared, so points far from existing centroids are more likely to be selected.
Repeat until K centroids are chosen, then run standard Lloyd k-means.

Arthur and Vassilvitskii showed that k-means++ gives an expected approximation ratio of O(log K) relative to the optimal WCSS. Their experiments reported roughly two-fold improvements in running time and as much as a 1000-fold reduction in error on certain datasets. Scikit-learn uses k-means++ as the default initialiser. A scalable variant called k-means|| (Bahmani et al., 2012) keeps the theoretical guarantees while reducing the number of sequential passes through the data.

K-medoids and PAM

The k-medoids algorithm replaces the cluster mean with a medoid, an actual data point that minimises the sum of dissimilarities to other points in its cluster. Leonard Kaufman and Peter Rousseeuw introduced Partitioning Around Medoids (PAM) in 1987. Because medoids are drawn from the data and dissimilarities can be arbitrary, k-medoids works with any distance function, including non-Euclidean ones such as Manhattan distance, cosine dissimilarity, or domain-specific measures for strings and graphs.

PAM has two phases. The BUILD phase greedily selects K data points as initial medoids to minimise total dissimilarity from each non-medoid to its nearest medoid. The SWAP phase then iteratively swaps each medoid with each non-medoid and performs the swap that reduces cost the most, until no swap improves the objective.

PAM has a runtime of O(K times (n minus K) squared) per iteration, which limits it to small datasets. CLARA runs PAM on multiple random samples and keeps the best clustering; CLARANS restricts the SWAP phase to a random subset of candidate swaps. Schubert and Rousseeuw's FastPAM and FasterPAM (2019-2021) reduced the per-iteration cost to roughly O(n squared), and BanditPAM (Tiwari et al., 2020) used multi-armed bandit techniques to focus computation on promising swaps.

K-medoids is more robust to outliers than k-means: a single far-away point cannot pull a medoid the way it pulls a mean. The trade-off is higher computational cost and, on clean data, a slightly worse WCSS.

Other variants

A number of related algorithms keep the centroid-based skeleton but change the prototype, the distance, or the optimisation strategy.

Algorithm	Prototype	Notes
K-means	Arithmetic mean	Minimises squared Euclidean distance; default for numeric data
K-medoids (PAM)	Actual data point (medoid)	Works with arbitrary dissimilarities; more robust to outliers
K-medians	Coordinate-wise median	Optimises sum of L1 distances; less affected by extreme values
K-modes	Mode of each attribute	Designed for categorical data; uses simple matching dissimilarity
K-prototypes	Mixed mean and mode	Handles datasets with both numeric and categorical features
Fuzzy c-means	Weighted mean	Each point has membership weights in all clusters
Mini-batch k-means	Running average mean	Centroids updated from small random batches; scales to huge data
Bisecting k-means	Arithmetic mean	Repeatedly splits one cluster at a time using 2-means
Spherical k-means	Unit-norm mean	Uses cosine similarity; popular for text and embedding data
Hartigan-Wong	Arithmetic mean	Local search variant that often finds better minima than Lloyd

Fuzzy c-means, introduced by J. C. Dunn in 1973 and refined by James C. Bezdek in 1981, assigns each point a degree of membership in every cluster rather than a hard assignment. A fuzziness parameter m (typically 2) controls how soft the assignments are.

Mini-batch k-means (Sculley, WWW 2010) draws a small random sample of points at each iteration and uses a running average to update centroids. It loses a little accuracy but reduces computation by orders of magnitude, making it practical for clustering millions of points and for streaming settings.

Bisecting k-means (Steinbach, Karypis, and Kumar, 2000) starts with all data in one cluster and repeatedly applies 2-means to split the cluster with the highest variance until K leaves are produced. It yields a hierarchy as a side effect and is often more stable than flat k-means.

Choosing the number of clusters

All centroid-based methods require choosing K in advance. Several heuristics help, but none is foolproof.

The elbow method plots WCSS against K and looks for the point where the curve bends from steep to shallow. Beyond the elbow, adding clusters yields little extra reduction in within-cluster variance. The method works when clusters are clearly separated and fails when the curve is smooth.

The silhouette score, introduced by Peter Rousseeuw in 1987, measures how tightly each point fits inside its assigned cluster relative to its distance from the next nearest cluster. Scores range from minus one to one; values close to one indicate good separation. The average silhouette across all points, computed for each candidate K, gives a single number to maximise. Unlike the elbow method, it considers both within-cluster compactness and between-cluster separation.

The gap statistic, proposed by Tibshirani, Walther, and Hastie in 2001, compares the observed WCSS at each K with the expected WCSS under a null reference distribution generated by uniform sampling. The chosen K is the smallest value where the gap statistic is within one standard error of the gap at the next value.

Analysts often compute all three. Other approaches include the Calinski-Harabasz and Davies-Bouldin indexes and the BIC-based X-means algorithm (Pelleg and Moore, 2000).

Comparison with other clustering families

Centroid-based clustering is typically taught alongside hierarchical clustering and density-based clustering.

Hierarchical methods such as agglomerative clustering build a tree of nested partitions. They do not require K up front; the user picks a cut height after the fact. They scale poorly (typically O(n squared) or worse) but offer a richer view of the data and can produce non-spherical clusters depending on the linkage criterion.

Density-based methods such as DBSCAN (Ester, Kriegel, Sander, and Xu, 1996) and OPTICS define clusters as regions of high density separated by low density. They find clusters of arbitrary shape, treat low-density points as noise, and do not require K. The trade-off is two parameters (a neighbourhood radius and a minimum point count) and difficulty when clusters have very different densities.

Centroid-based methods sit between these in scalability and shape flexibility. They are usually the fastest on large numeric datasets and the easiest to explain, but impose strong assumptions about cluster geometry. Gaussian mixture models are closely related, generalising k-means to allow elliptical clusters and probabilistic assignments via the EM algorithm.

Practical considerations

Feature scaling matters. Because k-means uses Euclidean distance, features with larger numeric ranges dominate the objective, so standardising or normalising before clustering is standard practice.

High-dimensional data brings the curse of dimensionality: as d grows, Euclidean distances become more uniform and the contrast between near and far neighbours fades. A common remedy is to reduce dimensionality first using principal component analysis, t-SNE, UMAP, or autoencoders. Outliers can pull a cluster mean off its true centre; k-medoids, k-medians, and robust pre-processing reduce this effect.

Multiple restarts are essentially mandatory. Scikit-learn's KMeans defaults to several runs and keeps the partition with the lowest inertia. R has kmeans plus the cluster package for PAM, CLARA, and silhouettes. Apache Spark MLlib includes KMeans, BisectingKMeans, and Gaussian mixture models. ELKI offers many k-means variants for research benchmarks.

Applications

Centroid-based clustering appears across many fields: customer segmentation, document clustering using bag-of-words or embedding vectors, colour quantisation and image segmentation, vector quantisation in speech coding and deep learning systems such as VQ-VAE, anomaly detection (points far from any centroid are flagged), feature learning, bioinformatics (gene expression, single-cell analysis), astronomy (stellar populations from spectroscopic features), and recommender systems where embeddings are pre-clustered for fast retrieval.

Explain like I'm 5

Imagine a playground full of kids and you want to split them into a few groups. You pick spots on the ground as "meeting spots" and ask each kid to run to the nearest one. Then you look at each group, find the average position of the kids, and move the meeting spot to that average. Some kids might now be closer to a different spot, so you let them switch. Keep moving spots and letting kids switch until nothing changes. That is centroid-based clustering: the meeting spots are the centroids, and each group is one cluster.

References

Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability.
Forgy, E. W. (1965). Cluster analysis of multivariate data. Biometrics 21, 768-769.
Arthur, D., and Vassilvitskii, S. (2007). k-means++: The Advantages of Careful Seeding. Proc. 18th ACM-SIAM Symposium on Discrete Algorithms, 1027-1035.
Kaufman, L., and Rousseeuw, P. J. (1987). Clustering by means of medoids. In Statistical Data Analysis Based on the L1 Norm.
Schubert, E., and Rousseeuw, P. J. (2021). Fast and eager k-medoids clustering. Information Systems.
Sculley, D. (2010). Web-scale k-means clustering. Proc. 19th International Conference on World Wide Web, 1177-1178.
Steinbach, M., Karypis, G., and Kumar, V. (2000). A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining.
Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.
Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. JRSS Series B, 63(2), 411-423.
Mahajan, M., Nimbhorkar, P., and Varadarajan, K. (2009). The planar k-means problem is NP-hard. WALCOM, 274-285.
Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD, 226-231.
Bahmani, B., et al. (2012). Scalable k-means++. PVLDB, 5(7), 622-633.
Wikipedia. k-means clustering. https://en.wikipedia.org/wiki/K-means_clustering
Wikipedia. k-medoids. https://en.wikipedia.org/wiki/K-medoids
Wikipedia. Lloyd's algorithm. https://en.wikipedia.org/wiki/Lloyd%27s_algorithm
Wikipedia. k-means++. https://en.wikipedia.org/wiki/K-means%2B%2B
Scikit-learn documentation. KMeans. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Introduction

How it works

Voronoi tessellation

K-means and Lloyd's algorithm

Algorithm steps

Initialisation: k-means++

K-medoids and PAM

Other variants

Choosing the number of clusters

Comparison with other clustering families

Practical considerations

Applications

Explain like I'm 5

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering

Introduction

How it works

Voronoi tessellation

K-means and Lloyd's algorithm

Algorithm steps

Initialisation: k-means++

K-medoids and PAM

Other variants

Choosing the number of clusters

Comparison with other clustering families

Practical considerations

Applications

Explain like I'm 5

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering