Centroid-based clustering
Last reviewed
May 11, 2026
Sources
19 citations
Review status
Source-backed
Revision
v2 · 2,498 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
19 citations
Review status
Source-backed
Revision
v2 · 2,498 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Centroid-based clustering is a family of machine learning algorithms that group data by representing each cluster with a single prototype point called a centroid. A point is assigned to whichever cluster has the nearest centroid under some distance measure. The approach falls within unsupervised learning, since the algorithms infer structure from features alone without labelled targets.
The canonical example is the k-means clustering algorithm, which partitions a dataset into K groups by minimising the sum of squared distances between each point and the mean of its assigned cluster. Variants such as k-medoids, k-medians, k-modes, k-prototypes, fuzzy c-means, mini-batch k-means, and bisecting k-means change the prototype, distance function, or optimisation strategy, but all share the same idea: describe each cluster by one summary point and assign data based on proximity to that summary.
Centroid-based methods are popular because they are simple, easy to explain, and fast on moderate-sized data. Weaknesses include a tendency to find roughly spherical clusters of similar size and a strong dependence on K and the initial centroid placement.
Given n points in some feature space, a centroid-based algorithm chooses K prototype points (the centroids) and assigns every data point to one of them so that points in the same cluster are as close to their centroid as possible. The standard objective for k-means is the within-cluster sum of squares (WCSS), also called the inertia. For a partition into clusters with centroids m_1, ..., m_K, the WCSS is the sum over all clusters of the squared Euclidean distances between each point and its centroid.
Minimising WCSS exactly is NP-hard, even in the plane and even for only two clusters (Mahajan, Nimbhorkar, and Varadarajan, 2009). In practice, algorithms use heuristic iterative procedures that converge to a local minimum. The most widely used is Lloyd's algorithm.
Once centroids are fixed, assigning every point in feature space to its nearest centroid defines a Voronoi tessellation: the space is partitioned into K convex regions, one per centroid. When each region's centroid is also the mean of the points it covers, the result is a centroidal Voronoi tessellation. Lloyd's algorithm can be viewed as iteratively building such a tessellation by recomputing the Voronoi regions and moving each centroid to the mean of its region.
This explains why standard k-means produces clusters with straight, flat boundaries. The decision surface between any two clusters is a hyperplane bisecting the line between their centroids, so clusters are always convex, roughly globe-shaped regions. Datasets with curved or interlocking shapes are usually a poor fit.
Stuart P. Lloyd of Bell Labs proposed the standard procedure in a 1957 internal memorandum on pulse-code modulation. The work circulated informally and was not published until 1982 in IEEE Transactions on Information Theory. Joel Max independently published a similar quantisation procedure in 1960 (hence "Lloyd-Max"), and Edward W. Forgy presented the same clustering scheme in 1965 ("Lloyd-Forgy"). The term "k-means" was coined by James MacQueen in 1967, though the basic idea traces back to Hugo Steinhaus in 1956.
Lloyd's algorithm for k-means proceeds as follows:
Each iteration runs in time proportional to n times K times d, where d is the number of features. The WCSS decreases monotonically with each step, so the algorithm always converges in finite time to a stationary point that is only guaranteed to be a local minimum. Most implementations run the algorithm several times with different initial centroids and keep the run with the lowest WCSS.
K-means assumes clusters are convex, roughly equal in size and variance, and shaped like spheres in the feature space. When these assumptions hold, the algorithm is fast and produces interpretable clusters; when they do not, k-means can give visibly wrong results, for example splitting one elongated cluster or merging two adjacent clusters of unequal size. Worst-case running time can be superpolynomial in n, but on realistic data the number of iterations is usually well under fifty, with most of the improvement happening in the first handful.
Plain random initialisation can produce clusterings arbitrarily worse than the optimal partition. In 2007, David Arthur and Sergei Vassilvitskii proposed k-means++, a seeding scheme that spreads the initial centroids out probabilistically:
Arthur and Vassilvitskii showed that k-means++ gives an expected approximation ratio of O(log K) relative to the optimal WCSS. Their experiments reported roughly two-fold improvements in running time and as much as a 1000-fold reduction in error on certain datasets. Scikit-learn uses k-means++ as the default initialiser. A scalable variant called k-means|| (Bahmani et al., 2012) keeps the theoretical guarantees while reducing the number of sequential passes through the data.
The k-medoids algorithm replaces the cluster mean with a medoid, an actual data point that minimises the sum of dissimilarities to other points in its cluster. Leonard Kaufman and Peter Rousseeuw introduced Partitioning Around Medoids (PAM) in 1987. Because medoids are drawn from the data and dissimilarities can be arbitrary, k-medoids works with any distance function, including non-Euclidean ones such as Manhattan distance, cosine dissimilarity, or domain-specific measures for strings and graphs.
PAM has two phases. The BUILD phase greedily selects K data points as initial medoids to minimise total dissimilarity from each non-medoid to its nearest medoid. The SWAP phase then iteratively swaps each medoid with each non-medoid and performs the swap that reduces cost the most, until no swap improves the objective.
PAM has a runtime of O(K times (n minus K) squared) per iteration, which limits it to small datasets. CLARA runs PAM on multiple random samples and keeps the best clustering; CLARANS restricts the SWAP phase to a random subset of candidate swaps. Schubert and Rousseeuw's FastPAM and FasterPAM (2019-2021) reduced the per-iteration cost to roughly O(n squared), and BanditPAM (Tiwari et al., 2020) used multi-armed bandit techniques to focus computation on promising swaps.
K-medoids is more robust to outliers than k-means: a single far-away point cannot pull a medoid the way it pulls a mean. The trade-off is higher computational cost and, on clean data, a slightly worse WCSS.
A number of related algorithms keep the centroid-based skeleton but change the prototype, the distance, or the optimisation strategy.
| Algorithm | Prototype | Notes |
|---|---|---|
| K-means | Arithmetic mean | Minimises squared Euclidean distance; default for numeric data |
| K-medoids (PAM) | Actual data point (medoid) | Works with arbitrary dissimilarities; more robust to outliers |
| K-medians | Coordinate-wise median | Optimises sum of L1 distances; less affected by extreme values |
| K-modes | Mode of each attribute | Designed for categorical data; uses simple matching dissimilarity |
| K-prototypes | Mixed mean and mode | Handles datasets with both numeric and categorical features |
| Fuzzy c-means | Weighted mean | Each point has membership weights in all clusters |
| Mini-batch k-means | Running average mean | Centroids updated from small random batches; scales to huge data |
| Bisecting k-means | Arithmetic mean | Repeatedly splits one cluster at a time using 2-means |
| Spherical k-means | Unit-norm mean | Uses cosine similarity; popular for text and embedding data |
| Hartigan-Wong | Arithmetic mean | Local search variant that often finds better minima than Lloyd |
Fuzzy c-means, introduced by J. C. Dunn in 1973 and refined by James C. Bezdek in 1981, assigns each point a degree of membership in every cluster rather than a hard assignment. A fuzziness parameter m (typically 2) controls how soft the assignments are.
Mini-batch k-means (Sculley, WWW 2010) draws a small random sample of points at each iteration and uses a running average to update centroids. It loses a little accuracy but reduces computation by orders of magnitude, making it practical for clustering millions of points and for streaming settings.
Bisecting k-means (Steinbach, Karypis, and Kumar, 2000) starts with all data in one cluster and repeatedly applies 2-means to split the cluster with the highest variance until K leaves are produced. It yields a hierarchy as a side effect and is often more stable than flat k-means.
All centroid-based methods require choosing K in advance. Several heuristics help, but none is foolproof.
The elbow method plots WCSS against K and looks for the point where the curve bends from steep to shallow. Beyond the elbow, adding clusters yields little extra reduction in within-cluster variance. The method works when clusters are clearly separated and fails when the curve is smooth.
The silhouette score, introduced by Peter Rousseeuw in 1987, measures how tightly each point fits inside its assigned cluster relative to its distance from the next nearest cluster. Scores range from minus one to one; values close to one indicate good separation. The average silhouette across all points, computed for each candidate K, gives a single number to maximise. Unlike the elbow method, it considers both within-cluster compactness and between-cluster separation.
The gap statistic, proposed by Tibshirani, Walther, and Hastie in 2001, compares the observed WCSS at each K with the expected WCSS under a null reference distribution generated by uniform sampling. The chosen K is the smallest value where the gap statistic is within one standard error of the gap at the next value.
Analysts often compute all three. Other approaches include the Calinski-Harabasz and Davies-Bouldin indexes and the BIC-based X-means algorithm (Pelleg and Moore, 2000).
Centroid-based clustering is typically taught alongside hierarchical clustering and density-based clustering.
Hierarchical methods such as agglomerative clustering build a tree of nested partitions. They do not require K up front; the user picks a cut height after the fact. They scale poorly (typically O(n squared) or worse) but offer a richer view of the data and can produce non-spherical clusters depending on the linkage criterion.
Density-based methods such as DBSCAN (Ester, Kriegel, Sander, and Xu, 1996) and OPTICS define clusters as regions of high density separated by low density. They find clusters of arbitrary shape, treat low-density points as noise, and do not require K. The trade-off is two parameters (a neighbourhood radius and a minimum point count) and difficulty when clusters have very different densities.
Centroid-based methods sit between these in scalability and shape flexibility. They are usually the fastest on large numeric datasets and the easiest to explain, but impose strong assumptions about cluster geometry. Gaussian mixture models are closely related, generalising k-means to allow elliptical clusters and probabilistic assignments via the EM algorithm.
Feature scaling matters. Because k-means uses Euclidean distance, features with larger numeric ranges dominate the objective, so standardising or normalising before clustering is standard practice.
High-dimensional data brings the curse of dimensionality: as d grows, Euclidean distances become more uniform and the contrast between near and far neighbours fades. A common remedy is to reduce dimensionality first using principal component analysis, t-SNE, UMAP, or autoencoders. Outliers can pull a cluster mean off its true centre; k-medoids, k-medians, and robust pre-processing reduce this effect.
Multiple restarts are essentially mandatory. Scikit-learn's KMeans defaults to several runs and keeps the partition with the lowest inertia. R has kmeans plus the cluster package for PAM, CLARA, and silhouettes. Apache Spark MLlib includes KMeans, BisectingKMeans, and Gaussian mixture models. ELKI offers many k-means variants for research benchmarks.
Centroid-based clustering appears across many fields: customer segmentation, document clustering using bag-of-words or embedding vectors, colour quantisation and image segmentation, vector quantisation in speech coding and deep learning systems such as VQ-VAE, anomaly detection (points far from any centroid are flagged), feature learning, bioinformatics (gene expression, single-cell analysis), astronomy (stellar populations from spectroscopic features), and recommender systems where embeddings are pre-clustered for fast retrieval.
Imagine a playground full of kids and you want to split them into a few groups. You pick spots on the ground as "meeting spots" and ask each kid to run to the nearest one. Then you look at each group, find the average position of the kids, and move the meeting spot to that average. Some kids might now be closer to a different spot, so you let them switch. Keep moving spots and letting kids switch until nothing changes. That is centroid-based clustering: the meeting spots are the centroids, and each group is one cluster.