See also: Machine learning terms
A centroid is the geometric center of a set of points, computed as the arithmetic mean of their coordinates. In machine learning, centroids are the building block of clustering algorithms and prototype-based methods. They serve as a single representative point that summarizes a group of similar observations. The most familiar use of centroids is in the k-means algorithm, where each cluster is described by the mean position of all data points assigned to it. Centroids also appear in nearest-prototype classifiers, vector quantization codebooks, anchor-free object detectors, and the index structures behind modern vector search engines.
Given a set of n observations x_1, x_2, ..., x_n in a d-dimensional space, the centroid c is the arithmetic mean of the observations:
c = (1/n) * (x_1 + x_2 + ... + x_n)
In coordinate form, each component of the centroid is the average of that coordinate across all points. For points (x_i, y_i) in two dimensions, the centroid is ((x_1 + x_2 + ... + x_n)/n, (y_1 + y_2 + ... + y_n)/n).
The centroid is sometimes called the geometric center, the center of mass (when uniform mass density is assumed), or the barycenter. It does not need to be an actual data point in the set: typically it lies between the points and may even fall outside the convex hull for non-convex shapes.
The centroid minimizes the sum of squared Euclidean distances from itself to every point in the set. Formally, the centroid solves:
c = argmin_p Sum_{i=1..n} ||x_i - p||^2
This property follows from setting the gradient of the squared-distance sum to zero and solving for p. It is the reason k-means uses the mean during its update step: given a fixed assignment of points to clusters, the mean is the cluster center that minimizes within-cluster variance.
Replacing squared Euclidean distance with other metrics changes the optimal representative. The point that minimizes the sum of L1 (Manhattan) distances is the geometric median, not the mean. The point that minimizes the maximum distance is the Chebyshev center. These distinctions motivate alternative cluster representatives such as medoids and modes, discussed below.
Given k centroids in a d-dimensional space, the Voronoi diagram partitions the space into k regions. Each region contains all points that are closer to one specific centroid than to any other. The boundaries between regions are perpendicular bisectors of the line segments connecting neighboring centroids. In k-means, the assignment step is exactly this: every data point joins the cluster whose centroid lies closest, so the final clusters are the intersections of the dataset with the Voronoi cells of the final centroids. This geometric perspective explains why k-means produces convex cluster boundaries and struggles with non-convex shapes.
When the centroids are themselves the centers of mass of their Voronoi cells, the configuration is called a centroidal Voronoi tessellation (CVT). Lloyd's iteration is a fixed-point procedure for finding such tessellations.
The k-means algorithm is the canonical use case for centroids in machine learning. Stuart Lloyd developed the iterative procedure in 1957 as a method for least-squares quantization in pulse-code modulation at Bell Labs, but his memo was not published externally until 1982. James MacQueen independently published the algorithm in 1967 and gave it the name "k-means." Edward Forgy described a similar method in 1965, and Hugo Steinhaus had outlined the underlying partitioning idea as early as 1956.
Given a dataset and a chosen number of clusters k, k-means proceeds as follows:
The within-cluster sum of squares (WCSS), also called inertia, decreases monotonically with each iteration because both steps are coordinate descent moves on the same objective. The algorithm is guaranteed to terminate at a local minimum in a finite number of steps because there are only finitely many possible point-to-cluster assignments. Finding the global optimum is NP-hard (Aloise et al., 2009), so practitioners typically run the algorithm many times with different initializations and keep the best result.
The quality of the final clustering depends heavily on initialization. k-means++, proposed by David Arthur and Sergei Vassilvitskii in 2007, replaces uniform random initialization with a probabilistic seeding scheme that spreads initial centroids across the data. The first centroid is chosen uniformly at random. Each subsequent centroid is sampled from the data points with probability proportional to D(x)^2, where D(x) is the distance from x to the closest already-chosen centroid. The authors proved that this initialization yields an expected WCSS within a factor of O(log k) of the optimum, which was the first such guarantee for any practical k-means seeding strategy. k-means++ is the default initialization in scikit-learn and most modern libraries.
For very large datasets, recomputing centroids over the full dataset on every iteration is expensive. Mini-batch k-means, introduced by D. Sculley in his 2010 WWW paper "Web-scale k-means clustering," performs each update on a small random sample (often 100 to 10,000 points). Centroids are updated using a per-cluster running average so older mini-batches still influence the centroid position but with diminishing weight. The result is convergence that is orders of magnitude faster than the batch algorithm, with only a small loss in cluster quality. Mini-batch k-means powers many production clustering pipelines that need to handle millions or billions of points.
| Variant | Centroid type | Distance metric | Notes |
|---|---|---|---|
| k-means (Lloyd) | Arithmetic mean | Euclidean (L2) | Standard formulation, fastest |
| k-medians | Per-feature median | Manhattan (L1) | More robust to outliers |
| k-medoids (PAM) | Actual data point | Any dissimilarity | Robust, interpretable, slower |
| k-modes | Per-feature mode | Matching dissimilarity | Designed for categorical data |
| k-prototypes | Mixed mean and mode | Mixed Euclidean and matching | Mixed numerical and categorical features |
| Spherical k-means | Mean projected to unit sphere | Cosine similarity | Common for text and embeddings |
| Fuzzy c-means | Weighted mean of soft assignments | Euclidean | Soft cluster membership |
| Bisecting k-means | Mean | Euclidean | Recursively splits using k=2 |
The choice of distance metric determines which point is the optimal centroid and which clusters the algorithm prefers. The standard choice is Euclidean distance, which pairs naturally with the mean. Other metrics produce different geometry and lead to different cluster shapes.
| Metric | Formula | Optimal representative | Common use |
|---|---|---|---|
| Euclidean (L2) | sqrt(Sum (x_i - y_i)^2) | Arithmetic mean | Continuous numeric features |
| Manhattan (L1) | Sum | x_i - y_i | |
| Chebyshev (L-infinity) | max | x_i - y_i | |
| Cosine similarity | 1 - (x . y) / ( | x | |
| Mahalanobis | sqrt((x - y)^T S^{-1} (x - y)) | Sample mean (with covariance correction) | Correlated features, anomaly detection |
| Hamming | Number of differing positions | Per-feature mode | Categorical or binary data |
When the metric is not Euclidean, using the arithmetic mean as the centroid no longer minimizes the chosen objective. Practitioners either swap the centroid for the appropriate representative (median, medoid, mode) or accept the approximation introduced by keeping the mean.
Centroid-based methods assume that a single point can adequately summarize a cluster. This works well for compact, roughly spherical clusters but fails for elongated, non-convex, or density-based structures. Several alternative paradigms relax this assumption.
| Method | Cluster representation | Strengths | Weaknesses |
|---|---|---|---|
| k-means | One centroid per cluster | Fast, simple, scales to millions of points | Requires k, sensitive to initialization, assumes spherical clusters |
| k-medoids | Actual data point per cluster | Robust to outliers, works with any dissimilarity | O(k(n-k)^2) per iteration, slow on large data |
| DBSCAN | Set of density-connected points | Finds arbitrary shapes, labels noise, no k needed | Sensitive to eps and min_samples, struggles with varying densities |
| OPTICS | Reachability plot of densities | Handles varying densities better than DBSCAN | Plot interpretation is non-trivial |
| Hierarchical clustering | Tree of nested merges or splits | Reveals hierarchy, no k needed up front | O(n^2) memory, O(n^3) time for naive agglomerative |
| Spectral clustering | Eigenvectors of an affinity graph | Captures non-convex clusters | O(n^2) or O(n^3) depending on solver |
| Gaussian mixture models | Mean and covariance per component | Soft assignments, elliptical clusters | EM is slow, requires k, can collapse |
| Affinity propagation | Self-selected exemplar points | No k needed, exemplar-based | O(n^2) memory and per iteration |
A pragmatic decision rule: try k-means first, and switch to DBSCAN or a Gaussian mixture if the clusters are visibly non-convex, of mixed densities, or noticeably overlapping.
Once a centroid-based algorithm produces clusters, several internal metrics quantify how good those clusters are.
WCSS is the total squared distance from each point to its assigned centroid. K-means minimizes this quantity directly. WCSS always decreases as k increases, so it is most useful for comparing different runs at the same k or for the elbow plot.
The silhouette score, introduced by Peter Rousseeuw in 1987, measures how well each point fits its cluster compared to its second-best alternative. For each point i, let a(i) be the mean distance to other points in its own cluster and b(i) the mean distance to points in the nearest neighboring cluster. The silhouette coefficient is
s(i) = (b(i) - a(i)) / max(a(i), b(i))
Values near +1 indicate the point is well matched to its cluster. Values near 0 mean the point is on a boundary. Negative values suggest a likely misassignment. The mean silhouette across the dataset is a single-number summary that can be compared across different values of k (Rousseeuw, 1987).
The Davies-Bouldin index (Davies and Bouldin, 1979) compares the average within-cluster scatter to the between-cluster separation, using centroids to define both. Lower values indicate tighter, better-separated clusters. The index is fast to compute and is provided by scikit-learn as davies_bouldin_score.
Also called the variance ratio criterion, this index is the ratio of between-cluster dispersion to within-cluster dispersion, both computed from centroids. Higher values indicate better-defined clusters.
Centroid-based methods require k to be specified in advance. Several techniques help estimate a reasonable value.
| Technique | Idea | Caveats |
|---|---|---|
| Elbow method | Plot WCSS vs k and look for the bend | Subjective; the elbow is often unclear |
| Silhouette analysis | Choose k that maximizes mean silhouette | Computationally expensive on large data |
| Gap statistic | Compare WCSS to that under a uniform null distribution (Tibshirani, Walther, Hastie, 2001) | Slow, requires many simulations |
| BIC / AIC | Penalize model complexity (typically with Gaussian mixture interpretation) | Assumes a probabilistic model |
| X-means | Start small, split clusters and accept splits that improve BIC (Pelleg and Moore, 2000) | Approximation, but automatic |
| G-means | Like X-means, but uses Anderson-Darling Gaussianity test | Sensitive to non-Gaussian clusters |
| Domain knowledge | Use prior structure (number of customer tiers, products, segments) | Often the most useful in practice |
No single technique is universally reliable. Experienced practitioners triangulate using two or three methods and validate against domain understanding.
The simplicity that makes centroid-based methods popular also creates well-known failure modes.
Centroid methods assume spherical, similarly sized clusters. When real clusters are elongated, ring-shaped, or of widely different densities, the algorithm draws boundaries through dense regions and assigns points incorrectly. DBSCAN and spectral clustering handle these cases better.
They are sensitive to outliers. A single distant point pulls its assigned centroid toward itself, distorting the cluster center. K-medoids, k-medians, or robust pre-filtering helps.
They require k in advance. Choosing the wrong k produces either over-merged or fragmented clusters. Validation indices and domain knowledge help, but no automatic method is perfect.
They are sensitive to feature scaling. Because Euclidean distance treats every dimension equally, a feature with a large range dominates the distance computation. Standardization or normalization is almost always required before clustering.
They converge to local minima. Different random initializations produce different solutions. Running k-means many times (the n_init parameter in scikit-learn) and keeping the best WCSS is the standard mitigation. Using k-means++ also improves consistency.
They assume a meaningful Euclidean geometry. For categorical, ordinal, or graph-structured data, the very notion of a centroid is questionable, and methods like k-modes, k-prototypes, or graph clustering are more appropriate.
Centroids show up in several other corners of machine learning beyond clustering proper.
The nearest centroid classifier represents each class by its centroid in feature space, then classifies a new observation by assigning it to the class of the nearest centroid. The same idea, applied to text vectors weighted by TF-IDF, is known as the Rocchio classifier after the 1971 work by Joseph Rocchio for relevance feedback in the SMART information retrieval system. Despite its simplicity, the nearest centroid classifier remains a strong baseline for high-dimensional text classification and is provided by scikit-learn as NearestCentroid. Rocchio's original formulation also added a negative term that pushes the query away from the centroid of non-relevant documents, an idea that still influences modern relevance feedback.
Vector quantization (VQ), developed by Robert Gray and others starting in the late 1970s, compresses signals by mapping each input vector to the nearest entry in a finite codebook. Each codebook entry is a centroid. The Linde-Buzo-Gray (LBG) algorithm, a generalization of Lloyd's algorithm, builds the codebook by iteratively splitting and re-clustering until distortion is minimized. VQ is the foundation of speech codecs, image compression schemes, and modern product quantization for fast nearest-neighbor search.
Large-scale vector retrieval libraries such as FAISS, ScaNN, and Milvus use centroids in the IVF (inverted file) index. During training, the library runs k-means to learn a coarse codebook of, for example, 4,096 centroids. Each database vector is assigned to its nearest centroid and stored in that centroid's posting list. At query time, the system computes distances between the query and all centroids, then probes only the few nearest posting lists. Combined with product quantization for compact encoding, IVF brings billion-vector search to commodity hardware (Douze et al., 2024).
Mixture-of-experts (MoE) layers in large transformer models route each token to a small number of experts. Recent work explores associating each expert with a learned centroid in embedding space and routing tokens by cosine similarity to the centroid. This geometric routing reduces the number of routing parameters and makes expert specialization directly inspectable from the centroid matrix.
In computer vision, CenterNet (Zhou et al., 2019, Objects as Points) treats each object as a single keypoint at the center of its bounding box and regresses size, depth, and orientation from features at that keypoint. The center of the bounding box is the centroid of the object's annotation, and the design eliminates the anchor boxes used by earlier detectors such as Faster R-CNN. Pose estimation pipelines often use centroid heatmaps for similar reasons.
Learning vector quantization (LVQ), self-organizing maps (SOM), radial basis function networks, and many few-shot classifiers all rely on prototype vectors that behave similarly to centroids. Prototypical networks (Snell et al., 2017) compute a class centroid in embedding space from a few support examples and classify queries by nearest centroid, achieving strong few-shot performance.
scikit-learn provides several centroid-based classes that follow the same fit and predict interface.
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.neighbors import NearestCentroid
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
# Standard k-means with k-means++ initialization
kmeans = KMeans(n_clusters=8, init="k-means++", n_init=10, random_state=0)
kmeans.fit(X_scaled)
centroids = kmeans.cluster_centers_ # shape: (8, n_features)
labels = kmeans.labels_
# Mini-batch variant for large datasets
mbk = MiniBatchKMeans(n_clusters=8, batch_size=1024, n_init=3, random_state=0)
mbk.fit(X_scaled)
# Nearest centroid classifier
ncc = NearestCentroid()
ncc.fit(X_train_scaled, y_train)
preds = ncc.predict(X_test_scaled)
Key scikit-learn parameters worth understanding when working with centroids:
| Parameter | Class | Default | Purpose |
|---|---|---|---|
| n_clusters | KMeans, MiniBatchKMeans | 8 | Number of centroids to fit |
| init | KMeans | 'k-means++' | Initialization scheme |
| n_init | KMeans | 10 | Number of restarts; best run is kept |
| algorithm | KMeans | 'lloyd' | Use 'elkan' for many clusters with low dimensionality |
| batch_size | MiniBatchKMeans | 1024 | Points used per update |
| reassignment_ratio | MiniBatchKMeans | 0.01 | Controls when low-count clusters are reseeded |
| metric | NearestCentroid | 'euclidean' | Switch to 'manhattan' for L1 prototypes |
| shrink_threshold | NearestCentroid | None | Shrunken centroids for high-dimensional data |
A few rules of thumb make centroid-based methods more reliable in practice:
Standardize features before clustering unless they are already on the same scale. Otherwise the metric is dominated by whichever feature has the largest variance.
Run multiple restarts. The default n_init=10 in scikit-learn is a sensible starting point. For difficult datasets, more restarts help.
Seed deterministically when reproducibility matters. Setting random_state ensures the same centroids appear across runs.
Validate cluster quality with at least one internal metric, ideally combined with a visualization (PCA or t-SNE projection of the data with cluster colors and centroid markers).
Reassess whether the data really has a centroid structure. If silhouette scores are low across all k and visualizations show overlapping or non-convex clusters, switch to DBSCAN, a Gaussian mixture, or hierarchical clustering rather than forcing more centroids onto the problem.
When feature dimensionality is very high, consider dimensionality reduction (PCA, UMAP) or shrunken centroids before applying nearest-centroid classification or k-means.
| Year | Researcher | Contribution |
|---|---|---|
| 1956 | Hugo Steinhaus | Early formulation of partitioning by minimizing within-group variance |
| 1957 | Stuart Lloyd | Iterative centroid algorithm at Bell Labs (memo unpublished until 1982) |
| 1965 | Edward Forgy | Independent description of an equivalent algorithm |
| 1967 | James MacQueen | Coined the term "k-means" |
| 1971 | Joseph Rocchio | Centroid-based relevance feedback for the SMART IR system |
| 1973 | J. C. Dunn | Introduced fuzzy c-means with soft centroid assignments |
| 1979 | Davies and Bouldin | Centroid-based cluster separation index |
| 1980 | Linde, Buzo, Gray | LBG vector quantization codebook design |
| 1987 | Kaufman and Rousseeuw | Partitioning Around Medoids (PAM) and the silhouette score |
| 1998 | Zhexue Huang | k-modes for categorical data |
| 2000 | Pelleg and Moore | X-means for automatic k selection via BIC |
| 2007 | Arthur and Vassilvitskii | k-means++ with provable O(log k) guarantee |
| 2010 | D. Sculley | Mini-batch k-means for web-scale data |
| 2012 | Bahmani et al. | Scalable k-means |
| 2017 | Snell et al. | Prototypical networks for few-shot learning |
| 2019 | Zhou et al. | CenterNet treats each object as a single centroid |
Imagine you have a bunch of marbles scattered on the floor and you want to find the "middle" of the pile. You measure how far each marble is from a guess, take the average, and you get a point that sits right in the middle of the marbles. That middle point is the centroid.
Now imagine the marbles come in a few different colors and you want to find the middle of each color group. The k-means algorithm picks a few starting middle points (centroids), then asks every marble "which middle is closest to me?" and groups them. Then it slides each middle to the actual center of its group of marbles, and asks again. After a few rounds, the middles stop moving and you have your color groups.
The centroid is just the middle. K-means is the game of moving the middles around until they sit exactly where they should.