Centroid

See also: Machine learning terms

A centroid is the geometric center of a set of points, computed as the arithmetic mean of their coordinates. In machine learning, centroids are the building block of clustering algorithms and prototype-based methods. They serve as a single representative point that summarizes a group of similar observations. The most familiar use of centroids is in the k-means algorithm, where each cluster is described by the mean position of all data points assigned to it. Centroids also appear in nearest-prototype classifiers, vector quantization codebooks, anchor-free object detectors, and the index structures behind modern vector search engines.

Definition

Given a set of n observations x_1, x_2, ..., x_n in a d-dimensional space, the centroid c is the arithmetic mean of the observations:

c = (1/n) * (x_1 + x_2 + ... + x_n)

In coordinate form, each component of the centroid is the average of that coordinate across all points. For points (x_i, y_i) in two dimensions, the centroid is ((x_1 + x_2 + ... + x_n)/n, (y_1 + y_2 + ... + y_n)/n).

The centroid is sometimes called the geometric center, the center of mass (when uniform mass density is assumed), or the barycenter. It does not need to be an actual data point in the set: typically it lies between the points and may even fall outside the convex hull for non-convex shapes.

Geometric interpretation

The centroid minimizes the sum of squared Euclidean distances from itself to every point in the set. Formally, the centroid solves:

c = argmin_p Sum_{i=1..n} ||x_i - p||^2

This property follows from setting the gradient of the squared-distance sum to zero and solving for p. It is the reason k-means uses the mean during its update step: given a fixed assignment of points to clusters, the mean is the cluster center that minimizes within-cluster variance.

Replacing squared Euclidean distance with other metrics changes the optimal representative. The point that minimizes the sum of L1 (Manhattan) distances is the geometric median, not the mean. The point that minimizes the maximum distance is the Chebyshev center. These distinctions motivate alternative cluster representatives such as medoids and modes, discussed below.

Voronoi tessellation

Given k centroids in a d-dimensional space, the Voronoi diagram partitions the space into k regions. Each region contains all points that are closer to one specific centroid than to any other. The boundaries between regions are perpendicular bisectors of the line segments connecting neighboring centroids. In k-means, the assignment step is exactly this: every data point joins the cluster whose centroid lies closest, so the final clusters are the intersections of the dataset with the Voronoi cells of the final centroids. This geometric perspective explains why k-means produces convex cluster boundaries and struggles with non-convex shapes.

When the centroids are themselves the centers of mass of their Voronoi cells, the configuration is called a centroidal Voronoi tessellation (CVT). Lloyd's iteration is a fixed-point procedure for finding such tessellations.

Centroids in k-means clustering

The k-means algorithm is the canonical use case for centroids in machine learning. Stuart Lloyd developed the iterative procedure in 1957 as a method for least-squares quantization in pulse-code modulation at Bell Labs, but his memo was not published externally until 1982. James MacQueen independently published the algorithm in 1967 and gave it the name "k-means." Edward Forgy described a similar method in 1965, and Hugo Steinhaus had outlined the underlying partitioning idea as early as 1956.

The algorithm

Given a dataset and a chosen number of clusters k, k-means proceeds as follows:

Initialize k centroids, often by sampling points from the dataset.
Assign each point to the nearest centroid using a chosen distance metric.
Update each centroid as the mean of the points currently assigned to it.
Repeat steps 2 and 3 until centroids stop moving (or move below a tolerance), or a maximum iteration count is reached.

The within-cluster sum of squares (WCSS), also called inertia, decreases monotonically with each iteration because both steps are coordinate descent moves on the same objective. The algorithm is guaranteed to terminate at a local minimum in a finite number of steps because there are only finitely many possible point-to-cluster assignments. Finding the global optimum is NP-hard (Aloise et al., 2009), so practitioners typically run the algorithm many times with different initializations and keep the best result.

k-means++

The quality of the final clustering depends heavily on initialization. k-means++, proposed by David Arthur and Sergei Vassilvitskii in 2007, replaces uniform random initialization with a probabilistic seeding scheme that spreads initial centroids across the data. The first centroid is chosen uniformly at random. Each subsequent centroid is sampled from the data points with probability proportional to D(x)^2, where D(x) is the distance from x to the closest already-chosen centroid. The authors proved that this initialization yields an expected WCSS within a factor of O(log k) of the optimum, which was the first such guarantee for any practical k-means seeding strategy. k-means++ is the default initialization in scikit-learn and most modern libraries.

Mini-batch k-means

For very large datasets, recomputing centroids over the full dataset on every iteration is expensive. Mini-batch k-means, introduced by D. Sculley in his 2010 WWW paper "Web-scale k-means clustering," performs each update on a small random sample (often 100 to 10,000 points). Centroids are updated using a per-cluster running average so older mini-batches still influence the centroid position but with diminishing weight. The result is convergence that is orders of magnitude faster than the batch algorithm, with only a small loss in cluster quality. Mini-batch k-means powers many production clustering pipelines that need to handle millions or billions of points.

Variants of k-means with different centroid definitions

Variant	Centroid type	Distance metric	Notes
k-means (Lloyd)	Arithmetic mean	Euclidean (L2)	Standard formulation, fastest
k-medians	Per-feature median	Manhattan (L1)	More robust to outliers
k-medoids (PAM)	Actual data point	Any dissimilarity	Robust, interpretable, slower
k-modes	Per-feature mode	Matching dissimilarity	Designed for categorical data
k-prototypes	Mixed mean and mode	Mixed Euclidean and matching	Mixed numerical and categorical features
Spherical k-means	Mean projected to unit sphere	Cosine similarity	Common for text and embeddings
Fuzzy c-means	Weighted mean of soft assignments	Euclidean	Soft cluster membership
Bisecting k-means	Mean	Euclidean	Recursively splits using k=2

Distance metrics

The choice of distance metric determines which point is the optimal centroid and which clusters the algorithm prefers. The standard choice is Euclidean distance, which pairs naturally with the mean. Other metrics produce different geometry and lead to different cluster shapes.

Metric	Formula	Optimal representative	Common use
Euclidean (L2)	sqrt(Sum (x_i - y_i)^2)	Arithmetic mean	Continuous numeric features
Manhattan (L1)	Sum	x_i - y_i
Chebyshev (L-infinity)	max	x_i - y_i
Cosine similarity	1 - (x . y) / (		x
Mahalanobis	sqrt((x - y)^T S^{-1} (x - y))	Sample mean (with covariance correction)	Correlated features, anomaly detection
Hamming	Number of differing positions	Per-feature mode	Categorical or binary data

When the metric is not Euclidean, using the arithmetic mean as the centroid no longer minimizes the chosen objective. Practitioners either swap the centroid for the appropriate representative (median, medoid, mode) or accept the approximation introduced by keeping the mean.

Alternatives to centroid-based clustering

Centroid-based methods assume that a single point can adequately summarize a cluster. This works well for compact, roughly spherical clusters but fails for elongated, non-convex, or density-based structures. Several alternative paradigms relax this assumption.

Method	Cluster representation	Strengths	Weaknesses
k-means	One centroid per cluster	Fast, simple, scales to millions of points	Requires k, sensitive to initialization, assumes spherical clusters
k-medoids	Actual data point per cluster	Robust to outliers, works with any dissimilarity	O(k(n-k)^2) per iteration, slow on large data
DBSCAN	Set of density-connected points	Finds arbitrary shapes, labels noise, no k needed	Sensitive to eps and min_samples, struggles with varying densities
OPTICS	Reachability plot of densities	Handles varying densities better than DBSCAN	Plot interpretation is non-trivial
Hierarchical clustering	Tree of nested merges or splits	Reveals hierarchy, no k needed up front	O(n^2) memory, O(n^3) time for naive agglomerative
Spectral clustering	Eigenvectors of an affinity graph	Captures non-convex clusters	O(n^2) or O(n^3) depending on solver
Gaussian mixture models	Mean and covariance per component	Soft assignments, elliptical clusters	EM is slow, requires k, can collapse
Affinity propagation	Self-selected exemplar points	No k needed, exemplar-based	O(n^2) memory and per iteration

A pragmatic decision rule: try k-means first, and switch to DBSCAN or a Gaussian mixture if the clusters are visibly non-convex, of mixed densities, or noticeably overlapping.

Cluster validation with centroids

Once a centroid-based algorithm produces clusters, several internal metrics quantify how good those clusters are.

Within-cluster sum of squares (WCSS / inertia)

WCSS is the total squared distance from each point to its assigned centroid. K-means minimizes this quantity directly. WCSS always decreases as k increases, so it is most useful for comparing different runs at the same k or for the elbow plot.

Silhouette score

The silhouette score, introduced by Peter Rousseeuw in 1987, measures how well each point fits its cluster compared to its second-best alternative. For each point i, let a(i) be the mean distance to other points in its own cluster and b(i) the mean distance to points in the nearest neighboring cluster. The silhouette coefficient is

s(i) = (b(i) - a(i)) / max(a(i), b(i))

Values near +1 indicate the point is well matched to its cluster. Values near 0 mean the point is on a boundary. Negative values suggest a likely misassignment. The mean silhouette across the dataset is a single-number summary that can be compared across different values of k (Rousseeuw, 1987).

Davies-Bouldin index

The Davies-Bouldin index (Davies and Bouldin, 1979) compares the average within-cluster scatter to the between-cluster separation, using centroids to define both. Lower values indicate tighter, better-separated clusters. The index is fast to compute and is provided by scikit-learn as davies_bouldin_score.

Calinski-Harabasz index

Also called the variance ratio criterion, this index is the ratio of between-cluster dispersion to within-cluster dispersion, both computed from centroids. Higher values indicate better-defined clusters.

Choosing the number of clusters

Centroid-based methods require k to be specified in advance. Several techniques help estimate a reasonable value.

Technique	Idea	Caveats
Elbow method	Plot WCSS vs k and look for the bend	Subjective; the elbow is often unclear
Silhouette analysis	Choose k that maximizes mean silhouette	Computationally expensive on large data
Gap statistic	Compare WCSS to that under a uniform null distribution (Tibshirani, Walther, Hastie, 2001)	Slow, requires many simulations
BIC / AIC	Penalize model complexity (typically with Gaussian mixture interpretation)	Assumes a probabilistic model
X-means	Start small, split clusters and accept splits that improve BIC (Pelleg and Moore, 2000)	Approximation, but automatic
G-means	Like X-means, but uses Anderson-Darling Gaussianity test	Sensitive to non-Gaussian clusters
Domain knowledge	Use prior structure (number of customer tiers, products, segments)	Often the most useful in practice

No single technique is universally reliable. Experienced practitioners triangulate using two or three methods and validate against domain understanding.

Limitations of centroid-based clustering

The simplicity that makes centroid-based methods popular also creates well-known failure modes.

Centroid methods assume spherical, similarly sized clusters. When real clusters are elongated, ring-shaped, or of widely different densities, the algorithm draws boundaries through dense regions and assigns points incorrectly. DBSCAN and spectral clustering handle these cases better.

They are sensitive to outliers. A single distant point pulls its assigned centroid toward itself, distorting the cluster center. K-medoids, k-medians, or robust pre-filtering helps.

They require k in advance. Choosing the wrong k produces either over-merged or fragmented clusters. Validation indices and domain knowledge help, but no automatic method is perfect.

They are sensitive to feature scaling. Because Euclidean distance treats every dimension equally, a feature with a large range dominates the distance computation. Standardization or normalization is almost always required before clustering.

They converge to local minima. Different random initializations produce different solutions. Running k-means many times (the n_init parameter in scikit-learn) and keeping the best WCSS is the standard mitigation. Using k-means++ also improves consistency.

They assume a meaningful Euclidean geometry. For categorical, ordinal, or graph-structured data, the very notion of a centroid is questionable, and methods like k-modes, k-prototypes, or graph clustering are more appropriate.

Centroid-based methods beyond clustering

Centroids show up in several other corners of machine learning beyond clustering proper.

Nearest centroid classifier

The nearest centroid classifier represents each class by its centroid in feature space, then classifies a new observation by assigning it to the class of the nearest centroid. The same idea, applied to text vectors weighted by TF-IDF, is known as the Rocchio classifier after the 1971 work by Joseph Rocchio for relevance feedback in the SMART information retrieval system. Despite its simplicity, the nearest centroid classifier remains a strong baseline for high-dimensional text classification and is provided by scikit-learn as NearestCentroid. Rocchio's original formulation also added a negative term that pushes the query away from the centroid of non-relevant documents, an idea that still influences modern relevance feedback.

Vector quantization

Vector quantization (VQ), developed by Robert Gray and others starting in the late 1970s, compresses signals by mapping each input vector to the nearest entry in a finite codebook. Each codebook entry is a centroid. The Linde-Buzo-Gray (LBG) algorithm, a generalization of Lloyd's algorithm, builds the codebook by iteratively splitting and re-clustering until distortion is minimized. VQ is the foundation of speech codecs, image compression schemes, and modern product quantization for fast nearest-neighbor search.

Inverted file index for vector search

Large-scale vector retrieval libraries such as FAISS, ScaNN, and Milvus use centroids in the IVF (inverted file) index. During training, the library runs k-means to learn a coarse codebook of, for example, 4,096 centroids. Each database vector is assigned to its nearest centroid and stored in that centroid's posting list. At query time, the system computes distances between the query and all centroids, then probes only the few nearest posting lists. Combined with product quantization for compact encoding, IVF brings billion-vector search to commodity hardware (Douze et al., 2024).

Mixture-of-experts routing

Mixture-of-experts (MoE) layers in large transformer models route each token to a small number of experts. Recent work explores associating each expert with a learned centroid in embedding space and routing tokens by cosine similarity to the centroid. This geometric routing reduces the number of routing parameters and makes expert specialization directly inspectable from the centroid matrix.

Object detection and pose estimation

In computer vision, CenterNet (Zhou et al., 2019, Objects as Points) treats each object as a single keypoint at the center of its bounding box and regresses size, depth, and orientation from features at that keypoint. The center of the bounding box is the centroid of the object's annotation, and the design eliminates the anchor boxes used by earlier detectors such as Faster R-CNN. Pose estimation pipelines often use centroid heatmaps for similar reasons.

Other prototype-based methods

Learning vector quantization (LVQ), self-organizing maps (SOM), radial basis function networks, and many few-shot classifiers all rely on prototype vectors that behave similarly to centroids. Prototypical networks (Snell et al., 2017) compute a class centroid in embedding space from a few support examples and classify queries by nearest centroid, achieving strong few-shot performance.

Implementation in scikit-learn

scikit-learn provides several centroid-based classes that follow the same fit and predict interface.

from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.neighbors import NearestCentroid
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

# Standard k-means with k-means++ initialization
kmeans = KMeans(n_clusters=8, init="k-means++", n_init=10, random_state=0)
kmeans.fit(X_scaled)
centroids = kmeans.cluster_centers_   # shape: (8, n_features)
labels = kmeans.labels_

# Mini-batch variant for large datasets
mbk = MiniBatchKMeans(n_clusters=8, batch_size=1024, n_init=3, random_state=0)
mbk.fit(X_scaled)

# Nearest centroid classifier
ncc = NearestCentroid()
ncc.fit(X_train_scaled, y_train)
preds = ncc.predict(X_test_scaled)

Key scikit-learn parameters worth understanding when working with centroids:

Parameter	Class	Default	Purpose
n_clusters	KMeans, MiniBatchKMeans	8	Number of centroids to fit
init	KMeans	'k-means++'	Initialization scheme
n_init	KMeans	10	Number of restarts; best run is kept
algorithm	KMeans	'lloyd'	Use 'elkan' for many clusters with low dimensionality
batch_size	MiniBatchKMeans	1024	Points used per update
reassignment_ratio	MiniBatchKMeans	0.01	Controls when low-count clusters are reseeded
metric	NearestCentroid	'euclidean'	Switch to 'manhattan' for L1 prototypes
shrink_threshold	NearestCentroid	None	Shrunken centroids for high-dimensional data

Practical tips

A few rules of thumb make centroid-based methods more reliable in practice:

Standardize features before clustering unless they are already on the same scale. Otherwise the metric is dominated by whichever feature has the largest variance.

Run multiple restarts. The default n_init=10 in scikit-learn is a sensible starting point. For difficult datasets, more restarts help.

Seed deterministically when reproducibility matters. Setting random_state ensures the same centroids appear across runs.

Validate cluster quality with at least one internal metric, ideally combined with a visualization (PCA or t-SNE projection of the data with cluster colors and centroid markers).

Reassess whether the data really has a centroid structure. If silhouette scores are low across all k and visualizations show overlapping or non-convex clusters, switch to DBSCAN, a Gaussian mixture, or hierarchical clustering rather than forcing more centroids onto the problem.

When feature dimensionality is very high, consider dimensionality reduction (PCA, UMAP) or shrunken centroids before applying nearest-centroid classification or k-means.

Historical context

Year	Researcher	Contribution
1956	Hugo Steinhaus	Early formulation of partitioning by minimizing within-group variance
1957	Stuart Lloyd	Iterative centroid algorithm at Bell Labs (memo unpublished until 1982)
1965	Edward Forgy	Independent description of an equivalent algorithm
1967	James MacQueen	Coined the term "k-means"
1971	Joseph Rocchio	Centroid-based relevance feedback for the SMART IR system
1973	J. C. Dunn	Introduced fuzzy c-means with soft centroid assignments
1979	Davies and Bouldin	Centroid-based cluster separation index
1980	Linde, Buzo, Gray	LBG vector quantization codebook design
1987	Kaufman and Rousseeuw	Partitioning Around Medoids (PAM) and the silhouette score
1998	Zhexue Huang	k-modes for categorical data
2000	Pelleg and Moore	X-means for automatic k selection via BIC
2007	Arthur and Vassilvitskii	k-means++ with provable O(log k) guarantee
2010	D. Sculley	Mini-batch k-means for web-scale data
2012	Bahmani et al.	Scalable k-means
2017	Snell et al.	Prototypical networks for few-shot learning
2019	Zhou et al.	CenterNet treats each object as a single centroid

Explain like I'm 5 (ELI5)

Imagine you have a bunch of marbles scattered on the floor and you want to find the "middle" of the pile. You measure how far each marble is from a guess, take the average, and you get a point that sits right in the middle of the marbles. That middle point is the centroid.

Now imagine the marbles come in a few different colors and you want to find the middle of each color group. The k-means algorithm picks a few starting middle points (centroids), then asks every marble "which middle is closest to me?" and groups them. Then it slides each middle to the actual center of its group of marbles, and asks again. After a few rounds, the middles stop moving and you have your color groups.

The centroid is just the middle. K-means is the game of moving the middles around until they sit exactly where they should.

References

Lloyd, S. P. (1982). "Least squares quantization in PCM." *IEEE Transactions on Information Theory*, 28(2), 129-137. (Originally a 1957 Bell Labs internal memo.)
MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations." *Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability*, 1, 281-297.
Arthur, D., and Vassilvitskii, S. (2007). "k-means++: The Advantages of Careful Seeding." *Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms*, 1027-1035.
Sculley, D. (2010). "Web-scale k-means clustering." *Proceedings of the 19th International Conference on World Wide Web (WWW '10)*, 1177-1178.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." *Journal of Computational and Applied Mathematics*, 20, 53-65.
Kaufman, L., and Rousseeuw, P. J. (1987). "Clustering by means of medoids." In *Statistical Data Analysis Based on the L1 Norm and Related Methods*, 405-416.
Davies, D. L., and Bouldin, D. W. (1979). "A Cluster Separation Measure." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, PAMI-1(2), 224-227.
Tibshirani, R., Walther, G., and Hastie, T. (2001). "Estimating the number of clusters in a data set via the gap statistic." *Journal of the Royal Statistical Society, Series B*, 63(2), 411-423.
Aloise, D., Deshpande, A., Hansen, P., and Popat, P. (2009). "NP-hardness of Euclidean sum-of-squares clustering." *Machine Learning*, 75(2), 245-248.
Pelleg, D., and Moore, A. (2000). "X-means: Extending k-means with efficient estimation of the number of clusters." *Proceedings of the 17th International Conference on Machine Learning*, 727-734.
Rocchio, J. J. (1971). "Relevance feedback in information retrieval." In G. Salton (ed.), *The SMART Retrieval System: Experiments in Automatic Document Processing*, 313-323. Prentice-Hall.
Linde, Y., Buzo, A., and Gray, R. M. (1980). "An algorithm for vector quantizer design." *IEEE Transactions on Communications*, 28(1), 84-95.
Huang, Z. (1998). "Extensions to the k-means algorithm for clustering large data sets with categorical values." *Data Mining and Knowledge Discovery*, 2(3), 283-304.
Zhou, X., Wang, D., and Krahenbuhl, P. (2019). "Objects as Points." *arXiv preprint arXiv:1904.07850*.
Snell, J., Swersky, K., and Zemel, R. (2017). "Prototypical networks for few-shot learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazare, P. E., Lomeli, M., Hosseini, L., and Jegou, H. (2024). "The Faiss library." *arXiv preprint arXiv:2401.08281*.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S. (2012). "Scalable K-Means++." *Proceedings of the VLDB Endowment*, 5(7), 622-633.

Definition

Geometric interpretation

Voronoi tessellation

Centroids in k-means clustering

The algorithm

k-means++

Mini-batch k-means

Variants of k-means with different centroid definitions

Distance metrics

Alternatives to centroid-based clustering

Cluster validation with centroids

Within-cluster sum of squares (WCSS / inertia)

Silhouette score

Davies-Bouldin index

Calinski-Harabasz index

Choosing the number of clusters

Limitations of centroid-based clustering

Centroid-based methods beyond clustering

Nearest centroid classifier

Vector quantization

Inverted file index for vector search

Mixture-of-experts routing

Object detection and pose estimation

Other prototype-based methods

Implementation in scikit-learn

Practical tips

Historical context

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Divisive Clustering

Hierarchical Clustering

K-Means

K-Median

Gaussian Mixture Model

Definition

Geometric interpretation

Voronoi tessellation

Centroids in k-means clustering

The algorithm

k-means++

Mini-batch k-means

Variants of k-means with different centroid definitions

Distance metrics

Alternatives to centroid-based clustering

Cluster validation with centroids

Within-cluster sum of squares (WCSS / inertia)

Silhouette score

Davies-Bouldin index

Calinski-Harabasz index

Choosing the number of clusters

Limitations of centroid-based clustering

Centroid-based methods beyond clustering

Nearest centroid classifier

Vector quantization

Inverted file index for vector search

Mixture-of-experts routing

Object detection and pose estimation

Other prototype-based methods

Implementation in scikit-learn

Practical tips

Historical context

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Divisive Clustering

Hierarchical Clustering

K-Means

K-Median

Gaussian Mixture Model