See also: Machine learning terms
In the field of machine learning and data analysis, k-means is an unsupervised machine learning clustering algorithm that partitions a dataset into k distinct clusters. The algorithm aims to minimize the sum of squared distances between the data points and the centroids of their corresponding clusters. It is one of the most widely used clustering methods in practice, applied to problems ranging from pattern recognition and image segmentation to customer segmentation and anomaly detection.
K-means was first described by Stuart Lloyd in 1957 as a technique for pulse-code modulation at Bell Labs, though his paper was not published externally until 1982. The algorithm was independently discovered and published by James MacQueen in 1967, who gave it the name "k-means." Hugo Steinhaus had proposed a similar idea even earlier, in 1956, making him the first researcher to explicitly describe an algorithm for partitioning multidimensional instances. Edward Forgy also described a similar method in 1965. Because of this complex history, the standard k-means algorithm is sometimes referred to as Lloyd's algorithm or the Lloyd-Forgy algorithm.
Despite its simplicity, k-means remains one of the top ten algorithms in data mining, according to a well-known survey by Wu et al. (2008). Its popularity stems from its ease of implementation, interpretability, and computational efficiency. The algorithm works best when clusters are roughly spherical, similarly sized, and well separated in the feature space.
The standard k-means algorithm, known as Lloyd's algorithm, is an iterative procedure that alternates between two steps: assigning data points to clusters and updating cluster centroids. The algorithm is guaranteed to converge, though it may converge to a local minimum rather than the global minimum.
Given a dataset of n observations, each represented as a d-dimensional vector, the k-means objective is to partition the n observations into k sets (S1, S2, ..., Sk) so as to minimize the within-cluster sum of squares (WCSS), also called inertia:
WCSS = Sum from i=1 to k of Sum over x in Si of ||x - mu_i||^2
where mu_i is the centroid (mean) of the points in cluster Si and ||.|| denotes the Euclidean distance.
Finding the globally optimal solution to the k-means objective is NP-hard in general (as proved by Aloise et al., 2009), which is why the algorithm relies on the iterative heuristic described below. Even for two clusters in two dimensions, the problem is NP-hard, although in practice Lloyd's algorithm finds good solutions quickly.
The k-means algorithm proceeds through the following steps:
Each data point is assigned to the nearest centroid, forming a cluster. The distance between a data point and a centroid is typically calculated using the Euclidean distance:
d(x, mu) = sqrt(Sum from j=1 to d of (xj - mu_j)^2)
Formally, each point x is assigned to the cluster whose centroid is closest:
C(x) = argmin_i ||x - mu_i||^2
Other distance metrics can be used in variants of k-means. For example, the Manhattan distance (L1 norm) is sometimes used when features represent different measurement scales, and cosine distance is preferred for text data and other high-dimensional sparse representations.
The centroids are recalculated by taking the mean of all data points assigned to their respective clusters:
mu_i = (1/|Si|) * Sum over x in Si of x
This step ensures that the centroids are updated to the center of the new clusters formed in the assignment step.
The assignment and update steps are repeated until the centroids' positions no longer change significantly or a predetermined number of iterations is reached. When convergence is achieved, the algorithm terminates, and the final cluster assignments are returned.
The WCSS objective decreases (or stays the same) with each iteration of the algorithm. Since there are only finitely many possible partitions of the data, the algorithm is guaranteed to converge in a finite number of steps. In practice, convergence typically occurs within 20 to 50 iterations for most datasets. However, worst-case examples exist where Lloyd's algorithm requires an exponential number of iterations to converge (Vattani, 2011), though such cases are rare in practice.
The time complexity of a single iteration of Lloyd's algorithm is O(n * k * d), where n is the number of data points, k is the number of clusters, and d is the number of dimensions. If the algorithm runs for T iterations, the total time complexity is O(T * n * k * d). The space complexity is O(n * d + k * d), since the algorithm needs to store the data matrix and the k centroids.
In the worst case, the number of iterations T can be superpolynomial, giving a worst-case complexity of O(n^(k+2/p)) where p is the number of features. In practice, however, k-means is considered one of the fastest clustering algorithms, making it a practical first choice even for datasets with millions of points.
The choice of initial centroids has a significant effect on the final clustering result. Poor initialization can lead to suboptimal solutions or slow convergence.
The simplest initialization strategy randomly selects k observations from the dataset as initial centroids. While easy to implement, this approach is highly sensitive to the random seed and can produce poor results if the initial centroids happen to be close together or placed in sparse regions of the data.
A common mitigation strategy is to run k-means multiple times with different random initializations and keep the result with the lowest WCSS. Scikit-learn's implementation does this by default, running the algorithm 10 times (controlled by the n_init parameter).
K-means++ is a smarter initialization method proposed by David Arthur and Sergei Vassilvitskii in 2007. It selects initial centroids that are spread out across the data, leading to better and more consistent clustering results.
The k-means++ initialization procedure is:
By selecting centroids that are far apart, k-means++ avoids the problem of placing multiple centroids in the same cluster region. Arthur and Vassilvitskii proved that k-means++ achieves an expected WCSS that is at most O(log k) times the optimal WCSS, providing a theoretical guarantee on the quality of the initialization. This was the first initialization scheme for k-means with a provable approximation bound.
K-means++ is the default initialization method in scikit-learn and most other modern implementations.
K-means|| (pronounced "k-means parallel") was proposed by Bahmani et al. (2012) as a scalable version of k-means++ suitable for distributed computing environments. Instead of selecting one centroid at a time (which requires a sequential pass over the data), k-means|| oversamples multiple candidate centroids in each round. After a small number of rounds (typically O(log n)), the oversampled candidates are reduced to k centroids using a weighted clustering step. K-means|| achieves similar quality to k-means++ with far fewer passes over the data, making it practical for very large distributed datasets.
| Method | Description | Pros | Cons |
|---|---|---|---|
| Random | Select k random data points | Simple, fast | Unstable results |
| K-means++ | Spread-out probabilistic selection | Provable O(log k) guarantee, stable | Slightly slower initialization (sequential) |
| K-means | Parallelized version of k-means++ | ||
| Forgy | Random assignment then compute centroids | Simple | Can produce poor initial centroids |
| Random partition | Randomly assign points to k clusters, then compute centroids | Centroids near data center | Tends to produce similar initial centroids |
One challenge with k-means is selecting the appropriate number of clusters (k). Several methods can be used to estimate the optimal value of k.
The elbow method involves running k-means for a range of k values (for example, from 1 to 10) and plotting the WCSS (inertia) against k. As k increases, WCSS decreases because each data point gets closer to its nearest centroid. The "elbow" of the curve, where the rate of decrease sharply levels off, suggests the optimal number of clusters.
The intuition is that adding more clusters beyond the elbow point provides diminishing returns in terms of explaining the variance in the data. However, the elbow is not always clearly defined, making this method somewhat subjective. In practice, it is best used as a rough guide rather than a definitive answer.
The silhouette score provides a more rigorous measure of clustering quality. For each data point i, the silhouette coefficient is:
s(i) = (b(i) - a(i)) / max(a(i), b(i))
where:
The silhouette coefficient ranges from -1 to +1:
| Score range | Interpretation |
|---|---|
| Close to +1 | Point is well-matched to its own cluster and poorly matched to neighboring clusters |
| Close to 0 | Point is on or near the boundary between two clusters |
| Close to -1 | Point may have been assigned to the wrong cluster |
The overall silhouette score is the mean of the silhouette coefficients across all data points. The optimal k is typically the one that maximizes the average silhouette score.
The gap statistic, introduced by Tibshirani, Walther, and Hastie in 2001, compares the WCSS of the observed data with the WCSS expected under a null reference distribution (typically a uniform distribution over the data range). The optimal k is the value where the gap between the observed and expected WCSS is largest. The gap statistic formalizes the elbow method by providing a statistical criterion rather than a visual one.
The Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) can be used to select k when the clustering model is interpreted probabilistically. These criteria balance model fit against model complexity: BIC penalizes the number of parameters more heavily than AIC, often favoring simpler models. They are most commonly applied with Gaussian mixture models, but can also be adapted to k-means by treating each cluster as a spherical Gaussian.
| Method | Based on | Strengths | Weaknesses |
|---|---|---|---|
| Elbow method | WCSS (inertia) | Simple, intuitive | Subjective; elbow not always clear |
| Silhouette score | Cohesion and separation | Principled; considers cluster shape | Computationally expensive for large n |
| Gap statistic | Comparison to null distribution | Statistically grounded | Complex to implement, slow |
| BIC/AIC | Model likelihood | Works well with GMMs; penalizes complexity | Assumes a specific probabilistic model |
Several variants of the k-means algorithm have been developed to address its limitations.
Mini-batch k-means is a variant designed for large datasets. Instead of using the entire dataset in each iteration, mini-batch k-means randomly samples a small batch of data points (typically 100 to 1,000) and performs the assignment and update steps on just that batch.
The update rule for mini-batch k-means uses a learning rate that decreases over time, allowing the centroids to gradually converge:
mu_i = (1 - eta) * mu_i + eta * x
where eta is the learning rate for the current iteration.
Mini-batch k-means was popularized by David Sculley in a 2010 paper for web-scale clustering applications. It runs significantly faster than standard k-means, often producing results that are only slightly worse in terms of WCSS. It is particularly useful when the dataset is too large to fit in memory or when speed is a priority.
| Aspect | Standard k-means | Mini-batch k-means |
|---|---|---|
| Data used per iteration | Entire dataset | Small random batch |
| Speed | Slower for large datasets | Much faster |
| Result quality | Optimal (for given initialization) | Slightly worse WCSS |
| Memory usage | Requires full dataset in memory | Low memory footprint |
| Best for | Small to medium datasets | Large datasets, streaming data |
K-medoids, also known as Partitioning Around Medoids (PAM), uses actual data points as cluster centers (called medoids) instead of means. Because medoids are real data points, the algorithm is more robust to outliers than standard k-means. The objective function minimizes the sum of dissimilarities between each point and its medoid, and it can work with any distance metric, not just Euclidean distance. The main drawback is higher computational cost: PAM has O(k * (n - k)^2) complexity per iteration, which makes it impractical for very large datasets. The CLARA and CLARANS algorithms were developed as scalable approximations of PAM.
K-median clustering replaces the mean with the median when computing cluster centers, and uses the Manhattan distance (L1 norm) instead of Euclidean distance. This makes the algorithm more robust to outliers, since the median is less affected by extreme values than the mean. K-median is equivalent to k-means with the L1 distance metric.
Bisecting k-means is a divisive hierarchical clustering approach. It starts with all data in a single cluster and iteratively splits the cluster with the largest WCSS (or the largest number of points) into two sub-clusters using standard k-means with k=2. This process repeats until the desired number of clusters is reached. Bisecting k-means often produces more balanced cluster sizes than standard k-means and can be more stable with respect to initialization, since only binary splits are performed at each step.
Kernel k-means addresses the limitation that standard k-means can only find linearly separable (convex) clusters. It applies the kernel trick to implicitly map data points into a higher-dimensional feature space where the clusters become linearly separable. In this feature space, the standard k-means algorithm is applied. The key insight is that the algorithm can be rewritten so that it only requires pairwise dot products between data points, which can be replaced by a kernel function (such as the radial basis function or polynomial kernel). Kernel k-means is closely related to spectral clustering (Dhillon et al., 2004). The main disadvantage is that it requires computing and storing an n-by-n kernel matrix, which limits scalability.
Fuzzy c-means (FCM) allows soft cluster assignments, where each data point has a degree of membership in each cluster rather than belonging to exactly one cluster. The membership values range from 0 to 1, and for each point they sum to 1 across all clusters. A fuzziness parameter m (typically set to 2) controls how soft or hard the assignments are. When m approaches 1, fuzzy c-means converges to hard k-means. FCM was introduced by Dunn in 1973 and improved by Bezdek in 1981. It is widely used in image processing and pattern recognition where cluster boundaries are not well defined.
| Variant | Key difference from standard k-means |
|---|---|
| Spherical k-means | Uses cosine similarity instead of Euclidean distance; suited for text data and high-dimensional sparse data |
| K-modes | Designed for categorical data; uses the mode instead of the mean and a matching dissimilarity measure |
| K-prototypes | Combines k-means and k-modes to handle datasets with both numerical and categorical features |
| X-means | Automatically determines k by starting with k=2 and repeatedly splitting clusters, using BIC to decide whether to accept each split |
| G-means | Similar to x-means, but tests each cluster for Gaussianity using the Anderson-Darling test to decide whether to split |
Before running k-means, several preprocessing steps can improve results significantly.
Because k-means uses Euclidean distance, features with larger scales dominate the distance calculation. Normalization or standardization of features is critical for meaningful results. Standardizing features to zero mean and unit variance (z-score normalization) is the most common approach. Min-max scaling to the [0, 1] range is another option. Without proper scaling, a feature measured in thousands (such as salary) will completely overshadow a feature measured in single digits (such as age), leading to clusters that are determined almost entirely by the high-magnitude feature.
When the dataset has many features, applying PCA or other dimensionality reduction techniques before clustering can reduce noise and speed up computation. Research by Ding and He (2004) showed that the cluster indicators produced by k-means are closely related to the principal components of the data, providing theoretical justification for using PCA as a preprocessing step.
Because k-means is sensitive to outliers (outliers can pull centroids away from the true cluster centers), removing extreme values before clustering can produce cleaner results. Alternatively, using a more robust variant such as k-medoids can reduce the influence of outliers without requiring their explicit removal.
K-means is just one of many clustering algorithms. Understanding its strengths and weaknesses relative to alternatives helps practitioners choose the right method for their data.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together points in high-density regions and marks points in low-density regions as outliers.
| Aspect | K-means | DBSCAN |
|---|---|---|
| Number of clusters | Must be specified (k) | Determined automatically |
| Cluster shape | Spherical (convex) | Arbitrary shapes |
| Outlier handling | No built-in mechanism | Labels outliers as noise |
| Parameters | k (number of clusters) | eps (neighborhood radius), min_samples |
| Scalability | Very scalable | Moderate (can struggle with varying densities) |
| Cluster sizes | Tends toward equal-sized clusters | Can find clusters of varying sizes |
DBSCAN is preferred when clusters have irregular shapes or when the data contains outliers. K-means is preferred when clusters are approximately spherical and the number of clusters is known.
Gaussian mixture models (GMMs) are a probabilistic clustering method that models the data as a mixture of Gaussian distributions. GMMs are fitted using the Expectation-Maximization (EM) algorithm.
| Aspect | K-means | GMM |
|---|---|---|
| Cluster assignment | Hard (each point to exactly one cluster) | Soft (probability of belonging to each cluster) |
| Cluster shape | Spherical only | Elliptical (flexible covariance) |
| Output | Cluster labels | Cluster probabilities |
| Speed | Faster | Slower |
| Sensitivity to initialization | High | High |
| Number of clusters | Must be specified | Must be specified (can use BIC/AIC) |
K-means can be viewed as a special case of GMM where each Gaussian component has equal, spherical covariance. GMMs are more flexible but also more computationally expensive and prone to overfitting with insufficient data.
Hierarchical clustering builds a tree (dendrogram) of nested cluster merges or splits, without requiring a predefined number of clusters. It is useful for understanding the hierarchical structure of the data but does not scale well to large datasets (O(n^2) space and O(n^3) time for agglomerative methods). K-means is much faster and scales to larger datasets, but it does not reveal the nested structure that hierarchical clustering provides.
| Aspect | K-means | DBSCAN | GMM | Hierarchical |
|---|---|---|---|---|
| Must specify k | Yes | No | Yes | No (cut dendrogram) |
| Cluster shape | Spherical | Arbitrary | Elliptical | Arbitrary |
| Outlier handling | None | Built-in | Possible via low probability | None |
| Scalability | O(nkd) per iteration | O(n log n) with index | O(nk) per EM step | O(n^2) to O(n^3) |
| Soft assignments | No | No | Yes | No |
| Best use case | Large data, spherical clusters | Irregular shapes, noise | Overlapping clusters | Small data, hierarchy exploration |
K-means clustering is used across many domains.
Businesses use k-means to group customers based on purchasing behavior, demographics, website interactions, and other attributes. By identifying distinct customer segments, companies can tailor marketing strategies, personalize recommendations, and optimize resource allocation. A typical segmentation might identify groups such as high-value loyal customers, bargain shoppers, and new prospects.
K-means is widely used for color quantization, where the goal is to reduce the number of distinct colors in an image while preserving visual quality. Each pixel's RGB values are treated as a 3-dimensional data point, and k-means groups similar colors together. The centroid of each cluster becomes one of the k representative colors in the compressed palette. For example, reducing a photograph from 96,000+ unique colors to just 64 colors using k-means produces an image that looks nearly identical to the original while requiring far less storage. A single byte can address up to 256 palette colors, compared to the 3 bytes per pixel required for full RGB encoding.
In natural language processing, k-means is used to group documents by topic. Documents are represented as vectors using techniques like TF-IDF or word embeddings, and k-means clusters them based on content similarity. Spherical k-means (using cosine distance) is often preferred for this task because text vectors are typically high-dimensional and sparse.
K-means can be used for anomaly detection by treating data points that are far from all cluster centroids as potential anomalies. After clustering, points with a distance to their nearest centroid that exceeds a threshold (often set as a multiple of the standard deviation) are flagged as outliers. This approach is used in fraud detection, network intrusion detection, and equipment failure monitoring.
| Application | Description | Typical k |
|---|---|---|
| Gene expression analysis | Group genes with similar expression patterns | 5 to 20 |
| Geographic analysis | Identify spatial clusters of events or facilities | Varies |
| Recommendation systems | Group similar users or items for collaborative filtering | 10 to 100 |
| Network traffic analysis | Group similar network traffic patterns to detect anomalies | 3 to 20 |
| Market basket analysis | Identify groups of products frequently purchased together | 5 to 15 |
| Medical imaging | Segment tissues or structures in MRI and CT scans | 3 to 10 |
| Search engine result grouping | Cluster search results into topical groups | 3 to 8 |
Scikit-learn provides a robust and well-optimized implementation of k-means.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Scale features first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit k-means
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
kmeans.fit(X_scaled)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
print("Inertia (WCSS):", kmeans.inertia_)
print("Silhouette Score:", silhouette_score(X_scaled, labels))
For large datasets, MiniBatchKMeans provides a faster alternative:
from sklearn.cluster import MiniBatchKMeans
mbk = MiniBatchKMeans(n_clusters=3, batch_size=1024, n_init=10)
mbk.fit(X_scaled)
Key parameters in scikit-learn's KMeans include:
| Parameter | Default | Description |
|---|---|---|
| n_clusters | 8 | Number of clusters (k) |
| init | 'k-means++' | Initialization method |
| n_init | 10 | Number of times to run with different initializations |
| max_iter | 300 | Maximum number of iterations per run |
| tol | 1e-4 | Convergence tolerance based on inertia change |
| algorithm | 'lloyd' | Algorithm to use ('lloyd' or 'elkan') |
The Elkan algorithm (Elkan, 2003) is an accelerated variant that uses the triangle inequality to skip unnecessary distance calculations. It can be significantly faster than Lloyd's algorithm when the number of clusters is large, though it requires additional memory to store inter-centroid distances.
K-means occupies a central position in the landscape of unsupervised learning algorithms. It can be viewed as a special case of the Expectation-Maximization (EM) algorithm applied to a mixture of spherical Gaussians with equal covariance. In this interpretation, the assignment step corresponds to the E-step (computing posterior responsibilities) and the update step corresponds to the M-step (re-estimating parameters). The difference is that k-means uses hard assignments (each point to one cluster), while EM uses soft probabilistic assignments.
K-means is also closely related to principal component analysis (PCA). Research by Ding and He (2004) showed that the cluster indicators produced by k-means are the solutions to a relaxed version of the k-means problem that is equivalent to PCA. This relationship suggests that applying PCA before k-means is a natural preprocessing step that often preserves the cluster structure while reducing noise and computation time.
The Voronoi tessellation is another related concept. The assignment step of k-means partitions the feature space into Voronoi cells, where each cell contains all points closest to a given centroid. This geometric perspective helps explain why k-means produces convex cluster boundaries.
The history of k-means involves several independent discoveries across different disciplines:
| Year | Researcher | Contribution |
|---|---|---|
| 1956 | Hugo Steinhaus | Proposed a similar partitioning idea in the context of mathematical optimization |
| 1957 | Stuart Lloyd | Developed the iterative algorithm for pulse-code modulation at Bell Labs (unpublished until 1982) |
| 1965 | Edward Forgy | Published a similar method independently |
| 1967 | James MacQueen | Published the algorithm and coined the term "k-means" |
| 1973 | J.C. Dunn | Introduced fuzzy c-means clustering |
| 1981 | J.C. Bezdek | Generalized fuzzy c-means |
| 2003 | Charles Elkan | Proposed an accelerated k-means using the triangle inequality |
| 2007 | Arthur and Vassilvitskii | Introduced k-means++ initialization with provable guarantees |
| 2010 | David Sculley | Published mini-batch k-means for web-scale applications |
| 2012 | Bahmani et al. | Proposed k-means |
The convergence of these independent discoveries reflects the fundamental nature of the problem: partitioning data into groups by proximity is a natural and broadly applicable idea that arises in many different fields.
Imagine you have a bunch of different colored balls on the floor, and you want to group them by color. The k-means algorithm is like a smart robot that helps you do this. First, you tell the robot how many color groups (clusters) you want, and it randomly picks a spot for each group.
Then, the robot looks at each ball and puts it in the group with the closest matching spot. After all the balls are in groups, the robot moves each spot to the middle of the balls in that group. It repeats this process (look at each ball, put it in the closest group, then move the spots to the middle) until the groups don't change anymore. At the end, you have all your balls neatly grouped by color.
The tricky part is that you have to tell the robot how many groups to make before it starts. If you say 3 but there are really 5 colors, the robot will do its best, but some colors will be lumped together. That is why picking the right number of groups is important.