K-Means

Introduction

In the field of machine learning and data analysis, k-means is an unsupervised machine learning clustering algorithm that partitions a dataset into k distinct clusters. The algorithm aims to minimize the sum of squared distances between the data points and the centroids of their corresponding clusters. It is one of the most widely used clustering methods in practice, applied to problems ranging from pattern recognition and image segmentation to customer segmentation and anomaly detection.

K-means was first described by Stuart Lloyd in 1957 as a technique for pulse-code modulation at Bell Labs, though his paper was not published externally until 1982. The algorithm was independently discovered and published by James MacQueen in 1967, who gave it the name "k-means." Hugo Steinhaus had proposed a similar idea even earlier, in 1956, making him the first researcher to explicitly describe an algorithm for partitioning multidimensional instances. Edward Forgy also described a similar method in 1965. Because of this complex history, the standard k-means algorithm is sometimes referred to as Lloyd's algorithm or the Lloyd-Forgy algorithm.

Despite its simplicity, k-means remains one of the top ten algorithms in data mining, according to a well-known survey by Wu et al. (2008). Its popularity stems from its ease of implementation, interpretability, and computational efficiency. The algorithm works best when clusters are roughly spherical, similarly sized, and well separated in the feature space.

Lloyd's algorithm

The standard k-means algorithm, known as Lloyd's algorithm, is an iterative procedure that alternates between two steps: assigning data points to clusters and updating cluster centroids. The algorithm is guaranteed to converge, though it may converge to a local minimum rather than the global minimum.

Formal definition

Given a dataset of n observations, each represented as a d-dimensional vector, the k-means objective is to partition the n observations into k sets (S1, S2, ..., Sk) so as to minimize the within-cluster sum of squares (WCSS), also called inertia:

WCSS = Sum from i=1 to k of Sum over x in Si of ||x - mu_i||^2

where mu_i is the centroid (mean) of the points in cluster Si and ||.|| denotes the Euclidean distance.

Finding the globally optimal solution to the k-means objective is NP-hard in general (as proved by Aloise et al., 2009), which is why the algorithm relies on the iterative heuristic described below. Even for two clusters in two dimensions, the problem is NP-hard, although in practice Lloyd's algorithm finds good solutions quickly.

Algorithm steps

The k-means algorithm proceeds through the following steps:

Initialization: Select k initial centroids (see initialization methods below).
Assignment step: Assign each data point to the nearest centroid, forming k clusters.
Update step: Recalculate each centroid as the mean of all data points assigned to its cluster.
Convergence check: If the centroids have not changed (or the change is below a tolerance threshold), stop. Otherwise, return to step 2.

Assignment step

Each data point is assigned to the nearest centroid, forming a cluster. The distance between a data point and a centroid is typically calculated using the Euclidean distance:

d(x, mu) = sqrt(Sum from j=1 to d of (xj - mu_j)^2)

Formally, each point x is assigned to the cluster whose centroid is closest:

C(x) = argmin_i ||x - mu_i||^2

Other distance metrics can be used in variants of k-means. For example, the Manhattan distance (L1 norm) is sometimes used when features represent different measurement scales, and cosine distance is preferred for text data and other high-dimensional sparse representations.

Update step

The centroids are recalculated by taking the mean of all data points assigned to their respective clusters:

mu_i = (1/|Si|) * Sum over x in Si of x

This step ensures that the centroids are updated to the center of the new clusters formed in the assignment step.

Convergence

The assignment and update steps are repeated until the centroids' positions no longer change significantly or a predetermined number of iterations is reached. When convergence is achieved, the algorithm terminates, and the final cluster assignments are returned.

The WCSS objective decreases (or stays the same) with each iteration of the algorithm. Since there are only finitely many possible partitions of the data, the algorithm is guaranteed to converge in a finite number of steps. In practice, convergence typically occurs within 20 to 50 iterations for most datasets. However, worst-case examples exist where Lloyd's algorithm requires an exponential number of iterations to converge (Vattani, 2011), though such cases are rare in practice.

Computational complexity

The time complexity of a single iteration of Lloyd's algorithm is O(n * k * d), where n is the number of data points, k is the number of clusters, and d is the number of dimensions. If the algorithm runs for T iterations, the total time complexity is O(T * n * k * d). The space complexity is O(n * d + k * d), since the algorithm needs to store the data matrix and the k centroids.

In the worst case, the number of iterations T can be superpolynomial, giving a worst-case complexity of O(n^(k+2/p)) where p is the number of features. In practice, however, k-means is considered one of the fastest clustering algorithms, making it a practical first choice even for datasets with millions of points.

Initialization methods

The choice of initial centroids has a significant effect on the final clustering result. Poor initialization can lead to suboptimal solutions or slow convergence.

Random initialization

The simplest initialization strategy randomly selects k observations from the dataset as initial centroids. While easy to implement, this approach is highly sensitive to the random seed and can produce poor results if the initial centroids happen to be close together or placed in sparse regions of the data.

A common mitigation strategy is to run k-means multiple times with different random initializations and keep the result with the lowest WCSS. Scikit-learn's implementation does this by default, running the algorithm 10 times (controlled by the n_init parameter).

K-means++ initialization

K-means++ is a smarter initialization method proposed by David Arthur and Sergei Vassilvitskii in 2007. It selects initial centroids that are spread out across the data, leading to better and more consistent clustering results.

The k-means++ initialization procedure is:

Choose the first centroid uniformly at random from the data points.
For each remaining data point, compute the distance D(x) to the nearest already-chosen centroid.
Choose the next centroid from the data points with probability proportional to D(x)^2.
Repeat steps 2 and 3 until all k centroids have been chosen.

By selecting centroids that are far apart, k-means++ avoids the problem of placing multiple centroids in the same cluster region. Arthur and Vassilvitskii proved that k-means++ achieves an expected WCSS that is at most O(log k) times the optimal WCSS, providing a theoretical guarantee on the quality of the initialization. This was the first initialization scheme for k-means with a provable approximation bound.

K-means++ is the default initialization method in scikit-learn and most other modern implementations.

K-means|| (scalable k-means++)

K-means|| (pronounced "k-means parallel") was proposed by Bahmani et al. (2012) as a scalable version of k-means++ suitable for distributed computing environments. Instead of selecting one centroid at a time (which requires a sequential pass over the data), k-means|| oversamples multiple candidate centroids in each round. After a small number of rounds (typically O(log n)), the oversampled candidates are reduced to k centroids using a weighted clustering step. K-means|| achieves similar quality to k-means++ with far fewer passes over the data, making it practical for very large distributed datasets.

Comparison of initialization methods

Method	Description	Pros	Cons
Random	Select k random data points	Simple, fast	Unstable results
K-means++	Spread-out probabilistic selection	Provable O(log k) guarantee, stable	Slightly slower initialization (sequential)
K-means			Parallelized version of k-means++
Forgy	Random assignment then compute centroids	Simple	Can produce poor initial centroids
Random partition	Randomly assign points to k clusters, then compute centroids	Centroids near data center	Tends to produce similar initial centroids

Choosing the optimal number of clusters

One challenge with k-means is selecting the appropriate number of clusters (k). Several methods can be used to estimate the optimal value of k.

Elbow method

The elbow method involves running k-means for a range of k values (for example, from 1 to 10) and plotting the WCSS (inertia) against k. As k increases, WCSS decreases because each data point gets closer to its nearest centroid. The "elbow" of the curve, where the rate of decrease sharply levels off, suggests the optimal number of clusters.

The intuition is that adding more clusters beyond the elbow point provides diminishing returns in terms of explaining the variance in the data. However, the elbow is not always clearly defined, making this method somewhat subjective. In practice, it is best used as a rough guide rather than a definitive answer.

Silhouette score

The silhouette score provides a more rigorous measure of clustering quality. For each data point i, the silhouette coefficient is:

s(i) = (b(i) - a(i)) / max(a(i), b(i))

where:

a(i) is the average distance from point i to all other points in the same cluster (cohesion).
b(i) is the average distance from point i to all points in the nearest neighboring cluster (separation).

The silhouette coefficient ranges from -1 to +1:

Score range	Interpretation
Close to +1	Point is well-matched to its own cluster and poorly matched to neighboring clusters
Close to 0	Point is on or near the boundary between two clusters
Close to -1	Point may have been assigned to the wrong cluster

The overall silhouette score is the mean of the silhouette coefficients across all data points. The optimal k is typically the one that maximizes the average silhouette score.

Gap statistic

The gap statistic, introduced by Tibshirani, Walther, and Hastie in 2001, compares the WCSS of the observed data with the WCSS expected under a null reference distribution (typically a uniform distribution over the data range). The optimal k is the value where the gap between the observed and expected WCSS is largest. The gap statistic formalizes the elbow method by providing a statistical criterion rather than a visual one.

Information criteria (BIC and AIC)

The Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) can be used to select k when the clustering model is interpreted probabilistically. These criteria balance model fit against model complexity: BIC penalizes the number of parameters more heavily than AIC, often favoring simpler models. They are most commonly applied with Gaussian mixture models, but can also be adapted to k-means by treating each cluster as a spherical Gaussian.

Comparison of methods for choosing k

Method	Based on	Strengths	Weaknesses
Elbow method	WCSS (inertia)	Simple, intuitive	Subjective; elbow not always clear
Silhouette score	Cohesion and separation	Principled; considers cluster shape	Computationally expensive for large n
Gap statistic	Comparison to null distribution	Statistically grounded	Complex to implement, slow
BIC/AIC	Model likelihood	Works well with GMMs; penalizes complexity	Assumes a specific probabilistic model

Variants and extensions

Several variants of the k-means algorithm have been developed to address its limitations.

Mini-batch k-means

Mini-batch k-means is a variant designed for large datasets. Instead of using the entire dataset in each iteration, mini-batch k-means randomly samples a small batch of data points (typically 100 to 1,000) and performs the assignment and update steps on just that batch.

The update rule for mini-batch k-means uses a learning rate that decreases over time, allowing the centroids to gradually converge:

mu_i = (1 - eta) * mu_i + eta * x

where eta is the learning rate for the current iteration.

Mini-batch k-means was popularized by David Sculley in a 2010 paper for web-scale clustering applications. It runs significantly faster than standard k-means, often producing results that are only slightly worse in terms of WCSS. It is particularly useful when the dataset is too large to fit in memory or when speed is a priority.

Aspect	Standard k-means	Mini-batch k-means
Data used per iteration	Entire dataset	Small random batch
Speed	Slower for large datasets	Much faster
Result quality	Optimal (for given initialization)	Slightly worse WCSS
Memory usage	Requires full dataset in memory	Low memory footprint
Best for	Small to medium datasets	Large datasets, streaming data

K-medoids (PAM)

K-medoids, also known as Partitioning Around Medoids (PAM), uses actual data points as cluster centers (called medoids) instead of means. Because medoids are real data points, the algorithm is more robust to outliers than standard k-means. The objective function minimizes the sum of dissimilarities between each point and its medoid, and it can work with any distance metric, not just Euclidean distance. The main drawback is higher computational cost: PAM has O(k * (n - k)^2) complexity per iteration, which makes it impractical for very large datasets. The CLARA and CLARANS algorithms were developed as scalable approximations of PAM.

K-median clustering

K-median clustering replaces the mean with the median when computing cluster centers, and uses the Manhattan distance (L1 norm) instead of Euclidean distance. This makes the algorithm more robust to outliers, since the median is less affected by extreme values than the mean. K-median is equivalent to k-means with the L1 distance metric.

Bisecting k-means

Bisecting k-means is a divisive hierarchical clustering approach. It starts with all data in a single cluster and iteratively splits the cluster with the largest WCSS (or the largest number of points) into two sub-clusters using standard k-means with k=2. This process repeats until the desired number of clusters is reached. Bisecting k-means often produces more balanced cluster sizes than standard k-means and can be more stable with respect to initialization, since only binary splits are performed at each step.

Kernel k-means

Kernel k-means addresses the limitation that standard k-means can only find linearly separable (convex) clusters. It applies the kernel trick to implicitly map data points into a higher-dimensional feature space where the clusters become linearly separable. In this feature space, the standard k-means algorithm is applied. The key insight is that the algorithm can be rewritten so that it only requires pairwise dot products between data points, which can be replaced by a kernel function (such as the radial basis function or polynomial kernel). Kernel k-means is closely related to spectral clustering (Dhillon et al., 2004). The main disadvantage is that it requires computing and storing an n-by-n kernel matrix, which limits scalability.

Fuzzy c-means

Fuzzy c-means (FCM) allows soft cluster assignments, where each data point has a degree of membership in each cluster rather than belonging to exactly one cluster. The membership values range from 0 to 1, and for each point they sum to 1 across all clusters. A fuzziness parameter m (typically set to 2) controls how soft or hard the assignments are. When m approaches 1, fuzzy c-means converges to hard k-means. FCM was introduced by Dunn in 1973 and improved by Bezdek in 1981. It is widely used in image processing and pattern recognition where cluster boundaries are not well defined.

Other variants

Variant	Key difference from standard k-means
Spherical k-means	Uses cosine similarity instead of Euclidean distance; suited for text data and high-dimensional sparse data
K-modes	Designed for categorical data; uses the mode instead of the mean and a matching dissimilarity measure
K-prototypes	Combines k-means and k-modes to handle datasets with both numerical and categorical features
X-means	Automatically determines k by starting with k=2 and repeatedly splitting clusters, using BIC to decide whether to accept each split
G-means	Similar to x-means, but tests each cluster for Gaussianity using the Anderson-Darling test to decide whether to split

Feature scaling and preprocessing

Before running k-means, several preprocessing steps can improve results significantly.

Feature scaling

Because k-means uses Euclidean distance, features with larger scales dominate the distance calculation. Normalization or standardization of features is critical for meaningful results. Standardizing features to zero mean and unit variance (z-score normalization) is the most common approach. Min-max scaling to the [0, 1] range is another option. Without proper scaling, a feature measured in thousands (such as salary) will completely overshadow a feature measured in single digits (such as age), leading to clusters that are determined almost entirely by the high-magnitude feature.

Dimension reduction

When the dataset has many features, applying PCA or other dimensionality reduction techniques before clustering can reduce noise and speed up computation. Research by Ding and He (2004) showed that the cluster indicators produced by k-means are closely related to the principal components of the data, providing theoretical justification for using PCA as a preprocessing step.

Outlier handling

Because k-means is sensitive to outliers (outliers can pull centroids away from the true cluster centers), removing extreme values before clustering can produce cleaner results. Alternatively, using a more robust variant such as k-medoids can reduce the influence of outliers without requiring their explicit removal.

Advantages and disadvantages

Advantages

K-means is simple to implement and computationally efficient, making it suitable for large datasets with millions of data points.
It is guaranteed to converge, although the solution may be a local minimum rather than the global minimum.
The algorithm is highly adaptable and can be easily modified to incorporate additional constraints or requirements.
Results are easy to interpret: each data point belongs to exactly one cluster, and each cluster is represented by its centroid.
K-means scales well to high-dimensional data, especially when k is small relative to n.
Extensive library support across all major programming languages and frameworks.

Disadvantages

The user must specify the number of clusters (k) beforehand, which may not always be known or easily determined.
The algorithm is sensitive to the initial centroid positions, which can affect the quality of the final clustering. Multiple restarts partially mitigate this issue.
K-means assumes that clusters are spherical (isotropic) and equally sized, which may not be true for many real-world datasets.
The algorithm is sensitive to outliers, as outliers can pull centroids away from the true cluster centers.
K-means uses Euclidean distance by default, which may not be appropriate for all data types (for example, categorical data or data with mixed feature types).
The algorithm can only find convex cluster boundaries and struggles with elongated, irregular, or overlapping clusters.
K-means does not provide a built-in measure of cluster uncertainty or probability of membership (unlike Gaussian mixture models).

Comparison with other clustering methods

K-means is just one of many clustering algorithms. Understanding its strengths and weaknesses relative to alternatives helps practitioners choose the right method for their data.

K-means vs. DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together points in high-density regions and marks points in low-density regions as outliers.

Aspect	K-means	DBSCAN
Number of clusters	Must be specified (k)	Determined automatically
Cluster shape	Spherical (convex)	Arbitrary shapes
Outlier handling	No built-in mechanism	Labels outliers as noise
Parameters	k (number of clusters)	eps (neighborhood radius), min_samples
Scalability	Very scalable	Moderate (can struggle with varying densities)
Cluster sizes	Tends toward equal-sized clusters	Can find clusters of varying sizes

DBSCAN is preferred when clusters have irregular shapes or when the data contains outliers. K-means is preferred when clusters are approximately spherical and the number of clusters is known.

K-means vs. Gaussian mixture models (GMM)

Gaussian mixture models (GMMs) are a probabilistic clustering method that models the data as a mixture of Gaussian distributions. GMMs are fitted using the Expectation-Maximization (EM) algorithm.

Aspect	K-means	GMM
Cluster assignment	Hard (each point to exactly one cluster)	Soft (probability of belonging to each cluster)
Cluster shape	Spherical only	Elliptical (flexible covariance)
Output	Cluster labels	Cluster probabilities
Speed	Faster	Slower
Sensitivity to initialization	High	High
Number of clusters	Must be specified	Must be specified (can use BIC/AIC)

K-means can be viewed as a special case of GMM where each Gaussian component has equal, spherical covariance. GMMs are more flexible but also more computationally expensive and prone to overfitting with insufficient data.

K-means vs. hierarchical clustering

Hierarchical clustering builds a tree (dendrogram) of nested cluster merges or splits, without requiring a predefined number of clusters. It is useful for understanding the hierarchical structure of the data but does not scale well to large datasets (O(n^2) space and O(n^3) time for agglomerative methods). K-means is much faster and scales to larger datasets, but it does not reveal the nested structure that hierarchical clustering provides.

Summary comparison

Aspect	K-means	DBSCAN	GMM	Hierarchical
Must specify k	Yes	No	Yes	No (cut dendrogram)
Cluster shape	Spherical	Arbitrary	Elliptical	Arbitrary
Outlier handling	None	Built-in	Possible via low probability	None
Scalability	O(nkd) per iteration	O(n log n) with index	O(nk) per EM step	O(n^2) to O(n^3)
Soft assignments	No	No	Yes	No
Best use case	Large data, spherical clusters	Irregular shapes, noise	Overlapping clusters	Small data, hierarchy exploration

Applications

K-means clustering is used across many domains.

Customer segmentation

Businesses use k-means to group customers based on purchasing behavior, demographics, website interactions, and other attributes. By identifying distinct customer segments, companies can tailor marketing strategies, personalize recommendations, and optimize resource allocation. A typical segmentation might identify groups such as high-value loyal customers, bargain shoppers, and new prospects.

Image compression and color quantization

K-means is widely used for color quantization, where the goal is to reduce the number of distinct colors in an image while preserving visual quality. Each pixel's RGB values are treated as a 3-dimensional data point, and k-means groups similar colors together. The centroid of each cluster becomes one of the k representative colors in the compressed palette. For example, reducing a photograph from 96,000+ unique colors to just 64 colors using k-means produces an image that looks nearly identical to the original while requiring far less storage. A single byte can address up to 256 palette colors, compared to the 3 bytes per pixel required for full RGB encoding.

Document clustering

In natural language processing, k-means is used to group documents by topic. Documents are represented as vectors using techniques like TF-IDF or word embeddings, and k-means clusters them based on content similarity. Spherical k-means (using cosine distance) is often preferred for this task because text vectors are typically high-dimensional and sparse.

Anomaly detection

K-means can be used for anomaly detection by treating data points that are far from all cluster centroids as potential anomalies. After clustering, points with a distance to their nearest centroid that exceeds a threshold (often set as a multiple of the standard deviation) are flagged as outliers. This approach is used in fraud detection, network intrusion detection, and equipment failure monitoring.

Other applications

Application	Description	Typical k
Gene expression analysis	Group genes with similar expression patterns	5 to 20
Geographic analysis	Identify spatial clusters of events or facilities	Varies
Recommendation systems	Group similar users or items for collaborative filtering	10 to 100
Network traffic analysis	Group similar network traffic patterns to detect anomalies	3 to 20
Market basket analysis	Identify groups of products frequently purchased together	5 to 15
Medical imaging	Segment tissues or structures in MRI and CT scans	3 to 10
Search engine result grouping	Cluster search results into topical groups	3 to 8

Implementation in scikit-learn

Scikit-learn provides a robust and well-optimized implementation of k-means.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Scale features first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit k-means
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
kmeans.fit(X_scaled)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_
print("Inertia (WCSS):", kmeans.inertia_)
print("Silhouette Score:", silhouette_score(X_scaled, labels))

For large datasets, MiniBatchKMeans provides a faster alternative:

from sklearn.cluster import MiniBatchKMeans

mbk = MiniBatchKMeans(n_clusters=3, batch_size=1024, n_init=10)
mbk.fit(X_scaled)

Key parameters in scikit-learn's KMeans include:

Parameter	Default	Description
n_clusters	8	Number of clusters (k)
init	'k-means++'	Initialization method
n_init	10	Number of times to run with different initializations
max_iter	300	Maximum number of iterations per run
tol	1e-4	Convergence tolerance based on inertia change
algorithm	'lloyd'	Algorithm to use ('lloyd' or 'elkan')

The Elkan algorithm (Elkan, 2003) is an accelerated variant that uses the triangle inequality to skip unnecessary distance calculations. It can be significantly faster than Lloyd's algorithm when the number of clusters is large, though it requires additional memory to store inter-centroid distances.

Relationship to other algorithms

K-means occupies a central position in the landscape of unsupervised learning algorithms. It can be viewed as a special case of the Expectation-Maximization (EM) algorithm applied to a mixture of spherical Gaussians with equal covariance. In this interpretation, the assignment step corresponds to the E-step (computing posterior responsibilities) and the update step corresponds to the M-step (re-estimating parameters). The difference is that k-means uses hard assignments (each point to one cluster), while EM uses soft probabilistic assignments.

K-means is also closely related to principal component analysis (PCA). Research by Ding and He (2004) showed that the cluster indicators produced by k-means are the solutions to a relaxed version of the k-means problem that is equivalent to PCA. This relationship suggests that applying PCA before k-means is a natural preprocessing step that often preserves the cluster structure while reducing noise and computation time.

The Voronoi tessellation is another related concept. The assignment step of k-means partitions the feature space into Voronoi cells, where each cell contains all points closest to a given centroid. This geometric perspective helps explain why k-means produces convex cluster boundaries.

Historical context

The history of k-means involves several independent discoveries across different disciplines:

Year	Researcher	Contribution
1956	Hugo Steinhaus	Proposed a similar partitioning idea in the context of mathematical optimization
1957	Stuart Lloyd	Developed the iterative algorithm for pulse-code modulation at Bell Labs (unpublished until 1982)
1965	Edward Forgy	Published a similar method independently
1967	James MacQueen	Published the algorithm and coined the term "k-means"
1973	J.C. Dunn	Introduced fuzzy c-means clustering
1981	J.C. Bezdek	Generalized fuzzy c-means
2003	Charles Elkan	Proposed an accelerated k-means using the triangle inequality
2007	Arthur and Vassilvitskii	Introduced k-means++ initialization with provable guarantees
2010	David Sculley	Published mini-batch k-means for web-scale applications
2012	Bahmani et al.	Proposed k-means

The convergence of these independent discoveries reflects the fundamental nature of the problem: partitioning data into groups by proximity is a natural and broadly applicable idea that arises in many different fields.

Explain like I'm 5 (ELI5)

Imagine you have a bunch of different colored balls on the floor, and you want to group them by color. The k-means algorithm is like a smart robot that helps you do this. First, you tell the robot how many color groups (clusters) you want, and it randomly picks a spot for each group.

Then, the robot looks at each ball and puts it in the group with the closest matching spot. After all the balls are in groups, the robot moves each spot to the middle of the balls in that group. It repeats this process (look at each ball, put it in the closest group, then move the spots to the middle) until the groups don't change anymore. At the end, you have all your balls neatly grouped by color.

The tricky part is that you have to tell the robot how many groups to make before it starts. If you say 3 but there are really 5 colors, the robot will do its best, but some colors will be lumped together. That is why picking the right number of groups is important.

References

Lloyd, S. P. (1982). "Least squares quantization in PCM." *IEEE Transactions on Information Theory*, 28(2), 129-137. (Originally an internal Bell Labs report from 1957.)
MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations." *Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability*, 1, 281-297.
Arthur, D., & Vassilvitskii, S. (2007). "K-means++: The Advantages of Careful Seeding." *Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms*, 1027-1035.
Tibshirani, R., Walther, G., & Hastie, T. (2001). "Estimating the number of clusters in a data set via the gap statistic." *Journal of the Royal Statistical Society, Series B*, 63(2), 411-423.
Sculley, D. (2010). "Web-scale k-means clustering." *Proceedings of the 19th International Conference on World Wide Web*, 1177-1178.
Steinhaus, H. (1956). "Sur la division des corps materiels en parties." *Bulletin de l'Academie Polonaise des Sciences*, 4(12), 801-804.
Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). "NP-hardness of Euclidean sum-of-squares clustering." *Machine Learning*, 75(2), 245-248.
Elkan, C. (2003). "Using the triangle inequality to accelerate k-means." *Proceedings of the 20th International Conference on Machine Learning*, 147-153.
Ding, C., & He, X. (2004). "K-means clustering via principal component analysis." *Proceedings of the 21st International Conference on Machine Learning*, 225-232.
Wu, X. et al. (2008). "Top 10 algorithms in data mining." *Knowledge and Information Systems*, 14(1), 1-37.
Dhillon, I. S., Guan, Y., & Kulis, B. (2004). "Kernel k-means, spectral clustering, and normalized cuts." *Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 551-556.
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). "Scalable K-Means++." *Proceedings of the VLDB Endowment*, 5(7), 622-633.

Introduction

Lloyd's algorithm

Formal definition

Algorithm steps

Assignment step

Update step

Convergence

Computational complexity

Initialization methods

Random initialization

K-means++ initialization

K-means|| (scalable k-means++)

Comparison of initialization methods

Choosing the optimal number of clusters

Elbow method

Silhouette score

Gap statistic

Information criteria (BIC and AIC)

Comparison of methods for choosing k

Variants and extensions

Mini-batch k-means

K-medoids (PAM)

K-median clustering

Bisecting k-means

Kernel k-means

Fuzzy c-means

Other variants

Feature scaling and preprocessing

Feature scaling

Dimension reduction

Outlier handling

Advantages and disadvantages

Advantages

Disadvantages

Comparison with other clustering methods

K-means vs. DBSCAN

K-means vs. Gaussian mixture models (GMM)

K-means vs. hierarchical clustering

Summary comparison

Applications

Customer segmentation

Image compression and color quantization

Document clustering

Anomaly detection

Other applications

Implementation in scikit-learn

Relationship to other algorithms

Historical context

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

K-Median

Divisive Clustering

Hierarchical Clustering

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN

Introduction

Lloyd's algorithm

Formal definition

Algorithm steps

Assignment step

Update step

Convergence

Computational complexity

Initialization methods

Random initialization

K-means++ initialization

K-means|| (scalable k-means++)

Comparison of initialization methods

Choosing the optimal number of clusters

Elbow method

Silhouette score

Gap statistic

Information criteria (BIC and AIC)

Comparison of methods for choosing k

Variants and extensions

Mini-batch k-means

K-medoids (PAM)