# K-Means

> Source: https://aiwiki.ai/wiki/k-means
> Updated: 2026-07-11
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## What is k-means clustering?

**K-means** is an [unsupervised machine learning](/wiki/unsupervised_machine_learning) [clustering](/wiki/clustering) algorithm that partitions a dataset into **k** distinct, non-overlapping clusters by repeatedly assigning each data point to its nearest cluster center and then recomputing each center as the mean of its assigned points. The algorithm minimizes the within-cluster sum of squared distances between the data points and the [centroid](/wiki/centroid)s of their corresponding clusters. It is one of the most widely used clustering methods in practice, applied to problems ranging from [pattern recognition](/wiki/pattern_recognition) and [image segmentation](/wiki/image_segmentation) to customer segmentation and [anomaly detection](/wiki/anomaly_detection). A 2006 IEEE International Conference on Data Mining (ICDM) panel ranked k-means the second most influential algorithm in data mining, behind only the C4.5 decision tree.[10]

In the field of [machine learning](/wiki/machine_learning) and data analysis, k-means was first described by Stuart Lloyd in 1957 as a technique for pulse-code modulation at Bell Labs, though his paper was not published externally until 1982 in IEEE Transactions on Information Theory, volume 28(2), pages 129-137.[1] The algorithm was independently discovered and published by James MacQueen in 1967, who gave it the name "k-means."[2] Hugo Steinhaus had proposed a similar idea even earlier, in 1956, making him the first researcher to explicitly describe an algorithm for partitioning multidimensional instances.[6] Edward Forgy also described a similar method in 1965.[13] Because of this complex history, the standard k-means algorithm is sometimes referred to as **Lloyd's algorithm** or the **Lloyd-Forgy algorithm**.

Despite its simplicity, k-means remains one of the top ten algorithms in data mining, according to a well-known survey by Wu et al. (2008).[10] Its popularity stems from its ease of implementation, interpretability, and computational efficiency. The algorithm works best when clusters are roughly spherical, similarly sized, and well separated in the [feature](/wiki/feature) space.

## How does Lloyd's algorithm work?

The standard k-means algorithm, known as Lloyd's algorithm, is an iterative procedure that alternates between two steps: assigning data points to clusters and updating cluster centroids.[1] The algorithm is guaranteed to converge, though it may converge to a local minimum rather than the global minimum. As Arthur and Vassilvitskii summarized in 2007, "twenty-five years ago, Lloyd proposed a local search solution that is still very widely used today."[3]

### Formal definition

Given a dataset of n observations, each represented as a d-dimensional vector, the k-means objective is to partition the n observations into k sets (S1, S2, ..., Sk) so as to minimize the **within-cluster sum of squares** (WCSS), also called inertia:

$$
\text{WCSS} = \sum_{i=1}^{k} \sum_{x \in S_i} \lVert x - \mu_i \rVert^2
$$

where $$\mu_i$$ is the centroid (mean) of the points in cluster $$S_i$$ and $$\lVert \cdot \rVert$$ denotes the Euclidean distance.

Finding the globally optimal solution to the k-means objective is NP-hard in general (as proved by Aloise et al., 2009), which is why the algorithm relies on the iterative heuristic described below.[7] Even for two clusters in two dimensions, the problem is NP-hard, although in practice Lloyd's algorithm finds good solutions quickly. The k-means++ authors put it directly: "Solving this problem exactly is NP-hard, even with just two clusters."[3]

### Algorithm steps

The k-means algorithm proceeds through the following steps:

1. **Initialization:** Select k initial centroids (see initialization methods below).
2. **Assignment step:** Assign each data point to the nearest centroid, forming k clusters.
3. **Update step:** Recalculate each centroid as the mean of all data points assigned to its cluster.
4. **Convergence check:** If the centroids have not changed (or the change is below a tolerance threshold), stop. Otherwise, return to step 2.

### Assignment step

Each data point is assigned to the nearest centroid, forming a cluster. The distance between a data point and a centroid is typically calculated using the Euclidean distance:

$$
d(x, \mu) = \sqrt{\sum_{j=1}^{d} (x_j - \mu_j)^2}
$$

Formally, each point x is assigned to the cluster whose centroid is closest:

$$
C(x) = \arg\min_i \lVert x - \mu_i \rVert^2
$$

Other distance metrics can be used in variants of k-means. For example, the Manhattan distance (L1 norm) is sometimes used when features represent different measurement scales, and cosine distance is preferred for text data and other high-dimensional sparse representations.

### Update step

The centroids are recalculated by taking the mean of all data points assigned to their respective clusters:

$$
\mu_i = \frac{1}{\lvert S_i \rvert} \sum_{x \in S_i} x
$$

This step ensures that the centroids are updated to the center of the new clusters formed in the assignment step.

### Convergence

The assignment and update steps are repeated until the centroids' positions no longer change significantly or a predetermined number of iterations is reached. When convergence is achieved, the algorithm terminates, and the final cluster assignments are returned.

The WCSS objective decreases (or stays the same) with each iteration of the algorithm. Since there are only finitely many possible partitions of the data, the algorithm is guaranteed to converge in a finite number of steps. In practice, convergence typically occurs within 20 to 50 iterations for most datasets. However, worst-case examples exist where Lloyd's algorithm requires an exponential number of iterations to converge: Vattani (2011) constructed a set of points in the plane that forces $$2^{\Omega(n)}$$ iterations, confirming an earlier conjecture by Arthur and Vassilvitskii, though such cases are rare in practice.[14]

### Computational complexity

The time complexity of a single iteration of Lloyd's algorithm is $$O(nkd)$$, where n is the number of data points, k is the number of clusters, and d is the number of dimensions. If the algorithm runs for T iterations, the total time complexity is $$O(Tnkd)$$. The space complexity is $$O(nd + kd)$$, since the algorithm needs to store the data matrix and the k centroids.

In the worst case, the number of iterations T can be superpolynomial, giving a worst-case complexity of $$O(n^{k + 2/p})$$ where p is the number of features. In practice, however, k-means is considered one of the fastest clustering algorithms, making it a practical first choice even for datasets with millions of points.

## Initialization methods

The choice of initial centroids has a significant effect on the final clustering result. Poor initialization can lead to suboptimal solutions or slow convergence.

### Random initialization

The simplest initialization strategy randomly selects k observations from the dataset as initial centroids. While easy to implement, this approach is highly sensitive to the random seed and can produce poor results if the initial centroids happen to be close together or placed in sparse regions of the data.

A common mitigation strategy is to run k-means multiple times with different random initializations and keep the result with the lowest WCSS. Scikit-learn runs the algorithm 10 times by default when initialization is random (controlled by the `n_init` parameter).

### What is k-means++ initialization?

**K-means++** is a smarter initialization method proposed by David Arthur and Sergei Vassilvitskii in 2007 at the ACM-SIAM Symposium on Discrete Algorithms (SODA).[3] It selects initial centroids that are spread out across the data, leading to better and more consistent clustering results.

The k-means++ initialization procedure is:

1. Choose the first centroid uniformly at random from the data points.
2. For each remaining data point, compute the distance D(x) to the nearest already-chosen centroid.
3. Choose the next centroid from the data points with probability proportional to D(x)^2.
4. Repeat steps 2 and 3 until all k centroids have been chosen.

By selecting centroids that are far apart, k-means++ avoids the problem of placing multiple centroids in the same cluster region. Arthur and Vassilvitskii proved that k-means++ achieves an expected WCSS that is at most O(log k) times the optimal WCSS, providing a theoretical guarantee on the quality of the initialization.[3] Their Theorem 1.1 states the explicit bound $$\mathbb{E}[\phi] \le 8(\ln k + 2)\, \phi_{\text{OPT}}$$, where $$\phi$$ is the resulting potential and $$\phi_{\text{OPT}}$$ is the optimum.[3] As the paper's abstract puts it, "By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Theta(log k)-competitive with the optimal clustering."[3] This was the first initialization scheme for k-means with a provable approximation bound.

K-means++ is the default initialization method in [scikit-learn](/wiki/scikit-learn) and most other modern implementations.

### K-means|| (scalable k-means++)

K-means|| (pronounced "k-means parallel") was proposed by Bahmani et al. (2012) as a scalable version of k-means++ suitable for distributed computing environments.[12] Instead of selecting one centroid at a time (which requires a sequential pass over the data), k-means|| oversamples multiple candidate centroids in each round. After a small number of rounds (typically O(log n)), the oversampled candidates are reduced to k centroids using a weighted clustering step. K-means|| achieves similar quality to k-means++ with far fewer passes over the data, making it practical for very large distributed datasets.

### Comparison of initialization methods

| Method | Description | Pros | Cons |
|---|---|---|---|
| Random | Select k random data points | Simple, fast | Unstable results |
| K-means++ | Spread-out probabilistic selection | Provable O(log k) guarantee, stable | Slightly slower initialization (sequential) |
| K-means|| | Parallelized version of k-means++ | Scalable to large distributed datasets | More complex implementation |
| Forgy | Random assignment then compute centroids | Simple | Can produce poor initial centroids |
| Random partition | Randomly assign points to k clusters, then compute centroids | Centroids near data center | Tends to produce similar initial centroids |

## How do you choose the optimal number of clusters?

One challenge with k-means is selecting the appropriate number of clusters (k). Several methods can be used to estimate the optimal value of k.

### Elbow method

The **elbow method** involves running k-means for a range of k values (for example, from 1 to 10) and plotting the WCSS (inertia) against k. As k increases, WCSS decreases because each data point gets closer to its nearest centroid. The "elbow" of the curve, where the rate of decrease sharply levels off, suggests the optimal number of clusters.

The intuition is that adding more clusters beyond the elbow point provides diminishing returns in terms of explaining the variance in the data. However, the elbow is not always clearly defined, making this method somewhat subjective. In practice, it is best used as a rough guide rather than a definitive answer.

### Silhouette score

The **silhouette score** provides a more rigorous measure of clustering quality. For each data point i, the silhouette coefficient is:

$$
s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
$$

where:
- a(i) is the average distance from point i to all other points in the same cluster (cohesion).
- b(i) is the average distance from point i to all points in the nearest neighboring cluster (separation).

The silhouette coefficient ranges from -1 to +1:

| Score range | Interpretation |
|---|---|
| Close to +1 | Point is well-matched to its own cluster and poorly matched to neighboring clusters |
| Close to 0 | Point is on or near the boundary between two clusters |
| Close to -1 | Point may have been assigned to the wrong cluster |

The overall silhouette score is the mean of the silhouette coefficients across all data points. The optimal k is typically the one that maximizes the average silhouette score.

### Gap statistic

The **gap statistic**, introduced by Tibshirani, Walther, and Hastie in 2001, compares the WCSS of the observed data with the WCSS expected under a null reference distribution (typically a uniform distribution over the data range).[4] The optimal k is the value where the gap between the observed and expected WCSS is largest. The gap statistic formalizes the elbow method by providing a statistical criterion rather than a visual one.

### Information criteria (BIC and AIC)

The **Bayesian Information Criterion** (BIC) and **Akaike Information Criterion** (AIC) can be used to select k when the clustering model is interpreted probabilistically. These criteria balance model fit against model complexity: BIC penalizes the number of parameters more heavily than AIC, often favoring simpler models. They are most commonly applied with [Gaussian mixture models](/wiki/gaussian_mixture_model), but can also be adapted to k-means by treating each cluster as a spherical Gaussian.

### Comparison of methods for choosing k

| Method | Based on | Strengths | Weaknesses |
|---|---|---|---|
| Elbow method | WCSS (inertia) | Simple, intuitive | Subjective; elbow not always clear |
| Silhouette score | Cohesion and separation | Principled; considers cluster shape | Computationally expensive for large n |
| Gap statistic | Comparison to null distribution | Statistically grounded | Complex to implement, slow |
| BIC/AIC | Model likelihood | Works well with GMMs; penalizes complexity | Assumes a specific probabilistic model |

## Variants and extensions

Several variants of the k-means algorithm have been developed to address its limitations.

### Mini-batch k-means

**Mini-batch k-means** is a variant designed for large datasets. Instead of using the entire dataset in each iteration, mini-batch k-means randomly samples a small batch of data points (typically 100 to 1,000) and performs the assignment and update steps on just that batch.

The update rule for mini-batch k-means uses a learning rate that decreases over time, allowing the centroids to gradually converge:

$$
\mu_i = (1 - \eta)\, \mu_i + \eta\, x
$$

where $$\eta$$ is the learning rate for the current iteration.

Mini-batch k-means was popularized by David Sculley in a 2010 paper for web-scale clustering applications.[5] Sculley reported that the mini-batch approach "reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent."[5] It runs significantly faster than standard k-means, often producing results that are only slightly worse in terms of WCSS. It is particularly useful when the dataset is too large to fit in memory or when speed is a priority.

| Aspect | Standard k-means | Mini-batch k-means |
|---|---|---|
| Data used per iteration | Entire dataset | Small random batch |
| Speed | Slower for large datasets | Much faster |
| Result quality | Optimal (for given initialization) | Slightly worse WCSS |
| Memory usage | Requires full dataset in memory | Low memory footprint |
| Best for | Small to medium datasets | Large datasets, streaming data |

### K-medoids (PAM)

K-medoids, also known as Partitioning Around Medoids (PAM), uses actual data points as cluster centers (called medoids) instead of means. Because medoids are real data points, the algorithm is more robust to outliers than standard k-means. The objective function minimizes the sum of dissimilarities between each point and its medoid, and it can work with any distance metric, not just Euclidean distance. The main drawback is higher computational cost: PAM has $$O(k(n - k)^2)$$ complexity per iteration, which makes it impractical for very large datasets. The CLARA and CLARANS algorithms were developed as scalable approximations of PAM.

### [K-median](/wiki/k-median) clustering

K-median clustering replaces the mean with the median when computing cluster centers, and uses the Manhattan distance (L1 norm) instead of Euclidean distance. This makes the algorithm more robust to outliers, since the median is less affected by extreme values than the mean. K-median is equivalent to k-means with the L1 distance metric.

### Bisecting k-means

Bisecting k-means is a divisive hierarchical clustering approach. It starts with all data in a single cluster and iteratively splits the cluster with the largest WCSS (or the largest number of points) into two sub-clusters using standard k-means with k=2. This process repeats until the desired number of clusters is reached. Bisecting k-means often produces more balanced cluster sizes than standard k-means and can be more stable with respect to initialization, since only binary splits are performed at each step.

### Kernel k-means

Kernel k-means addresses the limitation that standard k-means can only find linearly separable (convex) clusters. It applies the kernel trick to implicitly map data points into a higher-dimensional feature space where the clusters become linearly separable. In this feature space, the standard k-means algorithm is applied. The key insight is that the algorithm can be rewritten so that it only requires pairwise dot products between data points, which can be replaced by a kernel function (such as the radial basis function or polynomial kernel). Kernel k-means is closely related to spectral clustering (Dhillon et al., 2004).[11] The main disadvantage is that it requires computing and storing an n-by-n kernel matrix, which limits scalability.

### Fuzzy c-means

Fuzzy c-means (FCM) allows soft cluster assignments, where each data point has a degree of membership in each cluster rather than belonging to exactly one cluster. The membership values range from 0 to 1, and for each point they sum to 1 across all clusters. A fuzziness parameter m (typically set to 2) controls how soft or hard the assignments are. When m approaches 1, fuzzy c-means converges to hard k-means. FCM was introduced by Dunn in 1973 and improved by Bezdek in 1981.[15] It is widely used in image processing and pattern recognition where cluster boundaries are not well defined.

### Other variants

| Variant | Key difference from standard k-means |
|---|---|
| Spherical k-means | Uses cosine similarity instead of Euclidean distance; suited for text data and high-dimensional sparse data |
| K-modes | Designed for categorical data; uses the mode instead of the mean and a matching dissimilarity measure |
| K-prototypes | Combines k-means and k-modes to handle datasets with both numerical and categorical features |
| X-means | Automatically determines k by starting with k=2 and repeatedly splitting clusters, using BIC to decide whether to accept each split |
| G-means | Similar to x-means, but tests each cluster for Gaussianity using the Anderson-Darling test to decide whether to split |

## Feature scaling and preprocessing

Before running k-means, several preprocessing steps can improve results significantly.

### Feature scaling

Because k-means uses Euclidean distance, features with larger scales dominate the distance calculation. [Normalization](/wiki/normalization) or standardization of features is critical for meaningful results. Standardizing features to zero mean and unit variance (z-score normalization) is the most common approach. Min-max scaling to the [0, 1] range is another option. Without proper scaling, a feature measured in thousands (such as salary) will completely overshadow a feature measured in single digits (such as age), leading to clusters that are determined almost entirely by the high-magnitude feature.

### [Dimension reduction](/wiki/dimension_reduction)

When the dataset has many features, applying PCA or other dimensionality reduction techniques before clustering can reduce noise and speed up computation. Research by Ding and He (2004) showed that the cluster indicators produced by k-means are closely related to the principal components of the data, providing theoretical justification for using PCA as a preprocessing step.[9]

### Outlier handling

Because k-means is sensitive to outliers (outliers can pull centroids away from the true cluster centers), removing extreme values before clustering can produce cleaner results. Alternatively, using a more robust variant such as k-medoids can reduce the influence of outliers without requiring their explicit removal.

## Advantages and disadvantages

### Advantages

- K-means is simple to implement and computationally efficient, making it suitable for large datasets with millions of data points.
- It is guaranteed to converge, although the solution may be a local minimum rather than the global minimum.
- The algorithm is highly adaptable and can be easily modified to incorporate additional constraints or requirements.
- Results are easy to interpret: each data point belongs to exactly one cluster, and each cluster is represented by its centroid.
- K-means scales well to high-dimensional data, especially when k is small relative to n.
- Extensive library support across all major programming languages and frameworks.

### Disadvantages

- The user must specify the number of clusters (k) beforehand, which may not always be known or easily determined.
- The algorithm is sensitive to the initial centroid positions, which can affect the quality of the final clustering. Multiple restarts partially mitigate this issue.
- K-means assumes that clusters are **spherical** (isotropic) and **equally sized**, which may not be true for many real-world datasets.
- The algorithm is sensitive to **outliers**, as outliers can pull centroids away from the true cluster centers.
- K-means uses Euclidean distance by default, which may not be appropriate for all data types (for example, categorical data or data with mixed feature types).
- The algorithm can only find convex cluster boundaries and struggles with elongated, irregular, or overlapping clusters.
- K-means does not provide a built-in measure of cluster uncertainty or probability of membership (unlike Gaussian mixture models).

## How does k-means compare with other clustering methods?

K-means is just one of many clustering algorithms. Understanding its strengths and weaknesses relative to alternatives helps practitioners choose the right method for their data.

### K-means vs. DBSCAN

[DBSCAN](/wiki/dbscan) (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together points in high-density regions and marks points in low-density regions as outliers.

| Aspect | K-means | DBSCAN |
|---|---|---|
| Number of clusters | Must be specified (k) | Determined automatically |
| Cluster shape | Spherical (convex) | Arbitrary shapes |
| Outlier handling | No built-in mechanism | Labels outliers as noise |
| Parameters | k (number of clusters) | eps (neighborhood radius), min_samples |
| Scalability | Very scalable | Moderate (can struggle with varying densities) |
| Cluster sizes | Tends toward equal-sized clusters | Can find clusters of varying sizes |

DBSCAN is preferred when clusters have irregular shapes or when the data contains outliers. K-means is preferred when clusters are approximately spherical and the number of clusters is known.

### K-means vs. Gaussian mixture models (GMM)

[Gaussian mixture models](/wiki/gaussian_mixture_model) (GMMs) are a probabilistic clustering method that models the data as a mixture of Gaussian distributions. GMMs are fitted using the [Expectation-Maximization](/wiki/expectation_maximization) (EM) algorithm.

| Aspect | K-means | GMM |
|---|---|---|
| Cluster assignment | Hard (each point to exactly one cluster) | Soft (probability of belonging to each cluster) |
| Cluster shape | Spherical only | Elliptical (flexible covariance) |
| Output | Cluster labels | Cluster probabilities |
| Speed | Faster | Slower |
| Sensitivity to initialization | High | High |
| Number of clusters | Must be specified | Must be specified (can use BIC/AIC) |

K-means can be viewed as a special case of GMM where each Gaussian component has equal, spherical covariance. GMMs are more flexible but also more computationally expensive and prone to overfitting with insufficient data.

### K-means vs. hierarchical clustering

[Hierarchical clustering](/wiki/hierarchical_clustering) builds a tree (dendrogram) of nested cluster merges or splits, without requiring a predefined number of clusters. It is useful for understanding the hierarchical structure of the data but does not scale well to large datasets ($$O(n^2)$$ space and $$O(n^3)$$ time for agglomerative methods). K-means is much faster and scales to larger datasets, but it does not reveal the nested structure that hierarchical clustering provides.

### Summary comparison

| Aspect | K-means | DBSCAN | GMM | Hierarchical |
|---|---|---|---|---|
| Must specify k | Yes | No | Yes | No (cut dendrogram) |
| Cluster shape | Spherical | Arbitrary | Elliptical | Arbitrary |
| Outlier handling | None | Built-in | Possible via low probability | None |
| Scalability | $$O(nkd)$$ per iteration | $$O(n \log n)$$ with index | $$O(nk)$$ per EM step | $$O(n^2)$$ to $$O(n^3)$$ |
| Soft assignments | No | No | Yes | No |
| Best use case | Large data, spherical clusters | Irregular shapes, noise | Overlapping clusters | Small data, hierarchy exploration |

## What is k-means used for?

K-means clustering is used across many domains.

### Customer segmentation

Businesses use k-means to group customers based on purchasing behavior, demographics, website interactions, and other attributes. By identifying distinct customer segments, companies can tailor marketing strategies, personalize recommendations, and optimize resource allocation. A typical segmentation might identify groups such as high-value loyal customers, bargain shoppers, and new prospects.

### Image compression and color quantization

K-means is widely used for color quantization, where the goal is to reduce the number of distinct colors in an image while preserving visual quality. Each pixel's RGB values are treated as a 3-dimensional data point, and k-means groups similar colors together. The centroid of each cluster becomes one of the k representative colors in the compressed palette. In a canonical scikit-learn example, k-means reduces a photograph of the Summer Palace from 96,615 unique colors to just 64 colors while preserving the overall appearance of the image.[16] A single byte can address up to 256 palette colors, compared to the 3 bytes per pixel required for full RGB encoding.

### Document clustering

In natural language processing, k-means is used to group documents by topic. Documents are represented as vectors using techniques like TF-IDF or word embeddings, and k-means clusters them based on content similarity. Spherical k-means (using cosine distance) is often preferred for this task because text vectors are typically high-dimensional and sparse.

### Anomaly detection

K-means can be used for anomaly detection by treating data points that are far from all cluster centroids as potential anomalies. After clustering, points with a distance to their nearest centroid that exceeds a threshold (often set as a multiple of the standard deviation) are flagged as outliers. This approach is used in fraud detection, network intrusion detection, and equipment failure monitoring.

### Other applications

| Application | Description | Typical k |
|---|---|---|
| Gene expression analysis | Group genes with similar expression patterns | 5 to 20 |
| Geographic analysis | Identify spatial clusters of events or facilities | Varies |
| Recommendation systems | Group similar users or items for collaborative filtering | 10 to 100 |
| Network traffic analysis | Group similar network traffic patterns to detect anomalies | 3 to 20 |
| Market basket analysis | Identify groups of products frequently purchased together | 5 to 15 |
| Medical imaging | Segment tissues or structures in MRI and CT scans | 3 to 10 |
| Search engine result grouping | Cluster search results into topical groups | 3 to 8 |

## How is k-means implemented in scikit-learn?

Scikit-learn provides a robust and well-optimized implementation of k-means.

```python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Scale features first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit k-means
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
kmeans.fit(X_scaled)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_
print("Inertia (WCSS):", kmeans.inertia_)
print("Silhouette Score:", silhouette_score(X_scaled, labels))
```

For large datasets, `MiniBatchKMeans` provides a faster alternative:

```python
from sklearn.cluster import MiniBatchKMeans

mbk = MiniBatchKMeans(n_clusters=3, batch_size=1024, n_init=10)
mbk.fit(X_scaled)
```

Key parameters in scikit-learn's KMeans include:

| Parameter | Default | Description |
|---|---|---|
| n_clusters | 8 | Number of clusters (k) |
| init | 'k-means++' | Initialization method |
| n_init | 'auto' | Number of times to run with different initializations |
| max_iter | 300 | Maximum number of iterations per run |
| tol | 1e-4 | Convergence tolerance based on inertia change |
| algorithm | 'lloyd' | Algorithm to use ('lloyd' or 'elkan') |

Note that the default value of `n_init` changed to `'auto'` in scikit-learn version 1.4.[17] With the default `init='k-means++'`, the `'auto'` setting runs a single initialization, whereas it runs 10 initializations when `init='random'`. Earlier versions defaulted to running 10 initializations regardless of the initialization method, so code written for older versions may behave differently after upgrading.

The Elkan algorithm (Elkan, 2003) is an accelerated variant that uses the triangle inequality to skip unnecessary distance calculations.[8] Elkan showed how to "accelerate k-means dramatically while computing exactly the same result as the standard algorithm," and reported that the method is effective for datasets with up to 1,000 dimensions and becomes more effective as k increases.[8] It requires additional memory to store inter-centroid distances.

## Relationship to other algorithms

K-means occupies a central position in the landscape of unsupervised learning algorithms. It can be viewed as a special case of the Expectation-Maximization (EM) algorithm applied to a mixture of spherical Gaussians with equal covariance. In this interpretation, the assignment step corresponds to the E-step (computing posterior responsibilities) and the update step corresponds to the M-step (re-estimating parameters). The difference is that k-means uses hard assignments (each point to one cluster), while EM uses soft probabilistic assignments.

K-means is also closely related to principal component analysis (PCA). Research by Ding and He (2004) showed that the cluster indicators produced by k-means are the solutions to a relaxed version of the k-means problem that is equivalent to PCA.[9] This relationship suggests that applying PCA before k-means is a natural preprocessing step that often preserves the cluster structure while reducing noise and computation time.

The Voronoi tessellation is another related concept. The assignment step of k-means partitions the feature space into Voronoi cells, where each cell contains all points closest to a given centroid. This geometric perspective helps explain why k-means produces convex cluster boundaries.

## When was k-means invented? (Historical context)

The history of k-means involves several independent discoveries across different disciplines:

| Year | Researcher | Contribution |
|---|---|---|
| 1956 | Hugo Steinhaus | Proposed a similar partitioning idea in the context of mathematical optimization |
| 1957 | Stuart Lloyd | Developed the iterative algorithm for pulse-code modulation at Bell Labs (unpublished until 1982) |
| 1965 | Edward Forgy | Published a similar method independently |
| 1967 | James MacQueen | Published the algorithm and coined the term "k-means" |
| 1973 | J.C. Dunn | Introduced fuzzy c-means clustering |
| 1981 | J.C. Bezdek | Generalized fuzzy c-means |
| 2003 | Charles Elkan | Proposed an accelerated k-means using the triangle inequality |
| 2007 | Arthur and Vassilvitskii | Introduced k-means++ initialization with provable guarantees |
| 2010 | David Sculley | Published mini-batch k-means for web-scale applications |
| 2012 | Bahmani et al. | Proposed k-means|| for scalable parallel initialization |

The convergence of these independent discoveries reflects the fundamental nature of the problem: partitioning data into groups by proximity is a natural and broadly applicable idea that arises in many different fields.

## Explain like I'm 5 (ELI5)

Imagine you have a bunch of different colored balls on the floor, and you want to group them by color. The k-means algorithm is like a smart robot that helps you do this. First, you tell the robot how many color groups (clusters) you want, and it randomly picks a spot for each group.

Then, the robot looks at each ball and puts it in the group with the closest matching spot. After all the balls are in groups, the robot moves each spot to the middle of the balls in that group. It repeats this process (look at each ball, put it in the closest group, then move the spots to the middle) until the groups don't change anymore. At the end, you have all your balls neatly grouped by color.

The tricky part is that you have to tell the robot how many groups to make before it starts. If you say 3 but there are really 5 colors, the robot will do its best, but some colors will be lumped together. That is why picking the right number of groups is important.

## References

1. Lloyd, S. P. (1982). "Least squares quantization in PCM." *IEEE Transactions on Information Theory*, 28(2), 129-137. (Originally an internal Bell Labs report from 1957.)
2. MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations." *Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability*, 1, 281-297.
3. Arthur, D., & Vassilvitskii, S. (2007). "K-means++: The Advantages of Careful Seeding." *Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms*, 1027-1035.
4. Tibshirani, R., Walther, G., & Hastie, T. (2001). "Estimating the number of clusters in a data set via the gap statistic." *Journal of the Royal Statistical Society, Series B*, 63(2), 411-423.
5. Sculley, D. (2010). "Web-scale k-means clustering." *Proceedings of the 19th International Conference on World Wide Web*, 1177-1178.
6. Steinhaus, H. (1956). "Sur la division des corps materiels en parties." *Bulletin de l'Academie Polonaise des Sciences*, 4(12), 801-804.
7. Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). "NP-hardness of Euclidean sum-of-squares clustering." *Machine Learning*, 75(2), 245-248.
8. Elkan, C. (2003). "Using the triangle inequality to accelerate k-means." *Proceedings of the 20th International Conference on Machine Learning*, 147-153.
9. Ding, C., & He, X. (2004). "K-means clustering via principal component analysis." *Proceedings of the 21st International Conference on Machine Learning*, 225-232.
10. Wu, X. et al. (2008). "Top 10 algorithms in data mining." *Knowledge and Information Systems*, 14(1), 1-37.
11. Dhillon, I. S., Guan, Y., & Kulis, B. (2004). "Kernel k-means, spectral clustering, and normalized cuts." *Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 551-556.
12. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). "Scalable K-Means++." *Proceedings of the VLDB Endowment*, 5(7), 622-633.
13. Forgy, E. W. (1965). "Cluster analysis of multivariate data: efficiency versus interpretability of classifications." *Biometrics*, 21(3), 768-769.
14. Vattani, A. (2011). "k-means Requires Exponentially Many Iterations Even in the Plane." *Discrete & Computational Geometry*, 45(4), 596-616.
15. Bezdek, J. C. (1981). *Pattern Recognition with Fuzzy Objective Function Algorithms*. Plenum Press. (Building on Dunn, J. C. (1973). "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters." *Journal of Cybernetics*, 3(3), 32-57.)
16. scikit-learn developers. "Color Quantization using K-Means." scikit-learn documentation, example gallery. https://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html
17. scikit-learn developers. "sklearn.cluster.KMeans." scikit-learn API documentation (n_init default changed to 'auto' in version 1.4). https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html