K-median clustering is a partitioning-based clustering technique in unsupervised learning that divides a dataset of n objects into k groups by minimizing the sum of distances between each data point and the median of its assigned cluster. Unlike k-means, which uses the arithmetic mean and squared Euclidean distance, k-median typically relies on the Manhattan distance (L1 norm) and computes the coordinate-wise median as each cluster's center. This makes k-median considerably more robust to outliers and extreme values, since the statistical median has a breakdown point of 50%, compared to 0% for the mean.
K-median has roots in both statistics and operations research, where it appears in facility location theory. The algorithm sees wide use in domains where data contains noise or heavy-tailed distributions, including customer segmentation, anomaly detection, spatial analysis, and image segmentation.
Imagine you have a big pile of colored marbles spread out on the floor, and you want to sort them into a few groups. You pick a few marbles to be the "leaders" of each group. Every other marble joins the group of whichever leader is closest to it.
Now you need to find the best spot for each leader. In regular k-means, you would put the leader at the "average" position of all its marbles. But if one marble rolled way off into the corner, the average gets pulled toward that loner, and the leader ends up in a weird spot.
With k-median, instead of finding the average, you find the "middle" position. You line up all the marbles in each group and pick the one right in the center. That middle marble does not get yanked around by a single stray marble in the corner. You keep reassigning marbles and finding new middles until nothing changes. The result is groups that better represent where most of the marbles actually are, even when a few have rolled far away.
Given a set of n data points X = {x₁, x₂, ..., xₙ} in a d-dimensional space, the k-median problem seeks a set of k centers C = {c₁, c₂, ..., cₖ} that minimizes the following objective function:
min Σᵢ₌₁ⁿ min_{j ∈ {1,...,k}} ‖xᵢ − cⱼ‖₁
where ‖·‖₁ denotes the L1 norm (Manhattan distance). In contrast, k-means minimizes the sum of squared L2 (Euclidean) distances:
min Σᵢ₌₁ⁿ min_{j ∈ {1,...,k}} ‖xᵢ − cⱼ‖₂²
The key difference is that k-median uses absolute deviations rather than squared deviations. Squaring amplifies the influence of distant points, which is exactly why k-means is sensitive to outliers. The L1 objective treats each unit of distance equally, giving outliers proportional rather than outsized influence.
Each center cⱼ is computed as the coordinate-wise median of all points assigned to cluster j. For a cluster Sⱼ and dimension d, the d-th coordinate of the center is:
cⱼ[d] = median({xᵢ[d] : xᵢ ∈ Sⱼ})
Because the median is computed independently along each dimension, the resulting center may not coincide with any actual data point. This distinguishes k-median from k-medoids, where each center must be a member of the dataset.
The standard k-median algorithm follows a Lloyd-style iterative procedure similar to k-means, alternating between an assignment step and an update step.
Like k-means, the k-median algorithm is guaranteed to converge because the objective function decreases monotonically with each iteration and is bounded below by zero. However, the algorithm may converge to a local minimum rather than the global optimum. The quality of the final solution depends heavily on the initialization. Running the algorithm multiple times with different random seeds and selecting the solution with the lowest objective value is a standard mitigation strategy.
| Method | Description | Pros | Cons |
|---|---|---|---|
| Random selection | Choose k data points uniformly at random as initial centers | Simple, fast | Highly variable results; may converge to poor local minima |
| K-means++ (adapted) | Select the first center randomly, then choose subsequent centers with probability proportional to their distance from existing centers | Provable O(log k) approximation guarantee; widely used | Slightly more expensive initialization step |
| Forgy method | Select k random observations as initial centers | Tends to spread initial centers well | No formal guarantees beyond random |
| Random partition | Randomly assign each point to one of k clusters, then compute medians | Places initial centers near the data center | May produce poor initial separation |
| BUILD (from PAM) | Greedily select medoids that minimize the total cost | Deterministic; often high-quality initialization | O(n²k) time complexity |
K-median is frequently compared to two related algorithms: k-means and k-medoids. The following table summarizes the main differences.
| Property | K-median | K-means | K-medoids (PAM) |
|---|---|---|---|
| Objective function | Minimize sum of L1 (Manhattan) distances | Minimize sum of squared L2 (Euclidean) distances | Minimize sum of arbitrary dissimilarities |
| Cluster center | Coordinate-wise median | Coordinate-wise mean (centroid) | An actual data point (medoid) |
| Center is a data point? | Not necessarily | Not necessarily | Always |
| Distance metric | Manhattan distance (L1) | Euclidean distance (L2) | Any dissimilarity metric |
| Robustness to outliers | High (median has 50% breakdown point) | Low (mean has 0% breakdown point) | High (medoids resist outlier pull) |
| Computational cost per iteration | O(nkd) with sorting overhead | O(nkd) | O(k(n-k)²) for PAM |
| Best suited for | Noisy data, heavy-tailed distributions | Spherical, normally distributed clusters | Non-numeric data, arbitrary metrics |
| Update rule | Median of cluster members per dimension | Mean of cluster members per dimension | Swap medoid with best candidate point |
The robustness advantage of k-median over k-means stems from two complementary factors.
Statistical robustness of the median. The median is one of the most robust measures of central tendency. Its breakdown point is 50%, meaning that up to half of the data must be corrupted before the median can be driven arbitrarily far from the true center. The arithmetic mean, by contrast, has a breakdown point of 0%; a single extreme observation can shift the mean without bound.
Linear vs. quadratic penalty. The L1 loss function penalizes residuals linearly, while the L2 loss used in k-means penalizes residuals quadratically. Consider a point that lies 100 units from its cluster center. Under L1, its contribution to the objective is 100. Under L2, its contribution is 10,000. This squared penalty gives disproportionate influence to distant points in k-means, pulling cluster centers toward outliers.
Together, these properties make k-median a preferred choice when data is known to contain noise, measurement errors, or heavy-tailed distributions.
Consider five one-dimensional data points: {1, 2, 3, 4, 100}. The value 100 is an outlier.
| Statistic | Value | Effect of outlier |
|---|---|---|
| Mean | 22.0 | Pulled far from the bulk of the data |
| Median | 3 | Remains at the center of the majority |
If these points formed a single cluster, k-means would place its center at 22.0, while k-median would place its center at 3. The k-median center more accurately represents where most of the data lies.
The k-median problem is NP-hard in general metric spaces, meaning that no polynomial-time algorithm is known that can find the globally optimal solution for all instances. Even in Euclidean space, finding the exact optimum is NP-hard when both k and d are part of the input.
The Lloyd-style iterative heuristic described above runs in O(nkd) time per iteration, where n is the number of data points, k is the number of clusters, and d is the dimensionality. However, computing the coordinate-wise median requires sorting, adding an O(n log n) factor per dimension per cluster in the update step. In practice, the algorithm typically converges in a small number of iterations, but the worst-case number of iterations can be exponential.
Because exact optimization is intractable for large instances, a rich body of theoretical work focuses on approximation algorithms with provable guarantees.
| Algorithm / Authors | Year | Approximation ratio | Notes |
|---|---|---|---|
| Charikar, Guha, Tardos, Shmoys | 2002 | 6⅔ | First constant-factor approximation using LP rounding and primal-dual methods |
| Arya, Garg, Khandekar, Meyerson, Munagala, Pandit | 2004 | 3 + 2/p | Local search with p-swaps; for single swaps (p=1), ratio is 5 |
| Li and Svensson | 2013 | 1 + √3 + ε ≈ 2.732 + ε | Pseudo-approximation technique; breakthrough improvement over the long-standing ratio of 3 + ε |
| Byrka, Pensyl, Rybicki, Srinivasan, Trinh | 2017 | 2.675 + ε | Current best known ratio for general metrics |
The local search approach by Arya et al. is especially notable for its practical simplicity. Starting with any feasible solution, the algorithm iteratively tries to improve the objective by swapping one or more centers. If each swap considers replacing p centers simultaneously, the approximation guarantee is 3 + 2/p. For p = 1 (single swap), this yields a 5-approximation.
In the operations research and combinatorial optimization literature, the k-median problem is often studied as a metric facility location problem. Given a set of clients D and a set of potential facility locations F in a metric space satisfying the triangle inequality, the goal is to open exactly k facilities and assign each client to its nearest open facility so as to minimize the total assignment cost.
Formally:
min Σ_{u ∈ D} min_{v ∈ S} d(u, v)
subject to |S| = k, S ⊆ F
This formulation is known as the discrete k-median problem when centers must be chosen from a finite candidate set F, and as the continuous k-median problem when centers can be placed anywhere in the space.
The facility location perspective has driven much of the theoretical work on approximation algorithms and connects k-median to problems such as the p-median problem studied by Kariv and Hakimi (1979) in network optimization.
The PAM algorithm, introduced by Kaufman and Rousseeuw in 1987, is closely related to k-median but belongs to the k-medoids family. Instead of computing coordinate-wise medians, PAM selects actual data points as cluster centers (medoids). PAM operates in two phases:
The original PAM algorithm has a per-iteration complexity of O(k(n-k)²). The FastPAM variant (Schubert and Rousseeuw, 2019) reduces this to O(n²) per iteration, yielding up to a k-fold speedup. FasterPAM further improves performance through eager swap strategies.
For large datasets where PAM becomes prohibitively expensive, Kaufman and Rousseeuw also proposed CLARA (Clustering Large Applications). CLARA works by drawing multiple random samples from the dataset, running PAM on each sample, and selecting the best result.
CLARANS (Clustering Large Applications based on RANdomized Search), introduced by Ng and Han in 1994, combines ideas from PAM and CLARA. Rather than exhaustively searching the neighborhood of the current solution, CLARANS randomly samples a subset of possible swaps. Its cost scales roughly linearly with n, making it more practical for large datasets.
In the weighted variant, each data point xᵢ has an associated weight wᵢ, and the objective becomes:
min Σᵢ₌₁ⁿ wᵢ · min_{j ∈ {1,...,k}} ‖xᵢ − cⱼ‖₁
This formulation is useful when certain data points represent aggregated observations or carry different levels of importance.
The k-median with outliers variant allows a specified number of points to be discarded as outliers, further improving robustness. Chen (2008) gave the first constant-factor approximation for this problem.
K-median is available in several statistical and machine learning software packages.
| Software | Package / Module | Notes |
|---|---|---|
| Python | pyclustering (kmedians class) | Dedicated k-median implementation with L1 distance |
| Python | scikit-learn-extra (KMedoids) | Supports k-medoids with PAM and alternate methods; can approximate k-median behavior |
| R | flexclust package | Provides k-median clustering via the kcca function with Manhattan distance |
| R | cluster package (pam function) | PAM implementation for k-medoids |
| ELKI | KMediansLloyd | Java-based k-median with multiple distance options |
| Stata | cluster kmedians | Built-in command for k-median clustering |
| GeoDa | Clusters menu | Spatial data analysis tool with k-median support |
For users of scikit-learn, there is no built-in k-median class in the core library. However, a k-median algorithm can be implemented by modifying the centroid update step of KMeans to use numpy's median function instead of the mean. Alternatively, the pyclustering library provides a ready-to-use implementation.
A minimal k-median implementation in Python:
import numpy as np
def k_median(X, k, max_iter=100):
# Random initialization
indices = np.random.choice(len(X), k, replace=False)
centers = X[indices].copy()
for _ in range(max_iter):
# Assignment step: compute Manhattan distances
distances = np.array([
np.sum(np.abs(X - c), axis=1) for c in centers
]).T
labels = np.argmin(distances, axis=1)
# Update step: compute coordinate-wise median
new_centers = np.array([
np.median(X[labels == j], axis=0) for j in range(k)
])
if np.allclose(new_centers, centers):
break
centers = new_centers
return labels, centers
Like k-means, k-median requires the user to specify the number of clusters k in advance. Several methods can help determine an appropriate value:
K-median clustering is applied across many domains.
| Domain | Application | Why k-median is preferred |
|---|---|---|
| Customer segmentation | Grouping customers by purchasing behavior | Transaction data often contains outlier purchases; k-median resists skew from high-spending outliers |
| Facility location | Choosing warehouse or store locations to minimize total delivery distance | The L1 objective corresponds to minimizing total travel distance on a grid (city-block distance) |
| Image segmentation | Partitioning pixels into regions with similar color or intensity | Robust to sensor noise and extreme pixel values |
| Anomaly detection | Identifying data points far from any cluster center | Cleaner cluster centers improve outlier scoring |
| Gene expression analysis | Clustering genes with similar expression profiles | Biological data frequently contains measurement noise and outlier expressions |
| Text mining | Grouping documents by topic similarity | Document feature vectors are high-dimensional and sparse; L1 distance handles sparsity well |
| Spatial analysis | Partitioning geographic data into service regions | Manhattan distance naturally models travel on street grids |