K-Median

K-median clustering is a partitioning-based clustering technique in unsupervised learning that divides a dataset of n objects into k groups by minimizing the sum of distances between each data point and the median of its assigned cluster. Unlike k-means, which uses the arithmetic mean and squared Euclidean distance, k-median typically relies on the Manhattan distance (L1 norm) and computes the coordinate-wise median as each cluster's center. This makes k-median considerably more robust to outliers and extreme values, since the statistical median has a breakdown point of 50%, compared to 0% for the mean.

K-median has roots in both statistics and operations research, where it appears in facility location theory. The algorithm sees wide use in domains where data contains noise or heavy-tailed distributions, including customer segmentation, anomaly detection, spatial analysis, and image segmentation.

ELI5: Explain like I'm 5

Imagine you have a big pile of colored marbles spread out on the floor, and you want to sort them into a few groups. You pick a few marbles to be the "leaders" of each group. Every other marble joins the group of whichever leader is closest to it.

Now you need to find the best spot for each leader. In regular k-means, you would put the leader at the "average" position of all its marbles. But if one marble rolled way off into the corner, the average gets pulled toward that loner, and the leader ends up in a weird spot.

With k-median, instead of finding the average, you find the "middle" position. You line up all the marbles in each group and pick the one right in the center. That middle marble does not get yanked around by a single stray marble in the corner. You keep reassigning marbles and finding new middles until nothing changes. The result is groups that better represent where most of the marbles actually are, even when a few have rolled far away.

Mathematical formulation

Given a set of n data points X = {x₁, x₂, ..., xₙ} in a d-dimensional space, the k-median problem seeks a set of k centers C = {c₁, c₂, ..., cₖ} that minimizes the following objective function:

min  Σᵢ₌₁ⁿ  min_{j ∈ {1,...,k}}  ‖xᵢ − cⱼ‖₁

where ‖·‖₁ denotes the L1 norm (Manhattan distance). In contrast, k-means minimizes the sum of squared L2 (Euclidean) distances:

min  Σᵢ₌₁ⁿ  min_{j ∈ {1,...,k}}  ‖xᵢ − cⱼ‖₂²

The key difference is that k-median uses absolute deviations rather than squared deviations. Squaring amplifies the influence of distant points, which is exactly why k-means is sensitive to outliers. The L1 objective treats each unit of distance equally, giving outliers proportional rather than outsized influence.

Each center cⱼ is computed as the coordinate-wise median of all points assigned to cluster j. For a cluster Sⱼ and dimension d, the d-th coordinate of the center is:

cⱼ[d] = median({xᵢ[d] : xᵢ ∈ Sⱼ})

Because the median is computed independently along each dimension, the resulting center may not coincide with any actual data point. This distinguishes k-median from k-medoids, where each center must be a member of the dataset.

Algorithm

The standard k-median algorithm follows a Lloyd-style iterative procedure similar to k-means, alternating between an assignment step and an update step.

Step-by-step procedure

Initialization. Select k initial centers. Common methods include random selection, the k-means++ strategy (adapted for L1 distance), or deterministic approaches.
Assignment step (E-step). Assign each data point xᵢ to the cluster whose center is closest under the Manhattan distance:
- cluster(xᵢ) = argminⱼ ‖xᵢ - cⱼ‖₁
Update step (M-step). Recompute each cluster center as the coordinate-wise median of all points currently assigned to that cluster.
Convergence check. If no assignments changed (or if the change in the objective function falls below a threshold), stop. Otherwise, return to step 2.

Convergence properties

Like k-means, the k-median algorithm is guaranteed to converge because the objective function decreases monotonically with each iteration and is bounded below by zero. However, the algorithm may converge to a local minimum rather than the global optimum. The quality of the final solution depends heavily on the initialization. Running the algorithm multiple times with different random seeds and selecting the solution with the lowest objective value is a standard mitigation strategy.

Initialization methods

Method	Description	Pros	Cons
Random selection	Choose k data points uniformly at random as initial centers	Simple, fast	Highly variable results; may converge to poor local minima
K-means++ (adapted)	Select the first center randomly, then choose subsequent centers with probability proportional to their distance from existing centers	Provable O(log k) approximation guarantee; widely used	Slightly more expensive initialization step
Forgy method	Select k random observations as initial centers	Tends to spread initial centers well	No formal guarantees beyond random
Random partition	Randomly assign each point to one of k clusters, then compute medians	Places initial centers near the data center	May produce poor initial separation
BUILD (from PAM)	Greedily select medoids that minimize the total cost	Deterministic; often high-quality initialization	O(n²k) time complexity

K-median vs. k-means vs. k-medoids

K-median is frequently compared to two related algorithms: k-means and k-medoids. The following table summarizes the main differences.

Property	K-median	K-means	K-medoids (PAM)
Objective function	Minimize sum of L1 (Manhattan) distances	Minimize sum of squared L2 (Euclidean) distances	Minimize sum of arbitrary dissimilarities
Cluster center	Coordinate-wise median	Coordinate-wise mean (centroid)	An actual data point (medoid)
Center is a data point?	Not necessarily	Not necessarily	Always
Distance metric	Manhattan distance (L1)	Euclidean distance (L2)	Any dissimilarity metric
Robustness to outliers	High (median has 50% breakdown point)	Low (mean has 0% breakdown point)	High (medoids resist outlier pull)
Computational cost per iteration	O(nkd) with sorting overhead	O(nkd)	O(k(n-k)²) for PAM
Best suited for	Noisy data, heavy-tailed distributions	Spherical, normally distributed clusters	Non-numeric data, arbitrary metrics
Update rule	Median of cluster members per dimension	Mean of cluster members per dimension	Swap medoid with best candidate point

Robustness to outliers

The robustness advantage of k-median over k-means stems from two complementary factors.

Statistical robustness of the median. The median is one of the most robust measures of central tendency. Its breakdown point is 50%, meaning that up to half of the data must be corrupted before the median can be driven arbitrarily far from the true center. The arithmetic mean, by contrast, has a breakdown point of 0%; a single extreme observation can shift the mean without bound.

Linear vs. quadratic penalty. The L1 loss function penalizes residuals linearly, while the L2 loss used in k-means penalizes residuals quadratically. Consider a point that lies 100 units from its cluster center. Under L1, its contribution to the objective is 100. Under L2, its contribution is 10,000. This squared penalty gives disproportionate influence to distant points in k-means, pulling cluster centers toward outliers.

Together, these properties make k-median a preferred choice when data is known to contain noise, measurement errors, or heavy-tailed distributions.

Illustrative example

Consider five one-dimensional data points: {1, 2, 3, 4, 100}. The value 100 is an outlier.

Statistic	Value	Effect of outlier
Mean	22.0	Pulled far from the bulk of the data
Median	3	Remains at the center of the majority

If these points formed a single cluster, k-means would place its center at 22.0, while k-median would place its center at 3. The k-median center more accurately represents where most of the data lies.

Computational complexity and NP-hardness

The k-median problem is NP-hard in general metric spaces, meaning that no polynomial-time algorithm is known that can find the globally optimal solution for all instances. Even in Euclidean space, finding the exact optimum is NP-hard when both k and d are part of the input.

Heuristic complexity

The Lloyd-style iterative heuristic described above runs in O(nkd) time per iteration, where n is the number of data points, k is the number of clusters, and d is the dimensionality. However, computing the coordinate-wise median requires sorting, adding an O(n log n) factor per dimension per cluster in the update step. In practice, the algorithm typically converges in a small number of iterations, but the worst-case number of iterations can be exponential.

Approximation algorithms

Because exact optimization is intractable for large instances, a rich body of theoretical work focuses on approximation algorithms with provable guarantees.

Algorithm / Authors	Year	Approximation ratio	Notes
Charikar, Guha, Tardos, Shmoys	2002	6⅔	First constant-factor approximation using LP rounding and primal-dual methods
Arya, Garg, Khandekar, Meyerson, Munagala, Pandit	2004	3 + 2/p	Local search with p-swaps; for single swaps (p=1), ratio is 5
Li and Svensson	2013	1 + √3 + ε ≈ 2.732 + ε	Pseudo-approximation technique; breakthrough improvement over the long-standing ratio of 3 + ε
Byrka, Pensyl, Rybicki, Srinivasan, Trinh	2017	2.675 + ε	Current best known ratio for general metrics

The local search approach by Arya et al. is especially notable for its practical simplicity. Starting with any feasible solution, the algorithm iteratively tries to improve the objective by swapping one or more centers. If each swap considers replacing p centers simultaneously, the approximation guarantee is 3 + 2/p. For p = 1 (single swap), this yields a 5-approximation.

The metric k-median problem and facility location

In the operations research and combinatorial optimization literature, the k-median problem is often studied as a metric facility location problem. Given a set of clients D and a set of potential facility locations F in a metric space satisfying the triangle inequality, the goal is to open exactly k facilities and assign each client to its nearest open facility so as to minimize the total assignment cost.

Formally:

min   Σ_{u ∈ D}  min_{v ∈ S}  d(u, v)
subject to  |S| = k,  S ⊆ F

This formulation is known as the discrete k-median problem when centers must be chosen from a finite candidate set F, and as the continuous k-median problem when centers can be placed anywhere in the space.

The facility location perspective has driven much of the theoretical work on approximation algorithms and connects k-median to problems such as the p-median problem studied by Kariv and Hakimi (1979) in network optimization.

Partitioning around medoids (PAM)

The PAM algorithm, introduced by Kaufman and Rousseeuw in 1987, is closely related to k-median but belongs to the k-medoids family. Instead of computing coordinate-wise medians, PAM selects actual data points as cluster centers (medoids). PAM operates in two phases:

BUILD phase. Greedily select k initial medoids to minimize the total dissimilarity.
SWAP phase. Iteratively consider swapping each medoid with each non-medoid point, performing the swap that yields the largest decrease in the objective. Continue until no beneficial swap exists.

The original PAM algorithm has a per-iteration complexity of O(k(n-k)²). The FastPAM variant (Schubert and Rousseeuw, 2019) reduces this to O(n²) per iteration, yielding up to a k-fold speedup. FasterPAM further improves performance through eager swap strategies.

CLARA and CLARANS

For large datasets where PAM becomes prohibitively expensive, Kaufman and Rousseeuw also proposed CLARA (Clustering Large Applications). CLARA works by drawing multiple random samples from the dataset, running PAM on each sample, and selecting the best result.

CLARANS (Clustering Large Applications based on RANdomized Search), introduced by Ng and Han in 1994, combines ideas from PAM and CLARA. Rather than exhaustively searching the neighborhood of the current solution, CLARANS randomly samples a subset of possible swaps. Its cost scales roughly linearly with n, making it more practical for large datasets.

Weighted k-median

In the weighted variant, each data point xᵢ has an associated weight wᵢ, and the objective becomes:

min  Σᵢ₌₁ⁿ  wᵢ · min_{j ∈ {1,...,k}}  ‖xᵢ − cⱼ‖₁

This formulation is useful when certain data points represent aggregated observations or carry different levels of importance.

K-median with outliers

The k-median with outliers variant allows a specified number of points to be discarded as outliers, further improving robustness. Chen (2008) gave the first constant-factor approximation for this problem.

Software implementations

K-median is available in several statistical and machine learning software packages.

Software	Package / Module	Notes
Python	pyclustering (`kmedians` class)	Dedicated k-median implementation with L1 distance
Python	scikit-learn-extra (`KMedoids`)	Supports k-medoids with PAM and alternate methods; can approximate k-median behavior
R	`flexclust` package	Provides k-median clustering via the `kcca` function with Manhattan distance
R	`cluster` package (`pam` function)	PAM implementation for k-medoids
ELKI	`KMediansLloyd`	Java-based k-median with multiple distance options
Stata	`cluster kmedians`	Built-in command for k-median clustering
GeoDa	Clusters menu	Spatial data analysis tool with k-median support

For users of scikit-learn, there is no built-in k-median class in the core library. However, a k-median algorithm can be implemented by modifying the centroid update step of KMeans to use numpy's median function instead of the mean. Alternatively, the pyclustering library provides a ready-to-use implementation.

Python example

A minimal k-median implementation in Python:

import numpy as np

def k_median(X, k, max_iter=100):
    # Random initialization
    indices = np.random.choice(len(X), k, replace=False)
    centers = X[indices].copy()

    for _ in range(max_iter):
        # Assignment step: compute Manhattan distances
        distances = np.array([
            np.sum(np.abs(X - c), axis=1) for c in centers
        ]).T
        labels = np.argmin(distances, axis=1)

        # Update step: compute coordinate-wise median
        new_centers = np.array([
            np.median(X[labels == j], axis=0) for j in range(k)
        ])

        if np.allclose(new_centers, centers):
            break
        centers = new_centers

    return labels, centers

Choosing the number of clusters

Like k-means, k-median requires the user to specify the number of clusters k in advance. Several methods can help determine an appropriate value:

Elbow method. Plot the total L1 cost as a function of k and look for a "bend" in the curve where adding more clusters provides diminishing returns.
Silhouette analysis. Compute the average silhouette score across all data points for different values of k. Higher silhouette values indicate better-defined clusters.
Gap statistic. Compare the total within-cluster distance to that expected under a null reference distribution.
Cross-validation. For downstream prediction tasks, select k based on performance on held-out data.

Advantages and limitations

Advantages

More robust to outliers and noise than k-means due to the median's high breakdown point and the L1 distance's linear penalty.
Well suited for data with heavy-tailed or skewed distributions where the mean is a poor measure of central tendency.
Works effectively with discrete or even binary data.
Interpretable results; each cluster is represented by a center that reflects the typical value along each dimension.

Limitations

Computationally more expensive than k-means per iteration due to the sorting required for median computation.
Like k-means, assumes roughly spherical clusters and may struggle with elongated or irregularly shaped groups.
Convergence to local minima depends on initialization, and the globally optimal solution is NP-hard to find.
Requires the user to choose k in advance.
The coordinate-wise median center is not invariant under rotation of the coordinate axes, unlike the L2 centroid used in k-means.

Applications

K-median clustering is applied across many domains.

Domain	Application	Why k-median is preferred
Customer segmentation	Grouping customers by purchasing behavior	Transaction data often contains outlier purchases; k-median resists skew from high-spending outliers
Facility location	Choosing warehouse or store locations to minimize total delivery distance	The L1 objective corresponds to minimizing total travel distance on a grid (city-block distance)
Image segmentation	Partitioning pixels into regions with similar color or intensity	Robust to sensor noise and extreme pixel values
Anomaly detection	Identifying data points far from any cluster center	Cleaner cluster centers improve outlier scoring
Gene expression analysis	Clustering genes with similar expression profiles	Biological data frequently contains measurement noise and outlier expressions
Text mining	Grouping documents by topic similarity	Document feature vectors are high-dimensional and sparse; L1 distance handles sparsity well
Spatial analysis	Partitioning geographic data into service regions	Manhattan distance naturally models travel on street grids

References

Kaufman, L. and Rousseeuw, P.J. (1987). "Clustering by Means of Medoids." In Y. Dodge (Ed.), *Statistical Data Analysis Based on the L1-Norm and Related Methods*, pp. 405-416. North-Holland, Amsterdam.
Kaufman, L. and Rousseeuw, P.J. (1990). *Finding Groups in Data: An Introduction to Cluster Analysis*. Wiley Series in Probability and Statistics.
Charikar, M., Guha, S., Tardos, E., and Shmoys, D.B. (2002). "A Constant-Factor Approximation Algorithm for the k-Median Problem." *Journal of Computer and System Sciences*, 65(1), 129-149.
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V. (2004). "Local Search Heuristics for k-Median and Facility Location Problems." *SIAM Journal on Computing*, 33(3), 544-562.
Jain, K. and Vazirani, V.V. (2001). "Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation." *Journal of the ACM*, 48(2), 274-296.
Li, S. and Svensson, O. (2013). "Approximating k-Median via Pseudo-Approximation." *Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC)*, pp. 901-910.
Byrka, J., Pensyl, T., Rybicki, B., Srinivasan, A., and Trinh, K. (2017). "An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization." *ACM Transactions on Algorithms*, 13(2), Article 23.
Kariv, O. and Hakimi, S.L. (1979). "An Algorithmic Approach to Network Location Problems. II: The p-Medians." *SIAM Journal on Applied Mathematics*, 37(3), 539-560.
Schubert, E. and Rousseeuw, P.J. (2019). "Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms." *Proceedings of the International Conference on Similarity Search and Applications (SISAP)*, pp. 171-187.
Ng, R.T. and Han, J. (1994). "Efficient and Effective Clustering Methods for Spatial Data Mining." *Proceedings of the 20th International Conference on Very Large Data Bases (VLDB)*, pp. 144-155.
Chen, K. (2008). "A Constant Factor Approximation Algorithm for k-Median Clustering with Outliers." *Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)*, pp. 826-835.
Bradley, P.S. and Fayyad, U.M. (1998). "Refining Initial Points for K-Means Clustering." *Proceedings of the 15th International Conference on Machine Learning (ICML)*, pp. 91-99.
Arthur, D. and Vassilvitskii, S. (2007). "k-means++: The Advantages of Careful Seeding." *Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)*, pp. 1027-1035.

ELI5: Explain like I'm 5

Mathematical formulation

Algorithm

Step-by-step procedure

Convergence properties

Initialization methods

K-median vs. k-means vs. k-medoids

Robustness to outliers

Illustrative example

Computational complexity and NP-hardness

Heuristic complexity

Approximation algorithms

The metric k-median problem and facility location

Related algorithms and variants

Partitioning around medoids (PAM)

CLARA and CLARANS

Weighted k-median

K-median with outliers

Software implementations

Python example

Choosing the number of clusters

Advantages and limitations

Advantages

Limitations

Applications

See also

References

Improve this article

Related Articles

Hierarchical Clustering

K-Means

Divisive Clustering

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN

ELI5: Explain like I'm 5

Mathematical formulation

Algorithm

Step-by-step procedure

Convergence properties

Initialization methods

K-median vs. k-means vs. k-medoids

Robustness to outliers

Illustrative example

Computational complexity and NP-hardness

Heuristic complexity

Approximation algorithms

The metric k-median problem and facility location

Related algorithms and variants

Partitioning around medoids (PAM)

CLARA and CLARANS

Weighted k-median

K-median with outliers

Software implementations

Python example

Choosing the number of clusters

Advantages and limitations

Advantages

Limitations

Applications

See also

References

Related Articles

Hierarchical Clustering

K-Means

Divisive Clustering

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN