# K-Median

> Source: https://aiwiki.ai/wiki/k-median
> Updated: 2026-07-13
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**K-median clustering** (also written k-medians) is a partitioning-based [clustering](/wiki/clustering) algorithm that divides a dataset of $$n$$ objects into $$k$$ groups by minimizing the sum of distances between each data point and the median of its assigned cluster, typically under the L1 (Manhattan) norm. [1][14] It is a variant of [k-means](/wiki/k-means) clustering that swaps two ingredients: instead of the arithmetic mean and squared [Euclidean distance](/wiki/euclidean_distance) (L2), k-median uses the coordinate-wise median and absolute (L1) distance, which makes it considerably more robust to [outliers](/wiki/outliers) and heavy-tailed data. [14][15] Because the statistical median has a breakdown point of 50%, compared with 0% for the mean, up to half of a cluster's points can be corrupted before the k-median center is driven arbitrarily far from the bulk of the data. [16]

K-median belongs to [unsupervised learning](/wiki/unsupervised_learning) and has roots in both statistics and [operations research](/wiki/operations_research), where the closely related k-median problem appears in facility location theory. [3][8] The algorithm sees wide use in domains where data contains noise or heavy-tailed distributions, including customer segmentation, [anomaly detection](/wiki/anomaly_detection), spatial analysis, and [image segmentation](/wiki/image_segmentation).

## ELI5: Explain like I'm 5

Imagine you have a big pile of colored marbles spread out on the floor, and you want to sort them into a few groups. You pick a few marbles to be the "leaders" of each group. Every other marble joins the group of whichever leader is closest to it.

Now you need to find the best spot for each leader. In regular k-means, you would put the leader at the "average" position of all its marbles. But if one marble rolled way off into the corner, the average gets pulled toward that loner, and the leader ends up in a weird spot.

With k-median, instead of finding the average, you find the "middle" position. You line up all the marbles in each group and pick the one right in the center. That middle marble does not get yanked around by a single stray marble in the corner. You keep reassigning marbles and finding new middles until nothing changes. The result is groups that better represent where most of the marbles actually are, even when a few have rolled far away.

## What is the mathematical formulation of k-median?

Given a set of $$n$$ data points $$X = \{x_1, x_2, \ldots, x_n\}$$ in a $$d$$-dimensional space, the k-median problem seeks a set of $$k$$ centers $$C = \{c_1, c_2, \ldots, c_k\}$$ that minimizes the following objective function: [14]

$$
\min \sum_{i=1}^{n} \min_{j \in \{1, \ldots, k\}} \lVert x_i - c_j \rVert_1
$$

where $$\lVert \cdot \rVert_1$$ denotes the L1 norm (Manhattan distance). In contrast, k-means minimizes the sum of squared L2 (Euclidean) distances:

$$
\min \sum_{i=1}^{n} \min_{j \in \{1, \ldots, k\}} \lVert x_i - c_j \rVert_2^2
$$

The key difference is that k-median uses absolute deviations rather than squared deviations. Squaring amplifies the influence of distant points, which is exactly why k-means is sensitive to outliers. [15] The L1 objective treats each unit of distance equally, giving outliers proportional rather than outsized influence.

Each center $$c_j$$ is computed as the coordinate-wise median of all points assigned to cluster $$j$$. For a cluster $$S_j$$ and dimension $$d$$, the $$d$$-th coordinate of the center is:

$$
c_j[d] = \operatorname{median}(\{x_i[d] : x_i \in S_j\})
$$

This exploits the separability of the L1 norm: because the sum of absolute deviations decomposes across dimensions, the optimal L1 center is found in closed form by sorting each coordinate independently and taking the middle value (or the average of the two middle values for an even count). [14] Because the median is computed independently along each dimension, the resulting center may not coincide with any actual data point. As one practitioner guide puts it, "because the median is calculated for each dimension separately in k-medians, the medians would be x = 3, and y = 3. But there exists no point (3, 3) in the data set." [15] This is exactly what distinguishes k-median from [k-medoids](/wiki/k_medoids), where each center must be a member of the dataset.

## How does the k-median algorithm work?

The standard k-median algorithm follows a Lloyd-style iterative procedure similar to k-means, alternating between an assignment step and an update step.

### Step-by-step procedure

1. **Initialization.** Select k initial centers. Common methods include random selection, the k-means++ strategy (adapted for L1 distance), or deterministic approaches. [13]
2. **Assignment step (E-step).** Assign each data point $$x_i$$ to the cluster whose center is closest under the Manhattan distance:
   - $$\operatorname{cluster}(x_i) = \arg\min_j \lVert x_i - c_j \rVert_1$$
3. **Update step (M-step).** Recompute each cluster center as the coordinate-wise median of all points currently assigned to that cluster.
4. **Convergence check.** If no assignments changed (or if the change in the objective function falls below a threshold), stop. Otherwise, return to step 2.

### Convergence properties

Like k-means, the k-median algorithm is guaranteed to converge because the objective function decreases monotonically with each iteration and is bounded below by zero. However, the algorithm may converge to a local minimum rather than the global optimum. The quality of the final solution depends heavily on the initialization. [12] Running the algorithm multiple times with different random seeds and selecting the solution with the lowest objective value is a standard mitigation strategy.

### Initialization methods

| Method | Description | Pros | Cons |
|--------|-------------|------|------|
| Random selection | Choose k data points uniformly at random as initial centers | Simple, fast | Highly variable results; may converge to poor local minima |
| K-means++ (adapted) | Select the first center randomly, then choose subsequent centers with probability proportional to their distance from existing centers | Provable $$O(\log k)$$ approximation guarantee; widely used [13] | Slightly more expensive initialization step |
| Forgy method | Select k random observations as initial centers | Tends to spread initial centers well | No formal guarantees beyond random |
| Random partition | Randomly assign each point to one of k clusters, then compute medians | Places initial centers near the data center | May produce poor initial separation |
| BUILD (from PAM) | Greedily select medoids that minimize the total cost | Deterministic; often high-quality initialization [1] | $$O(n^2 k)$$ time complexity |

## How is k-median different from k-means and k-medoids?

K-median is frequently compared to two related algorithms: [k-means](/wiki/k-means) and [k-medoids](/wiki/k_medoids). The crucial distinction is the cluster center: k-means uses the mean, k-median uses the per-dimension median (which need not be a real point), and k-medoids uses a medoid, an actual data point from the cluster. [15] The following table summarizes the main differences.

| Property | K-median | K-means | K-medoids (PAM) |
|----------|----------|---------|------------------|
| Objective function | Minimize sum of L1 (Manhattan) distances | Minimize sum of squared L2 (Euclidean) distances | Minimize sum of arbitrary dissimilarities |
| Cluster center | Coordinate-wise median | Coordinate-wise mean (centroid) | An actual data point (medoid) |
| Center is a data point? | Not necessarily | Not necessarily | Always |
| Distance metric | Manhattan distance (L1) | Euclidean distance (L2) | Any dissimilarity metric |
| Robustness to outliers | High (median has 50% breakdown point) | Low (mean has 0% breakdown point) | High (medoids resist outlier pull) |
| Computational cost per iteration | $$O(nkd)$$ with sorting overhead | $$O(nkd)$$ | $$O(k(n-k)^2)$$ for PAM |
| Best suited for | Noisy data, heavy-tailed distributions | Spherical, normally distributed clusters | Non-numeric data, arbitrary metrics |
| Update rule | Median of cluster members per dimension | Mean of cluster members per dimension | Swap medoid with best candidate point |

A common source of confusion is treating k-median and k-medoids as the same method. They are not: k-medoids constrains each [centroid](/wiki/centroid) to be one of the observed objects, while k-median's coordinate-wise median is a synthetic point that may not exist in the dataset. [15]

## Why is k-median more robust to outliers than k-means?

The robustness advantage of k-median over k-means stems from two complementary factors. Wikipedia's summary of the method captures the intuition: "Because the median is resistant to extreme values, k-medians is generally more robust in the presence of outliers." [14]

**Statistical robustness of the median.** The median is one of the most robust measures of central tendency. Its breakdown point is 50%, meaning that up to half of the data must be corrupted before the median can be driven arbitrarily far from the true center. [16] The arithmetic mean, by contrast, has a breakdown point of 0%; a single extreme observation can shift the mean without bound. [16]

**Linear vs. quadratic penalty.** The L1 [loss function](/wiki/loss_function) penalizes residuals linearly, while the L2 loss used in k-means penalizes residuals quadratically. Consider a point that lies 100 units from its cluster center. Under L1, its contribution to the objective is 100. Under L2, its contribution is 10,000. This squared penalty gives disproportionate influence to distant points in k-means, pulling cluster centers toward outliers. As one Towards Data Science guide notes, by using the median and the Manhattan metric k-median is "much less sensitive to outliers," because each outlier contributes "only with their actual distance to the center, instead of the square of the distance." [15]

Together, these properties make k-median a preferred choice when data is known to contain noise, measurement errors, or heavy-tailed distributions.

### Illustrative example

Consider five one-dimensional data points: {1, 2, 3, 4, 100}. The value 100 is an outlier.

| Statistic | Value | Effect of outlier |
|-----------|-------|-------------------|
| Mean | 22.0 | Pulled far from the bulk of the data |
| Median | 3 | Remains at the center of the majority |

If these points formed a single cluster, k-means would place its center at 22.0, while k-median would place its center at 3. The k-median center more accurately represents where most of the data lies.

## Is k-median NP-hard, and how well can it be approximated?

The k-median problem is [NP-hard](/wiki/np_hardness) in general metric spaces, meaning that no polynomial-time algorithm is known that can find the globally optimal solution for all instances. [3] Even in Euclidean space, finding the exact optimum is NP-hard when both k and d are part of the input.

### Heuristic complexity

The Lloyd-style iterative heuristic described above runs in $$O(nkd)$$ time per iteration, where $$n$$ is the number of data points, $$k$$ is the number of clusters, and $$d$$ is the dimensionality. However, computing the coordinate-wise median requires sorting, adding an $$O(n \log n)$$ factor per dimension per cluster in the update step. In practice, the algorithm typically converges in a small number of iterations, but the worst-case number of iterations can be exponential.

### Approximation algorithms

Because exact optimization is intractable for large instances, a rich body of theoretical work focuses on [approximation algorithms](/wiki/approximation_algorithm) with provable guarantees. The best known constant factor for general metrics has been driven down over two decades from $$6\frac{2}{3}$$ to $$2.675 + \epsilon$$.

| Algorithm / Authors | Year | Approximation ratio | Notes |
|---------------------|------|---------------------|-------|
| Charikar, Guha, Tardos, Shmoys | 2002 | $$6\frac{2}{3}$$ | First constant-factor approximation using LP rounding and primal-dual methods [3] |
| Arya, Garg, Khandekar, Meyerson, Munagala, Pandit | 2004 | $$3 + 2/p$$ | Local search with p-swaps; for single swaps ($$p=1$$), ratio is 5 [4] |
| Li and Svensson | 2013 | $$1 + \sqrt{3} + \epsilon \approx 2.732 + \epsilon$$ | Pseudo-approximation technique; breakthrough improvement over the long-standing ratio of $$3 + \epsilon$$ [6] |
| Byrka, Pensyl, Rybicki, Srinivasan, Trinh | 2015 (SODA; journal 2017) | $$2.675 + \epsilon$$ | Current best known ratio for general metrics [7] |

The local search approach by Arya et al. is especially notable for its practical simplicity. Starting with any feasible solution, the algorithm iteratively tries to improve the objective by swapping one or more centers. If each swap considers replacing p centers simultaneously, the approximation guarantee is $$3 + 2/p$$. For $$p = 1$$ (single swap), this yields a 5-approximation. [4] The Li and Svensson result, which achieves $$1 + \sqrt{3} + \epsilon \approx 2.732 + \epsilon$$, broke the decade-old $$3 + \epsilon$$ barrier by showing that an $$\alpha$$-approximation follows from a pseudo-approximation that opens only $$k + O(1)$$ facilities. [6]

## What is the metric k-median problem in facility location?

In the operations research and combinatorial optimization literature, the k-median problem is often studied as a metric facility location problem. Given a set of clients D and a set of potential facility locations F in a [metric space](/wiki/metric_space) satisfying the triangle inequality, the goal is to open exactly k facilities and assign each client to its nearest open facility so as to minimize the total assignment cost. [3][5]

Formally:

$$
\min \sum_{u \in D} \min_{v \in S} d(u, v) \quad \text{subject to } \lvert S \rvert = k, \; S \subseteq F
$$

This formulation is known as the discrete k-median problem when centers must be chosen from a finite candidate set F, and as the continuous k-median problem when centers can be placed anywhere in the space.

The facility location perspective has driven much of the theoretical work on approximation algorithms and connects k-median to problems such as the p-median problem studied by Kariv and Hakimi (1979) in network optimization. [8] Jain and Vazirani showed that primal-dual and Lagrangian-relaxation techniques developed for metric facility location can be adapted to the k-median problem. [5]

## Related algorithms and variants

### Partitioning around medoids (PAM)

The PAM algorithm, introduced by Kaufman and Rousseeuw in 1987, is closely related to k-median but belongs to the k-medoids family. [1] Instead of computing coordinate-wise medians, PAM selects actual data points as cluster centers (medoids). PAM operates in two phases:

1. **BUILD phase.** Greedily select k initial medoids to minimize the total dissimilarity.
2. **SWAP phase.** Iteratively consider swapping each medoid with each non-medoid point, performing the swap that yields the largest decrease in the objective. Continue until no beneficial swap exists.

The original PAM algorithm has a per-iteration complexity of $$O(k(n-k)^2)$$. [2] The FastPAM variant (Schubert and Rousseeuw, 2019) reduces this to $$O(n^2)$$ per iteration, yielding up to a k-fold speedup. [9] FasterPAM further improves performance through eager swap strategies. [9]

### CLARA and CLARANS

For large datasets where PAM becomes prohibitively expensive, Kaufman and Rousseeuw also proposed CLARA (Clustering Large Applications). [2] CLARA works by drawing multiple random samples from the dataset, running PAM on each sample, and selecting the best result.

CLARANS (Clustering Large Applications based on RANdomized Search), introduced by Ng and Han in 1994, combines ideas from PAM and CLARA. [10] Rather than exhaustively searching the neighborhood of the current solution, CLARANS randomly samples a subset of possible swaps. Its cost scales roughly linearly with n, making it more practical for large datasets.

### Weighted k-median

In the weighted variant, each data point $$x_i$$ has an associated weight $$w_i$$, and the objective becomes:

$$
\min \sum_{i=1}^{n} w_i \cdot \min_{j \in \{1, \ldots, k\}} \lVert x_i - c_j \rVert_1
$$

This formulation is useful when certain data points represent aggregated observations or carry different levels of importance.

### K-median with outliers

The k-median with outliers variant allows a specified number of points to be discarded as outliers, further improving robustness. Chen (2008) gave the first constant-factor approximation for this problem. [11]

## Software implementations

K-median is available in several statistical and [machine learning](/wiki/machine_learning) software packages.

| Software | Package / Module | Notes |
|----------|------------------|-------|
| Python | pyclustering (`kmedians` class) | Dedicated k-median implementation with L1 distance |
| Python | [scikit-learn-extra](/wiki/scikit-learn) (`KMedoids`) | Supports k-medoids with PAM and alternate methods; can approximate k-median behavior |
| R | `flexclust` package | Provides k-median clustering via the `kcca` function with Manhattan distance |
| R | `cluster` package (`pam` function) | PAM implementation for k-medoids |
| ELKI | `KMediansLloyd` | Java-based k-median with multiple distance options |
| Stata | `cluster kmedians` | Built-in command for k-median clustering |
| GeoDa | Clusters menu | Spatial data analysis tool with k-median support |

For users of [scikit-learn](/wiki/scikit-learn), there is no built-in k-median class in the core library. However, a k-median algorithm can be implemented by modifying the centroid update step of [KMeans](/wiki/k-means) to use numpy's median function instead of the mean. Alternatively, the `pyclustering` library provides a ready-to-use implementation.

### Python example

A minimal k-median implementation in Python:

```python
import numpy as np

def k_median(X, k, max_iter=100):
    # Random initialization
    indices = np.random.choice(len(X), k, replace=False)
    centers = X[indices].copy()

    for _ in range(max_iter):
        # Assignment step: compute Manhattan distances
        distances = np.array([
            np.sum(np.abs(X - c), axis=1) for c in centers
        ]).T
        labels = np.argmin(distances, axis=1)

        # Update step: compute coordinate-wise median
        new_centers = np.array([
            np.median(X[labels == j], axis=0) for j in range(k)
        ])

        if np.allclose(new_centers, centers):
            break
        centers = new_centers

    return labels, centers
```

## How do you choose the number of clusters k?

Like k-means, k-median requires the user to specify the number of clusters k in advance. Several methods can help determine an appropriate value:

- **Elbow method.** Plot the total L1 cost as a function of k and look for a "bend" in the curve where adding more clusters provides diminishing returns.
- **Silhouette analysis.** Compute the average [silhouette score](/wiki/silhouette_score) across all data points for different values of k. Higher silhouette values indicate better-defined clusters. Kaufman and Rousseeuw suggest that an average silhouette width above roughly 0.5 indicates a reasonable clustering structure. [2]
- **Gap statistic.** Compare the total within-cluster distance to that expected under a null reference distribution.
- **Cross-validation.** For downstream prediction tasks, select k based on performance on held-out data.

## What are the advantages and limitations of k-median?

### Advantages

- More robust to outliers and noise than k-means due to the median's high breakdown point and the L1 distance's linear penalty. [14][16]
- Well suited for data with heavy-tailed or skewed distributions where the mean is a poor measure of central tendency.
- Works effectively with discrete or even binary data.
- Interpretable results; each cluster is represented by a center that reflects the typical value along each dimension.

### Limitations

- Computationally more expensive than k-means per iteration due to the sorting required for median computation.
- Like k-means, assumes roughly spherical clusters and may struggle with elongated or irregularly shaped groups.
- Convergence to local minima depends on initialization, and the globally optimal solution is NP-hard to find. [3]
- Requires the user to choose k in advance.
- The coordinate-wise median center is not invariant under rotation of the coordinate axes, unlike the L2 centroid used in k-means.

## When should you use k-median?

K-median clustering is applied across many domains, and it is the natural choice whenever cluster centers should resist being dragged toward extreme values or whenever Manhattan (city-block) distance is the meaningful metric.

| Domain | Application | Why k-median is preferred |
|--------|-------------|---------------------------|
| Customer segmentation | Grouping customers by purchasing behavior | Transaction data often contains outlier purchases; k-median resists skew from high-spending outliers |
| Facility location | Choosing warehouse or store locations to minimize total delivery distance | The L1 objective corresponds to minimizing total travel distance on a grid (city-block distance) |
| Image segmentation | Partitioning pixels into regions with similar color or intensity | Robust to sensor noise and extreme pixel values |
| Anomaly detection | Identifying data points far from any cluster center | Cleaner cluster centers improve outlier scoring |
| Gene expression analysis | Clustering genes with similar expression profiles | Biological data frequently contains measurement noise and outlier expressions |
| Text mining | Grouping documents by topic similarity | Document feature vectors are high-dimensional and sparse; L1 distance handles sparsity well |
| Spatial analysis | Partitioning geographic data into service regions | Manhattan distance naturally models travel on street grids |

## See also

- [K-means](/wiki/k-means)
- [K-medoids](/wiki/k_medoids)
- [Clustering](/wiki/clustering)
- [Unsupervised learning](/wiki/unsupervised_learning)
- [Manhattan distance](/wiki/manhattan_distance)
- [DBSCAN](/wiki/dbscan)
- [Hierarchical clustering](/wiki/hierarchical_clustering)
- [Anomaly detection](/wiki/anomaly_detection)

## References

1. Kaufman, L. and Rousseeuw, P.J. (1987). "Clustering by Means of Medoids." In Y. Dodge (Ed.), *Statistical Data Analysis Based on the L1-Norm and Related Methods*, pp. 405-416. North-Holland, Amsterdam.
2. Kaufman, L. and Rousseeuw, P.J. (1990). *Finding Groups in Data: An Introduction to Cluster Analysis*. Wiley Series in Probability and Statistics.
3. Charikar, M., Guha, S., Tardos, E., and Shmoys, D.B. (2002). "A Constant-Factor Approximation Algorithm for the k-Median Problem." *Journal of Computer and System Sciences*, 65(1), 129-149.
4. Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., and Pandit, V. (2004). "Local Search Heuristics for k-Median and Facility Location Problems." *SIAM Journal on Computing*, 33(3), 544-562.
5. Jain, K. and Vazirani, V.V. (2001). "Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation." *Journal of the ACM*, 48(2), 274-296.
6. Li, S. and Svensson, O. (2013). "Approximating k-Median via Pseudo-Approximation." *Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC)*, pp. 901-910. (Journal version: *SIAM Journal on Computing*, 45(2), 530-547, 2016.)
7. Byrka, J., Pensyl, T., Rybicki, B., Srinivasan, A., and Trinh, K. (2015). "An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization." *Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)*, pp. 737-756. (Journal version: *ACM Transactions on Algorithms*, 13(2), Article 23, 2017.)
8. Kariv, O. and Hakimi, S.L. (1979). "An Algorithmic Approach to Network Location Problems. II: The p-Medians." *SIAM Journal on Applied Mathematics*, 37(3), 539-560.
9. Schubert, E. and Rousseeuw, P.J. (2019). "Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms." *Proceedings of the International Conference on Similarity Search and Applications (SISAP)*, pp. 171-187.
10. Ng, R.T. and Han, J. (1994). "Efficient and Effective Clustering Methods for Spatial Data Mining." *Proceedings of the 20th International Conference on Very Large Data Bases (VLDB)*, pp. 144-155.
11. Chen, K. (2008). "A Constant Factor Approximation Algorithm for k-Median Clustering with Outliers." *Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)*, pp. 826-835.
12. Bradley, P.S. and Fayyad, U.M. (1998). "Refining Initial Points for K-Means Clustering." *Proceedings of the 15th International Conference on Machine Learning (ICML)*, pp. 91-99.
13. Arthur, D. and Vassilvitskii, S. (2007). "k-means++: The Advantages of Careful Seeding." *Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)*, pp. 1027-1035.
14. Wikipedia contributors. "k-medians clustering." *Wikipedia, The Free Encyclopedia*. Retrieved 2026. https://en.wikipedia.org/wiki/K-medians_clustering
15. Helm, M. (2021). "Use this clustering method if you have many outliers." *Towards Data Science*, August 6, 2021. https://towardsdatascience.com/use-this-clustering-method-if-you-have-many-outliers-5c99b4cd380d/
16. Wikipedia contributors. "Robust statistics." *Wikipedia, The Free Encyclopedia*. Retrieved 2026. https://en.wikipedia.org/wiki/Robust_statistics