# Agglomerative clustering

> Source: https://aiwiki.ai/wiki/agglomerative_clustering
> Updated: 2026-06-27
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Agglomerative clustering** is a bottom-up form of [hierarchical clustering](/wiki/hierarchical_clustering) used in [unsupervised learning](/wiki/unsupervised_learning) in which each data point starts in its own singleton cluster and the algorithm repeatedly merges the two closest clusters until one cluster remains or a stopping rule fires.[1][5] The full merge history is drawn as a [dendrogram](/wiki/dendrogram), a binary tree recording which clusters were joined and at what distance, and cutting that tree at a chosen height yields any desired number of clusters without specifying it in advance.[1] How "closest" is measured is set by a **linkage criterion**, the four standard choices being single, complete, average, and Ward linkage.[5] Older statistics literature also calls the method AGNES (AGglomerative NESting), in contrast to DIANA (DIvisive ANAlysis), which works top-down.[12]

## What is agglomerative clustering?

The core idea is simple. Treat every point as its own cluster, find the two clusters that are closest under some distance rule, merge them, and repeat. After n minus 1 merges every point sits in a single cluster, and the sequence of merges defines a hierarchy at every possible granularity. To get k clusters, you cut the dendrogram at the height that produces k branches.[1] As the scikit-learn user guide puts it, "The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together."[6]

Unlike [k-means clustering](/wiki/k-means_clustering), agglomerative clustering does not need the number of clusters up front. You can decide on k after looking at the dendrogram, or set a distance threshold and let the algorithm stop when the next merge would exceed it.[5] The trade off is that the method is greedy: once two clusters are merged the decision is never revisited, even if it later looks like a bad call.[1]

## How does the algorithm work?

A generic version of the procedure runs as follows:

1. Compute the n by n pairwise distance matrix between all data points.
2. Place each point in its own cluster, giving n initial clusters.
3. Find the pair of clusters with the smallest inter-cluster distance under the chosen linkage criterion.
4. Merge that pair and update the distance matrix.
5. Repeat steps 3 and 4 until only one cluster is left or a stopping criterion is met.[1]

The distance-matrix update step is where the **linkage criterion** matters, because it decides how the distance between a newly merged cluster and the remaining clusters is computed.

### Lance-Williams update formula

Most common linkage methods can be expressed through the **Lance-Williams formula**, a recursive equation that updates inter-cluster distances in constant time per pair instead of recomputing them from scratch.[8] For a merge of clusters A and B and any other cluster C, the new distance is

```
d(A union B, C) = alpha_A * d(A, C) + alpha_B * d(B, C) + beta * d(A, B) + gamma * |d(A, C) - d(B, C)|
```

Different values of alpha_A, alpha_B, beta and gamma give different linkages: single, complete, average (UPGMA), weighted average (WPGMA), centroid, median, and Ward. One implementation can support all of them without rewriting the inner loop.[8]

## What are linkage criteria?

The linkage criterion is the most consequential choice, because it changes the shape of the clusters the algorithm prefers. scikit-learn supports four, defined in its documentation as follows: "Ward minimizes the sum of squared differences within all clusters"; "Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters"; "Average linkage minimizes the average of the distances between all observations of pairs of clusters"; and "Single linkage minimizes the distance between the closest observations of pairs of clusters."[6]

| Linkage | Distance between clusters A and B | Tends to produce | Notes |
|---|---|---|---|
| **Single** | min over x in A, y in B of d(x, y) | Long, thin, chained clusters | Sensitive to noise; can join distant clusters through a chain of close points.[3] SLINK runs in O(n^2).[10] |
| **Complete** | max over x in A, y in B of d(x, y) | Compact, roughly equal-diameter clusters | Less prone to chaining than single linkage but can split large irregular clusters.[4] CLINK runs in O(n^2).[11] |
| **Average (UPGMA)** | mean over x in A, y in B of d(x, y) | Globular clusters of similar size | A compromise between single and complete; commonly used in [bioinformatics](/wiki/bioinformatics).[6] |
| **Ward** | Increase in within-cluster variance (ESS) after merging | Compact, roughly spherical, evenly sized clusters | Requires squared Euclidean distance. Often the default in practice.[2] |
| **Centroid (UPGMC)** | Distance between cluster centroids | Globular clusters | Can produce dendrogram inversions (a merge happens at a lower height than a previous one). |
| **Median (WPGMC)** | Distance between cluster medians | Globular clusters | Like centroid, also subject to inversions. |

**Single linkage** defines the distance between two clusters as the distance between their two closest members. It can find non-convex shapes well, and it has a clean connection to graph theory: running single-linkage clustering is mathematically equivalent to computing a minimum spanning tree of the points and cutting edges in decreasing order of weight.[3] The downside is **chaining**, where a string of points lying between two natural clusters causes them to merge through a thin bridge.[3] The SLINK algorithm, proposed by R. Sibson in 1973, computes single-linkage hierarchies in O(n^2) time and O(n) space.[10]

**Complete linkage** uses the distance between the two farthest members. This avoids chaining and tends to produce tight, comparable clusters, but it can split a single elongated cluster into pieces if a few outliers stretch its diameter.[4] The CLINK algorithm (D. Defays, 1977) runs in O(n^2) and is the complete-linkage analogue of SLINK.[11]

**Average linkage**, also called UPGMA (unweighted pair group method with arithmetic mean), takes the mean of all pairwise distances between members of the two clusters. It sits between single and complete linkage in behavior, and is the standard distance-based method for building phylogenetic trees in [bioinformatics](/wiki/bioinformatics).[6]

**Ward's method**, introduced by Joe H. Ward Jr. in 1963, takes a different angle.[9] Instead of asking how far apart two clusters are, it asks how much the within-cluster variance (the error sum of squares, ESS) would grow if the two clusters were merged, and picks the pair whose merge produces the smallest increase.[9] The scikit-learn user guide describes it as "a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach," which is why Ward output often resembles [k-means clustering](/wiki/k-means_clustering) but comes with a full hierarchy.[6] It tends to produce clusters of similar size and shape.[2] Ward requires the original point-to-point distances to be squared Euclidean, so the metric is not freely interchangeable; in scikit-learn, when `linkage='ward'` only `'euclidean'` (equivalently `'l2'`) is accepted as the metric.[5]

## Which distance metric should you use?

For linkages other than Ward, agglomerative clustering can be combined with any valid distance metric. Euclidean distance is the default for continuous numeric data. Manhattan (L1) distance is used for sparse data and some text-mining problems. Cosine distance is common when clustering text [embeddings](/wiki/embedding). Hamming distance is used for categorical or binary attributes, and Mahalanobis distance accounts for feature covariance. Many implementations also accept a precomputed n by n distance matrix when the notion of similarity is application-specific.[5] The metric matters as much as the linkage, because together they decide what "close" means.

## What is a dendrogram and how do you read it?

The output of agglomerative clustering is most naturally visualized as a **dendrogram**, a binary tree where leaves are individual data points and internal nodes are merges. The height of each internal node is the distance at which the merge happened, so reading from bottom to top traces the order of agglomeration.[1] Cutting the dendrogram with a horizontal line at a chosen height yields a flat clustering: every subtree below the cut becomes one cluster.[1] Long vertical segments (large jumps in merge height between consecutive joins) are sometimes interpreted as natural breaks in the data, but this is informal; there is no guarantee that the largest jump corresponds to a meaningful split. Centroid and median linkage can produce **inversions** where a merge appears at a lower height than an earlier one, which violates the visual convention of monotonically increasing heights and is one reason those linkages are less popular than Ward, average, single, and complete linkage.

## What is the time and space complexity?

A straightforward implementation computes the full n by n distance matrix and scans it for the closest pair at each of the n minus 1 merge steps. That gives **O(n^3) time** and **O(n^2) space**, since the matrix dominates memory.[7] With a priority queue (or one per row of the similarity matrix), the merge search drops to **O(n^2 log n)**.[7] Two special cases reach **O(n^2)**: SLINK for single linkage[10] and CLINK for complete linkage.[11] In practice the difference between O(n^2 log n) and O(n^2) is small, because all four classical variants still need O(n^2) pairwise distance computations regardless of the merge strategy.[8]

The O(n^2) memory footprint is what really limits scaling. For a million points the distance matrix alone is on the order of a terabyte at single precision, so agglomerative clustering is usually applied to a few thousand to a few hundred thousand points, or to subsampled or feature-reduced versions of larger datasets.

## How do you run it in scikit-learn?

In [scikit-learn](/wiki/scikit-learn), agglomerative clustering is exposed through `sklearn.cluster.AgglomerativeClustering`. The main parameters are:[5]

| Parameter | Default | Purpose |
|---|---|---|
| `n_clusters` | 2 | Target number of clusters. Set to `None` if `distance_threshold` is used. |
| `metric` | `'euclidean'` | Distance metric. Options: `'l1'`, `'l2'`, `'manhattan'`, `'cosine'`, `'precomputed'`. Only `'euclidean'`/`'l2'` allowed when `linkage='ward'`. |
| `linkage` | `'ward'` | One of `'ward'`, `'complete'`, `'average'`, `'single'`. |
| `distance_threshold` | `None` | Stop merging when the minimum inter-cluster distance exceeds this value. |
| `connectivity` | `None` | Sparse matrix of allowed merges; forbids merging clusters whose points are not adjacent in the graph. |
| `compute_distances` | `False` | If `True`, store merge distances so a dendrogram can be drawn. |

After fitting, the model exposes `labels_` for the flat cluster assignment, `children_` describing each merge, and `distances_` (the merge heights, useful for plotting with `scipy.cluster.hierarchy.dendrogram`).[5]

### Connectivity constraints

A useful feature of the scikit-learn implementation is **connectivity-constrained clustering**, where a sparse adjacency matrix tells the algorithm which pairs of points are allowed to merge.[6] The matrix is built either from prior knowledge (for example, web pages linked by hyperlinks) or from the data with helpers like `sklearn.neighbors.kneighbors_graph` for k-nearest-neighbor graphs or `sklearn.feature_extraction.image.grid_to_graph` for image pixel grids.[6] Connectivity constraints impose local structure (a cluster should be spatially contiguous, useful for image segmentation) and reduce computation by ignoring most candidate merges. The scikit-learn documentation warns that "Connectivity constraints and single, complete or average linkage can enhance the 'rich getting richer' aspect of agglomerative clustering," so that "in the limit of a small number of clusters, they tend to give a few macroscopically occupied clusters and almost empty ones"; Ward linkage gives the most regular sizes and is the safest pairing.[6]

### Basic example

```python
from sklearn.cluster import AgglomerativeClustering
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

clustering = AgglomerativeClustering(n_clusters=2, linkage='ward').fit(X)
print(clustering.labels_)  # e.g. [1 1 1 0 0 0]
```

A related class, `sklearn.cluster.FeatureAgglomeration`, applies the same algorithm to columns rather than rows for unsupervised feature reduction.[6]

## How is it different from k-means and DBSCAN?

Agglomerative clustering is one of several common families, each with trade-offs.

| Property | Agglomerative | [K-means](/wiki/k-means_clustering) | [DBSCAN](/wiki/dbscan) | Divisive (DIANA) |
|---|---|---|---|---|
| Direction | Bottom-up | Iterative reassignment | Density expansion | Top-down |
| Need to specify k? | Optional (can use a distance threshold or pick k from the dendrogram) | Yes | No (uses `eps` and `min_samples` instead) | Optional |
| Cluster shapes | Depends on linkage; flexible | Spherical, similar sizes | Arbitrary shapes, supports noise | Depends on splitting rule |
| Handles noise / outliers | Poorly (every point is assigned) | Poorly | Well (explicit noise label) | Mixed |
| Time complexity | O(n^2) to O(n^3) | O(n * k * iterations) | O(n log n) with spatial index | Up to O(2^n) in the worst case |
| Memory | O(n^2) for distance matrix | O(n + k) | O(n) | O(n^2) typically |
| Reproducible? | Yes (deterministic given metric and linkage) | No (depends on initialization) | Yes | Yes |
| Hierarchical output? | Yes, dendrogram | No | No | Yes, dendrogram |

K-means assumes roughly spherical clusters of similar size, requires k in advance, depends on random initialization, and scales well to large n. Agglomerative clustering with Ward linkage optimizes a closely related variance criterion but produces a full hierarchy and is deterministic, at the cost of much higher memory and runtime.[2] [DBSCAN](/wiki/dbscan) is density-based: it grows clusters from dense neighborhoods and labels low-density points as noise, does not need k, and scales better when a spatial index is used.[6] Agglomerative clustering with single linkage can mimic some of DBSCAN's flexibility on non-globular shapes, but it has no built-in concept of noise: every point ends up in some cluster.

Divisive hierarchical clustering, implemented in the DIANA algorithm (Kaufman and Rousseeuw, 1990), starts with all points in one cluster and splits at each step.[12] In principle DIANA can be better at finding the coarse top-level structure of the data, while agglomerative methods are better at picking out fine structure near the leaves. In practice divisive methods are far less common, because the general problem of finding the best split is combinatorial and most implementations rely on heuristics rather than an exact search.

## What are the strengths and limitations?

The big appeal is that agglomerative clustering produces a full hierarchy, which is more informative than a flat partition. You can read clusters at any resolution by cutting the dendrogram at different heights, you do not need to pick k in advance, and the result is deterministic given a metric and a linkage choice. It works with any pairwise distance (including a precomputed similarity matrix), which helps when all you have is a kernel or a custom dissimilarity score. Single linkage in particular can discover arbitrary-shaped clusters.

The limitations are mostly about scale and sensitivity. The O(n^2) memory footprint is a hard ceiling; for more than a few hundred thousand points the distance matrix does not fit in RAM on most machines.[7] The greedy merge rule cannot be undone, so a bad early merge propagates up the tree.[1] Results are sensitive to the distance metric, the linkage, and outliers, which can drag single-linkage chains across the dataset. There is no probabilistic model behind the clusters, so there is no built-in way to assign new points after the model is built; standard practice is to refit or to assign new points to the nearest existing centroid. Centroid and median linkage can produce dendrogram inversions that complicate interpretation.

## What is agglomerative clustering used for?

Agglomerative clustering appears wherever a hierarchical view of similarity is useful. In bioinformatics and phylogenetics, UPGMA and related average-linkage methods build evolutionary trees from genetic distance matrices. Cosine distance with average or complete linkage groups related documents, news articles, or search results in text clustering.[7] Marketing teams use Ward linkage on standardized purchase features for customer segmentation that produces evenly sized groups without picking k in advance. With connectivity constraints from a pixel grid, the algorithm can partition an image into spatially contiguous regions.[6] Very small clusters or single points that only merge into the main tree at high distance can be flagged as outliers. Even when a different algorithm is used for the final clustering, a dendrogram is a cheap way to inspect what natural groupings exist in a small to medium dataset.

## History

Ward's method was published in 1963,[9] and Stephen Johnson's 1967 paper *Hierarchical Clustering Schemes* in Psychometrika unified single-linkage and complete-linkage clustering under a common framework (sometimes called Johnson's algorithm). The Lance-Williams formula appeared in 1967, providing the recursive update that powers most modern implementations.[8] SLINK (Sibson, 1973)[10] and CLINK (Defays, 1977)[11] gave optimal O(n^2) algorithms for single and complete linkage. The AGNES and DIANA names were popularized by Kaufman and Rousseeuw's 1990 textbook *Finding Groups in Data*, still cited as the standard reference for the algorithm family.[12]

## Explain like I'm 5

Imagine a pile of toys on the floor. Treat each toy as its own tiny group of one. Find the two groups that look most alike (maybe two teddy bears) and push them together. Keep doing that, again and again, joining the two most similar groups each time, until you have a few big piles (bears, trucks, dolls), and eventually one giant pile that contains every toy. The drawing of who got joined to whom, and at what step, is the dendrogram, and you can stop whenever the groups look the way you want.

## References

1. Wikipedia. [Hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
2. Wikipedia. [Ward's method](https://en.wikipedia.org/wiki/Ward%27s_method).
3. Wikipedia. [Single-linkage clustering](https://en.wikipedia.org/wiki/Single-linkage_clustering).
4. Wikipedia. [Complete-linkage clustering](https://en.wikipedia.org/wiki/Complete-linkage_clustering).
5. scikit-learn developers. [AgglomerativeClustering documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html).
6. scikit-learn developers. [Clustering user guide (section 2.3.6, Hierarchical clustering)](https://scikit-learn.org/stable/modules/clustering.html).
7. Manning, Raghavan, and Schutze. *Introduction to Information Retrieval*, Chapter 17: [Time complexity of HAC](https://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html). Cambridge University Press, 2008.
8. Mullner, Daniel. [Modern hierarchical, agglomerative clustering algorithms](https://arxiv.org/abs/1109.2378). arXiv:1109.2378, 2011.
9. Ward, Joe H. Jr. "Hierarchical Grouping to Optimize an Objective Function." *Journal of the American Statistical Association*, 58(301), 1963.
10. Sibson, R. "SLINK: an optimally efficient algorithm for the single-link cluster method." *The Computer Journal*, 16(1), 1973.
11. Defays, D. "An efficient algorithm for a complete-link method." *The Computer Journal*, 20(4), 1977.
12. Kaufman, L., and Rousseeuw, P. J. *Finding Groups in Data: An Introduction to Cluster Analysis*. Wiley, 1990.