Agglomerative clustering
Last reviewed
May 11, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,739 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,739 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Agglomerative clustering is a bottom-up form of hierarchical clustering used in unsupervised learning. Each data point starts in its own singleton cluster, and the algorithm repeatedly merges the two closest clusters until one remains or a stopping rule fires. The full merge history is drawn as a dendrogram, a binary tree recording which clusters were joined and at what distance. Older statistics literature also calls it AGNES (AGglomerative NESting), in contrast to DIANA (DIvisive ANAlysis), which works top-down.
The core idea is simple. Treat every point as its own cluster, find the two clusters that are closest under some distance rule, merge them, and repeat. After n minus 1 merges every point sits in a single cluster, and the sequence of merges defines a hierarchy at every possible granularity. To get k clusters, you cut the dendrogram at the height that produces k branches.
Unlike k-means clustering, agglomerative clustering does not need the number of clusters up front. You can decide on k after looking at the dendrogram, or set a distance threshold and let the algorithm stop when the next merge would exceed it. The trade off is that the method is greedy: once two clusters are merged the decision is never revisited, even if it later looks like a bad call.
A generic version of the procedure runs as follows:
The distance-matrix update step is where the linkage criterion matters, because it decides how the distance between a newly merged cluster and the remaining clusters is computed.
Most common linkage methods can be expressed through the Lance-Williams formula, a recursive equation that updates inter-cluster distances in constant time per pair instead of recomputing them from scratch. For a merge of clusters A and B and any other cluster C, the new distance is
d(A union B, C) = alpha_A * d(A, C) + alpha_B * d(B, C) + beta * d(A, B) + gamma * |d(A, C) - d(B, C)|
Different values of alpha_A, alpha_B, beta and gamma give different linkages: single, complete, average (UPGMA), weighted average (WPGMA), centroid, median, and Ward. One implementation can support all of them without rewriting the inner loop.
The linkage criterion is the most consequential choice, because it changes the shape of the clusters the algorithm prefers.
| Linkage | Distance between clusters A and B | Tends to produce | Notes |
|---|---|---|---|
| Single | min over x in A, y in B of d(x, y) | Long, thin, chained clusters | Sensitive to noise; can join distant clusters through a chain of close points. SLINK runs in O(n^2). |
| Complete | max over x in A, y in B of d(x, y) | Compact, roughly equal-diameter clusters | Less prone to chaining than single linkage but can split large irregular clusters. CLINK runs in O(n^2). |
| Average (UPGMA) | mean over x in A, y in B of d(x, y) | Globular clusters of similar size | A compromise between single and complete; commonly used in bioinformatics. |
| Ward | Increase in within-cluster variance (ESS) after merging | Compact, roughly spherical, evenly sized clusters | Requires squared Euclidean distance. Often the default in practice. |
| Centroid (UPGMC) | Distance between cluster centroids | Globular clusters | Can produce dendrogram inversions (a merge happens at a lower height than a previous one). |
| Median (WPGMC) | Distance between cluster medians | Globular clusters | Like centroid, also subject to inversions. |
Single linkage defines the distance between two clusters as the distance between their two closest members. It can find non-convex shapes well, and it has a clean connection to graph theory: running single-linkage clustering is mathematically equivalent to computing a minimum spanning tree of the points and cutting edges in decreasing order of weight. The downside is chaining, where a string of points lying between two natural clusters causes them to merge through a thin bridge. The SLINK algorithm, proposed by R. Sibson in 1973, computes single-linkage hierarchies in O(n^2) time and O(n) space.
Complete linkage uses the distance between the two farthest members. This avoids chaining and tends to produce tight, comparable clusters, but it can split a single elongated cluster into pieces if a few outliers stretch its diameter. The CLINK algorithm (D. Defays, 1977) runs in O(n^2) and is the complete-linkage analogue of SLINK.
Average linkage, also called UPGMA (unweighted pair group method with arithmetic mean), takes the mean of all pairwise distances between members of the two clusters. It sits between single and complete linkage in behavior, and is the standard distance-based method for building phylogenetic trees in bioinformatics.
Ward's method, introduced by Joe H. Ward Jr. in 1963, takes a different angle. Instead of asking how far apart two clusters are, it asks how much the within-cluster variance (the error sum of squares, ESS) would grow if the two clusters were merged, and picks the pair whose merge produces the smallest increase. This makes Ward closely related to the objective function of k-means clustering, and it tends to produce clusters of similar size and shape. Ward requires the original point-to-point distances to be squared Euclidean, so the metric is not freely interchangeable; scikit-learn forces metric='euclidean' whenever linkage='ward' is selected.
For linkages other than Ward, agglomerative clustering can be combined with any valid distance metric. Euclidean distance is the default for continuous numeric data. Manhattan (L1) distance is used for sparse data and some text-mining problems. Cosine distance is common when clustering text embeddings. Hamming distance is used for categorical or binary attributes, and Mahalanobis distance accounts for feature covariance. Many implementations also accept a precomputed n by n distance matrix when the notion of similarity is application-specific. The metric matters as much as the linkage, because together they decide what "close" means.
The output of agglomerative clustering is most naturally visualized as a dendrogram, a binary tree where leaves are individual data points and internal nodes are merges. The height of each internal node is the distance at which the merge happened, so reading from bottom to top traces the order of agglomeration. Cutting the dendrogram with a horizontal line at a chosen height yields a flat clustering: every subtree below the cut becomes one cluster. Long vertical segments (large jumps in merge height between consecutive joins) are sometimes interpreted as natural breaks in the data, but this is informal; there is no guarantee that the largest jump corresponds to a meaningful split. Centroid and median linkage can produce inversions where a merge appears at a lower height than an earlier one, which violates the visual convention of monotonically increasing heights and is one reason those linkages are less popular than Ward, average, single, and complete linkage.
A straightforward implementation computes the full n by n distance matrix and scans it for the closest pair at each of the n minus 1 merge steps. That gives O(n^3) time and O(n^2) space, since the matrix dominates memory. With a priority queue (or one per row of the similarity matrix), the merge search drops to O(n^2 log n). Two special cases reach O(n^2): SLINK for single linkage and CLINK for complete linkage. In practice the difference between O(n^2 log n) and O(n^2) is small, because all four classical variants still need O(n^2) pairwise distance computations regardless of the merge strategy.
The O(n^2) memory footprint is what really limits scaling. For a million points the distance matrix alone is on the order of a terabyte at single precision, so agglomerative clustering is usually applied to a few thousand to a few hundred thousand points, or to subsampled or feature-reduced versions of larger datasets.
In scikit-learn, agglomerative clustering is exposed through sklearn.cluster.AgglomerativeClustering. The main parameters are:
| Parameter | Default | Purpose |
|---|---|---|
n_clusters | 2 | Target number of clusters. Set to None if distance_threshold is used. |
metric | 'euclidean' | Distance metric. Options: 'l1', 'l2', 'manhattan', 'cosine', 'precomputed'. Ignored when linkage='ward'. |
linkage | 'ward' | One of 'ward', 'complete', 'average', 'single'. |
distance_threshold | None | Stop merging when the minimum inter-cluster distance exceeds this value. |
connectivity | None | Sparse matrix of allowed merges; forbids merging clusters whose points are not adjacent in the graph. |
compute_distances | False | If True, store merge distances so a dendrogram can be drawn. |
After fitting, the model exposes labels_ for the flat cluster assignment, children_ describing each merge, and distances_ (the merge heights, useful for plotting with scipy.cluster.hierarchy.dendrogram).
A useful feature of the scikit-learn implementation is connectivity-constrained clustering, where a sparse adjacency matrix tells the algorithm which pairs of points are allowed to merge. The matrix is built either from prior knowledge (for example, web pages linked by hyperlinks) or from the data with helpers like sklearn.neighbors.kneighbors_graph for k-nearest-neighbor graphs or sklearn.feature_extraction.image.grid_to_graph for image pixel grids. Connectivity constraints impose local structure (a cluster should be spatially contiguous, useful for image segmentation) and reduce computation by ignoring most candidate merges. The scikit-learn documentation warns that combining single, complete, or average linkage with connectivity constraints can amplify a "rich get richer" effect where a few clusters absorb most points; Ward linkage is the safest pairing.
from sklearn.cluster import AgglomerativeClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
clustering = AgglomerativeClustering(n_clusters=2, linkage='ward').fit(X)
print(clustering.labels_) # e.g. [1 1 1 0 0 0]
A related class, sklearn.cluster.FeatureAgglomeration, applies the same algorithm to columns rather than rows for unsupervised feature reduction.
Agglomerative clustering is one of several common families, each with trade-offs.
| Property | Agglomerative | K-means | DBSCAN | Divisive (DIANA) |
|---|---|---|---|---|
| Direction | Bottom-up | Iterative reassignment | Density expansion | Top-down |
| Need to specify k? | Optional (can use a distance threshold or pick k from the dendrogram) | Yes | No (uses eps and min_samples instead) | Optional |
| Cluster shapes | Depends on linkage; flexible | Spherical, similar sizes | Arbitrary shapes, supports noise | Depends on splitting rule |
| Handles noise / outliers | Poorly (every point is assigned) | Poorly | Well (explicit noise label) | Mixed |
| Time complexity | O(n^2) to O(n^3) | O(n * k * iterations) | O(n log n) with spatial index | Up to O(2^n) in the worst case |
| Memory | O(n^2) for distance matrix | O(n + k) | O(n) | O(n^2) typically |
| Reproducible? | Yes (deterministic given metric and linkage) | No (depends on initialization) | Yes | Yes |
| Hierarchical output? | Yes, dendrogram | No | No | Yes, dendrogram |
K-means assumes roughly spherical clusters of similar size, requires k in advance, depends on random initialization, and scales well to large n. Agglomerative clustering with Ward linkage optimizes a closely related variance criterion but produces a full hierarchy and is deterministic, at the cost of much higher memory and runtime. DBSCAN is density-based: it grows clusters from dense neighborhoods and labels low-density points as noise, does not need k, and scales better when a spatial index is used. Agglomerative clustering with single linkage can mimic some of DBSCAN's flexibility on non-globular shapes, but it has no built-in concept of noise: every point ends up in some cluster.
Divisive hierarchical clustering, implemented in the DIANA algorithm (Kaufman and Rousseeuw, 1990), starts with all points in one cluster and splits at each step. In principle DIANA can be better at finding the coarse top-level structure of the data, while agglomerative methods are better at picking out fine structure near the leaves. In practice divisive methods are far less common, because the general problem of finding the best split is combinatorial and most implementations rely on heuristics rather than an exact search.
The big appeal is that agglomerative clustering produces a full hierarchy, which is more informative than a flat partition. You can read clusters at any resolution by cutting the dendrogram at different heights, you do not need to pick k in advance, and the result is deterministic given a metric and a linkage choice. It works with any pairwise distance (including a precomputed similarity matrix), which helps when all you have is a kernel or a custom dissimilarity score. Single linkage in particular can discover arbitrary-shaped clusters.
The limitations are mostly about scale and sensitivity. The O(n^2) memory footprint is a hard ceiling; for more than a few hundred thousand points the distance matrix does not fit in RAM on most machines. The greedy merge rule cannot be undone, so a bad early merge propagates up the tree. Results are sensitive to the distance metric, the linkage, and outliers, which can drag single-linkage chains across the dataset. There is no probabilistic model behind the clusters, so there is no built-in way to assign new points after the model is built; standard practice is to refit or to assign new points to the nearest existing centroid. Centroid and median linkage can produce dendrogram inversions that complicate interpretation.
Agglomerative clustering appears wherever a hierarchical view of similarity is useful. In bioinformatics and phylogenetics, UPGMA and related average-linkage methods build evolutionary trees from genetic distance matrices. Cosine distance with average or complete linkage groups related documents, news articles, or search results in text clustering. Marketing teams use Ward linkage on standardized purchase features for customer segmentation that produces evenly sized groups without picking k in advance. With connectivity constraints from a pixel grid, the algorithm can partition an image into spatially contiguous regions. Very small clusters or single points that only merge into the main tree at high distance can be flagged as outliers. Even when a different algorithm is used for the final clustering, a dendrogram is a cheap way to inspect what natural groupings exist in a small to medium dataset.
Ward's method was published in 1963, and Stephen Johnson's 1967 paper Hierarchical Clustering Schemes in Psychometrika unified single-linkage and complete-linkage clustering under a common framework (sometimes called Johnson's algorithm). The Lance-Williams formula appeared in 1967, providing the recursive update that powers most modern implementations. SLINK (Sibson, 1973) and CLINK (Defays, 1977) gave optimal O(n^2) algorithms for single and complete linkage. The AGNES and DIANA names were popularized by Kaufman and Rousseeuw's 1990 textbook Finding Groups in Data, still cited as the standard reference for the algorithm family.
Imagine a pile of toys on the floor. Treat each toy as its own tiny group of one. Find the two groups that look most alike (maybe two teddy bears) and push them together. Keep doing that, again and again, joining the two most similar groups each time, until you have a few big piles (bears, trucks, dolls), and eventually one giant pile that contains every toy. The drawing of who got joined to whom, and at what step, is the dendrogram, and you can stop whenever the groups look the way you want.