Agglomerative clustering

See also: Machine learning terms

Agglomerative clustering is a bottom-up form of hierarchical clustering used in unsupervised learning. Each data point starts in its own singleton cluster, and the algorithm repeatedly merges the two closest clusters until one remains or a stopping rule fires. The full merge history is drawn as a dendrogram, a binary tree recording which clusters were joined and at what distance. Older statistics literature also calls it AGNES (AGglomerative NESting), in contrast to DIANA (DIvisive ANAlysis), which works top-down.

Overview

The core idea is simple. Treat every point as its own cluster, find the two clusters that are closest under some distance rule, merge them, and repeat. After n minus 1 merges every point sits in a single cluster, and the sequence of merges defines a hierarchy at every possible granularity. To get k clusters, you cut the dendrogram at the height that produces k branches.

Unlike k-means clustering, agglomerative clustering does not need the number of clusters up front. You can decide on k after looking at the dendrogram, or set a distance threshold and let the algorithm stop when the next merge would exceed it. The trade off is that the method is greedy: once two clusters are merged the decision is never revisited, even if it later looks like a bad call.

Algorithm

A generic version of the procedure runs as follows:

Compute the n by n pairwise distance matrix between all data points.
Place each point in its own cluster, giving n initial clusters.
Find the pair of clusters with the smallest inter-cluster distance under the chosen linkage criterion.
Merge that pair and update the distance matrix.
Repeat steps 3 and 4 until only one cluster is left or a stopping criterion is met.

The distance-matrix update step is where the linkage criterion matters, because it decides how the distance between a newly merged cluster and the remaining clusters is computed.

Lance-Williams update formula

Most common linkage methods can be expressed through the Lance-Williams formula, a recursive equation that updates inter-cluster distances in constant time per pair instead of recomputing them from scratch. For a merge of clusters A and B and any other cluster C, the new distance is

d(A union B, C) = alpha_A * d(A, C) + alpha_B * d(B, C) + beta * d(A, B) + gamma * |d(A, C) - d(B, C)|

Different values of alpha_A, alpha_B, beta and gamma give different linkages: single, complete, average (UPGMA), weighted average (WPGMA), centroid, median, and Ward. One implementation can support all of them without rewriting the inner loop.

Linkage criteria

The linkage criterion is the most consequential choice, because it changes the shape of the clusters the algorithm prefers.

Linkage	Distance between clusters A and B	Tends to produce	Notes
Single	min over x in A, y in B of d(x, y)	Long, thin, chained clusters	Sensitive to noise; can join distant clusters through a chain of close points. SLINK runs in O(n^2).
Complete	max over x in A, y in B of d(x, y)	Compact, roughly equal-diameter clusters	Less prone to chaining than single linkage but can split large irregular clusters. CLINK runs in O(n^2).
Average (UPGMA)	mean over x in A, y in B of d(x, y)	Globular clusters of similar size	A compromise between single and complete; commonly used in bioinformatics.
Ward	Increase in within-cluster variance (ESS) after merging	Compact, roughly spherical, evenly sized clusters	Requires squared Euclidean distance. Often the default in practice.
Centroid (UPGMC)	Distance between cluster centroids	Globular clusters	Can produce dendrogram inversions (a merge happens at a lower height than a previous one).
Median (WPGMC)	Distance between cluster medians	Globular clusters	Like centroid, also subject to inversions.

Single linkage defines the distance between two clusters as the distance between their two closest members. It can find non-convex shapes well, and it has a clean connection to graph theory: running single-linkage clustering is mathematically equivalent to computing a minimum spanning tree of the points and cutting edges in decreasing order of weight. The downside is chaining, where a string of points lying between two natural clusters causes them to merge through a thin bridge. The SLINK algorithm, proposed by R. Sibson in 1973, computes single-linkage hierarchies in O(n^2) time and O(n) space.

Complete linkage uses the distance between the two farthest members. This avoids chaining and tends to produce tight, comparable clusters, but it can split a single elongated cluster into pieces if a few outliers stretch its diameter. The CLINK algorithm (D. Defays, 1977) runs in O(n^2) and is the complete-linkage analogue of SLINK.

Average linkage, also called UPGMA (unweighted pair group method with arithmetic mean), takes the mean of all pairwise distances between members of the two clusters. It sits between single and complete linkage in behavior, and is the standard distance-based method for building phylogenetic trees in bioinformatics.

Ward's method, introduced by Joe H. Ward Jr. in 1963, takes a different angle. Instead of asking how far apart two clusters are, it asks how much the within-cluster variance (the error sum of squares, ESS) would grow if the two clusters were merged, and picks the pair whose merge produces the smallest increase. This makes Ward closely related to the objective function of k-means clustering, and it tends to produce clusters of similar size and shape. Ward requires the original point-to-point distances to be squared Euclidean, so the metric is not freely interchangeable; scikit-learn forces metric='euclidean' whenever linkage='ward' is selected.

Distance metrics

For linkages other than Ward, agglomerative clustering can be combined with any valid distance metric. Euclidean distance is the default for continuous numeric data. Manhattan (L1) distance is used for sparse data and some text-mining problems. Cosine distance is common when clustering text embeddings. Hamming distance is used for categorical or binary attributes, and Mahalanobis distance accounts for feature covariance. Many implementations also accept a precomputed n by n distance matrix when the notion of similarity is application-specific. The metric matters as much as the linkage, because together they decide what "close" means.

Dendrogram

The output of agglomerative clustering is most naturally visualized as a dendrogram, a binary tree where leaves are individual data points and internal nodes are merges. The height of each internal node is the distance at which the merge happened, so reading from bottom to top traces the order of agglomeration. Cutting the dendrogram with a horizontal line at a chosen height yields a flat clustering: every subtree below the cut becomes one cluster. Long vertical segments (large jumps in merge height between consecutive joins) are sometimes interpreted as natural breaks in the data, but this is informal; there is no guarantee that the largest jump corresponds to a meaningful split. Centroid and median linkage can produce inversions where a merge appears at a lower height than an earlier one, which violates the visual convention of monotonically increasing heights and is one reason those linkages are less popular than Ward, average, single, and complete linkage.

Time and space complexity

A straightforward implementation computes the full n by n distance matrix and scans it for the closest pair at each of the n minus 1 merge steps. That gives O(n^3) time and O(n^2) space, since the matrix dominates memory. With a priority queue (or one per row of the similarity matrix), the merge search drops to O(n^2 log n). Two special cases reach O(n^2): SLINK for single linkage and CLINK for complete linkage. In practice the difference between O(n^2 log n) and O(n^2) is small, because all four classical variants still need O(n^2) pairwise distance computations regardless of the merge strategy.

The O(n^2) memory footprint is what really limits scaling. For a million points the distance matrix alone is on the order of a terabyte at single precision, so agglomerative clustering is usually applied to a few thousand to a few hundred thousand points, or to subsampled or feature-reduced versions of larger datasets.

scikit-learn implementation

In scikit-learn, agglomerative clustering is exposed through sklearn.cluster.AgglomerativeClustering. The main parameters are:

Parameter	Default	Purpose
`n_clusters`	2	Target number of clusters. Set to `None` if `distance_threshold` is used.
`metric`	`'euclidean'`	Distance metric. Options: `'l1'`, `'l2'`, `'manhattan'`, `'cosine'`, `'precomputed'`. Ignored when `linkage='ward'`.
`linkage`	`'ward'`	One of `'ward'`, `'complete'`, `'average'`, `'single'`.
`distance_threshold`	`None`	Stop merging when the minimum inter-cluster distance exceeds this value.
`connectivity`	`None`	Sparse matrix of allowed merges; forbids merging clusters whose points are not adjacent in the graph.
`compute_distances`	`False`	If `True`, store merge distances so a dendrogram can be drawn.

After fitting, the model exposes labels_ for the flat cluster assignment, children_ describing each merge, and distances_ (the merge heights, useful for plotting with scipy.cluster.hierarchy.dendrogram).

Connectivity constraints

A useful feature of the scikit-learn implementation is connectivity-constrained clustering, where a sparse adjacency matrix tells the algorithm which pairs of points are allowed to merge. The matrix is built either from prior knowledge (for example, web pages linked by hyperlinks) or from the data with helpers like sklearn.neighbors.kneighbors_graph for k-nearest-neighbor graphs or sklearn.feature_extraction.image.grid_to_graph for image pixel grids. Connectivity constraints impose local structure (a cluster should be spatially contiguous, useful for image segmentation) and reduce computation by ignoring most candidate merges. The scikit-learn documentation warns that combining single, complete, or average linkage with connectivity constraints can amplify a "rich get richer" effect where a few clusters absorb most points; Ward linkage is the safest pairing.

Basic example

from sklearn.cluster import AgglomerativeClustering
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

clustering = AgglomerativeClustering(n_clusters=2, linkage='ward').fit(X)
print(clustering.labels_)  # e.g. [1 1 1 0 0 0]

A related class, sklearn.cluster.FeatureAgglomeration, applies the same algorithm to columns rather than rows for unsupervised feature reduction.

Comparison with other clustering methods

Agglomerative clustering is one of several common families, each with trade-offs.

Property	Agglomerative	K-means	DBSCAN	Divisive (DIANA)
Direction	Bottom-up	Iterative reassignment	Density expansion	Top-down
Need to specify k?	Optional (can use a distance threshold or pick k from the dendrogram)	Yes	No (uses `eps` and `min_samples` instead)	Optional
Cluster shapes	Depends on linkage; flexible	Spherical, similar sizes	Arbitrary shapes, supports noise	Depends on splitting rule
Handles noise / outliers	Poorly (every point is assigned)	Poorly	Well (explicit noise label)	Mixed
Time complexity	O(n^2) to O(n^3)	O(n * k * iterations)	O(n log n) with spatial index	Up to O(2^n) in the worst case
Memory	O(n^2) for distance matrix	O(n + k)	O(n)	O(n^2) typically
Reproducible?	Yes (deterministic given metric and linkage)	No (depends on initialization)	Yes	Yes
Hierarchical output?	Yes, dendrogram	No	No	Yes, dendrogram

K-means assumes roughly spherical clusters of similar size, requires k in advance, depends on random initialization, and scales well to large n. Agglomerative clustering with Ward linkage optimizes a closely related variance criterion but produces a full hierarchy and is deterministic, at the cost of much higher memory and runtime. DBSCAN is density-based: it grows clusters from dense neighborhoods and labels low-density points as noise, does not need k, and scales better when a spatial index is used. Agglomerative clustering with single linkage can mimic some of DBSCAN's flexibility on non-globular shapes, but it has no built-in concept of noise: every point ends up in some cluster.

Divisive hierarchical clustering, implemented in the DIANA algorithm (Kaufman and Rousseeuw, 1990), starts with all points in one cluster and splits at each step. In principle DIANA can be better at finding the coarse top-level structure of the data, while agglomerative methods are better at picking out fine structure near the leaves. In practice divisive methods are far less common, because the general problem of finding the best split is combinatorial and most implementations rely on heuristics rather than an exact search.

Strengths and limitations

The big appeal is that agglomerative clustering produces a full hierarchy, which is more informative than a flat partition. You can read clusters at any resolution by cutting the dendrogram at different heights, you do not need to pick k in advance, and the result is deterministic given a metric and a linkage choice. It works with any pairwise distance (including a precomputed similarity matrix), which helps when all you have is a kernel or a custom dissimilarity score. Single linkage in particular can discover arbitrary-shaped clusters.

The limitations are mostly about scale and sensitivity. The O(n^2) memory footprint is a hard ceiling; for more than a few hundred thousand points the distance matrix does not fit in RAM on most machines. The greedy merge rule cannot be undone, so a bad early merge propagates up the tree. Results are sensitive to the distance metric, the linkage, and outliers, which can drag single-linkage chains across the dataset. There is no probabilistic model behind the clusters, so there is no built-in way to assign new points after the model is built; standard practice is to refit or to assign new points to the nearest existing centroid. Centroid and median linkage can produce dendrogram inversions that complicate interpretation.

Applications

Agglomerative clustering appears wherever a hierarchical view of similarity is useful. In bioinformatics and phylogenetics, UPGMA and related average-linkage methods build evolutionary trees from genetic distance matrices. Cosine distance with average or complete linkage groups related documents, news articles, or search results in text clustering. Marketing teams use Ward linkage on standardized purchase features for customer segmentation that produces evenly sized groups without picking k in advance. With connectivity constraints from a pixel grid, the algorithm can partition an image into spatially contiguous regions. Very small clusters or single points that only merge into the main tree at high distance can be flagged as outliers. Even when a different algorithm is used for the final clustering, a dendrogram is a cheap way to inspect what natural groupings exist in a small to medium dataset.

History

Ward's method was published in 1963, and Stephen Johnson's 1967 paper Hierarchical Clustering Schemes in Psychometrika unified single-linkage and complete-linkage clustering under a common framework (sometimes called Johnson's algorithm). The Lance-Williams formula appeared in 1967, providing the recursive update that powers most modern implementations. SLINK (Sibson, 1973) and CLINK (Defays, 1977) gave optimal O(n^2) algorithms for single and complete linkage. The AGNES and DIANA names were popularized by Kaufman and Rousseeuw's 1990 textbook Finding Groups in Data, still cited as the standard reference for the algorithm family.

Explain like I'm 5

Imagine a pile of toys on the floor. Treat each toy as its own tiny group of one. Find the two groups that look most alike (maybe two teddy bears) and push them together. Keep doing that, again and again, joining the two most similar groups each time, until you have a few big piles (bears, trucks, dolls), and eventually one giant pile that contains every toy. The drawing of who got joined to whom, and at what step, is the dendrogram, and you can stop whenever the groups look the way you want.

References

Wikipedia. Hierarchical clustering.
Wikipedia. Ward's method.
Wikipedia. Single-linkage clustering.
Wikipedia. Complete-linkage clustering.
scikit-learn developers. AgglomerativeClustering documentation.
scikit-learn developers. Clustering user guide.
Manning, Raghavan, and Schutze. *Introduction to Information Retrieval*, Chapter 17: Time complexity of HAC. Cambridge University Press, 2008.
Mullner, Daniel. Modern hierarchical, agglomerative clustering algorithms. arXiv:1109.2378, 2011.
Ward, Joe H. Jr. "Hierarchical Grouping to Optimize an Objective Function." *Journal of the American Statistical Association*, 58(301), 1963.
Sibson, R. "SLINK: an optimally efficient algorithm for the single-link cluster method." *The Computer Journal*, 16(1), 1973.
Defays, D. "An efficient algorithm for a complete-link method." *The Computer Journal*, 20(4), 1977.
Kaufman, L., and Rousseeuw, P. J. *Finding Groups in Data: An Introduction to Cluster Analysis*. Wiley, 1990.

Overview

Algorithm

Lance-Williams update formula

Linkage criteria

Distance metrics

Dendrogram

Time and space complexity

scikit-learn implementation

Connectivity constraints

Basic example

Comparison with other clustering methods

Strengths and limitations

Applications

History

Explain like I'm 5

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Area under the PR curve

Overview

Algorithm

Lance-Williams update formula

Linkage criteria

Distance metrics

Dendrogram

Time and space complexity

scikit-learn implementation

Connectivity constraints

Basic example

Comparison with other clustering methods

Strengths and limitations

Applications

History

Explain like I'm 5

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Area under the PR curve