Hierarchical clustering is a family of cluster analysis methods that organize data into a tree-like structure of nested groups. Unlike flat clustering techniques such as k-means, hierarchical clustering does not require the number of clusters to be specified in advance. Instead, it produces a hierarchy of clusters that can be visualized as a dendrogram, allowing analysts to choose the appropriate level of granularity after the algorithm has run. Hierarchical clustering is widely used in bioinformatics, natural language processing, information retrieval, and many other fields where understanding nested relationships in data is valuable.
The two main strategies are agglomerative (bottom-up) and divisive (top-down). Agglomerative methods start with each data point as its own cluster and repeatedly merge the closest pair, while divisive methods start with all data in a single cluster and repeatedly split. The choice of distance metric and linkage criterion has a significant impact on the shape and quality of the resulting clusters.
Imagine you have a big pile of different-colored building blocks on the floor. You start by picking up two blocks that look the most alike (maybe two red ones) and snapping them together. Then you find the next two things that look most alike, which might be two blue blocks, or it might be your red pair and an orange block. You keep snapping together the most similar pieces, step by step. As you go, small groups join into bigger groups, and bigger groups join into even bigger ones. By the end, all your blocks are connected in one big tower. If someone asks "show me three groups," you can break the tower apart at the right spot and hand them three sections. That tower you built is called a dendrogram, and the whole process of building it is hierarchical clustering.
Agglomerative hierarchical clustering (AHC), also called the bottom-up approach, is the more commonly used of the two strategies. The algorithm proceeds as follows:
Because every merge is final (the algorithm never revisits a previous decision), agglomerative clustering is a greedy algorithm. This means it finds a locally optimal solution at each step but does not guarantee a globally optimal clustering.
Rather than recomputing all pairwise distances from scratch after each merge, the Lance-Williams recurrence formula provides an efficient update rule. If clusters i and j are merged into a new cluster (i, j), the distance from the new cluster to any other cluster k is:
d(i ∪ j, k) = αi * d(i, k) + αj * d(j, k) + β * d(i, j) + γ * |d(i, k) - d(j, k)|
The parameters αi, αj, β, and γ differ depending on the linkage method. This formula unifies many linkage criteria into a single framework and reduces the per-step cost of updating the distance matrix.
Divisive hierarchical clustering, also called the top-down approach, works in the opposite direction:
The splitting step is computationally expensive because finding the optimal binary partition of a cluster of size m requires evaluating 2^(m-1) - 1 possible splits. In practice, heuristic methods are used to make this tractable.
The most well-known divisive algorithm is DIANA (DIvisive ANAlysis), introduced by Leonard Kaufman and Peter J. Rousseeuw in their 1990 book Finding Groups in Data. DIANA selects the cluster with the largest diameter and identifies the most dissimilar object within it (the object with the greatest average dissimilarity to the other members). This object starts a "splinter group." The algorithm then iteratively moves objects from the original cluster to the splinter group if they are, on average, closer to the splinter group than to the remaining members of the original cluster. The process continues until no more objects qualify for reassignment, producing two new clusters.
DIANA also computes a divisive coefficient, which measures the overall clustering structure found in the data. Values closer to 1 indicate strong cluster structure.
The linkage criterion determines how the distance between two clusters is calculated from the pairwise distances of their member points. The choice of linkage has a major effect on the shape, size, and quality of the resulting clusters.
| Linkage method | Definition | Characteristics |
|---|---|---|
| Single linkage (nearest neighbor) | Minimum distance between any point in cluster A and any point in cluster B | Tends to produce elongated, chain-like clusters. Subject to the "chaining effect" where noise points can bridge distinct clusters. Can detect non-convex shapes. |
| Complete linkage (farthest neighbor) | Maximum distance between any point in cluster A and any point in cluster B | Produces compact, roughly spherical clusters. Less sensitive to noise and outliers than single linkage. Tends to break large clusters. |
| Average linkage (UPGMA) | Arithmetic mean of all pairwise distances between points in cluster A and points in cluster B | Balances the extremes of single and complete linkage. Produces moderately compact clusters. Widely used in bioinformatics for phylogenetic analysis. |
| Weighted average linkage (WPGMA) | Weighted average where each cluster contributes equally regardless of size | Similar to UPGMA but does not weight by cluster size. Useful when cluster sizes differ greatly and equal contribution is desired. |
| Centroid linkage (UPGMC) | Euclidean distance between the centroids of the two clusters | Can produce inversions (non-monotonic dendrograms) where a merge occurs at a lower distance than a previous merge. Requires Euclidean space. |
| Median linkage (WPGMC) | Distance between weighted centroids, where each merged cluster contributes equally | Also susceptible to inversions. Less commonly used in practice. |
| Ward's method (minimum variance) | Increase in total within-cluster variance after merging | Tends to produce compact, equally sized clusters. Minimizes the within-cluster sum of squares at each step. Often yields the best results for data with roughly spherical clusters. |
Single linkage is the simplest and fastest linkage criterion, but it has a well-known drawback called the chaining effect. Because the distance between two clusters is determined solely by their closest pair of points, a chain of intermediate points with small pairwise distances can cause two otherwise distinct clusters to merge prematurely. In noisy datasets, individual noise points or outliers can act as "bridges" that connect clusters that should remain separate. For this reason, single linkage is best suited for detecting elongated or irregularly shaped clusters in clean data, and it is less appropriate for datasets with significant noise.
Ward's method is one of the most popular linkage criteria. At each step, it merges the pair of clusters that produces the smallest increase in the total within-cluster sum of squares (also called the error sum of squares, or ESS). Formally, when considering whether to merge clusters A and B, the increase in ESS is:
ΔESS = (nA * nB) / (nA + nB) * ||cA - cB||^2
where nA and nB are the sizes of the two clusters, and cA and cB are their centroids. Ward's method can also be implemented through the Lance-Williams formula with parameters:
Ward's method requires the use of squared Euclidean distance as the dissimilarity measure. It tends to find compact, spherical clusters of roughly equal size and is often a good default choice when no prior knowledge about cluster shapes is available.
Before applying a linkage criterion, one must choose a distance metric to measure the dissimilarity between individual data points. Common choices include:
| Distance metric | Formula | Best suited for |
|---|---|---|
| Euclidean distance | sqrt(Σ(xi - yi)^2) | Continuous numerical data in low to moderate dimensions. Sensitive to scale differences between features. |
| Squared Euclidean distance | Σ(xi - yi)^2 | Same as Euclidean but emphasizes larger differences. Required by Ward's method. |
| Manhattan distance (city block) | Σ | xi - yi |
| Cosine distance | 1 - (A · B) / ( | |
| Correlation distance | 1 - Pearson correlation | Gene expression data and other settings where the profile shape matters more than absolute values. |
| Jaccard distance | 1 - | A ∩ B |
Scaling or standardizing features before clustering is generally recommended when using Euclidean or Manhattan distance, because these metrics are sensitive to differences in feature magnitudes. Without normalization, features with larger numeric ranges will dominate the distance calculations.
A dendrogram is a tree diagram that records the sequence of merges (in agglomerative clustering) or splits (in divisive clustering) along with the distance or dissimilarity at which each operation occurs. The leaves of the dendrogram represent individual data points, and the internal nodes represent clusters formed by merging or splitting.
The vertical axis of a dendrogram represents the distance (or dissimilarity) at which clusters are joined. The key principles for interpretation are:
To obtain a flat partition (a specific number of clusters) from a dendrogram, one draws a horizontal line at a chosen height. All branches that cross this line define separate clusters. There are several strategies for choosing the cut height:
Although hierarchical clustering does not require specifying the number of clusters up front, one still needs to decide where to cut the dendrogram. Several quantitative methods can help:
The computational cost of hierarchical clustering is an important practical consideration, especially for large datasets.
| Algorithm | Time complexity | Space complexity | Notes |
|---|---|---|---|
| Naive agglomerative (any linkage) | O(n^3) | O(n^2) | Recomputes all pairwise distances after each merge. |
| Agglomerative with priority queue | O(n^2 log n) | O(n^2) | Uses a min-heap to efficiently find the closest pair. |
| SLINK (single linkage, Sibson 1973) | O(n^2) | O(n) | Optimal algorithm for single linkage. Processes points incrementally. |
| CLINK (complete linkage, Defays 1977) | O(n^2) | O(n) | Optimal algorithm for complete linkage. Similar incremental strategy to SLINK. |
| Naive divisive | O(2^n) | O(n^2) | Exhaustive search over all possible binary partitions. Impractical for all but very small datasets. |
| DIANA (divisive heuristic) | O(n^2) | O(n^2) | Uses a greedy heuristic instead of exhaustive search. |
The O(n^2) space requirement for storing the distance matrix is often the primary bottleneck. For a dataset of 100,000 points, the distance matrix alone requires approximately 40 GB of memory (assuming 4-byte floats). This makes standard hierarchical clustering impractical for very large datasets without modifications.
R. Sibson published the SLINK algorithm in 1973, providing an optimal O(n^2) time and O(n) space algorithm for single-linkage clustering. SLINK works by processing data points one at a time and updating a compact representation of the dendrogram, avoiding the need to store the full distance matrix. D. Defays published the analogous CLINK algorithm for complete linkage in 1977, achieving the same optimal complexity bounds. These algorithms made hierarchical clustering feasible for datasets with thousands to tens of thousands of points on the hardware available in the 1970s.
Hierarchical clustering and k-means are the two most widely used clustering methods. They differ in several ways:
| Property | Hierarchical clustering | K-means clustering |
|---|---|---|
| Number of clusters | Does not need to be specified in advance; determined by cutting the dendrogram | Must be specified before running the algorithm |
| Output | Nested hierarchy of clusters (dendrogram) | Flat partition into k groups |
| Determinism | Deterministic (given the same data and parameters, always produces the same result) | Non-deterministic due to random initialization of centroids |
| Cluster shapes | Can find clusters of various shapes depending on the linkage criterion | Assumes spherical (convex) clusters |
| Scalability | O(n^2 log n) to O(n^3) time, O(n^2) space | O(n * k * t) time, O(n * d) space, where t is iterations and d is dimensions |
| Typical dataset size | Small to medium (up to ~10,000-50,000 points) | Large datasets (millions of points) |
| Reversibility | Merges or splits are permanent and cannot be undone | Points are reassigned to centroids in each iteration |
In general, hierarchical clustering is preferred when the goal is to explore the cluster structure at multiple levels of granularity, when the number of clusters is unknown, or when the data has a natural hierarchical organization (such as biological taxonomies). K-means is preferred when computational efficiency is the priority and a flat partition with a known number of clusters is sufficient.
Several algorithms have been developed to extend hierarchical clustering to large datasets:
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), proposed by Tian Zhang, Raghu Ramakrishnan, and Miron Livny in 1996, is designed for very large datasets. It works in two phases. First, it scans the data once and builds a compact summary structure called a CF-tree (Clustering Feature tree). Each leaf of the CF-tree stores a clustering feature vector that summarizes a sub-cluster using three values: the number of points, the linear sum of points, and the squared sum of points. In the second phase, a standard clustering algorithm (such as agglomerative clustering) is applied to the leaf entries of the CF-tree. BIRCH can handle datasets that do not fit in memory and typically requires only one or two passes over the data.
CURE (Clustering Using Representatives), proposed by Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim in 1998, addresses two limitations of traditional agglomerative clustering: sensitivity to outliers and the assumption of spherical clusters. Instead of representing each cluster by a single centroid or by all its points, CURE selects a fixed number of well-scattered representative points for each cluster and then shrinks them toward the cluster centroid by a specified fraction. Clusters are merged based on the minimum distance between their representative points. This approach allows CURE to identify non-spherical clusters while remaining robust to outliers.
ROCK (Robust Clustering using Links), proposed by Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim in 1999, is an agglomerative hierarchical clustering algorithm designed for categorical data. Instead of using traditional distance metrics, ROCK uses the concept of "links" between data points, where the number of links between two points is defined as the number of common neighbors they share. Clusters are merged based on a goodness measure that considers both the number of links and the expected number of links. ROCK is particularly useful for market basket data and other categorical datasets.
Hierarchical clustering is used across a wide range of domains:
Hierarchical clustering is one of the most widely used techniques in bioinformatics. In gene expression analysis, it groups genes with similar expression profiles into co-expression clusters, helping researchers identify genes that may share regulatory mechanisms or biological functions. The resulting dendrograms have a natural interpretation analogous to evolutionary trees (phylogenies), which makes hierarchical clustering particularly appealing to biologists. It is also used for protein sequence classification, microarray data analysis, and constructing phylogenetic trees from DNA sequence similarity.
In text mining and information retrieval, hierarchical clustering can organize documents into topic hierarchies. Using cosine distance on TF-IDF or word embedding vectors, the algorithm groups similar documents at multiple levels of specificity. This is useful for building browsable topic directories, improving search result organization, and exploratory analysis of text corpora.
In computer vision, hierarchical clustering can group pixels based on color, intensity, or texture similarity to perform image segmentation. Ward's method is commonly used for this purpose because it tends to produce compact, evenly sized regions.
Hierarchical clustering can identify communities within social networks by grouping nodes (users) based on their connection patterns. The dendrogram reveals the hierarchical community structure at different scales.
Ecologists use hierarchical clustering to classify species based on morphological, genetic, or ecological similarity. The resulting dendrograms can serve as phenetic classifications, complementing phylogenetic analyses derived from evolutionary models.
In marketing and customer analytics, hierarchical clustering groups customers based on purchasing behavior, demographics, or preferences. The hierarchy allows marketers to examine segments at different levels of detail, from broad categories to fine-grained niches.
Hierarchical clustering is available in most major data analysis software packages:
| Language / Tool | Library or function | Notes |
|---|---|---|
| Python | scipy.cluster.hierarchy.linkage and dendrogram | Full suite of linkage methods with dendrogram visualization |
| Python | sklearn.cluster.AgglomerativeClustering | Integrates with the scikit-learn ecosystem. Supports connectivity constraints. |
| R | hclust() in the stats package | Built-in function supporting multiple linkage methods |
| R | diana() in the cluster package | Implementation of divisive clustering (DIANA algorithm) |
| MATLAB | linkage(), cluster(), dendrogram() | Comprehensive hierarchical clustering toolkit |
| Java | ELKI, Weka | Research-oriented implementations with many algorithm variants |
| Julia | Clustering.jl | Efficient implementations for the Julia ecosystem |
A basic example using SciPy and scikit-learn:
import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt
# Sample data
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8],
[1, 0.6], [9, 11], [8, 2], [10, 2], [9, 3]])
# Compute linkage matrix using Ward's method
Z = linkage(X, method='ward', metric='euclidean')
# Plot dendrogram
dendrogram(Z)
plt.title('Dendrogram (Ward linkage)')
plt.xlabel('Data point index')
plt.ylabel('Distance')
plt.show()
# Cut the dendrogram to obtain 3 clusters
labels = fcluster(Z, t=3, criterion='maxclust')
print('Cluster labels:', labels)
Hierarchical clustering has a long history in statistics and numerical taxonomy. Key milestones include: