# Hierarchical Clustering

> Source: https://aiwiki.ai/wiki/hierarchical_clustering
> Updated: 2026-07-12
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Hierarchical clustering** is an [unsupervised learning](/wiki/unsupervised_learning) method that groups data into a tree of nested clusters, building the hierarchy by repeatedly merging the most similar groups (agglomerative, bottom-up) or repeatedly splitting them (divisive, top-down).[1] Unlike flat clustering techniques such as [k-means](/wiki/k-means), it does not require the number of clusters to be fixed in advance: it produces a full hierarchy that can be visualized as a **dendrogram**, so analysts choose the level of granularity after the algorithm has run. The most common variant, agglomerative hierarchical clustering, runs in $$O(n^2 \log n)$$ time with $$O(n^2)$$ memory using a priority queue, which limits standard implementations to roughly tens of thousands of points.[14] Hierarchical clustering is a core family of [cluster analysis](/wiki/clustering) methods and is widely used in [bioinformatics](/wiki/bioinformatics), [natural language processing](/wiki/natural_language_understanding), [information retrieval](/wiki/information_retrieval), and many other fields where understanding nested relationships in data is valuable.

The two main strategies are **agglomerative** (bottom-up) and **divisive** (top-down).[2] Agglomerative methods start with each data point as its own cluster and repeatedly merge the closest pair, while divisive methods start with all data in a single cluster and repeatedly split. The choice of **distance metric** and **linkage criterion** has a significant impact on the shape and quality of the resulting clusters.

## Explain like I'm 5

Imagine you have a big pile of different-colored building blocks on the floor. You start by picking up two blocks that look the most alike (maybe two red ones) and snapping them together. Then you find the next two things that look most alike, which might be two blue blocks, or it might be your red pair and an orange block. You keep snapping together the most similar pieces, step by step. As you go, small groups join into bigger groups, and bigger groups join into even bigger ones. By the end, all your blocks are connected in one big tower. If someone asks "show me three groups," you can break the tower apart at the right spot and hand them three sections. That tower you built is called a dendrogram, and the whole process of building it is hierarchical clustering.

## How does agglomerative clustering work?

Agglomerative hierarchical clustering (AHC), also called the bottom-up approach, is the more commonly used of the two strategies. The algorithm proceeds as follows:

1. **Initialize.** Treat each of the *n* data points as a singleton cluster. Compute the pairwise distance (or dissimilarity) matrix, which is an $$n \times n$$ symmetric matrix.
2. **Find the closest pair.** Scan the distance matrix for the two clusters with the smallest inter-cluster distance according to the chosen linkage criterion.
3. **Merge.** Combine those two clusters into a single new cluster. Record the merge and the distance at which it occurred (this information is used to build the dendrogram).
4. **Update distances.** Recompute the distances between the new cluster and all remaining clusters. This can be done efficiently using the Lance-Williams recurrence formula.[6]
5. **Repeat.** Go back to step 2 and continue until only one cluster remains (or until a stopping criterion is met).

Because every merge is final (the algorithm never revisits a previous decision), agglomerative clustering is a greedy algorithm. This means it finds a locally optimal solution at each step but does not guarantee a globally optimal clustering.

### The Lance-Williams formula

Rather than recomputing all pairwise distances from scratch after each merge, the Lance-Williams recurrence formula provides an efficient update rule. If clusters *i* and *j* are merged into a new cluster *(i, j)*, the distance from the new cluster to any other cluster *k* is:

$$
d(i \cup j, k) = \alpha_i\, d(i, k) + \alpha_j\, d(j, k) + \beta\, d(i, j) + \gamma\, \lvert d(i, k) - d(j, k) \rvert
$$

The parameters $$\alpha_i$$, $$\alpha_j$$, $$\beta$$, and $$\gamma$$ differ depending on the linkage method. Lance and Williams introduced this recurrence in their 1967 paper in *The Computer Journal*, which unified many linkage criteria into a single framework and reduced the per-step cost of updating the distance matrix.[6]

## How does divisive clustering work?

Divisive hierarchical clustering, also called the top-down approach, works in the opposite direction:

1. **Initialize.** Start with a single cluster containing all *n* data points.
2. **Select a cluster to split.** Typically, the cluster with the largest diameter (maximum pairwise dissimilarity) is chosen.
3. **Split.** Divide the selected cluster into two sub-clusters using some splitting strategy.
4. **Repeat.** Continue selecting and splitting clusters until each data point is in its own singleton cluster or a stopping criterion is reached.

The splitting step is computationally expensive because finding the optimal binary partition of a cluster of size $$m$$ requires evaluating $$2^{(m-1)} - 1$$ possible splits. In practice, heuristic methods are used to make this tractable.

### The DIANA algorithm

The most well-known divisive algorithm is DIANA (DIvisive ANAlysis), introduced by Leonard Kaufman and Peter J. Rousseeuw in their 1990 book *Finding Groups in Data*.[7] DIANA selects the cluster with the largest diameter and identifies the most dissimilar object within it (the object with the greatest average dissimilarity to the other members). This object starts a "splinter group." The algorithm then iteratively moves objects from the original cluster to the splinter group if they are, on average, closer to the splinter group than to the remaining members of the original cluster. The process continues until no more objects qualify for reassignment, producing two new clusters.

DIANA also computes a **divisive coefficient**, which measures the overall clustering structure found in the data. Values closer to 1 indicate strong cluster structure.[7]

## What are the linkage criteria?

The linkage criterion determines how the distance between two clusters is calculated from the pairwise distances of their member points. The choice of linkage has a major effect on the shape, size, and quality of the resulting clusters.

| Linkage method | Definition | Characteristics |
|---|---|---|
| Single linkage (nearest neighbor) | Minimum distance between any point in cluster A and any point in cluster B | Tends to produce elongated, chain-like clusters. Subject to the "chaining effect" where noise points can bridge distinct clusters. Can detect non-convex shapes. |
| Complete linkage (farthest neighbor) | Maximum distance between any point in cluster A and any point in cluster B | Produces compact, roughly spherical clusters. Less sensitive to noise and outliers than single linkage. Tends to break large clusters. |
| Average linkage (UPGMA) | Arithmetic mean of all pairwise distances between points in cluster A and points in cluster B | Balances the extremes of single and complete linkage. Produces moderately compact clusters. Widely used in bioinformatics for phylogenetic analysis. |
| Weighted average linkage (WPGMA) | Weighted average where each cluster contributes equally regardless of size | Similar to UPGMA but does not weight by cluster size. Useful when cluster sizes differ greatly and equal contribution is desired. |
| Centroid linkage (UPGMC) | Euclidean distance between the centroids of the two clusters | Can produce inversions (non-monotonic dendrograms) where a merge occurs at a lower distance than a previous merge. Requires Euclidean space. |
| Median linkage (WPGMC) | Distance between weighted centroids, where each merged cluster contributes equally | Also susceptible to inversions. Less commonly used in practice. |
| Ward's method (minimum variance) | Increase in total within-cluster variance after merging | Tends to produce compact, equally sized clusters. Minimizes the within-cluster sum of squares at each step. Often yields the best results for data with roughly spherical clusters. |

### Single linkage and the chaining effect

Single linkage is the simplest and fastest linkage criterion, but it has a well-known drawback called the **chaining effect**. Because the distance between two clusters is determined solely by their closest pair of points, a chain of intermediate points with small pairwise distances can cause two otherwise distinct clusters to merge prematurely. In noisy datasets, individual noise points or outliers can act as "bridges" that connect clusters that should remain separate. For this reason, single linkage is best suited for detecting elongated or irregularly shaped clusters in clean data, and it is less appropriate for datasets with significant noise.[2]

### Ward's method in detail

Ward's method is one of the most popular linkage criteria and originates from Joe H. Ward's 1963 paper "Hierarchical Grouping to Optimize an Objective Function," published in the *Journal of the American Statistical Association*, which has accumulated more than 19,000 citations.[5] Ward described it as "a procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics."[5] At each step, the method merges the pair of clusters that produces the smallest increase in the total within-cluster sum of squares (also called the error sum of squares, or ESS). Formally, when considering whether to merge clusters A and B, the increase in ESS is:

$$
\Delta \text{ESS} = \frac{n_A\, n_B}{n_A + n_B} \lVert c_A - c_B \rVert^2
$$

where $$n_A$$ and $$n_B$$ are the sizes of the two clusters, and $$c_A$$ and $$c_B$$ are their centroids. Ward's method can also be implemented through the Lance-Williams formula with parameters:[6]

- $$\alpha_i = \frac{n_i + n_k}{n_i + n_j + n_k}$$
- $$\alpha_j = \frac{n_j + n_k}{n_i + n_j + n_k}$$
- $$\beta = \frac{-n_k}{n_i + n_j + n_k}$$
- $$\gamma = 0$$

Ward's method requires the use of squared Euclidean distance as the dissimilarity measure. It tends to find compact, spherical clusters of roughly equal size and is often a good default choice when no prior knowledge about cluster shapes is available.[5]

## Distance metrics

Before applying a linkage criterion, one must choose a distance metric to measure the dissimilarity between individual data points. Common choices include:

| Distance metric | Formula | Best suited for |
|---|---|---|
| Euclidean distance | $$\sqrt{\sum (x_i - y_i)^2}$$ | Continuous numerical data in low to moderate dimensions. Sensitive to scale differences between features. |
| Squared Euclidean distance | $$\sum (x_i - y_i)^2$$ | Same as Euclidean but emphasizes larger differences. Required by Ward's method. |
| Manhattan distance (city block) | $$\sum \lvert x_i - y_i \rvert$$ | Data where movement is constrained to grid-like paths. More robust to outliers than Euclidean. |
| Cosine distance | $$1 - \frac{A \cdot B}{\lVert A \rVert \, \lVert B \rVert}$$ | Text data, [word embeddings](/wiki/word_embedding), and other high-dimensional sparse data where magnitude is less important than direction. |
| Correlation distance | 1 - Pearson correlation | Gene expression data and other settings where the profile shape matters more than absolute values. |
| Jaccard distance | $$1 - \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert}$$ | Binary or categorical data, such as document-term matrices. |

Scaling or standardizing features before clustering is generally recommended when using Euclidean or Manhattan distance, because these metrics are sensitive to differences in feature magnitudes. Without [normalization](/wiki/normalization), features with larger numeric ranges will dominate the distance calculations.

## What is a dendrogram?

A **dendrogram** is a tree diagram that records the sequence of merges (in agglomerative clustering) or splits (in divisive clustering) along with the distance or dissimilarity at which each operation occurs. The leaves of the dendrogram represent individual data points, and the internal nodes represent clusters formed by merging or splitting.

### Reading a dendrogram

The vertical axis of a dendrogram represents the distance (or dissimilarity) at which clusters are joined. The key principles for interpretation are:

- **Height of a horizontal bar** indicates the distance at which two sub-clusters merged. Lower bars mean the merged items are more similar; higher bars mean they are less similar.
- **Long vertical lines** (large jumps in height) suggest that the clusters being merged are relatively dissimilar. These jumps can help identify natural cluster boundaries.
- **The order of leaves** along the horizontal axis is partly arbitrary. Only the hierarchical nesting structure is meaningful, not the left-to-right arrangement of leaves at the same level.

### Cutting the dendrogram

To obtain a flat partition (a specific number of clusters) from a dendrogram, one draws a horizontal line at a chosen height. All branches that cross this line define separate clusters. There are several strategies for choosing the cut height:

- **Visual inspection.** Look for the largest vertical gap in the dendrogram (the longest "branches" without any merges). Cutting at this gap often yields natural clusters.
- **Fixed number of clusters.** If domain knowledge suggests a specific number of groups, cut at the height that produces that many clusters.
- **Inconsistency coefficient.** Compare the height of each link with the average height of links below it. Links with unusually large inconsistency values indicate significant merges and good cutting points.
- **Dynamic tree cut.** Proposed by Peter Langfelder, Bin Zhang, and Steve Horvath (2008), this method analyzes the shape of dendrogram branches rather than using a single horizontal cut, adapting to different densities and sizes of clusters within the same dendrogram.[13]

## How do you choose the number of clusters?

Although hierarchical clustering does not require specifying the number of clusters up front, one still needs to decide where to cut the dendrogram. Several quantitative methods can help:

- **Silhouette analysis.** For each data point, the silhouette coefficient measures how similar it is to members of its own cluster compared to the nearest neighboring cluster. Values range from -1 to +1, with higher values indicating better-defined clusters. The number of clusters that maximizes the average silhouette score is preferred.[7]
- **Gap statistic.** Proposed by Robert Tibshirani, Guenther Walther, and Trevor Hastie (2001), this method compares the total within-cluster variation for different numbers of clusters against what would be expected under a null reference distribution (uniform random data). The optimal number of clusters is the smallest *k* for which the gap statistic is within one standard error of the maximum.[12]
- **Elbow method.** Plot the total within-cluster sum of squares as a function of the number of clusters. The "elbow" point, where adding more clusters yields diminishing returns, suggests a good choice. However, this method can be subjective because the elbow is not always clearly defined.
- **Cophenetic correlation coefficient.** Introduced by Robert Sokal and F. James Rohlf in 1962, this measures the correlation between the original pairwise distances and the cophenetic distances (the dendrogram heights at which pairs of points are first merged). Values closer to 1 indicate that the dendrogram faithfully represents the underlying distance structure.[11] While this does not directly select the number of clusters, it helps evaluate and compare different linkage methods.

## What is the computational complexity of hierarchical clustering?

The computational cost of hierarchical clustering is an important practical consideration, especially for large datasets. The naive algorithm runs in $$O(n^3)$$ time, while priority-queue and SLINK/CLINK variants improve this to $$O(n^2 \log n)$$ or $$O(n^2)$$, but all standard approaches need $$O(n^2)$$ space for the distance matrix.

| Algorithm | Time complexity | Space complexity | Notes |
|---|---|---|---|
| Naive agglomerative (any linkage) | $$O(n^3)$$ | $$O(n^2)$$ | Recomputes all pairwise distances after each merge. |
| Agglomerative with priority queue | $$O(n^2 \log n)$$ | $$O(n^2)$$ | Uses a min-heap to efficiently find the closest pair. |
| SLINK (single linkage, Sibson 1973) | $$O(n^2)$$ | $$O(n)$$ | Optimal algorithm for single linkage. Processes points incrementally. |
| CLINK (complete linkage, Defays 1977) | $$O(n^2)$$ | $$O(n)$$ | Optimal algorithm for complete linkage. Similar incremental strategy to SLINK. |
| Naive divisive | $$O(2^n)$$ | $$O(n^2)$$ | Exhaustive search over all possible binary partitions. Impractical for all but very small datasets. |
| DIANA (divisive heuristic) | $$O(n^2)$$ | $$O(n^2)$$ | Uses a greedy heuristic instead of exhaustive search. |

The $$O(n^2)$$ space requirement for storing the distance matrix is often the primary bottleneck. For a dataset of 100,000 points, the distance matrix alone requires approximately 40 GB of memory (assuming 4-byte floats). This makes standard hierarchical clustering impractical for very large datasets without modifications.[14]

### SLINK and CLINK

R. Sibson published the **SLINK** algorithm in 1973 in *The Computer Journal*, providing an optimal $$O(n^2)$$ time and $$O(n)$$ space algorithm for single-linkage clustering, which Sibson proved "attains the theoretically optimal lower bound" on running time for the single-link method.[3] SLINK works by processing data points one at a time and updating a compact representation of the dendrogram, avoiding the need to store the full distance matrix. D. Defays published the analogous **CLINK** algorithm for complete linkage in 1977, achieving the same optimal complexity bounds.[4] These algorithms made hierarchical clustering feasible for datasets with thousands to tens of thousands of points on the hardware available in the 1970s.

## How does hierarchical clustering differ from k-means?

Hierarchical clustering and [k-means](/wiki/k-means) are the two most widely used clustering methods. They differ in several ways:

| Property | Hierarchical clustering | K-means clustering |
|---|---|---|
| Number of clusters | Does not need to be specified in advance; determined by cutting the dendrogram | Must be specified before running the algorithm |
| Output | Nested hierarchy of clusters (dendrogram) | Flat partition into *k* groups |
| Determinism | Deterministic (given the same data and parameters, always produces the same result) | Non-deterministic due to random initialization of centroids |
| Cluster shapes | Can find clusters of various shapes depending on the linkage criterion | Assumes spherical (convex) clusters |
| Scalability | $$O(n^2 \log n)$$ to $$O(n^3)$$ time, $$O(n^2)$$ space | $$O(n \cdot k \cdot t)$$ time, $$O(n \cdot d)$$ space, where $$t$$ is iterations and $$d$$ is dimensions |
| Typical dataset size | Small to medium (up to ~10,000-50,000 points) | Large datasets (millions of points) |
| Reversibility | Merges or splits are permanent and cannot be undone | Points are reassigned to centroids in each iteration |

In general, hierarchical clustering is preferred when the goal is to explore the cluster structure at multiple levels of granularity, when the number of clusters is unknown, or when the data has a natural hierarchical organization (such as biological taxonomies). K-means is preferred when computational efficiency is the priority and a flat partition with a known number of clusters is sufficient.

## Scalable variants

Several algorithms have been developed to extend hierarchical clustering to large datasets:

### BIRCH

**BIRCH** (Balanced Iterative Reducing and Clustering using Hierarchies), proposed by Tian Zhang, Raghu Ramakrishnan, and Miron Livny in 1996, is designed for very large datasets. It works in two phases. First, it scans the data once and builds a compact summary structure called a **CF-tree** (Clustering Feature tree). Each leaf of the CF-tree stores a clustering feature vector that summarizes a sub-cluster using three values: the number of points, the linear sum of points, and the squared sum of points. In the second phase, a standard clustering algorithm (such as agglomerative clustering) is applied to the leaf entries of the CF-tree. BIRCH can handle datasets that do not fit in memory and typically requires only one or two passes over the data.[8]

### CURE

**CURE** (Clustering Using Representatives), proposed by Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim in 1998, addresses two limitations of traditional agglomerative clustering: sensitivity to outliers and the assumption of spherical clusters. Instead of representing each cluster by a single centroid or by all its points, CURE selects a fixed number of well-scattered representative points for each cluster and then shrinks them toward the cluster centroid by a specified fraction. Clusters are merged based on the minimum distance between their representative points. This approach allows CURE to identify non-spherical clusters while remaining robust to outliers.[9]

### ROCK

**ROCK** (Robust Clustering using Links), proposed by Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim in 1999, is an agglomerative hierarchical clustering algorithm designed for categorical data. Instead of using traditional distance metrics, ROCK uses the concept of "links" between data points, where the number of links between two points is defined as the number of common neighbors they share. Clusters are merged based on a goodness measure that considers both the number of links and the expected number of links. ROCK is particularly useful for market basket data and other categorical datasets.[10]

## What is hierarchical clustering used for?

Hierarchical clustering is used across a wide range of domains:

### Bioinformatics and genomics

Hierarchical clustering is one of the most widely used techniques in bioinformatics.[1] In [gene expression](/wiki/gene_expression) analysis, it groups genes with similar expression profiles into co-expression clusters, helping researchers identify genes that may share regulatory mechanisms or biological functions. The resulting dendrograms have a natural interpretation analogous to evolutionary trees (phylogenies), which makes hierarchical clustering particularly appealing to biologists. It is also used for protein sequence classification, microarray data analysis, and constructing phylogenetic trees from DNA sequence similarity.

### Document clustering and information retrieval

In text mining and [information retrieval](/wiki/information_retrieval), hierarchical clustering can organize documents into topic hierarchies. Using cosine distance on TF-IDF or [word embedding](/wiki/word_embedding) vectors, the algorithm groups similar documents at multiple levels of specificity. This is useful for building browsable topic directories, improving search result organization, and exploratory analysis of text corpora.

### Image segmentation

In [computer vision](/wiki/computer_vision), hierarchical clustering can group pixels based on color, intensity, or texture similarity to perform [image segmentation](/wiki/image_segmentation). Ward's method is commonly used for this purpose because it tends to produce compact, evenly sized regions.

### Social network analysis

Hierarchical clustering can identify communities within social networks by grouping nodes (users) based on their connection patterns. The dendrogram reveals the hierarchical community structure at different scales.

### Ecology and taxonomy

Ecologists use hierarchical clustering to classify species based on morphological, genetic, or ecological similarity. The resulting dendrograms can serve as phenetic classifications, complementing phylogenetic analyses derived from evolutionary models.

### Market segmentation

In marketing and customer analytics, hierarchical clustering groups customers based on purchasing behavior, demographics, or preferences. The hierarchy allows marketers to examine segments at different levels of detail, from broad categories to fine-grained niches.

## Advantages and limitations

### Advantages

- **No need to pre-specify cluster count.** The dendrogram provides a complete view of the cluster structure at all levels, and the number of clusters can be chosen after examining the results.
- **Deterministic output.** Given the same data and parameters, the algorithm always produces the same dendrogram (unlike k-means, which depends on random initialization).
- **Rich visualization.** The dendrogram provides an intuitive visual summary of cluster relationships and the distances at which clusters merge.
- **Flexibility in distance metrics.** Any valid distance or dissimilarity measure can be used. The algorithm only requires a distance matrix, not the raw data points.
- **Hierarchical structure.** Captures nested relationships in the data, which is valuable in domains like biology and taxonomy.

### Limitations

- **High computational cost.** Even with optimized algorithms, the $$O(n^2)$$ memory requirement and $$O(n^2 \log n)$$ time complexity make standard hierarchical clustering impractical for datasets larger than roughly 10,000 to 50,000 points.
- **Irreversible decisions.** Once two clusters are merged (agglomerative) or split (divisive), the decision cannot be reconsidered. An early mistake propagates through the rest of the hierarchy.
- **Sensitivity to noise and outliers.** Particularly with single linkage, noise points can bridge distinct clusters. Complete linkage and Ward's method are less sensitive but still affected.
- **Choice of distance metric and linkage.** Different combinations of distance metric and linkage criterion can produce very different clusterings from the same data. There is no universally best combination; the choice depends on the data and the application.
- **Difficulty with high-dimensional data.** In high-dimensional spaces, distance metrics lose discriminative power due to the curse of dimensionality. [Dimensionality reduction](/wiki/dimension_reduction) techniques are often applied as a preprocessing step.

## Software implementations

Hierarchical clustering is available in most major data analysis software packages:

| Language / Tool | Library or function | Notes |
|---|---|---|
| Python | `scipy.cluster.hierarchy.linkage` and `dendrogram` | Full suite of linkage methods with dendrogram visualization |
| Python | `sklearn.cluster.AgglomerativeClustering` | Integrates with the [scikit-learn](/wiki/scikit-learn) ecosystem. Supports connectivity constraints. |
| R | `hclust()` in the stats package | Built-in function supporting multiple linkage methods |
| R | `diana()` in the cluster package | Implementation of divisive clustering (DIANA algorithm) |
| MATLAB | `linkage()`, `cluster()`, `dendrogram()` | Comprehensive hierarchical clustering toolkit |
| Java | ELKI, Weka | Research-oriented implementations with many algorithm variants |
| Julia | `Clustering.jl` | Efficient implementations for the Julia ecosystem |

### Python example

A basic example using SciPy and [scikit-learn](/wiki/scikit-learn):

```python
import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt

# Sample data
X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8],
              [1, 0.6], [9, 11], [8, 2], [10, 2], [9, 3]])

# Compute linkage matrix using Ward's method
Z = linkage(X, method='ward', metric='euclidean')

# Plot dendrogram
dendrogram(Z)
plt.title('Dendrogram (Ward linkage)')
plt.xlabel('Data point index')
plt.ylabel('Distance')
plt.show()

# Cut the dendrogram to obtain 3 clusters
labels = fcluster(Z, t=3, criterion='maxclust')
print('Cluster labels:', labels)
```

## When was hierarchical clustering developed?

Hierarchical clustering has a long history in statistics and numerical taxonomy. Key milestones include:

- **1901.** Karl Pearson introduced the concept of principal axes and distance-based data analysis, laying groundwork for multivariate methods.
- **1930s-1950s.** Numerical taxonomists developed agglomerative methods for biological classification. Robert Sokal and Peter Sneath systematized these approaches in their 1963 book *Principles of Numerical Taxonomy*.
- **1963.** Joe H. Ward published his minimum variance method for hierarchical clustering in the *Journal of the American Statistical Association*.[5]
- **1967.** Lance and Williams published their general recurrence formula in *The Computer Journal*, unifying many linkage criteria.[6]
- **1973.** Sibson published the SLINK algorithm, providing an optimal $$O(n^2)$$ algorithm for single-linkage clustering.[3]
- **1977.** Defays published the CLINK algorithm for optimal complete-linkage clustering.[4]
- **1990.** Kaufman and Rousseeuw published *Finding Groups in Data*, introducing the DIANA divisive algorithm and the silhouette method for cluster validation.[7]
- **1996.** Zhang, Ramakrishnan, and Livny published the BIRCH algorithm for scalable hierarchical clustering.[8]
- **1998-1999.** Guha, Rastogi, and Shim published CURE and ROCK for handling non-spherical clusters and categorical data.[9][10]

## See also

- [Clustering](/wiki/clustering)
- [K-means](/wiki/k-means)
- [Unsupervised learning](/wiki/unsupervised_learning)
- [Dimensionality reduction](/wiki/dimension_reduction)
- [Decision tree](/wiki/decision_tree)

## References

1. Rokach, L., & Maimon, O. (2005). "Clustering Methods." In *Data Mining and Knowledge Discovery Handbook*, pp. 321-352. Springer.
2. Murtagh, F., & Contreras, P. (2012). "Algorithms for hierarchical clustering: an overview." *Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery*, 2(1), 86-97.
3. Sibson, R. (1973). "SLINK: An optimally efficient algorithm for the single-link cluster method." *The Computer Journal*, 16(1), 30-34. https://academic.oup.com/comjnl/article-abstract/16/1/30/434805
4. Defays, D. (1977). "An efficient algorithm for a complete link method." *The Computer Journal*, 20(4), 364-366.
5. Ward, J.H. (1963). "Hierarchical Grouping to Optimize an Objective Function." *Journal of the American Statistical Association*, 58(301), 236-244. https://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10500845
6. Lance, G.N., & Williams, W.T. (1967). "A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems." *The Computer Journal*, 9(4), 373-380. https://academic.oup.com/comjnl/article/9/4/373/354711
7. Kaufman, L., & Rousseeuw, P.J. (1990). *Finding Groups in Data: An Introduction to Cluster Analysis*. John Wiley & Sons.
8. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). "BIRCH: An Efficient Data Clustering Method for Very Large Databases." *Proceedings of the ACM SIGMOD Conference*, pp. 103-114.
9. Guha, S., Rastogi, R., & Shim, K. (1998). "CURE: An Efficient Clustering Algorithm for Large Databases." *Proceedings of the ACM SIGMOD Conference*, pp. 73-84.
10. Guha, S., Rastogi, R., & Shim, K. (2000). "ROCK: A Robust Clustering Algorithm for Categorical Attributes." *Information Systems*, 25(5), 345-366.
11. Sokal, R.R., & Rohlf, F.J. (1962). "The Comparison of Dendrograms by Objective Methods." *Taxon*, 11(2), 33-40.
12. Tibshirani, R., Walther, G., & Hastie, T. (2001). "Estimating the number of clusters in a data set via the gap statistic." *Journal of the Royal Statistical Society: Series B*, 63(2), 411-423.
13. Langfelder, P., Zhang, B., & Horvath, S. (2008). "Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R." *Bioinformatics*, 24(5), 719-720.
14. Mullner, D. (2011). "Modern hierarchical, agglomerative clustering algorithms." *arXiv preprint arXiv:1109.2378*. https://arxiv.org/abs/1109.2378