Divisive clustering, also called top-down clustering, is a hierarchical clustering method that begins with all data points in a single cluster and recursively splits that cluster into smaller subclusters. The process continues until each observation occupies its own cluster, or until a predefined stopping criterion is satisfied. This approach contrasts with agglomerative clustering (bottom-up clustering), which starts with individual data points and merges them iteratively.
Divisive methods are sometimes preferred when analysts are primarily interested in the broad, high-level structure of a dataset because the most meaningful partitions are produced during the earliest splitting steps. The best-known divisive algorithm is DIANA (DIvisive ANAlysis), introduced by Kaufman and Rousseeuw in 1990. Other popular divisive strategies include bisecting k-means and Principal Direction Divisive Partitioning (PDDP).
Imagine you have a big pile of colored blocks. Divisive clustering is like starting with the whole pile and splitting it into two groups: maybe all the warm-colored blocks (red, orange, yellow) in one pile and all the cool-colored blocks (blue, green, purple) in another. Then you look at each pile and split it again. You keep splitting until every single block has its own spot. It is the opposite of putting blocks together one by one into groups; instead, you start big and break things apart.
The general divisive clustering procedure can be described in the following steps:
The result is a hierarchy of nested clusters, which can be represented as a dendrogram. The root of the dendrogram corresponds to the initial single cluster containing all observations, and the leaves correspond to individual data points.
Divisive clustering methods can be classified by how they use features during the splitting step:
DIANA (DIvisive ANAlysis) was introduced by Leonard Kaufman and Peter J. Rousseeuw in their 1990 book Finding Groups in Data: An Introduction to Cluster Analysis. It is the most widely cited divisive clustering algorithm and is implemented in the R programming language through the diana() function in the cluster package.
DIANA constructs a hierarchy of clusters from the top down through the following procedure:
Start with one cluster containing all n observations.
Select the cluster with the largest diameter. The diameter of a cluster is defined as the maximum dissimilarity between any two observations in that cluster.
Identify the most disparate observation. Within the selected cluster, find the observation with the largest average dissimilarity to all other members. This observation is removed from the cluster and placed into a new group called the "splinter group." The remaining observations form the "old party."
Reassign observations. For each observation i still in the old party, compute the difference:
D(i) = [average dissimilarity of i to all observations in the old party] - [average dissimilarity of i to all observations in the splinter group]
If the largest D(i) is positive, move that observation from the old party to the splinter group. Repeat this step until all remaining D(i) values are negative, meaning every remaining observation is closer (on average) to the old party than to the splinter group.
Record the split. The selected cluster has now been divided into two subclusters.
Repeat from step 2 until every cluster contains a single observation.
DIANA produces a quality measure called the divisive coefficient (DC). For each observation i, let d(i) be the diameter of the last cluster to which it belonged before being separated as a singleton, divided by the diameter of the entire dataset. The divisive coefficient is then computed as:
DC = (1/n) * sum of (1 - d(i)) for all observations i
DC values range from 0 to 1. A value close to 1 indicates strong clustering structure, while a value close to 0 suggests that the data lacks clear group separation. The DC is analogous to the agglomerative coefficient (AC) used in agglomerative methods.
DIANA also provides a visualization called the banner plot, which displays the order in which clusters are split. Each observation is represented by a horizontal bar whose length corresponds to the diameter of the cluster from which it was last separated. The banner plot offers an intuitive view of the hierarchical splitting process and helps analysts assess the clustering structure.
library(cluster)
# Run DIANA on the USArrests dataset
res.diana <- diana(USArrests, stand = TRUE)
# View the divisive coefficient
res.diana$dc
# <sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup> 0.8514 (example value)
# Plot the dendrogram
plot(res.diana, which.plots = 2, main = "DIANA Dendrogram")
# Cut the dendrogram into 4 clusters
clusters <- cutree(as.hclust(res.diana), k = 4)
Bisecting k-means is a divisive clustering algorithm that combines the hierarchical top-down strategy with the flat partitioning approach of k-means clustering. It was popularized by Steinbach, Karypis, and Kumar in their 2000 paper A Comparison of Document Clustering Techniques.
The bisecting k-means procedure works as follows:
To improve robustness, the bisection step is often repeated multiple times with different random initializations, and the split that produces the best result (lowest total sum of squared errors) is kept.
Bisecting k-means is available in several major libraries and frameworks:
| Library / Framework | Class or function | Language |
|---|---|---|
| scikit-learn | sklearn.cluster.BisectingKMeans | Python |
| Apache Spark MLlib | pyspark.ml.clustering.BisectingKMeans | Python, Scala, Java, R |
| R cluster package | Manual implementation using kmeans() | R |
| NLTK | nltk.cluster.KMeansClusterer (with bisecting wrapper) | Python |
In scikit-learn (added in version 1.1), the BisectingKMeans class supports two bisecting strategies: "largest_cluster" (selects the cluster with the most points) and "biggest_inertia" (selects the cluster with the highest sum of squared errors).
PDDP is a divisive clustering algorithm introduced by Daniel Boley in 1998. Instead of using distance-based heuristics or k-means, PDDP splits clusters by projecting data onto the first principal component (the direction of greatest variance) and dividing along a hyperplane perpendicular to that direction.
PDDP has been applied extensively in text mining and information retrieval, where documents are represented as high-dimensional sparse vectors.
The computational cost of divisive clustering depends on the specific algorithm used and the splitting strategy employed.
| Algorithm | Time complexity | Space complexity | Notes |
|---|---|---|---|
| Exhaustive divisive | O(2^n) | O(n^2) | Considers all possible bipartitions; impractical for large datasets |
| DIANA | O(n^2) per split, O(n^3) total | O(n^2) | Requires a full dissimilarity matrix; quadratic space |
| Bisecting k-means | O(n * k * t) per split | O(n * d) | Linear in n; d is the number of features, t is iteration count |
| PDDP | O(n * d) per split | O(n * d) | Efficient for sparse, high-dimensional data |
The exhaustive approach evaluates all 2^(n-1) - 1 possible ways to divide a cluster of n points into two nonempty subsets, which is computationally infeasible for any nontrivial dataset. This is why all practical divisive algorithms use heuristics to select the split.
Divisive and agglomerative clustering are the two main families of hierarchical clustering. They produce dendrograms that look similar, but their construction processes differ, which leads to distinct strengths and weaknesses.
| Property | Divisive (top-down) | Agglomerative (bottom-up) |
|---|---|---|
| Starting point | All points in one cluster | Each point in its own cluster |
| Direction | Splits clusters | Merges clusters |
| Global vs. local decisions | Makes global decisions early (top-level splits consider all data) | Makes local decisions early (bottom-level merges consider nearby points) |
| Strength | Better at identifying large, well-separated clusters | Better at identifying small, tightly grouped clusters |
| Dendrogram reliability | Upper levels of dendrogram are more reliable | Lower levels of dendrogram are more reliable |
| Common algorithms | DIANA, bisecting k-means, PDDP | Single linkage, complete linkage, average linkage, Ward's method |
| Typical time complexity | O(n^2) to O(n^3) for DIANA; O(n * k) for bisecting k-means | O(n^2) to O(n^3) depending on linkage |
| Implementation availability | Less common; fewer library implementations | Widely available in most ML libraries |
| Reversibility | Early splits affect all subsequent clusters | Early merges affect all subsequent clusters |
One practical advantage of divisive methods is that analysts who are interested only in the top few partitions can stop early and obtain a useful result without running the algorithm to completion. In agglomerative clustering, the full dendrogram must typically be built before meaningful large-scale structure is visible.
Conversely, agglomerative methods tend to handle outliers more gracefully. In divisive clustering, an outlier may be separated into its own cluster during an early split, which can distort the remaining hierarchy.
Divisive clustering algorithms require a stopping criterion to determine when to stop splitting. Common approaches include:
Divisive clustering methods have been applied across a wide range of domains:
Divisive methods are well-suited for organizing large collections of documents into hierarchical topic taxonomies. Bisecting k-means, in particular, has been shown to perform well for document clustering because the top-down approach naturally produces a browsable hierarchy. Steinbach et al. (2000) demonstrated that bisecting k-means outperformed both standard k-means and several agglomerative methods on document clustering benchmarks. PDDP was also designed specifically with text data in mind and takes advantage of the sparse, high-dimensional nature of document-term matrices.
Hierarchical clustering is widely used in bioinformatics to group genes with similar expression profiles. Divisive methods can produce dendrograms that reveal the major functional categories of genes before showing finer subcategories. The MITree (Matrix Incision Tree) algorithm is a divisive method specifically designed for gene expression data that uses geometric space-partitioning to determine optimal splitting hyperplanes. Divisive approaches have also been used for phylogenetic tree construction and protein sequence clustering.
In computer vision, divisive clustering can be used to segment images by recursively partitioning pixel groups based on color, texture, or spatial features. The top-down approach can quickly separate an image into foreground and background regions before refining smaller segments.
Divisive methods can partition social networks into communities by recursively splitting the network graph. The Girvan-Newman algorithm, while not a traditional clustering algorithm, follows a divisive logic by iteratively removing edges with the highest betweenness centrality to split the network into communities.
DIANA has been applied to GPS data for classifying urban street segments based on traffic characteristics and level-of-service criteria, demonstrating the algorithm's utility in spatial data analysis.
Several extensions and variants of divisive clustering have been proposed: