Divisive Clustering

Divisive clustering, also called top-down clustering, is a hierarchical clustering method that begins with all data points in a single cluster and recursively splits that cluster into smaller subclusters. The process continues until each observation occupies its own cluster, or until a predefined stopping criterion is satisfied. This approach contrasts with agglomerative clustering (bottom-up clustering), which starts with individual data points and merges them iteratively.

Divisive methods are sometimes preferred when analysts are primarily interested in the broad, high-level structure of a dataset because the most meaningful partitions are produced during the earliest splitting steps. The best-known divisive algorithm is DIANA (DIvisive ANAlysis), introduced by Kaufman and Rousseeuw in 1990. Other popular divisive strategies include bisecting k-means and Principal Direction Divisive Partitioning (PDDP).

Explain like I'm 5 (ELI5)

Imagine you have a big pile of colored blocks. Divisive clustering is like starting with the whole pile and splitting it into two groups: maybe all the warm-colored blocks (red, orange, yellow) in one pile and all the cool-colored blocks (blue, green, purple) in another. Then you look at each pile and split it again. You keep splitting until every single block has its own spot. It is the opposite of putting blocks together one by one into groups; instead, you start big and break things apart.

How divisive clustering works

The general divisive clustering procedure can be described in the following steps:

Initialization. Place all n data points into a single cluster.
Select the cluster to split. Choose the cluster that is most heterogeneous according to some criterion (for example, the cluster with the largest diameter, highest variance, or greatest number of points).
Split the selected cluster. Divide the chosen cluster into two subclusters using a splitting strategy. Different algorithms use different strategies: DIANA uses a dissimilarity-based heuristic, bisecting k-means applies the k-means algorithm with k=2, and PDDP projects data onto the first principal component.
Repeat. Return to step 2 and continue splitting until a stopping criterion is met.
Termination. The algorithm stops when the desired number of clusters has been reached, when each cluster contains only a single observation, or when some other quality threshold (such as a minimum cluster diameter) is satisfied.

The result is a hierarchy of nested clusters, which can be represented as a dendrogram. The root of the dendrogram corresponds to the initial single cluster containing all observations, and the leaves correspond to individual data points.

Monothetic vs. polythetic splitting

Divisive clustering methods can be classified by how they use features during the splitting step:

Monothetic methods use a single variable at each split. The algorithm selects the variable that best separates the data and divides the cluster based on a threshold or category in that variable. Monothetic methods produce interpretable splits because each division can be described by a simple rule on one feature. The DIVCLUS-T algorithm (Chavent, 2007) is a well-known monothetic divisive method.
Polythetic methods consider all variables simultaneously. They typically rely on a distance metric or dissimilarity matrix computed from all features. DIANA and bisecting k-means are both polythetic methods. Polythetic approaches generally produce higher-quality clusters because they capture the full multivariate structure of the data, but the resulting splits are harder to describe with simple rules.

The DIANA algorithm

DIANA (DIvisive ANAlysis) was introduced by Leonard Kaufman and Peter J. Rousseeuw in their 1990 book Finding Groups in Data: An Introduction to Cluster Analysis. It is the most widely cited divisive clustering algorithm and is implemented in the R programming language through the diana() function in the cluster package.

Algorithm details

DIANA constructs a hierarchy of clusters from the top down through the following procedure:

Start with one cluster containing all n observations.
Select the cluster with the largest diameter. The diameter of a cluster is defined as the maximum dissimilarity between any two observations in that cluster.
Identify the most disparate observation. Within the selected cluster, find the observation with the largest average dissimilarity to all other members. This observation is removed from the cluster and placed into a new group called the "splinter group." The remaining observations form the "old party."
Reassign observations. For each observation i still in the old party, compute the difference:

D(i) = [average dissimilarity of i to all observations in the old party] - [average dissimilarity of i to all observations in the splinter group]

If the largest D(i) is positive, move that observation from the old party to the splinter group. Repeat this step until all remaining D(i) values are negative, meaning every remaining observation is closer (on average) to the old party than to the splinter group.
Record the split. The selected cluster has now been divided into two subclusters.
Repeat from step 2 until every cluster contains a single observation.

The divisive coefficient

DIANA produces a quality measure called the divisive coefficient (DC). For each observation i, let d(i) be the diameter of the last cluster to which it belonged before being separated as a singleton, divided by the diameter of the entire dataset. The divisive coefficient is then computed as:

DC = (1/n) * sum of (1 - d(i)) for all observations i

DC values range from 0 to 1. A value close to 1 indicates strong clustering structure, while a value close to 0 suggests that the data lacks clear group separation. The DC is analogous to the agglomerative coefficient (AC) used in agglomerative methods.

DIANA also provides a visualization called the banner plot, which displays the order in which clusters are split. Each observation is represented by a horizontal bar whose length corresponds to the diameter of the cluster from which it was last separated. The banner plot offers an intuitive view of the hierarchical splitting process and helps analysts assess the clustering structure.

DIANA example in R

library(cluster)

# Run DIANA on the USArrests dataset
res.diana <- diana(USArrests, stand = TRUE)

# View the divisive coefficient
res.diana$dc
# <sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup> 0.8514  (example value)

# Plot the dendrogram
plot(res.diana, which.plots = 2, main = "DIANA Dendrogram")

# Cut the dendrogram into 4 clusters
clusters <- cutree(as.hclust(res.diana), k = 4)

Bisecting k-means

Bisecting k-means is a divisive clustering algorithm that combines the hierarchical top-down strategy with the flat partitioning approach of k-means clustering. It was popularized by Steinbach, Karypis, and Kumar in their 2000 paper A Comparison of Document Clustering Techniques.

Algorithm details

The bisecting k-means procedure works as follows:

Start with one cluster containing all data points.
Select the cluster to bisect. Common selection criteria include choosing the cluster with the most data points (largest cluster) or the cluster with the highest sum of squared errors (biggest inertia).
Bisect the selected cluster. Apply the standard k-means algorithm with k=2 to the selected cluster, splitting it into two subclusters.
Repeat steps 2 and 3 until the desired number of clusters k is reached.

To improve robustness, the bisection step is often repeated multiple times with different random initializations, and the split that produces the best result (lowest total sum of squared errors) is kept.

Advantages of bisecting k-means

Efficiency. Bisecting k-means runs in approximately O(n * k * t) time, where n is the number of data points, k is the target number of clusters, and t is the number of iterations per bisection. This is significantly faster than DIANA's O(n^2) space and time requirements.
Quality. Steinbach et al. (2000) showed that bisecting k-means produced clusters of comparable or better quality than both standard k-means and agglomerative hierarchical methods on document clustering tasks, as measured by entropy and F-measure.
Robustness to initialization. Unlike standard k-means, bisecting k-means is less sensitive to the initial placement of centroids because each bisection involves only two clusters.
No empty clusters. The hierarchical splitting process guarantees that every cluster contains at least one data point.

Software implementations

Bisecting k-means is available in several major libraries and frameworks:

Library / Framework	Class or function	Language
scikit-learn	`sklearn.cluster.BisectingKMeans`	Python
Apache Spark MLlib	`pyspark.ml.clustering.BisectingKMeans`	Python, Scala, Java, R
R cluster package	Manual implementation using `kmeans()`	R
NLTK	`nltk.cluster.KMeansClusterer` (with bisecting wrapper)	Python

In scikit-learn (added in version 1.1), the BisectingKMeans class supports two bisecting strategies: "largest_cluster" (selects the cluster with the most points) and "biggest_inertia" (selects the cluster with the highest sum of squared errors).

Principal Direction Divisive Partitioning (PDDP)

PDDP is a divisive clustering algorithm introduced by Daniel Boley in 1998. Instead of using distance-based heuristics or k-means, PDDP splits clusters by projecting data onto the first principal component (the direction of greatest variance) and dividing along a hyperplane perpendicular to that direction.

Key characteristics of PDDP

At each step, the cluster with the largest variance is selected for splitting.
Data points are projected onto the leading eigenvector of the cluster's covariance matrix.
A hyperplane orthogonal to this eigenvector, passing through the centroid, splits the cluster in two.
PDDP is particularly effective for high-dimensional data such as document-term matrices because it exploits the sparsity of the data.
The algorithm scales well with both dataset size and dimensionality.

PDDP has been applied extensively in text mining and information retrieval, where documents are represented as high-dimensional sparse vectors.

Computational complexity

The computational cost of divisive clustering depends on the specific algorithm used and the splitting strategy employed.

Algorithm	Time complexity	Space complexity	Notes
Exhaustive divisive	O(2^n)	O(n^2)	Considers all possible bipartitions; impractical for large datasets
DIANA	O(n^2) per split, O(n^3) total	O(n^2)	Requires a full dissimilarity matrix; quadratic space
Bisecting k-means	O(n * k * t) per split	O(n * d)	Linear in n; d is the number of features, t is iteration count
PDDP	O(n * d) per split	O(n * d)	Efficient for sparse, high-dimensional data

The exhaustive approach evaluates all 2^(n-1) - 1 possible ways to divide a cluster of n points into two nonempty subsets, which is computationally infeasible for any nontrivial dataset. This is why all practical divisive algorithms use heuristics to select the split.

Comparison with agglomerative clustering

Divisive and agglomerative clustering are the two main families of hierarchical clustering. They produce dendrograms that look similar, but their construction processes differ, which leads to distinct strengths and weaknesses.

Property	Divisive (top-down)	Agglomerative (bottom-up)
Starting point	All points in one cluster	Each point in its own cluster
Direction	Splits clusters	Merges clusters
Global vs. local decisions	Makes global decisions early (top-level splits consider all data)	Makes local decisions early (bottom-level merges consider nearby points)
Strength	Better at identifying large, well-separated clusters	Better at identifying small, tightly grouped clusters
Dendrogram reliability	Upper levels of dendrogram are more reliable	Lower levels of dendrogram are more reliable
Common algorithms	DIANA, bisecting k-means, PDDP	Single linkage, complete linkage, average linkage, Ward's method
Typical time complexity	O(n^2) to O(n^3) for DIANA; O(n * k) for bisecting k-means	O(n^2) to O(n^3) depending on linkage
Implementation availability	Less common; fewer library implementations	Widely available in most ML libraries
Reversibility	Early splits affect all subsequent clusters	Early merges affect all subsequent clusters

One practical advantage of divisive methods is that analysts who are interested only in the top few partitions can stop early and obtain a useful result without running the algorithm to completion. In agglomerative clustering, the full dendrogram must typically be built before meaningful large-scale structure is visible.

Conversely, agglomerative methods tend to handle outliers more gracefully. In divisive clustering, an outlier may be separated into its own cluster during an early split, which can distort the remaining hierarchy.

Stopping criteria

Divisive clustering algorithms require a stopping criterion to determine when to stop splitting. Common approaches include:

Target number of clusters. Specify the desired number of clusters k in advance. The algorithm stops after producing k clusters.
Minimum cluster size. Stop splitting a cluster once it contains fewer than a specified number of observations.
Maximum number of splits. Limit the total number of divisive steps.
Cluster quality threshold. Stop when the diameter, variance, or inertia of every remaining cluster falls below a predefined threshold.
Dendrogram cut height. Build the full dendrogram and then cut it at a specified height to obtain the desired granularity.
Validation metrics. Use internal validation measures such as the silhouette score or the Calinski-Harabasz index to determine the optimal number of clusters.

Applications

Divisive clustering methods have been applied across a wide range of domains:

Text mining and document clustering

Divisive methods are well-suited for organizing large collections of documents into hierarchical topic taxonomies. Bisecting k-means, in particular, has been shown to perform well for document clustering because the top-down approach naturally produces a browsable hierarchy. Steinbach et al. (2000) demonstrated that bisecting k-means outperformed both standard k-means and several agglomerative methods on document clustering benchmarks. PDDP was also designed specifically with text data in mind and takes advantage of the sparse, high-dimensional nature of document-term matrices.

Bioinformatics and gene expression analysis

Hierarchical clustering is widely used in bioinformatics to group genes with similar expression profiles. Divisive methods can produce dendrograms that reveal the major functional categories of genes before showing finer subcategories. The MITree (Matrix Incision Tree) algorithm is a divisive method specifically designed for gene expression data that uses geometric space-partitioning to determine optimal splitting hyperplanes. Divisive approaches have also been used for phylogenetic tree construction and protein sequence clustering.

Image segmentation

In computer vision, divisive clustering can be used to segment images by recursively partitioning pixel groups based on color, texture, or spatial features. The top-down approach can quickly separate an image into foreground and background regions before refining smaller segments.

Divisive methods can partition social networks into communities by recursively splitting the network graph. The Girvan-Newman algorithm, while not a traditional clustering algorithm, follows a divisive logic by iteratively removing edges with the highest betweenness centrality to split the network into communities.

Urban planning and transportation

DIANA has been applied to GPS data for classifying urban street segments based on traffic characteristics and level-of-service criteria, demonstrating the algorithm's utility in spatial data analysis.

Strengths and limitations

Strengths

Global perspective. By starting with the entire dataset, divisive methods make initial splits based on the overall structure of the data rather than local neighborhood relationships.
Hierarchical output. The resulting dendrogram allows analysts to examine the data at multiple levels of granularity without rerunning the algorithm.
Early termination. Analysts interested in only a few broad clusters can stop the algorithm early, saving computation.
Large cluster detection. Divisive methods are better at identifying large, well-separated groups than agglomerative methods, which may struggle to merge smaller clusters correctly at higher levels.

Limitations

Irreversible splits. Once a split is made, it cannot be undone. If an early split is suboptimal, all subsequent splits in those subclusters will be affected.
Computational cost. DIANA requires O(n^2) space for the dissimilarity matrix and O(n^3) time in the worst case, making it impractical for very large datasets.
Sensitivity to outliers. Outliers may be isolated into their own clusters during early splits, wasting splitting steps and potentially distorting the hierarchy.
Limited software support. Compared to agglomerative methods, divisive clustering algorithms are implemented in fewer machine learning libraries.
Splitting strategy dependence. The quality of the results depends heavily on the heuristic used to choose and execute splits.

Variants and extensions

Several extensions and variants of divisive clustering have been proposed:

Bisecting k-medoids. Similar to bisecting k-means but uses medoids instead of centroids, making it more robust to outliers and applicable to non-Euclidean distance metrics.
DIVCLUS-T. A monothetic divisive method proposed by Chavent (2007) that splits clusters using one variable at a time, producing highly interpretable results.
Divisive information-theoretic clustering. Uses information-theoretic criteria such as mutual information or entropy to determine the best split at each step.
Spectral divisive methods. Use the eigenvectors of a graph Laplacian or similarity matrix to determine splits, combining ideas from spectral clustering with the divisive framework.
SOTA (Self-Organizing Tree Algorithm). Originally proposed for phylogenetic reconstruction, SOTA produces a divisive hierarchical binary tree using a neural network approach.

References

Kaufman, L., & Rousseeuw, P. J. (1990). *Finding Groups in Data: An Introduction to Cluster Analysis*. John Wiley & Sons. Chapter 6: Divisive Analysis (Program DIANA).
Steinbach, M., Karypis, G., & Kumar, V. (2000). A Comparison of Document Clustering Techniques. *KDD Workshop on Text Mining*, University of Minnesota Technical Report #00-034.
Boley, D. (1998). Principal Direction Divisive Partitioning. *Data Mining and Knowledge Discovery*, 2(4), 325-344.
Chavent, M. (2007). DIVCLUS-T: A monothetic divisive hierarchical clustering method. *Computational Statistics & Data Analysis*, 52(2), 687-701.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. *ACM Computing Surveys*, 31(3), 264-323.
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. *IEEE Transactions on Neural Networks*, 16(3), 645-678.
Manning, C. D., Raghavan, P., & Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press. Chapter 17: Hierarchical Clustering.
Roux, M. (2018). A comparative study of divisive and agglomerative hierarchical clustering algorithms. *Journal of Classification*, 35(2), 345-366.
Savaresi, S. M., & Boley, D. L. (2004). A comparative analysis on the bisecting k-means and the PDDP clustering algorithms. *Intelligent Data Analysis*, 8(4), 345-362.
Girvan, M., & Newman, M. E. J. (2002). Community structure in social and biological networks. *Proceedings of the National Academy of Sciences*, 99(12), 7821-7826.
Zhao, Y., & Karypis, G. (2004). Empirical and theoretical comparisons of selected criterion functions for document clustering. *Machine Learning*, 55(3), 311-331.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. *Journal of Computational and Applied Mathematics*, 20, 53-65.
Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, 2825-2830.

Explain like I'm 5 (ELI5)

How divisive clustering works

Monothetic vs. polythetic splitting

The DIANA algorithm

Algorithm details

The divisive coefficient

The banner plot

DIANA example in R

Bisecting k-means

Algorithm details

Advantages of bisecting k-means

Software implementations

Principal Direction Divisive Partitioning (PDDP)

Key characteristics of PDDP

Computational complexity

Comparison with agglomerative clustering

Stopping criteria

Applications

Text mining and document clustering

Bioinformatics and gene expression analysis

Image segmentation

Social network analysis

Urban planning and transportation

Strengths and limitations

Strengths

Limitations

Variants and extensions

See also

References

Improve this article

Related Articles

Hierarchical Clustering

K-Means

K-Median

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN

Explain like I'm 5 (ELI5)

How divisive clustering works

Monothetic vs. polythetic splitting

The DIANA algorithm

Algorithm details

The divisive coefficient

The banner plot

DIANA example in R

Bisecting k-means

Algorithm details

Advantages of bisecting k-means

Software implementations

Principal Direction Divisive Partitioning (PDDP)

Key characteristics of PDDP

Computational complexity

Comparison with agglomerative clustering

Stopping criteria

Applications

Text mining and document clustering

Bioinformatics and gene expression analysis

Image segmentation

Social network analysis

Urban planning and transportation

Strengths and limitations

Strengths

Limitations

Variants and extensions

See also

References

Related Articles

Hierarchical Clustering

K-Means

K-Median

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN