See also: Machine learning terms
Gini impurity is a measure of how often a randomly chosen element from a dataset would be incorrectly classified if it were labeled randomly according to the distribution of classes in that dataset. It is one of the most widely used splitting criteria in decision tree algorithms, serving as the default impurity measure in the CART (Classification and Regression Trees) algorithm introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in 1984. The metric quantifies the degree of "impurity" or disorder within a node of a decision tree, guiding the algorithm toward splits that produce the purest possible child nodes.
Gini impurity is valued in practice for its computational simplicity. Unlike entropy-based measures, it does not require logarithmic calculations, making it faster to evaluate during tree construction. Despite this simplicity, it produces decision trees that are nearly identical in quality to those built using information gain, which is why it remains the default splitting criterion in popular machine learning libraries such as scikit-learn.
The measure goes by several names across different fields. In ecology and biodiversity research, the same formula is known as the Gini-Simpson index or Simpson's diversity index. In sociology it appears as the Gibbs-Martin index or Blau index. In machine learning, the terms "Gini impurity," "Gini index," and "Gini's diversity index" are used interchangeably.
The concept behind Gini impurity originates from the work of Italian statistician Corrado Gini (1884-1965). In 1912, Gini published his paper "Variabilita e mutabilita" while serving as the head of statistics at the University of Cagliari in Sardinia, Italy. In that work he introduced the simple mean difference as a measure of variability, which he later applied to income and wealth inequality. The Gini coefficient used in economics to measure income inequality descends from this 1912 paper.
The specific application of Gini's diversity measure to decision tree construction came in 1984, when Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published Classification and Regression Trees. This book introduced the CART algorithm, which adopted Gini impurity as its primary splitting criterion for classification tasks. The CART framework established a rigorous, theoretically grounded approach to tree-based learning that remains foundational today.
Around the same time, Ross Quinlan independently developed the ID3 algorithm (1986) and later C4.5 (1993), both of which use Shannon entropy and information gain as their splitting criteria. The parallel development of CART with Gini impurity and ID3/C4.5 with entropy created two traditions in decision tree construction that continue to coexist in modern software.
Separately, ecologist Edward H. Simpson introduced the same mathematical formula in 1949 to measure species concentration in ecological communities. Simpson's diversity index, defined as 1 minus the sum of squared species proportions, is mathematically identical to the Gini impurity formula used in machine learning.
Given a dataset (or node) containing samples from K classes, let p_i denote the proportion of samples belonging to class i. The Gini impurity is defined as:
Gini(S) = 1 - Σ(p_i)² for i = 1 to K
An equivalent formulation expresses the same quantity as:
Gini(S) = Σ p_i(1 - p_i) for i = 1 to K
Both formulas yield the same result. The first form emphasizes that Gini impurity equals one minus the sum of squared class probabilities. The second form makes the probabilistic interpretation more transparent: for each class i, the term p_i(1 - p_i) represents the probability of selecting an element of class i (probability p_i) and then misclassifying it by assigning a different label (probability 1 - p_i).
The equivalence between the two forms can be shown algebraically:
Σ p_i(1 - p_i) = Σ p_i - Σ p_i² = 1 - Σ p_i²
The last step follows because the class probabilities sum to one: Σ p_i = 1.
The Gini impurity is always non-negative and has well-defined minimum and maximum values:
| Condition | Gini impurity value | Interpretation |
|---|---|---|
| All samples belong to one class | 0 | Perfectly pure node; no misclassification possible |
| Binary classification, equal split (50/50) | 0.5 | Maximum impurity for two classes |
| 3 classes, uniform distribution | 0.667 | Maximum impurity for three classes |
| 10 classes, uniform distribution | 0.9 | Maximum impurity for ten classes |
| K classes, uniform distribution (each class has probability 1/K) | 1 - 1/K | Maximum impurity for K classes |
For a binary classification problem, the Gini impurity ranges from 0 to 0.5. For a problem with K classes, the maximum value is 1 - 1/K, which approaches 1 as the number of classes grows large. The maximum is achieved when all classes are equally represented, meaning p_i = 1/K for every class.
To see why the maximum Gini impurity equals 1 - 1/K, substitute p_i = 1/K for all classes into the formula:
Gini_max = 1 - Σ(1/K)² = 1 - K × (1/K²) = 1 - 1/K
For binary classification (K = 2), this gives 1 - 1/2 = 0.5. For three classes, it gives 1 - 1/3 ≈ 0.667. For ten classes, it gives 1 - 1/10 = 0.9.
The Gini impurity has an intuitive probabilistic interpretation. Imagine drawing two samples independently and at random from a node, with replacement. The Gini impurity equals the probability that the two samples belong to different classes. More precisely, if you draw one sample and label it according to the node's class distribution, and then independently draw a second sample, the Gini impurity is the probability that the predicted label from the first draw does not match the true label of the second draw.
This interpretation explains why a pure node (all samples from one class) has Gini impurity of 0: any two random draws will always agree. Conversely, a maximally impure node has the highest chance that two random draws will disagree.
Consider a dataset of 10 animals that need to be classified as "Cat," "Dog," or "Bird." Suppose the dataset contains 5 cats, 3 dogs, and 2 birds.
Step 1: Calculate class proportions.
| Class | Count | Proportion (p_i) |
|---|---|---|
| Cat | 5 | 5/10 = 0.50 |
| Dog | 3 | 3/10 = 0.30 |
| Bird | 2 | 2/10 = 0.20 |
Step 2: Square each proportion and sum them.
Σ(p_i)² = 0.50² + 0.30² + 0.20² = 0.25 + 0.09 + 0.04 = 0.38
Step 3: Compute Gini impurity.
Gini = 1 - 0.38 = 0.62
This value of 0.62 indicates moderate impurity. The maximum possible Gini impurity for three classes is 1 - 1/3 ≈ 0.667, so this dataset is fairly close to maximum disorder.
Consider a dataset of 80 objects across 3 classes: 19 objects in class A, 21 in class B, and 40 in class C.
| Class | Count | Proportion (p_i) | (p_i)² |
|---|---|---|---|
| A | 19 | 19/80 = 0.2375 | 0.0564 |
| B | 21 | 21/80 = 0.2625 | 0.0689 |
| C | 40 | 40/80 = 0.5000 | 0.2500 |
Gini = 1 - (0.0564 + 0.0689 + 0.2500) = 1 - 0.3753 = 0.6247
This is lower than the maximum of 0.667 for three classes because class C is overrepresented, pulling the distribution away from uniform.
When a decision tree evaluates a candidate split, it does not simply compute the Gini impurity of the child nodes independently. Instead, it calculates a weighted Gini impurity that accounts for the number of samples routed to each child.
If a parent node with N samples is split into a left child with N_L samples and a right child with N_R samples, the weighted Gini impurity of the split is:
Gini_split = (N_L / N) × Gini(left) + (N_R / N) × Gini(right)
The algorithm evaluates this weighted impurity for every possible split on every feature and selects the split that produces the lowest weighted Gini impurity. Equivalently, the algorithm selects the split that maximizes the Gini gain, defined as:
Gini Gain = Gini(parent) - Gini_split
A larger Gini gain means the split does a better job of separating classes.
Continuing the animal classification example, suppose we consider splitting the dataset on the feature "Has Feathers" (Yes or No).
Left child (Has Feathers = Yes): 0 cats, 0 dogs, 2 birds (2 samples)
Gini(left) = 1 - (0² + 0² + 1.0²) = 1 - 1.0 = 0.0 (pure node)
Right child (Has Feathers = No): 5 cats, 3 dogs, 0 birds (8 samples)
Gini(right) = 1 - (0.625² + 0.375² + 0²) = 1 - (0.3906 + 0.1406) = 1 - 0.5312 = 0.4688
Weighted Gini impurity:
Gini_split = (2/10) × 0.0 + (8/10) × 0.4688 = 0 + 0.3750 = 0.3750
Gini gain:
Gini Gain = 0.62 - 0.375 = 0.245
The algorithm would compare this Gini gain against the gains from all other candidate splits and choose the one with the highest gain (or equivalently, the lowest weighted Gini impurity).
The procedure for evaluating candidate splits using Gini impurity differs depending on whether a feature is continuous or categorical.
For a continuous feature (such as age, temperature, or income), the algorithm must determine a threshold value t so that samples with feature values less than or equal to t go to the left child and samples with values greater than t go to the right child. The standard approach involves:
For example, if a feature has sorted values [2, 5, 7, 10], the candidate thresholds are 3.5, 6.0, and 8.5. The algorithm evaluates all three and selects the one that produces the purest child nodes.
Maintaining pre-sorted indices for continuous features is a common optimization. By sorting each feature once at the beginning, the algorithm avoids repeated sorting at every node, reducing the per-node computation from O(n log n) to O(n) for each feature.
For a categorical feature with L levels, the CART algorithm searches for the best binary partition of the levels into two groups. In principle, there are 2^(L-1) - 1 possible binary partitions, which grows exponentially with the number of levels. For example, a feature with 10 categories has 511 possible partitions, and a feature with 20 categories has 524,287.
For binary classification problems, an efficient shortcut exists. If the categories are sorted by the proportion of the positive class within each category, the optimal binary split based on Gini impurity will always occur at one of the L - 1 boundaries in this sorted order. This reduces the search from exponential to linear in the number of categories. Breiman et al. (1984) proved this property in the original CART book.
For multiclass problems with more than two target classes, no such shortcut is available, and the search remains exponential. In practice, heuristics such as one-hot encoding, target encoding, or limiting the maximum number of categories are used to keep computation tractable.
The CART algorithm, introduced by Breiman et al. in their landmark 1984 book Classification and Regression Trees, uses Gini impurity as its primary splitting criterion for classification tasks. CART builds binary trees by recursively partitioning the data, and at each node it searches exhaustively over all features and all possible binary split points to find the split that minimizes the weighted Gini impurity.
The CART tree-building process follows these steps:
After growing a full tree, CART applies a pruning procedure called cost-complexity pruning (also known as minimal cost-complexity pruning or weakest-link pruning). The full tree is progressively simplified by removing branches that contribute least to overall predictive accuracy on a validation set. This pruning step is important because the greedy splitting procedure tends to produce overly deep trees that overfit the training data. Breiman et al. (1984) showed that growing a large tree and then pruning it back yields better generalization than stopping the splitting process early.
The CART algorithm was influential not only because of its practical effectiveness but also because it established a rigorous framework for tree-based learning. Breiman et al. proved that the CART procedure, under mild conditions, is Bayes risk consistent, meaning that as the training set grows, the classifier's error converges to the best achievable error rate. Many modern ensemble methods, including random forest and gradient boosting, build upon the CART splitting procedure.
Gini impurity and entropy are the two most common impurity measures used in decision tree construction. While they serve the same purpose, they differ in their mathematical formulation, computational properties, and subtle behavioral characteristics.
Entropy for a node is defined as:
Entropy(S) = -Σ p_i log&sub2;(p_i) for i = 1 to K
Information gain is the reduction in entropy after a split, and it is used by algorithms such as ID3 and C4.5 to select features.
| Property | Gini impurity | Entropy |
|---|---|---|
| Formula | 1 - Σ(p_i)² | -Σ p_i log&sub2;(p_i) |
| Range (binary) | [0, 0.5] | [0, 1.0] |
| Range (K classes) | [0, 1 - 1/K] | [0, log&sub2;(K)] |
| Computational cost | Lower (no logarithms) | Higher (requires logarithms) |
| Split behavior | Tends to isolate the dominant class | Tends to produce more balanced splits |
| Sensitivity to probability changes | Less sensitive to small changes | More sensitive to small probability differences |
| Used by | CART (scikit-learn default) | ID3, C4.5, C5.0 |
| Theoretical basis | Tsallis entropy (q = 2) | Shannon information theory |
The following table summarizes the major decision tree algorithms and their default impurity measures:
| Algorithm | Year | Author(s) | Default splitting criterion | Split type |
|---|---|---|---|---|
| ID3 | 1986 | Ross Quinlan | Information gain (entropy) | Multi-way |
| CART | 1984 | Breiman, Friedman, Olshen, Stone | Gini impurity | Binary |
| C4.5 | 1993 | Ross Quinlan | Gain ratio (normalized entropy) | Multi-way |
| C5.0 | 1997 | Ross Quinlan | Gain ratio (normalized entropy) | Multi-way |
| scikit-learn DecisionTreeClassifier | 2007+ | Pedregosa et al. | Gini impurity (default) | Binary |
In practice, the choice between Gini impurity and entropy rarely makes a significant difference in model performance. Research has shown that the two measures agree on the best split in over 98% of cases. The primary practical distinction is computational speed: because Gini impurity avoids logarithmic operations, it is slightly faster to compute, which can matter when building large ensembles of trees.
One notable behavioral difference is that Gini impurity tends to favor splits that isolate the largest class into its own pure node, whereas entropy-based splits tend to produce more evenly sized child nodes. For imbalanced datasets, this difference can become meaningful. Research by Kamperi (2021) tested both criteria on a dataset with 980 majority-class samples and 20 minority-class samples, finding that entropy produced substantially better recall (0.400 vs. 0.200) and AUC (0.883 vs. 0.600) for minority class detection. The reason is that Gini impurity penalizes small impurities less aggressively than entropy, which can cause the tree to stop splitting prematurely when minority classes are present.
From a theoretical perspective, Gini impurity is a special case of Tsallis entropy with deformation parameter q = 2. Tsallis entropy, a generalization of Shannon entropy used in statistical physics, is defined as:
S_q = (1 / (q - 1)) × (1 - Σ p_i^q)
When q = 2, this reduces to:
S_2 = 1 - Σ p_i²
which is exactly the Gini impurity formula. As q approaches 1, Tsallis entropy converges to Shannon entropy. This mathematical relationship explains why Gini impurity and entropy behave so similarly in practice: they are members of the same parametric family of impurity functions, differing only in the value of the parameter q. In physics, Tsallis entropy with q = 2 is associated with non-extensive, dissipative, and out-of-equilibrium systems.
This connection also means that Gini impurity, Shannon entropy, and gain ratio can all be viewed as special cases of a unified framework based on Tsallis entropy, as demonstrated by Maszczyk and Duch (2008).
Beyond individual decision trees, Gini impurity plays a central role in several ensemble learning methods.
In random forest models, each tree in the ensemble is built using the CART procedure with Gini impurity as the splitting criterion (by default). Because random forests construct hundreds or thousands of trees, each trained on a bootstrap sample with a random subset of features, the computational efficiency of Gini impurity compared to entropy can provide meaningful speedups during training.
Gini impurity also serves a second important role in random forests: it provides a measure of feature importance known as Mean Decrease in Impurity (MDI), sometimes called "Gini importance." For each feature, the MDI is calculated as the total reduction in Gini impurity attributable to splits on that feature, averaged across all trees in the forest. More precisely, at each node where a feature is used for splitting, the decrease in impurity is computed as:
ΔGini = N_node/N_total × (Gini(node) - N_L/N_node × Gini(left) - N_R/N_node × Gini(right))
These decreases are summed across all nodes and all trees, then normalized so that the importance values for all features sum to one. Features that produce large impurity reductions at many nodes across many trees receive higher importance scores.
In scikit-learn, this importance measure is accessible via the feature_importances_ attribute of RandomForestClassifier and DecisionTreeClassifier objects.
In gradient boosting frameworks, the relationship with Gini impurity is more indirect. Gradient boosted trees (including XGBoost, LightGBM, and CatBoost) optimize a differentiable loss function (such as log loss for classification) using gradient descent in function space. Each successive tree is fit to the negative gradient of the loss, not directly to the class labels. The actual splits within each tree are chosen to maximize the reduction in a second-order approximation of the loss function, not by minimizing Gini impurity.
However, Gini impurity remains relevant for computing feature importances in gradient boosting models. The "gain" or "weight" feature importance metrics in XGBoost and LightGBM are calculated using the impurity reduction at splits, analogous to the MDI measure in random forests.
While Gini importance is computationally cheap and readily available, it has known limitations:
| Limitation | Description | |---|---|---| | Bias toward high-cardinality features | Features with many unique values (e.g., unique IDs) receive artificially high importance because they offer more possible splits | | Bias toward continuous features | Continuous features tend to receive higher importance than categorical features with few levels | | Training set bias | Gini importance is computed on the training data and may not reflect a feature's predictive value on unseen data | | Correlation blindness | Correlated features share importance, making it difficult to identify the truly important variable |
An alternative that addresses these limitations is permutation importance, which measures the decrease in model performance when a feature's values are randomly shuffled. Permutation importance can be computed on a held-out test set and does not suffer from the cardinality bias of Gini importance. Strobl et al. (2007) provided a detailed analysis of the bias sources in Gini-based importance measures and proposed conditional permutation importance as a remedy.
Despite sharing the "Gini" name, Gini impurity and the Gini coefficient are distinct concepts from different fields. The naming similarity can cause confusion, so it is worth clarifying the differences.
| Aspect | Gini impurity | Gini coefficient |
|---|---|---|
| Domain | Machine learning, decision trees | Economics, income inequality |
| Purpose | Measures class disorder in a dataset | Measures wealth or income inequality in a population |
| Formula | 1 - Σ(p_i)² | Based on the Lorenz curve; ranges from 0 to 1 |
| Range | [0, 1 - 1/K] | [0, 1] |
| Value of 0 means | Perfect purity (one class only) | Perfect equality (everyone has the same income) |
| Value near 1 means | High class disorder | Extreme inequality (one person holds all the wealth) |
| Named after | Corrado Gini (Italian statistician) | Corrado Gini (Italian statistician) |
Both metrics are named after Italian statistician Corrado Gini, who introduced the Gini coefficient in his 1912 paper on income variability. The Gini impurity used in machine learning is related to Gini's diversity index (also called the Gini-Simpson index in ecology), which measures the probability that two randomly selected individuals from a sample belong to different categories. While the mathematical forms share a family resemblance, the Gini coefficient in economics involves the Lorenz curve and cumulative distribution functions, making it a structurally different calculation.
While Gini impurity is effective and widely used, practitioners should be aware of several limitations:
Greedy splitting. At any given node, the decision tree evaluates only the local subset of data and selects the split that minimizes impurity at that node. This greedy approach does not guarantee finding the globally optimal tree structure. A locally suboptimal split might lead to better overall performance if it enables stronger splits downstream.
Sensitivity to noise. Noisy features or mislabeled samples can distort Gini impurity calculations, leading the tree to split on noise rather than genuine patterns. Regularization techniques such as setting minimum sample thresholds for leaf nodes or limiting tree depth help mitigate this.
Class imbalance. As noted in the comparison with entropy, Gini impurity can underperform on datasets with severe class imbalance because it penalizes small impurities less aggressively. Techniques such as class weighting, oversampling (SMOTE), or using entropy as the splitting criterion can improve performance in these scenarios.
Classification only. Gini impurity is defined for classification tasks and does not apply to regression. For regression trees, CART uses variance reduction (minimizing the mean squared error of the target variable within each child node) as the splitting criterion.
Tendency toward deep trees. Without constraints, impurity-driven splitting can produce overly deep and complex trees. Pruning, maximum depth limits, and minimum samples per leaf are standard countermeasures.
Feature interaction blindness. Gini impurity evaluates each feature independently at each node. It does not consider interactions between features directly. Complex interactions can only be captured through a series of sequential splits.
A basic implementation of Gini impurity in Python requires only a few lines of code:
import numpy as np
def gini_impurity(labels):
"""Calculate the Gini impurity of a list of class labels."""
if len(labels) == 0:
return 0
classes, counts = np.unique(labels, return_counts=True)
probabilities = counts / len(labels)
return 1 - np.sum(probabilities ** 2)
def weighted_gini(left_labels, right_labels):
"""Calculate the weighted Gini impurity of a binary split."""
n = len(left_labels) + len(right_labels)
if n == 0:
return 0
weight_left = len(left_labels) / n
weight_right = len(right_labels) / n
return weight_left * gini_impurity(left_labels) + weight_right * gini_impurity(right_labels)
def best_split(X, y):
"""Find the best feature and threshold to split on."""
best_gini = float('inf')
best_feature = None
best_threshold = None
for feature_idx in range(X.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>):
thresholds = np.unique(X[:, feature_idx])
for threshold in thresholds:
left_mask = X[:, feature_idx] <= threshold
right_mask = ~left_mask
if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
continue
gini = weighted_gini(y[left_mask], y[right_mask])
if gini < best_gini:
best_gini = gini
best_feature = feature_idx
best_threshold = threshold
return best_feature, best_threshold, best_gini
In scikit-learn, Gini impurity is the default criterion for DecisionTreeClassifier and RandomForestClassifier:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Decision tree with Gini impurity (default)
dt_clf = DecisionTreeClassifier(criterion='gini', random_state=0)
iris = load_iris()
scores = cross_val_score(dt_clf, iris.data, iris.target, cv=10)
print(f"Mean accuracy: {scores.mean():.3f}")
# Random forest with Gini impurity (default)
rf_clf = RandomForestClassifier(criterion='gini', n_estimators=100, random_state=0)
# To use entropy instead:
dt_entropy = DecisionTreeClassifier(criterion='entropy', random_state=0)
# To use log loss (equivalent to entropy in scikit-learn):
dt_logloss = DecisionTreeClassifier(criterion='log_loss', random_state=0)
After fitting a model, feature importances based on Gini impurity (MDI) can be accessed directly:
rf_clf.fit(X_train, y_train)
importances = rf_clf.feature_importances_ # Gini-based MDI
The following properties make Gini impurity a practical and effective splitting criterion:
Imagine you have a bag full of colored marbles. Some are red, some are blue, and some are green. If you close your eyes, reach into the bag, and pull out one marble, Gini impurity tells you how likely it is that you would guess the color wrong.
If the bag has only red marbles, you would always guess "red" and never be wrong. The Gini impurity is 0, which means the bag is perfectly "pure" because there is only one color.
If the bag has exactly half red and half blue marbles, you would guess wrong about half the time. The Gini impurity is 0.5, the highest it can be for two colors, because the bag is as mixed up as possible.
When computers build decision trees to sort things into groups, they look for questions that make the groups as "pure" as possible. Gini impurity helps the computer figure out which question to ask at each step. The computer picks the question that makes the resulting groups have the lowest Gini impurity, meaning the groups are the least mixed up.
Think of it like sorting a pile of toys. If someone asked you, "Are you a stuffed animal?" and all the stuffed animals went into one pile and everything else into another, those piles would be more "pure" than if you had asked, "Are you bigger than a shoebox?" which might leave a messy mix in each pile. Gini impurity measures how messy each pile is.