Gini Impurity

Introduction

Gini impurity is a measure of how often a randomly chosen element from a dataset would be incorrectly classified if it were labeled randomly according to the distribution of classes in that dataset. It is one of the most widely used splitting criteria in decision tree algorithms, serving as the default impurity measure in the CART (Classification and Regression Trees) algorithm introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in 1984. The metric quantifies the degree of "impurity" or disorder within a node of a decision tree, guiding the algorithm toward splits that produce the purest possible child nodes.

Gini impurity is valued in practice for its computational simplicity. Unlike entropy-based measures, it does not require logarithmic calculations, making it faster to evaluate during tree construction. Despite this simplicity, it produces decision trees that are nearly identical in quality to those built using information gain, which is why it remains the default splitting criterion in popular machine learning libraries such as scikit-learn.

The measure goes by several names across different fields. In ecology and biodiversity research, the same formula is known as the Gini-Simpson index or Simpson's diversity index. In sociology it appears as the Gibbs-Martin index or Blau index. In machine learning, the terms "Gini impurity," "Gini index," and "Gini's diversity index" are used interchangeably.

Historical background

The concept behind Gini impurity originates from the work of Italian statistician Corrado Gini (1884-1965). In 1912, Gini published his paper "Variabilita e mutabilita" while serving as the head of statistics at the University of Cagliari in Sardinia, Italy. In that work he introduced the simple mean difference as a measure of variability, which he later applied to income and wealth inequality. The Gini coefficient used in economics to measure income inequality descends from this 1912 paper.

The specific application of Gini's diversity measure to decision tree construction came in 1984, when Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published Classification and Regression Trees. This book introduced the CART algorithm, which adopted Gini impurity as its primary splitting criterion for classification tasks. The CART framework established a rigorous, theoretically grounded approach to tree-based learning that remains foundational today.

Around the same time, Ross Quinlan independently developed the ID3 algorithm (1986) and later C4.5 (1993), both of which use Shannon entropy and information gain as their splitting criteria. The parallel development of CART with Gini impurity and ID3/C4.5 with entropy created two traditions in decision tree construction that continue to coexist in modern software.

Separately, ecologist Edward H. Simpson introduced the same mathematical formula in 1949 to measure species concentration in ecological communities. Simpson's diversity index, defined as 1 minus the sum of squared species proportions, is mathematically identical to the Gini impurity formula used in machine learning.

Mathematical definition

Given a dataset (or node) containing samples from K classes, let p_i denote the proportion of samples belonging to class i. The Gini impurity is defined as:

Gini(S) = 1 - Σ(p_i)² for i = 1 to K

An equivalent formulation expresses the same quantity as:

Gini(S) = Σ p_i(1 - p_i) for i = 1 to K

Both formulas yield the same result. The first form emphasizes that Gini impurity equals one minus the sum of squared class probabilities. The second form makes the probabilistic interpretation more transparent: for each class i, the term p_i(1 - p_i) represents the probability of selecting an element of class i (probability p_i) and then misclassifying it by assigning a different label (probability 1 - p_i).

The equivalence between the two forms can be shown algebraically:

Σ p_i(1 - p_i) = Σ p_i - Σ p_i² = 1 - Σ p_i²

The last step follows because the class probabilities sum to one: Σ p_i = 1.

Range of values

The Gini impurity is always non-negative and has well-defined minimum and maximum values:

Condition	Gini impurity value	Interpretation
All samples belong to one class	0	Perfectly pure node; no misclassification possible
Binary classification, equal split (50/50)	0.5	Maximum impurity for two classes
3 classes, uniform distribution	0.667	Maximum impurity for three classes
10 classes, uniform distribution	0.9	Maximum impurity for ten classes
K classes, uniform distribution (each class has probability 1/K)	1 - 1/K	Maximum impurity for K classes

For a binary classification problem, the Gini impurity ranges from 0 to 0.5. For a problem with K classes, the maximum value is 1 - 1/K, which approaches 1 as the number of classes grows large. The maximum is achieved when all classes are equally represented, meaning p_i = 1/K for every class.

Derivation of the maximum

To see why the maximum Gini impurity equals 1 - 1/K, substitute p_i = 1/K for all classes into the formula:

Gini_max = 1 - Σ(1/K)² = 1 - K × (1/K²) = 1 - 1/K

For binary classification (K = 2), this gives 1 - 1/2 = 0.5. For three classes, it gives 1 - 1/3 ≈ 0.667. For ten classes, it gives 1 - 1/10 = 0.9.

Probabilistic interpretation

The Gini impurity has an intuitive probabilistic interpretation. Imagine drawing two samples independently and at random from a node, with replacement. The Gini impurity equals the probability that the two samples belong to different classes. More precisely, if you draw one sample and label it according to the node's class distribution, and then independently draw a second sample, the Gini impurity is the probability that the predicted label from the first draw does not match the true label of the second draw.

This interpretation explains why a pure node (all samples from one class) has Gini impurity of 0: any two random draws will always agree. Conversely, a maximally impure node has the highest chance that two random draws will disagree.

Worked example

Consider a dataset of 10 animals that need to be classified as "Cat," "Dog," or "Bird." Suppose the dataset contains 5 cats, 3 dogs, and 2 birds.

Step 1: Calculate class proportions.

Class	Count	Proportion (p_i)
Cat	5	5/10 = 0.50
Dog	3	3/10 = 0.30
Bird	2	2/10 = 0.20

Step 2: Square each proportion and sum them.

Σ(p_i)² = 0.50² + 0.30² + 0.20² = 0.25 + 0.09 + 0.04 = 0.38

Step 3: Compute Gini impurity.

Gini = 1 - 0.38 = 0.62

This value of 0.62 indicates moderate impurity. The maximum possible Gini impurity for three classes is 1 - 1/3 ≈ 0.667, so this dataset is fairly close to maximum disorder.

Multiclass example with more categories

Consider a dataset of 80 objects across 3 classes: 19 objects in class A, 21 in class B, and 40 in class C.

Class	Count	Proportion (p_i)	(p_i)²
A	19	19/80 = 0.2375	0.0564
B	21	21/80 = 0.2625	0.0689
C	40	40/80 = 0.5000	0.2500

Gini = 1 - (0.0564 + 0.0689 + 0.2500) = 1 - 0.3753 = 0.6247

This is lower than the maximum of 0.667 for three classes because class C is overrepresented, pulling the distribution away from uniform.

Weighted Gini impurity for splits

When a decision tree evaluates a candidate split, it does not simply compute the Gini impurity of the child nodes independently. Instead, it calculates a weighted Gini impurity that accounts for the number of samples routed to each child.

If a parent node with N samples is split into a left child with N_L samples and a right child with N_R samples, the weighted Gini impurity of the split is:

Gini_split = (N_L / N) × Gini(left) + (N_R / N) × Gini(right)

The algorithm evaluates this weighted impurity for every possible split on every feature and selects the split that produces the lowest weighted Gini impurity. Equivalently, the algorithm selects the split that maximizes the Gini gain, defined as:

Gini Gain = Gini(parent) - Gini_split

A larger Gini gain means the split does a better job of separating classes.

Worked example: evaluating a split

Continuing the animal classification example, suppose we consider splitting the dataset on the feature "Has Feathers" (Yes or No).

Left child (Has Feathers = Yes): 0 cats, 0 dogs, 2 birds (2 samples)

Gini(left) = 1 - (0² + 0² + 1.0²) = 1 - 1.0 = 0.0 (pure node)

Right child (Has Feathers = No): 5 cats, 3 dogs, 0 birds (8 samples)

Gini(right) = 1 - (0.625² + 0.375² + 0²) = 1 - (0.3906 + 0.1406) = 1 - 0.5312 = 0.4688

Weighted Gini impurity:

Gini_split = (2/10) × 0.0 + (8/10) × 0.4688 = 0 + 0.3750 = 0.3750

Gini gain:

Gini Gain = 0.62 - 0.375 = 0.245

The algorithm would compare this Gini gain against the gains from all other candidate splits and choose the one with the highest gain (or equivalently, the lowest weighted Gini impurity).

Handling different feature types

The procedure for evaluating candidate splits using Gini impurity differs depending on whether a feature is continuous or categorical.

Continuous features

For a continuous feature (such as age, temperature, or income), the algorithm must determine a threshold value t so that samples with feature values less than or equal to t go to the left child and samples with values greater than t go to the right child. The standard approach involves:

Sort the values of the continuous feature in ascending order.
Compute candidate thresholds as the midpoints between each pair of consecutive distinct values. For n distinct values, this produces at most n - 1 candidate thresholds.
For each candidate threshold, split the data and compute the weighted Gini impurity.
Select the threshold that yields the lowest weighted Gini impurity.

For example, if a feature has sorted values [2, 5, 7, 10], the candidate thresholds are 3.5, 6.0, and 8.5. The algorithm evaluates all three and selects the one that produces the purest child nodes.

Maintaining pre-sorted indices for continuous features is a common optimization. By sorting each feature once at the beginning, the algorithm avoids repeated sorting at every node, reducing the per-node computation from O(n log n) to O(n) for each feature.

Categorical features

For a categorical feature with L levels, the CART algorithm searches for the best binary partition of the levels into two groups. In principle, there are 2^(L-1) - 1 possible binary partitions, which grows exponentially with the number of levels. For example, a feature with 10 categories has 511 possible partitions, and a feature with 20 categories has 524,287.

For binary classification problems, an efficient shortcut exists. If the categories are sorted by the proportion of the positive class within each category, the optimal binary split based on Gini impurity will always occur at one of the L - 1 boundaries in this sorted order. This reduces the search from exponential to linear in the number of categories. Breiman et al. (1984) proved this property in the original CART book.

For multiclass problems with more than two target classes, no such shortcut is available, and the search remains exponential. In practice, heuristics such as one-hot encoding, target encoding, or limiting the maximum number of categories are used to keep computation tractable.

Use in the CART algorithm

The CART algorithm, introduced by Breiman et al. in their landmark 1984 book Classification and Regression Trees, uses Gini impurity as its primary splitting criterion for classification tasks. CART builds binary trees by recursively partitioning the data, and at each node it searches exhaustively over all features and all possible binary split points to find the split that minimizes the weighted Gini impurity.

The CART tree-building process follows these steps:

Start with the full dataset at the root node.
For each feature, evaluate all possible binary splits.
For each candidate split, calculate the weighted Gini impurity of the resulting child nodes.
Select the feature and split point that produce the lowest weighted Gini impurity.
Partition the data into two child nodes.
Repeat steps 2 through 5 recursively for each child node.
Stop when a stopping criterion is met (e.g., maximum depth, minimum samples per leaf, or the node is already pure).

After growing a full tree, CART applies a pruning procedure called cost-complexity pruning (also known as minimal cost-complexity pruning or weakest-link pruning). The full tree is progressively simplified by removing branches that contribute least to overall predictive accuracy on a validation set. This pruning step is important because the greedy splitting procedure tends to produce overly deep trees that overfit the training data. Breiman et al. (1984) showed that growing a large tree and then pruning it back yields better generalization than stopping the splitting process early.

The CART algorithm was influential not only because of its practical effectiveness but also because it established a rigorous framework for tree-based learning. Breiman et al. proved that the CART procedure, under mild conditions, is Bayes risk consistent, meaning that as the training set grows, the classifier's error converges to the best achievable error rate. Many modern ensemble methods, including random forest and gradient boosting, build upon the CART splitting procedure.

Gini impurity vs. entropy

Gini impurity and entropy are the two most common impurity measures used in decision tree construction. While they serve the same purpose, they differ in their mathematical formulation, computational properties, and subtle behavioral characteristics.

Entropy for a node is defined as:

Entropy(S) = -Σ p_i log&sub2;(p_i) for i = 1 to K

Information gain is the reduction in entropy after a split, and it is used by algorithms such as ID3 and C4.5 to select features.

Property	Gini impurity	Entropy
Formula	1 - Σ(p_i)²	-Σ p_i log&sub2;(p_i)
Range (binary)	[0, 0.5]	[0, 1.0]
Range (K classes)	[0, 1 - 1/K]	[0, log&sub2;(K)]
Computational cost	Lower (no logarithms)	Higher (requires logarithms)
Split behavior	Tends to isolate the dominant class	Tends to produce more balanced splits
Sensitivity to probability changes	Less sensitive to small changes	More sensitive to small probability differences
Used by	CART (scikit-learn default)	ID3, C4.5, C5.0
Theoretical basis	Tsallis entropy (q = 2)	Shannon information theory

Decision tree algorithms and their splitting criteria

The following table summarizes the major decision tree algorithms and their default impurity measures:

Algorithm	Year	Author(s)	Default splitting criterion	Split type
ID3	1986	Ross Quinlan	Information gain (entropy)	Multi-way
CART	1984	Breiman, Friedman, Olshen, Stone	Gini impurity	Binary
C4.5	1993	Ross Quinlan	Gain ratio (normalized entropy)	Multi-way
C5.0	1997	Ross Quinlan	Gain ratio (normalized entropy)	Multi-way
scikit-learn DecisionTreeClassifier	2007+	Pedregosa et al.	Gini impurity (default)	Binary

Practical differences

In practice, the choice between Gini impurity and entropy rarely makes a significant difference in model performance. Research has shown that the two measures agree on the best split in over 98% of cases. The primary practical distinction is computational speed: because Gini impurity avoids logarithmic operations, it is slightly faster to compute, which can matter when building large ensembles of trees.

One notable behavioral difference is that Gini impurity tends to favor splits that isolate the largest class into its own pure node, whereas entropy-based splits tend to produce more evenly sized child nodes. For imbalanced datasets, this difference can become meaningful. Research by Kamperi (2021) tested both criteria on a dataset with 980 majority-class samples and 20 minority-class samples, finding that entropy produced substantially better recall (0.400 vs. 0.200) and AUC (0.883 vs. 0.600) for minority class detection. The reason is that Gini impurity penalizes small impurities less aggressively than entropy, which can cause the tree to stop splitting prematurely when minority classes are present.

Relationship to Tsallis entropy

From a theoretical perspective, Gini impurity is a special case of Tsallis entropy with deformation parameter q = 2. Tsallis entropy, a generalization of Shannon entropy used in statistical physics, is defined as:

S_q = (1 / (q - 1)) × (1 - Σ p_i^q)

When q = 2, this reduces to:

S_2 = 1 - Σ p_i²

which is exactly the Gini impurity formula. As q approaches 1, Tsallis entropy converges to Shannon entropy. This mathematical relationship explains why Gini impurity and entropy behave so similarly in practice: they are members of the same parametric family of impurity functions, differing only in the value of the parameter q. In physics, Tsallis entropy with q = 2 is associated with non-extensive, dissipative, and out-of-equilibrium systems.

This connection also means that Gini impurity, Shannon entropy, and gain ratio can all be viewed as special cases of a unified framework based on Tsallis entropy, as demonstrated by Maszczyk and Duch (2008).

Gini impurity in ensemble methods

Beyond individual decision trees, Gini impurity plays a central role in several ensemble learning methods.

Random forests

In random forest models, each tree in the ensemble is built using the CART procedure with Gini impurity as the splitting criterion (by default). Because random forests construct hundreds or thousands of trees, each trained on a bootstrap sample with a random subset of features, the computational efficiency of Gini impurity compared to entropy can provide meaningful speedups during training.

Gini impurity also serves a second important role in random forests: it provides a measure of feature importance known as Mean Decrease in Impurity (MDI), sometimes called "Gini importance." For each feature, the MDI is calculated as the total reduction in Gini impurity attributable to splits on that feature, averaged across all trees in the forest. More precisely, at each node where a feature is used for splitting, the decrease in impurity is computed as:

ΔGini = N_node/N_total × (Gini(node) - N_L/N_node × Gini(left) - N_R/N_node × Gini(right))

These decreases are summed across all nodes and all trees, then normalized so that the importance values for all features sum to one. Features that produce large impurity reductions at many nodes across many trees receive higher importance scores.

In scikit-learn, this importance measure is accessible via the feature_importances_ attribute of RandomForestClassifier and DecisionTreeClassifier objects.

Gradient boosting

In gradient boosting frameworks, the relationship with Gini impurity is more indirect. Gradient boosted trees (including XGBoost, LightGBM, and CatBoost) optimize a differentiable loss function (such as log loss for classification) using gradient descent in function space. Each successive tree is fit to the negative gradient of the loss, not directly to the class labels. The actual splits within each tree are chosen to maximize the reduction in a second-order approximation of the loss function, not by minimizing Gini impurity.

However, Gini impurity remains relevant for computing feature importances in gradient boosting models. The "gain" or "weight" feature importance metrics in XGBoost and LightGBM are calculated using the impurity reduction at splits, analogous to the MDI measure in random forests.

Limitations of Gini-based feature importance

While Gini importance is computationally cheap and readily available, it has known limitations:

| Limitation | Description | |---|---|---| | Bias toward high-cardinality features | Features with many unique values (e.g., unique IDs) receive artificially high importance because they offer more possible splits | | Bias toward continuous features | Continuous features tend to receive higher importance than categorical features with few levels | | Training set bias | Gini importance is computed on the training data and may not reflect a feature's predictive value on unseen data | | Correlation blindness | Correlated features share importance, making it difficult to identify the truly important variable |

An alternative that addresses these limitations is permutation importance, which measures the decrease in model performance when a feature's values are randomly shuffled. Permutation importance can be computed on a held-out test set and does not suffer from the cardinality bias of Gini importance. Strobl et al. (2007) provided a detailed analysis of the bias sources in Gini-based importance measures and proposed conditional permutation importance as a remedy.

Gini impurity vs. Gini coefficient

Despite sharing the "Gini" name, Gini impurity and the Gini coefficient are distinct concepts from different fields. The naming similarity can cause confusion, so it is worth clarifying the differences.

Aspect	Gini impurity	Gini coefficient
Domain	Machine learning, decision trees	Economics, income inequality
Purpose	Measures class disorder in a dataset	Measures wealth or income inequality in a population
Formula	1 - Σ(p_i)²	Based on the Lorenz curve; ranges from 0 to 1
Range	[0, 1 - 1/K]	[0, 1]
Value of 0 means	Perfect purity (one class only)	Perfect equality (everyone has the same income)
Value near 1 means	High class disorder	Extreme inequality (one person holds all the wealth)
Named after	Corrado Gini (Italian statistician)	Corrado Gini (Italian statistician)

Both metrics are named after Italian statistician Corrado Gini, who introduced the Gini coefficient in his 1912 paper on income variability. The Gini impurity used in machine learning is related to Gini's diversity index (also called the Gini-Simpson index in ecology), which measures the probability that two randomly selected individuals from a sample belong to different categories. While the mathematical forms share a family resemblance, the Gini coefficient in economics involves the Lorenz curve and cumulative distribution functions, making it a structurally different calculation.

Limitations and practical considerations

While Gini impurity is effective and widely used, practitioners should be aware of several limitations:

Greedy splitting. At any given node, the decision tree evaluates only the local subset of data and selects the split that minimizes impurity at that node. This greedy approach does not guarantee finding the globally optimal tree structure. A locally suboptimal split might lead to better overall performance if it enables stronger splits downstream.
Sensitivity to noise. Noisy features or mislabeled samples can distort Gini impurity calculations, leading the tree to split on noise rather than genuine patterns. Regularization techniques such as setting minimum sample thresholds for leaf nodes or limiting tree depth help mitigate this.
Class imbalance. As noted in the comparison with entropy, Gini impurity can underperform on datasets with severe class imbalance because it penalizes small impurities less aggressively. Techniques such as class weighting, oversampling (SMOTE), or using entropy as the splitting criterion can improve performance in these scenarios.
Classification only. Gini impurity is defined for classification tasks and does not apply to regression. For regression trees, CART uses variance reduction (minimizing the mean squared error of the target variable within each child node) as the splitting criterion.
Tendency toward deep trees. Without constraints, impurity-driven splitting can produce overly deep and complex trees. Pruning, maximum depth limits, and minimum samples per leaf are standard countermeasures.
Feature interaction blindness. Gini impurity evaluates each feature independently at each node. It does not consider interactions between features directly. Complex interactions can only be captured through a series of sequential splits.

Implementation

Python implementation from scratch

A basic implementation of Gini impurity in Python requires only a few lines of code:

import numpy as np

def gini_impurity(labels):
    """Calculate the Gini impurity of a list of class labels."""
    if len(labels) == 0:
        return 0
    classes, counts = np.unique(labels, return_counts=True)
    probabilities = counts / len(labels)
    return 1 - np.sum(probabilities ** 2)

def weighted_gini(left_labels, right_labels):
    """Calculate the weighted Gini impurity of a binary split."""
    n = len(left_labels) + len(right_labels)
    if n == 0:
        return 0
    weight_left = len(left_labels) / n
    weight_right = len(right_labels) / n
    return weight_left * gini_impurity(left_labels) + weight_right * gini_impurity(right_labels)

def best_split(X, y):
    """Find the best feature and threshold to split on."""
    best_gini = float('inf')
    best_feature = None
    best_threshold = None
    
    for feature_idx in range(X.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>):
        thresholds = np.unique(X[:, feature_idx])
        for threshold in thresholds:
            left_mask = X[:, feature_idx] <= threshold
            right_mask = ~left_mask
            if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
                continue
            gini = weighted_gini(y[left_mask], y[right_mask])
            if gini < best_gini:
                best_gini = gini
                best_feature = feature_idx
                best_threshold = threshold
    
    return best_feature, best_threshold, best_gini

Using scikit-learn

In scikit-learn, Gini impurity is the default criterion for DecisionTreeClassifier and RandomForestClassifier:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Decision tree with Gini impurity (default)
dt_clf = DecisionTreeClassifier(criterion='gini', random_state=0)
iris = load_iris()
scores = cross_val_score(dt_clf, iris.data, iris.target, cv=10)
print(f"Mean accuracy: {scores.mean():.3f}")

# Random forest with Gini impurity (default)
rf_clf = RandomForestClassifier(criterion='gini', n_estimators=100, random_state=0)

# To use entropy instead:
dt_entropy = DecisionTreeClassifier(criterion='entropy', random_state=0)

# To use log loss (equivalent to entropy in scikit-learn):
dt_logloss = DecisionTreeClassifier(criterion='log_loss', random_state=0)

After fitting a model, feature importances based on Gini impurity (MDI) can be accessed directly:

rf_clf.fit(X_train, y_train)
importances = rf_clf.feature_importances_  # Gini-based MDI

Properties and characteristics

The following properties make Gini impurity a practical and effective splitting criterion:

No logarithms required. Unlike entropy, Gini impurity involves only basic arithmetic operations (multiplication, addition, subtraction), making it computationally cheaper. For each candidate split, computing Gini impurity requires O(K) operations, where K is the number of classes.
Concave function. Gini impurity is a concave function of the class probabilities. This concavity ensures that any split of a mixed node into child nodes will reduce the weighted impurity, which is a necessary property for a valid splitting criterion.
Differentiable. The Gini impurity function is differentiable everywhere with respect to the class probabilities, which is useful in optimization contexts.
Bounded. The value is always between 0 and 1 - 1/K, providing a normalized measure regardless of the number of classes.
Symmetric. The Gini impurity depends only on the class proportions, not on the labeling scheme. Renaming classes does not change the impurity value.
Consistent with entropy. In the vast majority of cases, Gini impurity and entropy select the same optimal split, so the choice between them rarely affects model quality.
Strictly concave at non-degenerate points. This strict concavity guarantees that any non-trivial split (one that does not simply reproduce the parent) will strictly decrease the weighted impurity, meaning the algorithm always makes progress.

Explain like I'm 5 (ELI5)

Imagine you have a bag full of colored marbles. Some are red, some are blue, and some are green. If you close your eyes, reach into the bag, and pull out one marble, Gini impurity tells you how likely it is that you would guess the color wrong.

If the bag has only red marbles, you would always guess "red" and never be wrong. The Gini impurity is 0, which means the bag is perfectly "pure" because there is only one color.

If the bag has exactly half red and half blue marbles, you would guess wrong about half the time. The Gini impurity is 0.5, the highest it can be for two colors, because the bag is as mixed up as possible.

When computers build decision trees to sort things into groups, they look for questions that make the groups as "pure" as possible. Gini impurity helps the computer figure out which question to ask at each step. The computer picks the question that makes the resulting groups have the lowest Gini impurity, meaning the groups are the least mixed up.

Think of it like sorting a pile of toys. If someone asked you, "Are you a stuffed animal?" and all the stuffed animals went into one pile and everything else into another, those piles would be more "pure" than if you had asked, "Are you bigger than a shoebox?" which might leave a messy mix in each pile. Gini impurity measures how messy each pile is.

References

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth & Brooks/Cole.
Gini, C. (1912). "Variabilita e mutabilita." *Studi Economico-Giuridici della Regia Universita di Cagliari*, 3(2), 3-159.
Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning*, 1(1), 81-106.
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd edition. Springer.
Strobl, C., Boulesteix, A. L., Zeileis, A., and Hothorn, T. (2007). "Bias in random forest variable importance measures: Illustrations, sources and a solution." *BMC Bioinformatics*, 8(25).
Maszczyk, T. and Duch, W. (2008). "Comparison of Shannon, Renyi and Tsallis Entropy Used in Decision Trees." *Artificial Intelligence and Soft Computing*, 5097, 643-651.
Simpson, E. H. (1949). "Measurement of Diversity." *Nature*, 163, 688.
Ceriani, L. and Verme, P. (2012). "The origins of the Gini index: extracts from Variabilita e Mutabilita (1912) by Corrado Gini." *The Journal of Economic Inequality*, 10, 421-443.
Raschka, S. (2015). "Why are implementations of decision tree algorithms usually binary and what are the advantages of the different impurity metrics?" *sebastianraschka.com*.
Kamperi, E. (2021). "Decision Trees: Gini index vs entropy." *ekamperi.github.io*.
scikit-learn documentation (2025). "DecisionTreeClassifier." https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Loh, W. Y. (2014). "Fifty Years of Classification and Regression Trees." *International Statistical Review*, 82(3), 329-348.

Introduction

Historical background

Mathematical definition

Range of values

Derivation of the maximum

Probabilistic interpretation

Worked example

Multiclass example with more categories

Weighted Gini impurity for splits

Worked example: evaluating a split

Handling different feature types

Continuous features

Categorical features

Use in the CART algorithm

Gini impurity vs. entropy

Decision tree algorithms and their splitting criteria

Practical differences

Relationship to Tsallis entropy

Gini impurity in ensemble methods

Random forests

Gradient boosting

Limitations of Gini-based feature importance

Gini impurity vs. Gini coefficient

Limitations and practical considerations

Implementation

Python implementation from scratch

Using scikit-learn

Properties and characteristics

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Splitter

Information Gain

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Introduction

Historical background

Mathematical definition

Range of values

Derivation of the maximum

Probabilistic interpretation

Worked example

Multiclass example with more categories

Weighted Gini impurity for splits

Worked example: evaluating a split

Handling different feature types

Continuous features

Categorical features

Use in the CART algorithm

Gini impurity vs. entropy

Decision tree algorithms and their splitting criteria

Practical differences

Relationship to Tsallis entropy

Gini impurity in ensemble methods

Random forests

Gradient boosting

Limitations of Gini-based feature importance

Gini impurity vs. Gini coefficient

Limitations and practical considerations

Implementation

Python implementation from scratch

Using scikit-learn

Properties and characteristics

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Splitter

Information Gain

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy