Information Gain (IG) is a measure from information theory that quantifies the reduction in entropy (uncertainty) achieved by partitioning a dataset based on a particular feature. It was introduced as the primary splitting criterion in the ID3 decision tree algorithm by J. Ross Quinlan in 1986, and it remains one of the most important concepts in supervised machine learning. Information gain is mathematically equivalent to mutual information between the feature and the target variable, making it a foundational tool in both tree-based learning and feature engineering for high-dimensional datasets.
At its core, information gain answers a simple question: "How much does knowing the value of this feature reduce our uncertainty about the class label?" Features that produce the largest reduction in uncertainty are considered the most informative and are selected first when building a decision tree or ranking features for a classification model.
Imagine you have a big jar of mixed-up candies and you want to sort them by flavor, but you cannot taste them. You can only ask yes-or-no questions about how they look. Some questions are really helpful. If you ask "Is it wrapped in red foil?", and all the strawberry candies happen to be in red foil, that one question instantly sorts out every strawberry candy. That question has high information gain because it removes a lot of confusion.
Other questions are not very helpful. If you ask "Is it bigger than my thumb?", and big candies come in every flavor, that question does not help you sort at all. That question has low information gain because you are just as confused after asking it.
Information gain is the number that tells a computer which question to ask first so it can sort things out as quickly as possible. The computer picks the question that clears up the most confusion, then picks the next best question, and keeps going until everything is sorted.
The intellectual foundations of information gain trace back to Claude Shannon's landmark 1948 paper "A Mathematical Theory of Communication," published in the Bell System Technical Journal. Shannon defined entropy as a measure of the average information content (or uncertainty) of a random variable, establishing the mathematical framework that would underpin all later work on information-based learning. Shannon's paper has been called the "Magna Carta of the Information Age" and is one of the most cited scientific papers in history.
The concept of entropy remained largely within the domain of communications engineering and statistics for several decades. In 1959, Solomon Kullback and Richard Leibler had already formalized the divergence measure (now called Kullback-Leibler divergence) that relates closely to information gain. However, these information-theoretic tools were not applied to machine learning until the 1980s.
J. Ross Quinlan bridged this gap in his 1986 paper "Induction of Decision Trees," published in Machine Learning. Quinlan introduced the ID3 algorithm and demonstrated that entropy-based splitting produced compact, accurate decision trees on several benchmark problems. The paper drew explicitly on information theory, framing the attribute selection problem as one of maximizing the expected information content of each split.
Quinlan's subsequent book, C4.5: Programs for Machine Learning (1993), introduced the gain ratio refinement and addressed practical issues such as continuous features, missing values, and pruning. C4.5 became one of the most widely used classification algorithms of the 1990s. In a 2006 survey conducted by the IEEE International Conference on Data Mining (ICDM), C4.5 was ranked as the number one algorithm in data mining, and it was described as "probably the machine learning workhorse most widely used in practice to date."
Meanwhile, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone developed the CART algorithm in 1984, which used Gini impurity instead of entropy. The parallel development of these two approaches led to decades of comparative research, ultimately showing that the choice between entropy and Gini rarely matters in practice.
Information gain is built on entropy, Shannon's measure of uncertainty. For a dataset S with C classes, where p_i is the proportion of samples belonging to class i, entropy is defined as:
$$H(S) = -\sum_{i=1}^{C} p_i \log_2 p_i$$
Entropy equals 0 when all samples belong to a single class (perfect purity) and reaches its maximum value of log_2(C) when classes are uniformly distributed (maximum uncertainty). For a binary classification problem, entropy ranges between 0 and 1 bit.
The choice of base-2 logarithm means that entropy is measured in bits (binary digits). Using the natural logarithm instead yields entropy in nats, and using base-10 yields hartleys. The base affects only the scale; the relative ordering of entropies remains the same.
Given a dataset S and a feature A that partitions S into subsets {S_v | v in Values(A)}, the information gain of splitting on A is:
$$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$$
The first term, H(S), is the entropy of the dataset before the split. The second term is the weighted average entropy of the child subsets after the split, where each subset's entropy is weighted by the fraction of samples it contains. The difference represents how much uncertainty was eliminated by learning the value of feature A.
Information gain is always non-negative. It equals zero when the feature provides no information about the class, and it reaches its maximum when the feature perfectly separates all classes.
The weighted average entropy term in the formula is the conditional entropy of the target S given feature A:
$$H(S|A) = \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$$
Conditional entropy represents the expected uncertainty about the class label that remains after observing the feature value. Information gain can therefore be written compactly as:
$$IG(S, A) = H(S) - H(S|A)$$
This formulation makes the connection to mutual information immediately apparent: information gain is the difference between the prior entropy and the posterior (conditional) entropy.
The classic "Play Tennis" dataset, used in many machine learning textbooks (including Mitchell, 1997), illustrates how information gain is computed step by step. The dataset contains 14 days of observations, recording whether a tennis game was played based on weather conditions.
| Day | Outlook | Temperature | Humidity | Wind | Play Tennis |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | Weak | No |
| 2 | Sunny | Hot | High | Strong | No |
| 3 | Overcast | Hot | High | Weak | Yes |
| 4 | Rain | Mild | High | Weak | Yes |
| 5 | Rain | Cool | Normal | Weak | Yes |
| 6 | Rain | Cool | Normal | Strong | No |
| 7 | Overcast | Cool | Normal | Strong | Yes |
| 8 | Sunny | Mild | High | Weak | No |
| 9 | Sunny | Cool | Normal | Weak | Yes |
| 10 | Rain | Mild | Normal | Weak | Yes |
| 11 | Sunny | Mild | Normal | Strong | Yes |
| 12 | Overcast | Mild | High | Strong | Yes |
| 13 | Overcast | Hot | Normal | Weak | Yes |
| 14 | Rain | Mild | High | Strong | No |
Out of 14 days, 9 are "Yes" (play) and 5 are "No" (don't play).
H(S) = -(9/14) log_2(9/14) - (5/14) log_2(5/14) = -(0.643)(−0.637) - (0.357)(−1.485) = 0.410 + 0.530 = 0.940 bits
The feature "Outlook" has three values: Sunny (5 days), Overcast (4 days), and Rain (5 days).
| Outlook Value | Total | Yes | No | Entropy |
|---|---|---|---|---|
| Sunny | 5 | 2 | 3 | 0.971 |
| Overcast | 4 | 4 | 0 | 0.000 |
| Rain | 5 | 3 | 2 | 0.971 |
Weighted entropy = (5/14)(0.971) + (4/14)(0.000) + (5/14)(0.971) = 0.347 + 0.000 + 0.347 = 0.693 bits
IG(S, Outlook) = H(S) - Weighted Entropy = 0.940 - 0.693 = 0.247 bits
The same calculation is performed for every feature:
| Feature | Information Gain |
|---|---|
| Outlook | 0.247 |
| Humidity | 0.153 |
| Wind | 0.048 |
| Temperature | 0.029 |
Since Outlook has the highest information gain, the ID3 algorithm selects it as the root node of the decision tree. Notice that the Overcast branch is completely pure (all "Yes"), so it becomes a leaf node. The algorithm then recurses on the Sunny and Rain branches, computing information gain on the remaining features within each subset.
Information gain serves as the splitting criterion in several influential decision tree algorithms.
The Iterative Dichotomiser 3 (ID3) algorithm, introduced by Quinlan in 1986, was the first decision tree algorithm to use information gain as its sole splitting criterion. At each internal node, ID3 computes the information gain of every available feature and selects the one with the highest value. The algorithm then creates one branch for each possible value of the selected feature and recurses on each resulting subset until all samples in a node belong to the same class or no more features remain.
The ID3 algorithm follows a top-down, greedy approach:
ID3 handles only categorical (discrete) features and does not include pruning, which can lead to overfitting on noisy data. It also uses a greedy strategy that may converge on locally optimal solutions rather than globally optimal trees.
Quinlan's C4.5 algorithm (1993) is the successor to ID3 and addresses several of its shortcomings. While C4.5 still computes information gain, it uses a normalized version called the information gain ratio (discussed below) as its default splitting criterion. C4.5 also introduces several important capabilities:
The commercial successor C5.0 further improves on C4.5 with faster execution, lower memory usage, and boosting capabilities. It continues to rely on entropy-based splitting criteria.
| Algorithm | Year | Splitting Criterion | Handles Continuous Features | Pruning | Developer |
|---|---|---|---|---|---|
| CART | 1984 | Gini impurity | Yes | Yes (cost-complexity) | Breiman et al. |
| ID3 | 1986 | Information gain | No | No | J. R. Quinlan |
| C4.5 | 1993 | Gain ratio (default) | Yes | Yes (error-based) | J. R. Quinlan |
| C5.0 | 1997 | Gain ratio / boosted | Yes | Yes | J. R. Quinlan |
A well-known limitation of information gain is its bias toward features with many distinct values. A feature such as a customer ID or telephone number, which assigns a unique value to almost every record, will produce child nodes that are nearly pure simply because each node contains very few samples. As a result, information gain for such features can be very high even though splitting on them produces no useful generalization.
Consider a dataset of 1,000 records with a feature "Record ID" that takes 1,000 unique values. Splitting on Record ID creates 1,000 child nodes, each containing exactly one record with zero entropy. The information gain equals the full parent entropy, which is the maximum possible. Yet this split is useless for prediction on new data because it has memorized the training set.
This bias is a fundamental consequence of the formula: more branches mean more opportunities to reduce the weighted average entropy, regardless of whether the split captures meaningful patterns. The problem was identified early in the decision tree literature and was one of the primary motivations for Quinlan's development of the gain ratio.
To address the high-cardinality bias, Quinlan introduced the information gain ratio (IGR) in the C4.5 algorithm. The gain ratio normalizes information gain by the intrinsic information (also called split information) of the feature:
$$GainRatio(S, A) = \frac{IG(S, A)}{SplitInfo(S, A)}$$
where the split information is:
$$SplitInfo(S, A) = -\sum_{v \in Values(A)} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}$$
Split information measures the entropy of the distribution of samples across the feature's branches. A feature with many equally sized branches has high split information, which drives down the gain ratio even if the raw information gain is large. This penalizes features that fragment the data into many small subsets.
For the "Play Tennis" example, the split information for Outlook is:
SplitInfo(S, Outlook) = -(5/14) log_2(5/14) - (4/14) log_2(4/14) - (5/14) log_2(5/14) = 1.577 bits
GainRatio(S, Outlook) = 0.247 / 1.577 = 0.157
While gain ratio corrects the bias toward high-cardinality features, it can overcorrect in some cases. Features with very few values (such as a binary feature) have low split information, which can inflate the gain ratio even when the raw information gain is modest. To mitigate this, C4.5 uses a heuristic: it only considers features whose information gain is at least as high as the average information gain across all features, and then selects the one with the highest gain ratio among those candidates. This two-stage process combines the strengths of both measures, using information gain as a filter and gain ratio for final selection.
Research by Harris (2002) in "Information Gain Versus Gain Ratio: A Study of Split Method Biases" confirmed that neither measure is universally superior. Information gain tends to favor many-branched splits that can overfit, while gain ratio can over-penalize attributes with few distinct values but high information content.
Information gain is mathematically equivalent to mutual information between the feature variable A and the target variable Y. Mutual information is defined as:
$$I(A; Y) = H(Y) - H(Y|A)$$
This is identical to the information gain formula: IG(S, A) = H(S) - H(S|A), where H(S|A) is the conditional entropy of the target given the feature. Both expressions quantify how much uncertainty about Y is removed by observing A.
The equivalence means that any property of mutual information applies directly to information gain:
| Property | Description |
|---|---|
| Non-negativity | IG(S, A) >= 0 for all features A |
| Symmetry | I(A; Y) = I(Y; A); the gain is the same regardless of which variable is conditioned on |
| Zero iff independent | IG(S, A) = 0 if and only if A and Y are statistically independent |
| Bounded | I(A; Y) <= min(H(A), H(Y)) |
Mutual information can also be expressed as the Kullback-Leibler divergence between the joint distribution and the product of the marginals:
$$I(A; Y) = D_{KL}(P(A, Y) | P(A) P(Y))$$
This KL divergence formulation provides an intuitive interpretation: mutual information measures how much the joint distribution of A and Y differs from what would be expected if the two variables were independent. When A and Y are independent, the joint distribution equals the product of marginals, and the KL divergence (and therefore the information gain) is zero.
The term "mutual information" is more common in information theory and statistics, while "information gain" is the standard name in the decision tree literature. In feature selection contexts, the two terms are often used interchangeably.
The basic information gain formula considers one feature at a time. Several extensions address multi-feature interactions.
Conditional mutual information measures the mutual information between two variables X and Y given a third variable Z:
$$I(X; Y | Z) = H(X | Z) - H(X | Y, Z)$$
This quantity is useful in feature selection when one wishes to evaluate how much additional information a candidate feature provides beyond what is already known from previously selected features. Unlike standard information gain, conditional mutual information accounts for redundancy between features.
First studied by McGill (1954) under the name "interaction information" and independently by Hu Kuo Ting (1962), this measure generalizes mutual information to three or more variables. For three variables X, Y, and Z:
$$II(X; Y; Z) = I(X; Y | Z) - I(X; Y)$$
Unlike mutual information, interaction information can be negative. Negative interaction information indicates redundancy: X and Y share information about Z that overlaps. Positive interaction information indicates synergy: the variables together convey more information than the sum of their individual contributions. This distinction is important in feature selection because synergistic features may each appear uninformative individually but become highly predictive when combined.
Joint mutual information considers multiple features simultaneously:
$$I(X_1, X_2, ..., X_k; Y) = H(Y) - H(Y | X_1, X_2, ..., X_k)$$
This captures the total information a set of features provides about the target. However, computing joint mutual information exactly becomes impractical as the number of features grows, because estimating high-dimensional joint distributions requires exponentially more data. Approximation methods such as the minimum redundancy maximum relevance (mRMR) algorithm (Peng, Long, and Ding, 2005) decompose the joint objective into pairwise terms to make computation feasible.
Beyond decision trees, information gain (mutual information) is widely used as a filter method for feature selection. Filter methods evaluate features independently of any particular learning algorithm, ranking them by a statistical measure of relevance to the target variable.
For each feature X_j in the dataset, compute the mutual information I(X_j; Y) with respect to the target Y. Features are then ranked by their mutual information values, and the top-k features are selected. This approach has several advantages:
Information gain has been one of the most effective feature selection methods for text classification. Yang and Pedersen (1997) conducted an influential comparative study of five feature selection methods for text categorization and found that information gain and chi-square were the most effective at aggressive dimensionality reduction without losing classification accuracy. Their experiments showed that these two measures were strongly correlated and that information gain could reduce the feature space by orders of magnitude (from tens of thousands of terms to a few hundred) while maintaining or even improving classifier performance.
Information gain is particularly valuable in text classification because the feature space (vocabulary) is typically very large. By ranking words according to their mutual information with the class label, practitioners can eliminate noise words that contribute little predictive power and retain only the most discriminative terms. This approach has been applied to spam filtering, sentiment analysis, topic labeling, and document routing.
In bioinformatics, information gain is used for gene selection in microarray data analysis. Gene expression datasets often contain thousands of genes (features) but only a small number of samples, creating a high-dimensional, low-sample-size problem. Information gain helps identify the genes most relevant to a disease phenotype or biological condition, enabling researchers to focus experimental efforts on the most informative genetic markers.
The popular Python library scikit-learn provides the mutual_info_classif function for classification tasks and mutual_info_regression for regression tasks. These functions estimate mutual information using a non-parametric method based on k-nearest neighbor distances (Kraskov, Stogbauer, and Grassberger, 2004).
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Compute mutual information (information gain) for each feature
mi_scores = mutual_info_classif(X, y)
for name, score in zip(iris.feature_names, mi_scores):
print(f"{name}: {score:.4f}")
# Select top 2 features using mutual information
selector = SelectKBest(mutual_info_classif, k=2)
X_selected = selector.fit_transform(X, y)
Because mutual information evaluates each feature individually, it does not account for interactions between features. Two features that are individually uninformative may become highly informative when combined (synergistic features). Methods such as minimum redundancy maximum relevance (mRMR) and joint mutual information extend the basic filter approach by also considering feature-feature dependencies.
Additionally, estimating mutual information for continuous features requires discretization or density estimation, both of which introduce additional hyperparameters. The choice of binning strategy or the number of nearest neighbors in a KNN-based estimator can influence the estimated values.
Gini impurity is the other major splitting criterion used in decision tree algorithms. The CART algorithm (Breiman et al., 1984) uses Gini impurity, while the ID3/C4.5 family uses entropy-based information gain. Both metrics measure node impurity, but they differ in their mathematical formulation and practical behavior.
For a node with C classes and class proportions p_1, p_2, ..., p_C:
| Metric | Formula | Range (binary) |
|---|---|---|
| Entropy | H = -sum(p_i log_2 p_i) | 0 to 1 |
| Gini impurity | G = 1 - sum(p_i^2) = sum(p_i (1 - p_i)) | 0 to 0.5 |
| Aspect | Information Gain (Entropy) | Gini Impurity |
|---|---|---|
| Computation | Requires logarithms (slower) | Uses only multiplication and subtraction (faster) |
| Sensitivity to class imbalance | More sensitive; penalizes small impurities more heavily | Less sensitive; can underweight minority classes |
| Split type | Multi-way splits (one branch per feature value) | Binary splits only (CART produces binary trees) |
| Information-theoretic meaning | Measures average bits of information | Measures probability of misclassification |
| Used by | ID3, C4.5, C5.0 | CART, random forest (scikit-learn default), XGBoost |
| Empirical tree structure | Very similar to Gini in most cases | Very similar to entropy in most cases |
In practice, entropy and Gini impurity produce nearly identical decision trees in the majority of cases. Research by Raileanu and Stoffel (2004) found that the two criteria disagree on the chosen split in only about 2% of cases. However, there are situations where the choice matters:
Scikit-learn's DecisionTreeClassifier uses Gini impurity by default (criterion='gini'), but switching to entropy is a single parameter change (criterion='entropy').
Beyond information gain and Gini impurity, several alternative splitting criteria have been proposed:
| Criterion | Description | Used by |
|---|---|---|
| Chi-square | Uses statistical significance testing to determine whether the split is meaningful | CHAID algorithm |
| Variance reduction | Measures the reduction in target variance for regression tasks | Regression trees |
| Twoing | Finds the best binary partition of classes at each node | CART (optional) |
| Distance measure | Based on the distance between class probability distributions | Various |
Modern tree-based ensemble methods build on the same splitting principles as individual decision trees, and information gain continues to play a role in these methods.
Random forests construct hundreds or thousands of decision trees, each trained on a bootstrap sample of the data. At each split, only a random subset of features is considered. The individual trees in a random forest can use either Gini impurity or entropy (information gain) as their splitting criterion. Feature importance in random forests is often computed as the total decrease in impurity across all splits in all trees, which is directly related to information gain when entropy is the criterion.
Gradient boosting algorithms such as XGBoost, LightGBM, and CatBoost typically use custom loss-function-based splitting criteria rather than pure information gain. However, the concept remains closely related. XGBoost, for example, reports a "gain" metric for feature importance that measures the total improvement in the loss function brought by splits on each feature. This gain is conceptually analogous to information gain, though it is computed with respect to the specific loss function rather than entropy.
| Method | Default Criterion | Feature Importance Metric |
|---|---|---|
| Random forest | Gini impurity (scikit-learn) | Mean decrease in impurity |
| XGBoost | Custom loss (logloss, etc.) | Gain, cover, weight |
| LightGBM | Custom loss | Split, gain |
| CatBoost | Custom loss | PredictionValuesChange |
Beyond classification and feature selection, information gain appears as a guiding principle in Bayesian optimization and active learning.
In Bayesian experimental design, the goal is to choose which experiment to run next so as to learn as much as possible about a quantity of interest. The most natural formalization is to maximize the expected information gain (EIG) of the experiment's outcome with respect to the model parameters. Following Bayesian decision theory, one begins with a prior over model parameters and a likelihood model, then selects the experimental design that maximizes the expected reduction in entropy of the posterior distribution.
Entropy-based acquisition functions such as Entropy Search (Hennig and Schuler, 2012) and Max-Value Entropy Search (Wang and Jegelka, 2017) use this principle to guide sequential optimization. These methods estimate the entropy of the location of the global optimum and select the next evaluation point to maximize the mutual information between the observation and the optimum's location.
In active learning, a classifier selects which unlabeled data points to query for labels. Information-gain-based strategies select the data point whose label would produce the largest reduction in the model's overall uncertainty, connecting directly to the same entropy-reduction framework used in decision tree splitting.
When implementing information gain in practice, several details require attention.
Information gain is defined for categorical features that partition the data into discrete subsets. To handle continuous (numerical) features, the standard approach is to evaluate binary threshold splits: for each candidate threshold t, the data is split into {x | x_j <= t} and {x | x_j > t}, and the information gain of this binary partition is computed. The threshold producing the highest information gain is selected. C4.5 sorts the feature values and evaluates thresholds at the midpoints between consecutive distinct values.
C4.5 handles missing values by distributing records with unknown feature values across all branches in proportion to the known-value distribution. The information gain calculation uses only the records with known values for the feature, and the gain is scaled by the fraction of records with known values.
For a dataset with N records, F features, and V_j possible values for feature j, computing information gain for all features at a single node requires O(N * F) time. For continuous features with sorting, the cost increases to O(N * F * log N). In large-scale applications, techniques such as histogram-based binning (used in LightGBM and XGBoost) reduce the cost by discretizing continuous features into a fixed number of bins before computing splits.
The following pseudocode illustrates the information gain calculation for a categorical feature:
function InformationGain(S, A):
parent_entropy = Entropy(S)
weighted_child_entropy = 0
for each value v in Values(A):
S_v = subset of S where feature A = v
weight = |S_v| / |S|
weighted_child_entropy += weight * Entropy(S_v)
return parent_entropy - weighted_child_entropy
function Entropy(S):
total = |S|
entropy = 0
for each class c in Classes(S):
p = count(c in S) / total
if p > 0:
entropy -= p * log2(p)
return entropy
| Quantity | Formula | Description |
|---|---|---|
| Entropy | H(S) = -sum(p_i log_2 p_i) | Uncertainty of the class distribution |
| Conditional entropy | H(S | A) = sum(( |
| Information gain | IG(S, A) = H(S) - H(S | A) |
| Split information | SplitInfo(S, A) = -sum(( | S_v |
| Gain ratio | GainRatio(S, A) = IG(S, A) / SplitInfo(S, A) | Normalized information gain (C4.5) |
| Mutual information | I(X; Y) = H(X) - H(X | Y) |
| Conditional mutual information | I(X; Y | Z) = H(X |
| Gini impurity | G(S) = 1 - sum(p_i^2) | Alternative impurity measure (CART) |