Information Gain

Information Gain (IG) is a measure from information theory that quantifies the reduction in entropy (uncertainty) achieved by partitioning a dataset based on a particular feature. It was introduced as the primary splitting criterion in the ID3 decision tree algorithm by J. Ross Quinlan in 1986, and it remains one of the most important concepts in supervised machine learning. Information gain is mathematically equivalent to mutual information between the feature and the target variable, making it a foundational tool in both tree-based learning and feature engineering for high-dimensional datasets.

At its core, information gain answers a simple question: "How much does knowing the value of this feature reduce our uncertainty about the class label?" Features that produce the largest reduction in uncertainty are considered the most informative and are selected first when building a decision tree or ranking features for a classification model.

Explain Like I'm 5 (ELI5)

Imagine you have a big jar of mixed-up candies and you want to sort them by flavor, but you cannot taste them. You can only ask yes-or-no questions about how they look. Some questions are really helpful. If you ask "Is it wrapped in red foil?", and all the strawberry candies happen to be in red foil, that one question instantly sorts out every strawberry candy. That question has high information gain because it removes a lot of confusion.

Other questions are not very helpful. If you ask "Is it bigger than my thumb?", and big candies come in every flavor, that question does not help you sort at all. That question has low information gain because you are just as confused after asking it.

Information gain is the number that tells a computer which question to ask first so it can sort things out as quickly as possible. The computer picks the question that clears up the most confusion, then picks the next best question, and keeps going until everything is sorted.

Historical context

The intellectual foundations of information gain trace back to Claude Shannon's landmark 1948 paper "A Mathematical Theory of Communication," published in the Bell System Technical Journal. Shannon defined entropy as a measure of the average information content (or uncertainty) of a random variable, establishing the mathematical framework that would underpin all later work on information-based learning. Shannon's paper has been called the "Magna Carta of the Information Age" and is one of the most cited scientific papers in history.

The concept of entropy remained largely within the domain of communications engineering and statistics for several decades. In 1959, Solomon Kullback and Richard Leibler had already formalized the divergence measure (now called Kullback-Leibler divergence) that relates closely to information gain. However, these information-theoretic tools were not applied to machine learning until the 1980s.

J. Ross Quinlan bridged this gap in his 1986 paper "Induction of Decision Trees," published in Machine Learning. Quinlan introduced the ID3 algorithm and demonstrated that entropy-based splitting produced compact, accurate decision trees on several benchmark problems. The paper drew explicitly on information theory, framing the attribute selection problem as one of maximizing the expected information content of each split.

Quinlan's subsequent book, C4.5: Programs for Machine Learning (1993), introduced the gain ratio refinement and addressed practical issues such as continuous features, missing values, and pruning. C4.5 became one of the most widely used classification algorithms of the 1990s. In a 2006 survey conducted by the IEEE International Conference on Data Mining (ICDM), C4.5 was ranked as the number one algorithm in data mining, and it was described as "probably the machine learning workhorse most widely used in practice to date."

Meanwhile, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone developed the CART algorithm in 1984, which used Gini impurity instead of entropy. The parallel development of these two approaches led to decades of comparative research, ultimately showing that the choice between entropy and Gini rarely matters in practice.

Mathematical definition

Entropy

Information gain is built on entropy, Shannon's measure of uncertainty. For a dataset S with C classes, where p_i is the proportion of samples belonging to class i, entropy is defined as:

$$H(S) = -\sum_{i=1}^{C} p_i \log_2 p_i$$

Entropy equals 0 when all samples belong to a single class (perfect purity) and reaches its maximum value of log_2(C) when classes are uniformly distributed (maximum uncertainty). For a binary classification problem, entropy ranges between 0 and 1 bit.

The choice of base-2 logarithm means that entropy is measured in bits (binary digits). Using the natural logarithm instead yields entropy in nats, and using base-10 yields hartleys. The base affects only the scale; the relative ordering of entropies remains the same.

Information gain formula

Given a dataset S and a feature A that partitions S into subsets {S_v | v in Values(A)}, the information gain of splitting on A is:

$$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$$

The first term, H(S), is the entropy of the dataset before the split. The second term is the weighted average entropy of the child subsets after the split, where each subset's entropy is weighted by the fraction of samples it contains. The difference represents how much uncertainty was eliminated by learning the value of feature A.

Information gain is always non-negative. It equals zero when the feature provides no information about the class, and it reaches its maximum when the feature perfectly separates all classes.

Conditional entropy

The weighted average entropy term in the formula is the conditional entropy of the target S given feature A:

$$H(S|A) = \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$$

Conditional entropy represents the expected uncertainty about the class label that remains after observing the feature value. Information gain can therefore be written compactly as:

$$IG(S, A) = H(S) - H(S|A)$$

This formulation makes the connection to mutual information immediately apparent: information gain is the difference between the prior entropy and the posterior (conditional) entropy.

Worked example: the "Play Tennis" dataset

The classic "Play Tennis" dataset, used in many machine learning textbooks (including Mitchell, 1997), illustrates how information gain is computed step by step. The dataset contains 14 days of observations, recording whether a tennis game was played based on weather conditions.

Day	Outlook	Temperature	Humidity	Wind	Play Tennis
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
3	Overcast	Hot	High	Weak	Yes
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
7	Overcast	Cool	Normal	Strong	Yes
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
10	Rain	Mild	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes
12	Overcast	Mild	High	Strong	Yes
13	Overcast	Hot	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

Step 1: compute the parent entropy

Out of 14 days, 9 are "Yes" (play) and 5 are "No" (don't play).

H(S) = -(9/14) log_2(9/14) - (5/14) log_2(5/14) = -(0.643)(−0.637) - (0.357)(−1.485) = 0.410 + 0.530 = 0.940 bits

Step 2: compute the weighted entropy after splitting on "Outlook"

The feature "Outlook" has three values: Sunny (5 days), Overcast (4 days), and Rain (5 days).

Outlook Value	Total	Yes	No	Entropy
Sunny	5	2	3	0.971
Overcast	4	4	0	0.000
Rain	5	3	2	0.971

Weighted entropy = (5/14)(0.971) + (4/14)(0.000) + (5/14)(0.971) = 0.347 + 0.000 + 0.347 = 0.693 bits

Step 3: compute information gain

IG(S, Outlook) = H(S) - Weighted Entropy = 0.940 - 0.693 = 0.247 bits

Step 4: repeat for all features

The same calculation is performed for every feature:

Feature	Information Gain
Outlook	0.247
Humidity	0.153
Wind	0.048
Temperature	0.029

Since Outlook has the highest information gain, the ID3 algorithm selects it as the root node of the decision tree. Notice that the Overcast branch is completely pure (all "Yes"), so it becomes a leaf node. The algorithm then recurses on the Sunny and Rain branches, computing information gain on the remaining features within each subset.

Use in decision tree algorithms

Information gain serves as the splitting criterion in several influential decision tree algorithms.

ID3

The Iterative Dichotomiser 3 (ID3) algorithm, introduced by Quinlan in 1986, was the first decision tree algorithm to use information gain as its sole splitting criterion. At each internal node, ID3 computes the information gain of every available feature and selects the one with the highest value. The algorithm then creates one branch for each possible value of the selected feature and recurses on each resulting subset until all samples in a node belong to the same class or no more features remain.

The ID3 algorithm follows a top-down, greedy approach:

Calculate the entropy for every candidate attribute using the dataset.
Partition the set into subsets using the attribute for which the resulting entropy after splitting is minimized (equivalently, information gain is maximized).
Create a decision tree node containing that attribute.
Recurse on subsets using the remaining attributes.

ID3 handles only categorical (discrete) features and does not include pruning, which can lead to overfitting on noisy data. It also uses a greedy strategy that may converge on locally optimal solutions rather than globally optimal trees.

C4.5

Quinlan's C4.5 algorithm (1993) is the successor to ID3 and addresses several of its shortcomings. While C4.5 still computes information gain, it uses a normalized version called the information gain ratio (discussed below) as its default splitting criterion. C4.5 also introduces several important capabilities:

Continuous features: C4.5 creates a threshold and splits the data into records whose attribute value is above the threshold and those at or below it.
Missing values: Records with unknown feature values are allowed (marked as "?") and are excluded from entropy calculations. The information gain is then scaled by the fraction of records with known values.
Pruning: After the tree is built, C4.5 goes back through and removes branches that do not help by replacing them with leaf nodes, using error-rate estimation on a hold-out set or via cross-validation.
Attribute costs: C4.5 can manage attributes with differing costs.

C5.0

The commercial successor C5.0 further improves on C4.5 with faster execution, lower memory usage, and boosting capabilities. It continues to rely on entropy-based splitting criteria.

Algorithm comparison

Algorithm	Year	Splitting Criterion	Handles Continuous Features	Pruning	Developer
CART	1984	Gini impurity	Yes	Yes (cost-complexity)	Breiman et al.
ID3	1986	Information gain	No	No	J. R. Quinlan
C4.5	1993	Gain ratio (default)	Yes	Yes (error-based)	J. R. Quinlan
C5.0	1997	Gain ratio / boosted	Yes	Yes	J. R. Quinlan

Bias toward high-cardinality features

A well-known limitation of information gain is its bias toward features with many distinct values. A feature such as a customer ID or telephone number, which assigns a unique value to almost every record, will produce child nodes that are nearly pure simply because each node contains very few samples. As a result, information gain for such features can be very high even though splitting on them produces no useful generalization.

Consider a dataset of 1,000 records with a feature "Record ID" that takes 1,000 unique values. Splitting on Record ID creates 1,000 child nodes, each containing exactly one record with zero entropy. The information gain equals the full parent entropy, which is the maximum possible. Yet this split is useless for prediction on new data because it has memorized the training set.

This bias is a fundamental consequence of the formula: more branches mean more opportunities to reduce the weighted average entropy, regardless of whether the split captures meaningful patterns. The problem was identified early in the decision tree literature and was one of the primary motivations for Quinlan's development of the gain ratio.

Information gain ratio

To address the high-cardinality bias, Quinlan introduced the information gain ratio (IGR) in the C4.5 algorithm. The gain ratio normalizes information gain by the intrinsic information (also called split information) of the feature:

$$GainRatio(S, A) = \frac{IG(S, A)}{SplitInfo(S, A)}$$

where the split information is:

$$SplitInfo(S, A) = -\sum_{v \in Values(A)} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}$$

Split information measures the entropy of the distribution of samples across the feature's branches. A feature with many equally sized branches has high split information, which drives down the gain ratio even if the raw information gain is large. This penalizes features that fragment the data into many small subsets.

For the "Play Tennis" example, the split information for Outlook is:

SplitInfo(S, Outlook) = -(5/14) log_2(5/14) - (4/14) log_2(4/14) - (5/14) log_2(5/14) = 1.577 bits

GainRatio(S, Outlook) = 0.247 / 1.577 = 0.157

Limitations of gain ratio

While gain ratio corrects the bias toward high-cardinality features, it can overcorrect in some cases. Features with very few values (such as a binary feature) have low split information, which can inflate the gain ratio even when the raw information gain is modest. To mitigate this, C4.5 uses a heuristic: it only considers features whose information gain is at least as high as the average information gain across all features, and then selects the one with the highest gain ratio among those candidates. This two-stage process combines the strengths of both measures, using information gain as a filter and gain ratio for final selection.

Research by Harris (2002) in "Information Gain Versus Gain Ratio: A Study of Split Method Biases" confirmed that neither measure is universally superior. Information gain tends to favor many-branched splits that can overfit, while gain ratio can over-penalize attributes with few distinct values but high information content.

Equivalence to mutual information

Information gain is mathematically equivalent to mutual information between the feature variable A and the target variable Y. Mutual information is defined as:

$$I(A; Y) = H(Y) - H(Y|A)$$

This is identical to the information gain formula: IG(S, A) = H(S) - H(S|A), where H(S|A) is the conditional entropy of the target given the feature. Both expressions quantify how much uncertainty about Y is removed by observing A.

The equivalence means that any property of mutual information applies directly to information gain:

Property	Description
Non-negativity	IG(S, A) >= 0 for all features A
Symmetry	I(A; Y) = I(Y; A); the gain is the same regardless of which variable is conditioned on
Zero iff independent	IG(S, A) = 0 if and only if A and Y are statistically independent
Bounded	I(A; Y) <= min(H(A), H(Y))

Mutual information can also be expressed as the Kullback-Leibler divergence between the joint distribution and the product of the marginals:

$$I(A; Y) = D_{KL}(P(A, Y) | P(A) P(Y))$$

This KL divergence formulation provides an intuitive interpretation: mutual information measures how much the joint distribution of A and Y differs from what would be expected if the two variables were independent. When A and Y are independent, the joint distribution equals the product of marginals, and the KL divergence (and therefore the information gain) is zero.

The term "mutual information" is more common in information theory and statistics, while "information gain" is the standard name in the decision tree literature. In feature selection contexts, the two terms are often used interchangeably.

Multivariate extensions

The basic information gain formula considers one feature at a time. Several extensions address multi-feature interactions.

Conditional mutual information

Conditional mutual information measures the mutual information between two variables X and Y given a third variable Z:

$$I(X; Y | Z) = H(X | Z) - H(X | Y, Z)$$

This quantity is useful in feature selection when one wishes to evaluate how much additional information a candidate feature provides beyond what is already known from previously selected features. Unlike standard information gain, conditional mutual information accounts for redundancy between features.

Interaction information

First studied by McGill (1954) under the name "interaction information" and independently by Hu Kuo Ting (1962), this measure generalizes mutual information to three or more variables. For three variables X, Y, and Z:

$$II(X; Y; Z) = I(X; Y | Z) - I(X; Y)$$

Unlike mutual information, interaction information can be negative. Negative interaction information indicates redundancy: X and Y share information about Z that overlaps. Positive interaction information indicates synergy: the variables together convey more information than the sum of their individual contributions. This distinction is important in feature selection because synergistic features may each appear uninformative individually but become highly predictive when combined.

Joint mutual information

Joint mutual information considers multiple features simultaneously:

$$I(X_1, X_2, ..., X_k; Y) = H(Y) - H(Y | X_1, X_2, ..., X_k)$$

This captures the total information a set of features provides about the target. However, computing joint mutual information exactly becomes impractical as the number of features grows, because estimating high-dimensional joint distributions requires exponentially more data. Approximation methods such as the minimum redundancy maximum relevance (mRMR) algorithm (Peng, Long, and Ding, 2005) decompose the joint objective into pairwise terms to make computation feasible.

Information gain for feature selection

Beyond decision trees, information gain (mutual information) is widely used as a filter method for feature selection. Filter methods evaluate features independently of any particular learning algorithm, ranking them by a statistical measure of relevance to the target variable.

How it works

For each feature X_j in the dataset, compute the mutual information I(X_j; Y) with respect to the target Y. Features are then ranked by their mutual information values, and the top-k features are selected. This approach has several advantages:

It is model-agnostic, meaning the selected features can be used with any downstream classifier or regressor.
It captures non-linear relationships between features and the target, unlike correlation-based methods that only detect linear associations.
It is computationally efficient for large feature spaces because each feature is evaluated independently.

Text classification

Information gain has been one of the most effective feature selection methods for text classification. Yang and Pedersen (1997) conducted an influential comparative study of five feature selection methods for text categorization and found that information gain and chi-square were the most effective at aggressive dimensionality reduction without losing classification accuracy. Their experiments showed that these two measures were strongly correlated and that information gain could reduce the feature space by orders of magnitude (from tens of thousands of terms to a few hundred) while maintaining or even improving classifier performance.

Information gain is particularly valuable in text classification because the feature space (vocabulary) is typically very large. By ranking words according to their mutual information with the class label, practitioners can eliminate noise words that contribute little predictive power and retain only the most discriminative terms. This approach has been applied to spam filtering, sentiment analysis, topic labeling, and document routing.

Bioinformatics and gene selection

In bioinformatics, information gain is used for gene selection in microarray data analysis. Gene expression datasets often contain thousands of genes (features) but only a small number of samples, creating a high-dimensional, low-sample-size problem. Information gain helps identify the genes most relevant to a disease phenotype or biological condition, enabling researchers to focus experimental efforts on the most informative genetic markers.

Scikit-learn implementation

The popular Python library scikit-learn provides the mutual_info_classif function for classification tasks and mutual_info_regression for regression tasks. These functions estimate mutual information using a non-parametric method based on k-nearest neighbor distances (Kraskov, Stogbauer, and Grassberger, 2004).

from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Compute mutual information (information gain) for each feature
mi_scores = mutual_info_classif(X, y)
for name, score in zip(iris.feature_names, mi_scores):
    print(f"{name}: {score:.4f}")

# Select top 2 features using mutual information
selector = SelectKBest(mutual_info_classif, k=2)
X_selected = selector.fit_transform(X, y)

Limitations as a filter

Because mutual information evaluates each feature individually, it does not account for interactions between features. Two features that are individually uninformative may become highly informative when combined (synergistic features). Methods such as minimum redundancy maximum relevance (mRMR) and joint mutual information extend the basic filter approach by also considering feature-feature dependencies.

Additionally, estimating mutual information for continuous features requires discretization or density estimation, both of which introduce additional hyperparameters. The choice of binning strategy or the number of nearest neighbors in a KNN-based estimator can influence the estimated values.

Comparison with Gini impurity

Gini impurity is the other major splitting criterion used in decision tree algorithms. The CART algorithm (Breiman et al., 1984) uses Gini impurity, while the ID3/C4.5 family uses entropy-based information gain. Both metrics measure node impurity, but they differ in their mathematical formulation and practical behavior.

Formulas

For a node with C classes and class proportions p_1, p_2, ..., p_C:

Metric	Formula	Range (binary)
Entropy	H = -sum(p_i log_2 p_i)	0 to 1
Gini impurity	G = 1 - sum(p_i^2) = sum(p_i (1 - p_i))	0 to 0.5

Key differences

Aspect	Information Gain (Entropy)	Gini Impurity
Computation	Requires logarithms (slower)	Uses only multiplication and subtraction (faster)
Sensitivity to class imbalance	More sensitive; penalizes small impurities more heavily	Less sensitive; can underweight minority classes
Split type	Multi-way splits (one branch per feature value)	Binary splits only (CART produces binary trees)
Information-theoretic meaning	Measures average bits of information	Measures probability of misclassification
Used by	ID3, C4.5, C5.0	CART, random forest (scikit-learn default), XGBoost
Empirical tree structure	Very similar to Gini in most cases	Very similar to entropy in most cases

When to prefer each

In practice, entropy and Gini impurity produce nearly identical decision trees in the majority of cases. Research by Raileanu and Stoffel (2004) found that the two criteria disagree on the chosen split in only about 2% of cases. However, there are situations where the choice matters:

Imbalanced datasets: Entropy tends to perform better because it penalizes small impurities more aggressively, leading to splits that better isolate minority classes. Gini impurity may treat a node with a 95:5 class ratio as "nearly pure" and decline to split further.
Computational constraints: Gini impurity avoids logarithmic computations and is slightly faster to evaluate, which can matter when building large ensembles of thousands of trees.
Multi-class problems: Both metrics generalize naturally to multiple classes. The practical differences remain minimal.

Scikit-learn's DecisionTreeClassifier uses Gini impurity by default (criterion='gini'), but switching to entropy is a single parameter change (criterion='entropy').

Other splitting criteria

Beyond information gain and Gini impurity, several alternative splitting criteria have been proposed:

Criterion	Description	Used by
Chi-square	Uses statistical significance testing to determine whether the split is meaningful	CHAID algorithm
Variance reduction	Measures the reduction in target variance for regression tasks	Regression trees
Twoing	Finds the best binary partition of classes at each node	CART (optional)
Distance measure	Based on the distance between class probability distributions	Various

Information gain in ensemble methods

Modern tree-based ensemble methods build on the same splitting principles as individual decision trees, and information gain continues to play a role in these methods.

Random forests

Random forests construct hundreds or thousands of decision trees, each trained on a bootstrap sample of the data. At each split, only a random subset of features is considered. The individual trees in a random forest can use either Gini impurity or entropy (information gain) as their splitting criterion. Feature importance in random forests is often computed as the total decrease in impurity across all splits in all trees, which is directly related to information gain when entropy is the criterion.

Gradient boosted trees

Gradient boosting algorithms such as XGBoost, LightGBM, and CatBoost typically use custom loss-function-based splitting criteria rather than pure information gain. However, the concept remains closely related. XGBoost, for example, reports a "gain" metric for feature importance that measures the total improvement in the loss function brought by splits on each feature. This gain is conceptually analogous to information gain, though it is computed with respect to the specific loss function rather than entropy.

Method	Default Criterion	Feature Importance Metric
Random forest	Gini impurity (scikit-learn)	Mean decrease in impurity
XGBoost	Custom loss (logloss, etc.)	Gain, cover, weight
LightGBM	Custom loss	Split, gain
CatBoost	Custom loss	PredictionValuesChange

Information gain in Bayesian optimization and active learning

Beyond classification and feature selection, information gain appears as a guiding principle in Bayesian optimization and active learning.

In Bayesian experimental design, the goal is to choose which experiment to run next so as to learn as much as possible about a quantity of interest. The most natural formalization is to maximize the expected information gain (EIG) of the experiment's outcome with respect to the model parameters. Following Bayesian decision theory, one begins with a prior over model parameters and a likelihood model, then selects the experimental design that maximizes the expected reduction in entropy of the posterior distribution.

Entropy-based acquisition functions such as Entropy Search (Hennig and Schuler, 2012) and Max-Value Entropy Search (Wang and Jegelka, 2017) use this principle to guide sequential optimization. These methods estimate the entropy of the location of the global optimum and select the next evaluation point to maximize the mutual information between the observation and the optimum's location.

In active learning, a classifier selects which unlabeled data points to query for labels. Information-gain-based strategies select the data point whose label would produce the largest reduction in the model's overall uncertainty, connecting directly to the same entropy-reduction framework used in decision tree splitting.

Implementation considerations

When implementing information gain in practice, several details require attention.

Continuous features

Information gain is defined for categorical features that partition the data into discrete subsets. To handle continuous (numerical) features, the standard approach is to evaluate binary threshold splits: for each candidate threshold t, the data is split into {x | x_j <= t} and {x | x_j > t}, and the information gain of this binary partition is computed. The threshold producing the highest information gain is selected. C4.5 sorts the feature values and evaluates thresholds at the midpoints between consecutive distinct values.

Missing values

C4.5 handles missing values by distributing records with unknown feature values across all branches in proportion to the known-value distribution. The information gain calculation uses only the records with known values for the feature, and the gain is scaled by the fraction of records with known values.

Computational efficiency

For a dataset with N records, F features, and V_j possible values for feature j, computing information gain for all features at a single node requires O(N * F) time. For continuous features with sorting, the cost increases to O(N * F * log N). In large-scale applications, techniques such as histogram-based binning (used in LightGBM and XGBoost) reduce the cost by discretizing continuous features into a fixed number of bins before computing splits.

Pseudocode

The following pseudocode illustrates the information gain calculation for a categorical feature:

function InformationGain(S, A):
    parent_entropy = Entropy(S)
    weighted_child_entropy = 0
    for each value v in Values(A):
        S_v = subset of S where feature A = v
        weight = |S_v| / |S|
        weighted_child_entropy += weight * Entropy(S_v)
    return parent_entropy - weighted_child_entropy

function Entropy(S):
    total = |S|
    entropy = 0
    for each class c in Classes(S):
        p = count(c in S) / total
        if p > 0:
            entropy -= p * log2(p)
    return entropy

Summary of key formulas

Quantity	Formula	Description
Entropy	H(S) = -sum(p_i log_2 p_i)	Uncertainty of the class distribution
Conditional entropy	H(S	A) = sum((
Information gain	IG(S, A) = H(S) - H(S	A)
Split information	SplitInfo(S, A) = -sum((	S_v
Gain ratio	GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)	Normalized information gain (C4.5)
Mutual information	I(X; Y) = H(X) - H(X	Y)
Conditional mutual information	I(X; Y	Z) = H(X
Gini impurity	G(S) = 1 - sum(p_i^2)	Alternative impurity measure (CART)

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning*, 1(1), 81-106.
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann Publishers.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth.
Cover, T. M., & Thomas, J. A. (2006). *Elements of Information Theory* (2nd ed.). Wiley-Interscience.
Raileanu, L. E., & Stoffel, K. (2004). "Theoretical Comparison between the Gini Index and Information Gain Criteria." *Annals of Mathematics and Artificial Intelligence*, 41(1), 77-93.
Mitchell, T. M. (1997). *Machine Learning*. McGraw-Hill. Chapter 3: Decision Tree Learning.
Yang, Y., & Pedersen, J. O. (1997). "A Comparative Study on Feature Selection in Text Categorization." *Proceedings of the 14th International Conference on Machine Learning (ICML)*, 412-420.
Kraskov, A., Stogbauer, H., & Grassberger, P. (2004). "Estimating Mutual Information." *Physical Review E*, 69(6), 066138.
Peng, H., Long, F., & Ding, C. (2005). "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 27(8), 1226-1238.
Witten, I. H., Frank, E., & Hall, M. A. (2011). *Data Mining: Practical Machine Learning Tools and Techniques* (3rd ed.). Morgan Kaufmann.
Scikit-learn developers. "Feature selection: mutual_info_classif." *scikit-learn documentation*. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html
Wu, X., Kumar, V., Quinlan, J. R., et al. (2008). "Top 10 Algorithms in Data Mining." *Knowledge and Information Systems*, 14(1), 1-37.
McGill, W. J. (1954). "Multivariate Information Transmission." *Psychometrika*, 19(2), 97-116.
Kullback, S., & Leibler, R. A. (1951). "On Information and Sufficiency." *Annals of Mathematical Statistics*, 22(1), 79-86.

Explain Like I'm 5 (ELI5)

Historical context

Mathematical definition

Entropy

Information gain formula

Conditional entropy

Worked example: the "Play Tennis" dataset

Step 1: compute the parent entropy

Step 2: compute the weighted entropy after splitting on "Outlook"

Step 3: compute information gain

Step 4: repeat for all features

Use in decision tree algorithms

ID3

C4.5

C5.0

Algorithm comparison

Bias toward high-cardinality features

Information gain ratio

Limitations of gain ratio

Equivalence to mutual information

Multivariate extensions

Conditional mutual information

Interaction information

Joint mutual information

Information gain for feature selection

How it works

Text classification

Bioinformatics and gene selection

Scikit-learn implementation

Limitations as a filter

Comparison with Gini impurity

Formulas

Key differences

When to prefer each

Other splitting criteria

Information gain in ensemble methods

Random forests

Gradient boosted trees

Information gain in Bayesian optimization and active learning

Implementation considerations

Continuous features

Missing values

Computational efficiency

Pseudocode

Summary of key formulas

References

Improve this article

Related Articles

ARC-AGI 2

Cross-Entropy

Entropy

Log Loss

Perplexity

KL Divergence (Kullback-Leibler Divergence)

Explain Like I'm 5 (ELI5)

Historical context

Mathematical definition

Entropy

Information gain formula

Conditional entropy

Worked example: the "Play Tennis" dataset

Step 1: compute the parent entropy

Step 2: compute the weighted entropy after splitting on "Outlook"

Step 3: compute information gain

Step 4: repeat for all features

Use in decision tree algorithms

ID3

C4.5

C5.0

Algorithm comparison

Bias toward high-cardinality features

Information gain ratio

Limitations of gain ratio

Equivalence to mutual information

Multivariate extensions

Conditional mutual information

Interaction information

Joint mutual information

Information gain for feature selection

How it works