Split

Overview

In machine learning, the word split shows up in two distinct contexts. The first sense describes dividing a dataset into non-overlapping subsets, usually a training set, a validation set, and a test set, so that a model can be fit, tuned, and evaluated on different slices of data. The second sense refers to the decision boundary chosen at each internal node of a decision tree, where the algorithm picks a feature and a threshold that partitions the samples into child nodes.

Both meanings share an underlying idea of partitioning. Dataset splits partition rows of data across roles in the training pipeline. Decision tree splits partition the feature space into regions that are more homogeneous than the parent. The mechanics, however, are completely different.

This article covers both senses in detail.

Splitting a dataset

When a model is evaluated on the same data it trained on, the accuracy number is meaningless because high accuracy could just reflect memorization. To get an honest read on how a model will behave on new inputs, the data has to be split into parts that play different roles.

The three classic subsets

The Wikipedia entry on training, validation, and test sets defines the three roles cleanly. The training set is the portion used to fit model parameters, such as the weights of a neural network or the coefficients of a linear regression. The validation set is held back during training and used to tune hyperparameters, choose architectures, and decide when to stop training. The test set is locked away until the final model is chosen, then used exactly once to estimate how the model will perform on unseen inputs.

The separation is what gives the test number any credibility. If the test set is consulted during model selection, even indirectly, it stops being a fair sample of unseen data and starts behaving like a second validation set.

Typical ratios

There is no universally correct ratio. The right choice depends on dataset size, model complexity, and how many hyperparameter trials the validation set has to absorb. Common defaults from practitioner guides include the following.

Ratio (train/val/test)	When it tends to be used
80 / 10 / 10	Default for medium-sized tabular datasets
70 / 15 / 15	Popular when the validation set has to support heavy hyperparameter tuning
60 / 20 / 20	Smaller datasets where evaluation needs more statistical power
98 / 1 / 1	Very large datasets where even 1% is enough samples to evaluate reliably

For deep learning on millions of examples, a 1% test set still contains enough rows for tight confidence intervals. For a few hundred rows, 20% may leave the test set too small to distinguish real performance from noise, which is part of why cross-validation is often preferred for small data.

Random split

The simplest split assigns rows to training, validation, and test sets uniformly at random. In scikit-learn, the train_test_split function does this in one line and accepts a test_size parameter (default 0.25 if neither size is specified), a random_state for reproducibility, and a shuffle flag that is on by default.

A random split assumes that the rows are independent and identically distributed. When that assumption holds, metrics computed on the test set are unbiased estimates of generalization performance. When it fails, which happens often in practice, the random split silently produces optimistic numbers.

Stratified split

For classification problems with imbalanced classes, a random split can place very few minority-class examples in the test set, or none at all in extreme cases. Stratified splitting fixes this by drawing rows so that each class is represented in each subset in roughly the same proportion as in the full dataset.

In scikit-learn, passing stratify=y to train_test_split triggers stratified sampling based on the label vector. The same idea generalizes to k-fold settings through StratifiedKFold, which preserves class ratios across every fold. For a dataset with 95% negatives and 5% positives, stratification keeps roughly that ratio in every fold rather than leaving some folds with no positives at all.

Time-based split

If the data has a time dimension, a random split is almost always wrong. The model gets to peek at the future during training, which leaks information that would not be available at prediction time and produces evaluation numbers that no real deployment will match.

The right approach is a chronological split. Rows from earlier dates form the training set, later dates become the validation set, and the most recent dates form the test set. Scikit-learn provides TimeSeriesSplit for the cross-validated version. It implements walk-forward validation, where each successive fold uses an expanding training window of all earlier folds, and the next fold becomes the test set. Walk-forward evaluation is sometimes called the k-fold of the time series world.

Group split

The trickiest case is when rows are not independent because they share some grouping variable, such as a patient ID, a customer, a session, or a manufacturing unit. If two transactions from the same customer end up in different subsets, the model can use customer-specific patterns to score the test row, even when the goal is to generalize to new customers.

GroupKFold in scikit-learn handles this. It guarantees that every group appears in exactly one fold, so the train and test sets never share groups. There is also StratifiedGroupKFold, which combines group separation with class-ratio preservation for imbalanced grouped data. Medical machine learning and recommendation systems are two areas where group splits are essential. Splitting MRI slices from one patient across train and test can inflate test accuracy while telling you nothing useful about new patients.

K-fold cross-validation

For small or medium datasets, holding out a single validation set wastes data and produces noisy estimates. K-fold cross-validation cuts the data into k roughly equal folds, then loops k times, using each fold once as validation and the remaining k-1 folds for training. The validation scores from each iteration are averaged into a single estimate.

A value of k=5 or k=10 is typical. Higher k uses more data for training each round at the cost of more model fits. Leave-one-out cross-validation is the extreme case where k equals the number of rows. Cross-validation is most useful when the dataset is small enough that a single holdout split would starve the model or leave the test set too noisy.

Data leakage

The whole point of splitting is to keep the test data unseen. Data leakage is the umbrella term for any time information from the test set sneaks into model fitting. Common causes include the following.

Preprocessing before splitting. If a scaler, encoder, or imputer is fit on rows that later end up in the test set, the model has been exposed to statistics from data it should not have seen. The fix is to fit transforms only on the training portion, which a Pipeline enforces automatically in scikit-learn.

Target leakage. A feature that is only available after the label is known will produce suspiciously good models that fail in production. Including a cancellation date when predicting churn is a classic example.

Temporal leakage. Randomly splitting time series data lets the model train on the future. The cure is a time-aware split.

Group leakage. Allowing the same patient, user, or session to appear in both training and test sets lets the model exploit identity rather than learning generalizable patterns.

The scikit-learn guide on common pitfalls treats leakage as one of the most frequent sources of overoptimistic evaluation, and the cure is almost always to split first and process second.

Splits in decision trees

The second meaning of split lives inside decision tree models. Here a split is a yes/no test applied to a single feature that partitions the samples flowing into a node into two child nodes. A typical split looks like age <= 35.5 or country == "FR". The tree-building algorithm tries many candidate splits at each node and keeps the one that most improves a chosen impurity measure.

Binary recursive partitioning

The CART algorithm, short for Classification and Regression Trees, builds a tree by repeatedly splitting the current node into two children, then recursing on each child until a stopping condition fires. Other tree algorithms such as ID3, C4.5, and C5.0 use the same recursive structure with different splitting rules, and C4.5 supports multi-way splits on categorical variables.

At every internal node the algorithm considers candidate features. For numeric features it scans candidate thresholds, usually midpoints between sorted unique values. For categorical features it considers subsets of categories. The split with the largest impurity drop wins.

In scikit-learn, the splitter='best' setting performs an exhaustive greedy search over all features and thresholds. The splitter='random' setting samples one random threshold per feature, which is the engine behind Extra Trees and one of the speed tricks in Random Forest.

Gini impurity

Gini impurity is the default splitting criterion for classification trees in scikit-learn and the original CART implementation. For a node with class proportions p_k, Gini impurity equals 1 minus the sum of p_k squared. It can be read as the probability that a randomly chosen sample is misclassified if labelled by drawing from the empirical class distribution. A pure node has Gini impurity zero. A 50/50 binary node has Gini 0.5. The split with the highest weighted Gini drop, often called Gini gain, is selected.

Information gain and entropy

The ID3 algorithm and its descendants use information gain, which is the drop in Shannon entropy caused by a split. Entropy at a node is the negative sum of p_k times log p_k over all classes. A pure node has entropy zero. A maximally mixed node has entropy log_2(K) for K classes. Information gain equals the entropy of the parent minus the weighted average entropy of the children, and the split with the highest information gain is chosen.

Gini and entropy usually agree on which split is best. Gini is slightly faster because it avoids the logarithm. C4.5 extends information gain to gain ratio, which normalizes by the intrinsic information of the split and penalizes splits with many small partitions, the failure mode that pure information gain has on high-cardinality categorical variables like ID columns.

Variance reduction

For regression trees the target is continuous and Gini and entropy do not apply. The standard criterion is reduction in variance, also called squared-error reduction. At each candidate split, the algorithm picks the partition that minimizes the weighted variance of the target across the two child nodes, equivalent to minimizing the sum of squared residuals from each child's mean.

Scikit-learn calls this squared_error (formerly mse). It also supports friedman_mse, a variant introduced by Jerome Friedman, and absolute_error, which splits on mean absolute deviation from the median. A Poisson deviance criterion is also available for count or frequency targets.

Chi-square

The CHAID algorithm uses a chi-square test of independence to evaluate splits. Splits with larger chi-square statistics, meaning child distributions that differ more from the parent than chance predicts, are preferred. Chi-square is less common in modern tree libraries but still appears in some statistical software.

Stopping criteria

A tree could keep splitting until every leaf has a single sample, which produces a perfectly pure but useless model. Practical implementations stop on conditions such as maximum depth, a minimum sample count to split, a minimum leaf size, or a minimum impurity decrease. Pruning afterwards removes splits that did not generalize. The interaction between splitting criteria and stopping rules controls the bias-variance tradeoff of a tree, and ensemble methods like Random Forest and gradient boosting deliberately use trees that are individually overfit or underfit, then average them out.

Explain like I'm 5

Imagine you have a big bag of toy animals and you want to teach a friend how to sort them. You first show them a few examples (the training set), then quiz them with toys they have not seen (the validation set) so you can give hints, and finally test them on a batch you have kept hidden the whole time (the test set). If you let them peek at the hidden batch, the test score does not mean anything.

Now think of a different kind of split. Instead of splitting the bag of toys, you are splitting the toys themselves into groups by asking yes/no questions. "Does it have four legs?" splits the animals into two piles. The decision tree keeps asking questions like this until each pile contains mostly one kind of animal. That kind of split is the building block of a decision tree.

Split

Overview

Splitting a dataset

The three classic subsets

Typical ratios

Random split

Stratified split

Time-based split

Group split

K-fold cross-validation

Data leakage

Splits in decision trees

Binary recursive partitioning

Gini impurity

Information gain and entropy

Variance reduction

Chi-square

Stopping criteria

Explain like I'm 5

References

Improve this article

Overview

Splitting a dataset

The three classic subsets

Typical ratios

Random split

Stratified split

Time-based split

Group split

K-fold cross-validation

Data leakage

Splits in decision trees

Binary recursive partitioning

Gini impurity

Information gain and entropy

Variance reduction

Chi-square

Stopping criteria

Explain like I'm 5

References

Overview

Splitting a dataset

The three classic subsets

Typical ratios

Random split

Stratified split

Time-based split

Group split

K-fold cross-validation

Data leakage

Splits in decision trees

Binary recursive partitioning

Gini impurity

Information gain and entropy

Variance reduction

Chi-square

Stopping criteria

Explain like I'm 5

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Static

Agglomerative clustering

Area under the PR curve

Overview

Splitting a dataset

The three classic subsets

Typical ratios

Random split

Stratified split

Time-based split

Group split

K-fold cross-validation

Data leakage

Splits in decision trees

Binary recursive partitioning

Gini impurity

Information gain and entropy

Variance reduction

Chi-square

Stopping criteria

Explain like I'm 5

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Static

Agglomerative clustering

Area under the PR curve