See also: Machine learning terms
A splitter is a term used in two distinct senses in machine learning. The first and most common sense is a utility that partitions a dataset into subsets such as training, validation, and test sets, or into the folds of a cross-validation scheme. The second sense, used in the decision tree literature and in Google's machine learning glossary, refers to the routine inside a tree-learning algorithm that chooses the best condition at each internal node. Both meanings share the underlying idea of dividing data so that downstream learning is honest, reproducible, and accurate, but the algorithms involved are quite different. This article covers the data splitter first, then the decision-tree splitter.
A data splitter is a method or class that divides a dataset into subsets, typically a training set, a validation set, and a test set, or into k folds for cross-validation. Splitting matters because a model evaluated on the same data it was trained on will look better than it really is, often dramatically so when the model has high capacity. A held-out test set, untouched until the very end of model development, is the closest a practitioner can get to an honest estimate of generalization error.
Splitters do three jobs at once. They support honest model evaluation by reserving data the learning algorithm has never seen. They enable hyperparameter tuning, by giving an inner validation set or cross-validation loop separate from the final test set. And they prevent a class of bugs known as data leakage, in which information from the test set sneaks into training, sometimes through a careless preprocessing step, sometimes through ignoring grouping or temporal structure in the data.
There is no single best splitting strategy; the right choice depends on the size of the dataset, the structure of the labels, the presence of grouping, and whether the data are temporal. The most common strategies form a small family of named procedures.
| Strategy | What it does | When to use it |
|---|---|---|
| Hold-out | Single random partition into train and test (or train, validation, test) | Large datasets where one split is statistically reliable |
| K-fold cross-validation | Rotates each of k folds through the role of validation set | Default for small to medium datasets |
| Stratified k-fold | K-fold that preserves class proportions in every fold | Classification with imbalanced or rare classes |
| Group k-fold | K-fold that keeps all rows from a group on the same side | Patients in clinical data, users in behavioural logs |
| Stratified group k-fold | Combines stratification with non-overlapping groups | Imbalanced classification with grouped observations |
| Leave-one-out (LOO) | n folds, each containing a single test point | Tiny datasets where every observation matters |
| Leave-p-out | All possible test sets of size p | Statistical theory and very small datasets |
| Time-series split | Train on past, test on future, expanding window | Forecasting and any temporally ordered data |
| Walk-forward / rolling-origin | Repeated train-test pairs with a moving cut-off | Backtesting trading strategies and demand models |
| Repeated k-fold | Runs k-fold several times with different seeds | Reduces variance of the cross-validation estimate |
| Stratified shuffle split | Random shuffle splits with stratified class proportions | Many quick repeats on imbalanced data |
The choice between these is rarely arbitrary. Stratification is essential when the positive class is rare, because a plain random split can produce folds with no positive examples at all, breaking metrics like recall and area under the precision-recall curve. Group splitters matter whenever the same entity appears in many rows, for example a patient with several scans or a user with several sessions; a random split would put part of the same patient's data in train and part in test, producing optimistic accuracy that does not survive deployment.
Time-series splitters enforce a different rule: the test set must come strictly after the training set in time. Random k-fold on time-stamped data is a classic data leakage trap, because the training fold can contain points from after the test fold and the model effectively peeks at the future. The scikit-learn TimeSeriesSplit and walk-forward variants like rolling-origin evaluation produce a sequence of expanding or sliding windows that mimic a real deployment in which a model is retrained periodically and used to predict the next interval.
The scikit-learn library exposes splitters as iterators in sklearn.model_selection. Each class implements a split(X, y, groups) method that yields pairs of integer index arrays for the training and validation portions of each fold. A model selection routine such as cross_val_score or GridSearchCV accepts any of these objects as its cv argument, which makes swapping splitters trivial.
| Class | Stratified | Groups | Order-aware | Notes |
|---|---|---|---|---|
KFold | No | No | No | Plain k-fold, default n_splits=5 |
StratifiedKFold | Yes | No | No | Preserves class proportions in each fold |
GroupKFold | No | Yes | No | No group appears in two folds |
StratifiedGroupKFold | Yes | Yes | No | Stratified, with non-overlapping groups |
TimeSeriesSplit | No | No | Yes | Train on past, test on next chunk |
ShuffleSplit | No | No | No | Repeated random train/test splits |
StratifiedShuffleSplit | Yes | No | No | Shuffle split that preserves class ratios |
GroupShuffleSplit | No | Yes | No | Shuffle split that respects groups |
LeaveOneOut | No | No | No | Equivalent to KFold(n_splits=n) |
LeavePOut | No | No | No | All C(n, p) train/test pairs |
LeaveOneGroupOut | No | Yes | No | Each group becomes the test set in turn |
LeavePGroupsOut | No | Yes | No | All combinations of p groups as test |
PredefinedSplit | No | No | No | Use externally specified fold indices |
RepeatedKFold | No | No | No | Runs k-fold multiple times with new seeds |
RepeatedStratifiedKFold | Yes | No | No | Repeated stratified k-fold |
The official scikit-learn user guide for cross-validation walks through these classes with diagrams that show, for each splitter, which samples land in the training and validation sets across folds. The diagrams make the differences between, say, KFold and GroupKFold easy to see at a glance and are worth consulting before choosing a splitter for a new project.
No official rule fixes the proportions of a hold-out split. Practitioners use a few defaults that have stood up in practice. An 80/20 split between train and test is common when no separate validation set is needed because hyperparameter tuning is done with k-fold cross-validation on the training portion. A 70/15/15 or 60/20/20 three-way split adds an explicit validation set and is a typical recipe for medium-sized datasets and deep learning. For very large datasets, the training fraction is often pushed higher (say 95/2.5/2.5) because even a small percentage corresponds to a statistically reliable test set. For very small datasets, k-fold cross-validation or leave-one-out is preferred to a single hold-out, because a single small test set is too noisy to trust.
Stratification matters most when the data are imbalanced. Suppose 1% of customers churn in a given month. A plain random split with 100 test points has an expected count of one churner, with a non-trivial probability of zero churners and therefore zero recall. Stratified splitters fix this by sorting examples by class and then sampling from each class separately, so each fold contains the same 1% positive rate as the full dataset. The same idea applies to multi-class classification; StratifiedKFold extends naturally to k classes by stratifying along the full label distribution.
Stratification is incompatible with strict temporal order, because shuffling and resorting by class breaks the chronology. For time-series classification, practitioners typically use a time-series split first and verify after the fact that the training and test windows have similar class distributions. If the class distribution drifts over time, stratification cannot save the model from concept drift; that is a separate problem.
Many high-profile errors in published machine-learning results trace back to a poorly chosen splitter. The most common patterns are: random splitting time-series data so the model peeks at future values, ignoring patient or user grouping so the same person appears in train and test, and applying preprocessing such as scaling or feature selection to the full dataset before splitting, so the test set influences the training pipeline.
A correctly chosen splitter is the first defence against these bugs. Group-aware splitters prevent same-entity leakage; time-series splitters prevent future-information leakage; and pipeline objects such as scikit-learn's Pipeline ensure that all preprocessing is fit on the training fold only. The combination of the right splitter and a pipeline is the standard recipe for honest cross-validation in modern Python machine learning.
The second meaning of splitter comes from the decision tree literature. Google's Decision Forests glossary defines a splitter as the routine and algorithm responsible for finding the best condition at each node while training a decision tree. In other words, the splitter is the inner-loop search that, given a node and the data routed to it, picks a feature and a threshold (or a feature and a category set, or a hyperplane) that splits the node's data into two child nodes in a way that improves a chosen criterion.
A decision tree learning algorithm is built around two nested loops. The outer loop grows the tree, deciding which leaves to expand and when to stop. The inner loop, the splitter, scans the candidate splits at each leaf and picks one. Almost every difference between popular tree algorithms (CART, ID3, C4.5, Extra Trees, LightGBM, XGBoost) lives in this inner loop: which candidate splits are considered, what criterion is used to score them, and how ties and edge cases are handled.
The scoring function inside the splitter is called the split criterion. For classification trees, the criterion measures the impurity of a node and the algorithm chooses the split that reduces the weighted impurity of the children by the largest amount. For regression trees, the criterion measures variance or some loss-derived quantity around the node's mean prediction.
| Criterion | Task | Definition | Used by |
|---|---|---|---|
| Gini impurity | Classification | Probability that a random sample is mislabelled by class probabilities of the node | CART, scikit-learn DecisionTreeClassifier (default) |
| Entropy | Classification | Shannon entropy of the class distribution; reduction is information gain | ID3, C4.5, scikit-learn (criterion="entropy") |
| Log loss | Classification | Same numerical value as entropy in scikit-learn, exposed under a separate name | scikit-learn (criterion="log_loss") |
| Mean squared error | Regression | Variance within the node | scikit-learn DecisionTreeRegressor (default) |
| Friedman MSE | Regression | MSE-based improvement score derived in Friedman's GBM paper | scikit-learn (criterion="friedman_mse"), gradient boosting |
| Mean absolute error | Regression | Sum of absolute deviations from the node median | scikit-learn (criterion="absolute_error") |
| Half Poisson deviance | Regression with counts | Deviance of a Poisson model around the node mean | scikit-learn (criterion="poisson") |
| Quantile loss / pinball | Quantile regression trees | Asymmetric loss around a target quantile | LightGBM objective="quantile", XGBoost reg:quantileerror |
Gini impurity ranges from 0 (a node where every sample has the same label) to 1 minus 1/k for k equally distributed classes (0.5 in the binary case). Shannon entropy ranges from 0 to log k. The two criteria almost always pick the same split in practice, and the choice between them is more about computational cost (Gini avoids logarithms and is slightly faster) than statistical accuracy. Friedman MSE is a special-purpose criterion designed for additive regression trees in gradient boosting; it maximises the squared mean difference between children, weighted by their sizes, instead of using the standard MSE reduction.
Given a criterion, several algorithms compete to find a good split efficiently. The differences between them dominate the runtime characteristics of modern tree libraries.
| Algorithm | How it picks candidates | Strengths | Used by |
|---|---|---|---|
| Exact greedy | Enumerates every unique value of every feature | Optimal w.r.t. the criterion; simple | CART, scikit-learn (splitter="best"), early XGBoost (tree_method="exact") |
| Random splitter | Picks a random threshold per feature, takes the best | Very fast, decorrelates trees in an ensemble | scikit-learn (splitter="random"), ExtraTreeClassifier |
| Histogram-based | Buckets each feature into a fixed number of bins and scans bin boundaries | Fast and memory-light on large data | LightGBM, XGBoost (tree_method="hist"), CatBoost, scikit-learn HistGradientBoosting* |
| Approximate greedy with quantile sketch | Proposes candidate thresholds at weighted quantiles of the feature | Scales to billions of rows; supports distributed training | XGBoost (tree_method="approx") |
| Sparsity-aware | Sends missing values to the side with the larger gain | Native handling of sparse and missing data | XGBoost, LightGBM, scikit-learn HistGradientBoosting* |
| Oblique / multivariate | Searches over linear combinations of features (hyperplane splits) | Captures diagonal structure with fewer nodes | OC1, scikit-learn ObliqueTree (third-party), CART-LC |
The exact greedy algorithm tries every value as a candidate threshold for every feature at every node. It is the textbook description of CART and scales as O(features times samples) per split, which is fine for small data and prohibitive for large data. Histogram-based splitters trade a small amount of accuracy for speed by binning continuous features into, by default, 255 buckets and scanning only the bucket boundaries. The trick is that once the per-bin gradient and Hessian sums are computed, scoring all candidate splits for one feature is O(bins) rather than O(samples). LightGBM also exploits histogram subtraction, computing the histogram of the smaller child and subtracting it from the parent to get the larger child for free. These optimisations are the reason histogram-based gradient boosting is the default choice on tabular data with millions of rows.
XGBoost's approximate greedy algorithm uses a different idea, the weighted quantile sketch introduced in the original XGBoost paper by Chen and Guestrin. Instead of binning into fixed-width buckets, it places candidate thresholds at quantiles of the feature distribution weighted by the second-order gradient (the Hessian) so that each bucket carries roughly equal loss curvature. This produces accurate splits even when the feature distribution is skewed and is provably mergeable across distributed shards.
The random splitter is a deliberate weakening of the exact greedy algorithm. Instead of scanning every threshold, it draws a single random threshold for each feature in the candidate set and picks the best of those random thresholds. The motivation is variance reduction in an ensemble: when many random trees are aggregated, the noise introduced by random thresholds averages out, while the trees become less correlated and the ensemble's variance falls.
Extremely Randomized Trees, introduced by Pierre Geurts, Damien Ernst, and Louis Wehenkel in their 2006 Machine Learning paper of the same name, take this idea further. At each node, Extra Trees draw a random subset of features (like random forest) and then a single random threshold per feature, and pick the best of those random splits using the standard impurity criterion. The result is faster training (no full sort per feature) and often comparable accuracy to a random forest, especially when the dataset is noisy and the decorrelation benefits dominate. In scikit-learn, ExtraTreesClassifier and ExtraTreesRegressor implement this algorithm, and the underlying ExtraTreeClassifier corresponds to DecisionTreeClassifier(splitter="random").
Scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor expose the choice of splitting algorithm through a splitter parameter with two settings:
| Value | Behaviour |
|---|---|
"best" (default) | Exact greedy: scan every feature and every candidate threshold, pick the highest-gain split |
"random" | For each feature, sample one random threshold; pick the best of those |
A standalone DecisionTreeClassifier(splitter="random") is rarely competitive with the default, but it shines inside an ensemble such as BaggingClassifier or ExtraTreesClassifier, where decorrelation between trees boosts the aggregate accuracy. The criterion parameter (Gini, entropy, log loss, MSE, Friedman MSE, MAE, Poisson) is independent of the splitter parameter and chooses the scoring function the splitter uses.
For a deeper treatment of how thresholds are chosen for numerical features, see threshold (decision trees), which discusses tie-breaking, missing-value handling, and how histogram-based libraries propose candidate thresholds.
The historical decision-tree algorithms differ mainly in their splitters. The CART algorithm, introduced by Breiman, Friedman, Olshen, and Stone in 1984, uses exact greedy splitting with Gini impurity for classification and MSE for regression, and produces strictly binary trees. ID3, introduced by Quinlan in 1986, uses information gain with entropy and produces multiway splits on categorical features. C4.5, Quinlan's 1993 extension of ID3, uses gain ratio (information gain divided by split entropy) to penalise high-cardinality features, supports continuous attributes via threshold splits, and includes pessimistic error pruning. The differences are entirely inside the splitter and the tree construction loop, not in the prediction logic, which is identical: traverse the tree, return the leaf prediction.
Modern gradient-boosted tree libraries inherit from this lineage but have re-engineered the splitter for scale. LightGBM adds gradient-based one-side sampling (GOSS) on top of histogram splitting to focus the splitter on samples with large gradients; XGBoost adds the weighted quantile sketch and a sparsity-aware default direction for missing values; gradient boosting frameworks such as CatBoost add ordered boosting and oblivious trees in which the splitter chooses the same condition at every node of a given depth.
The two senses of splitter, the data partitioner and the tree-node split-finder, share a common premise: machine learning needs principled rules for dividing data so that what comes out of the model is honest. The data splitter divides examples between training and evaluation; the tree splitter divides examples between branches of a tree. Both have to fight against shortcuts that look attractive in the short term and break the model in the long term, whether that is a leaky split that flatters the test metric or a greedy threshold that overfits a noisy feature. The right choice in either case depends on the structure of the data, the size of the dataset, and the downstream use of the model.
Imagine you want to teach your little robot to recognize different animals. You have lots of pictures of animals to help your robot learn. To make sure your robot really knows its stuff, you need to test it using some of the pictures, but pictures the robot has never seen before.
A splitter in machine learning is like a helper who organizes the pictures into groups for teaching and testing the robot. The helper makes sure that the robot sees a variety of animals in each group, so it learns how to recognize all of them properly. Once the robot has seen many different groups of pictures, you can be more confident that it can recognize animals it hasn't seen before.
There is also a second kind of splitter inside a special model called a decision tree. A decision tree is like a game of twenty questions: at every step it asks one yes-or-no question about the picture, like "does it have whiskers?" The splitter is the part of the program that decides which question to ask at each step, by trying lots of possible questions and picking the one that does the best job of separating cats from dogs from rabbits. Both kinds of splitters share the same goal: chopping data into smaller pieces in a smart way, so the robot can learn from the pieces and we can check it has really learned.