Splitter

See also: Machine learning terms

A splitter is a term used in two distinct senses in machine learning. The first and most common sense is a utility that partitions a dataset into subsets such as training, validation, and test sets, or into the folds of a cross-validation scheme. The second sense, used in the decision tree literature and in Google's machine learning glossary, refers to the routine inside a tree-learning algorithm that chooses the best condition at each internal node. Both meanings share the underlying idea of dividing data so that downstream learning is honest, reproducible, and accurate, but the algorithms involved are quite different. This article covers the data splitter first, then the decision-tree splitter.

Splitter as a data partitioner

A data splitter is a method or class that divides a dataset into subsets, typically a training set, a validation set, and a test set, or into k folds for cross-validation. Splitting matters because a model evaluated on the same data it was trained on will look better than it really is, often dramatically so when the model has high capacity. A held-out test set, untouched until the very end of model development, is the closest a practitioner can get to an honest estimate of generalization error.

Splitters do three jobs at once. They support honest model evaluation by reserving data the learning algorithm has never seen. They enable hyperparameter tuning, by giving an inner validation set or cross-validation loop separate from the final test set. And they prevent a class of bugs known as data leakage, in which information from the test set sneaks into training, sometimes through a careless preprocessing step, sometimes through ignoring grouping or temporal structure in the data.

Common splitting strategies

There is no single best splitting strategy; the right choice depends on the size of the dataset, the structure of the labels, the presence of grouping, and whether the data are temporal. The most common strategies form a small family of named procedures.

Strategy	What it does	When to use it
Hold-out	Single random partition into train and test (or train, validation, test)	Large datasets where one split is statistically reliable
K-fold cross-validation	Rotates each of k folds through the role of validation set	Default for small to medium datasets
Stratified k-fold	K-fold that preserves class proportions in every fold	Classification with imbalanced or rare classes
Group k-fold	K-fold that keeps all rows from a group on the same side	Patients in clinical data, users in behavioural logs
Stratified group k-fold	Combines stratification with non-overlapping groups	Imbalanced classification with grouped observations
Leave-one-out (LOO)	n folds, each containing a single test point	Tiny datasets where every observation matters
Leave-p-out	All possible test sets of size p	Statistical theory and very small datasets
Time-series split	Train on past, test on future, expanding window	Forecasting and any temporally ordered data
Walk-forward / rolling-origin	Repeated train-test pairs with a moving cut-off	Backtesting trading strategies and demand models
Repeated k-fold	Runs k-fold several times with different seeds	Reduces variance of the cross-validation estimate
Stratified shuffle split	Random shuffle splits with stratified class proportions	Many quick repeats on imbalanced data

The choice between these is rarely arbitrary. Stratification is essential when the positive class is rare, because a plain random split can produce folds with no positive examples at all, breaking metrics like recall and area under the precision-recall curve. Group splitters matter whenever the same entity appears in many rows, for example a patient with several scans or a user with several sessions; a random split would put part of the same patient's data in train and part in test, producing optimistic accuracy that does not survive deployment.

Time-series splitters enforce a different rule: the test set must come strictly after the training set in time. Random k-fold on time-stamped data is a classic data leakage trap, because the training fold can contain points from after the test fold and the model effectively peeks at the future. The scikit-learn TimeSeriesSplit and walk-forward variants like rolling-origin evaluation produce a sequence of expanding or sliding windows that mimic a real deployment in which a model is retrained periodically and used to predict the next interval.

scikit-learn splitter classes

The scikit-learn library exposes splitters as iterators in sklearn.model_selection. Each class implements a split(X, y, groups) method that yields pairs of integer index arrays for the training and validation portions of each fold. A model selection routine such as cross_val_score or GridSearchCV accepts any of these objects as its cv argument, which makes swapping splitters trivial.

Class	Stratified	Groups	Order-aware	Notes
`KFold`	No	No	No	Plain k-fold, default `n_splits=5`
`StratifiedKFold`	Yes	No	No	Preserves class proportions in each fold
`GroupKFold`	No	Yes	No	No group appears in two folds
`StratifiedGroupKFold`	Yes	Yes	No	Stratified, with non-overlapping groups
`TimeSeriesSplit`	No	No	Yes	Train on past, test on next chunk
`ShuffleSplit`	No	No	No	Repeated random train/test splits
`StratifiedShuffleSplit`	Yes	No	No	Shuffle split that preserves class ratios
`GroupShuffleSplit`	No	Yes	No	Shuffle split that respects groups
`LeaveOneOut`	No	No	No	Equivalent to `KFold(n_splits=n)`
`LeavePOut`	No	No	No	All `C(n, p)` train/test pairs
`LeaveOneGroupOut`	No	Yes	No	Each group becomes the test set in turn
`LeavePGroupsOut`	No	Yes	No	All combinations of p groups as test
`PredefinedSplit`	No	No	No	Use externally specified fold indices
`RepeatedKFold`	No	No	No	Runs k-fold multiple times with new seeds
`RepeatedStratifiedKFold`	Yes	No	No	Repeated stratified k-fold

The official scikit-learn user guide for cross-validation walks through these classes with diagrams that show, for each splitter, which samples land in the training and validation sets across folds. The diagrams make the differences between, say, KFold and GroupKFold easy to see at a glance and are worth consulting before choosing a splitter for a new project.

Train/validation/test ratios

No official rule fixes the proportions of a hold-out split. Practitioners use a few defaults that have stood up in practice. An 80/20 split between train and test is common when no separate validation set is needed because hyperparameter tuning is done with k-fold cross-validation on the training portion. A 70/15/15 or 60/20/20 three-way split adds an explicit validation set and is a typical recipe for medium-sized datasets and deep learning. For very large datasets, the training fraction is often pushed higher (say 95/2.5/2.5) because even a small percentage corresponds to a statistically reliable test set. For very small datasets, k-fold cross-validation or leave-one-out is preferred to a single hold-out, because a single small test set is too noisy to trust.

Stratification and imbalanced data

Stratification matters most when the data are imbalanced. Suppose 1% of customers churn in a given month. A plain random split with 100 test points has an expected count of one churner, with a non-trivial probability of zero churners and therefore zero recall. Stratified splitters fix this by sorting examples by class and then sampling from each class separately, so each fold contains the same 1% positive rate as the full dataset. The same idea applies to multi-class classification; StratifiedKFold extends naturally to k classes by stratifying along the full label distribution.

Stratification is incompatible with strict temporal order, because shuffling and resorting by class breaks the chronology. For time-series classification, practitioners typically use a time-series split first and verify after the fact that the training and test windows have similar class distributions. If the class distribution drifts over time, stratification cannot save the model from concept drift; that is a separate problem.

Splitters and data leakage

Many high-profile errors in published machine-learning results trace back to a poorly chosen splitter. The most common patterns are: random splitting time-series data so the model peeks at future values, ignoring patient or user grouping so the same person appears in train and test, and applying preprocessing such as scaling or feature selection to the full dataset before splitting, so the test set influences the training pipeline.

A correctly chosen splitter is the first defence against these bugs. Group-aware splitters prevent same-entity leakage; time-series splitters prevent future-information leakage; and pipeline objects such as scikit-learn's Pipeline ensure that all preprocessing is fit on the training fold only. The combination of the right splitter and a pipeline is the standard recipe for honest cross-validation in modern Python machine learning.

Splitter inside a decision tree

The second meaning of splitter comes from the decision tree literature. Google's Decision Forests glossary defines a splitter as the routine and algorithm responsible for finding the best condition at each node while training a decision tree. In other words, the splitter is the inner-loop search that, given a node and the data routed to it, picks a feature and a threshold (or a feature and a category set, or a hyperplane) that splits the node's data into two child nodes in a way that improves a chosen criterion.

A decision tree learning algorithm is built around two nested loops. The outer loop grows the tree, deciding which leaves to expand and when to stop. The inner loop, the splitter, scans the candidate splits at each leaf and picks one. Almost every difference between popular tree algorithms (CART, ID3, C4.5, Extra Trees, LightGBM, XGBoost) lives in this inner loop: which candidate splits are considered, what criterion is used to score them, and how ties and edge cases are handled.

Split criteria

The scoring function inside the splitter is called the split criterion. For classification trees, the criterion measures the impurity of a node and the algorithm chooses the split that reduces the weighted impurity of the children by the largest amount. For regression trees, the criterion measures variance or some loss-derived quantity around the node's mean prediction.

Criterion	Task	Definition	Used by
Gini impurity	Classification	Probability that a random sample is mislabelled by class probabilities of the node	CART, scikit-learn `DecisionTreeClassifier` (default)
Entropy	Classification	Shannon entropy of the class distribution; reduction is information gain	ID3, C4.5, scikit-learn (`criterion="entropy"`)
Log loss	Classification	Same numerical value as entropy in scikit-learn, exposed under a separate name	scikit-learn (`criterion="log_loss"`)
Mean squared error	Regression	Variance within the node	scikit-learn `DecisionTreeRegressor` (default)
Friedman MSE	Regression	MSE-based improvement score derived in Friedman's GBM paper	scikit-learn (`criterion="friedman_mse"`), gradient boosting
Mean absolute error	Regression	Sum of absolute deviations from the node median	scikit-learn (`criterion="absolute_error"`)
Half Poisson deviance	Regression with counts	Deviance of a Poisson model around the node mean	scikit-learn (`criterion="poisson"`)
Quantile loss / pinball	Quantile regression trees	Asymmetric loss around a target quantile	LightGBM `objective="quantile"`, XGBoost `reg:quantileerror`

Gini impurity ranges from 0 (a node where every sample has the same label) to 1 minus 1/k for k equally distributed classes (0.5 in the binary case). Shannon entropy ranges from 0 to log k. The two criteria almost always pick the same split in practice, and the choice between them is more about computational cost (Gini avoids logarithms and is slightly faster) than statistical accuracy. Friedman MSE is a special-purpose criterion designed for additive regression trees in gradient boosting; it maximises the squared mean difference between children, weighted by their sizes, instead of using the standard MSE reduction.

Split-finding algorithms

Given a criterion, several algorithms compete to find a good split efficiently. The differences between them dominate the runtime characteristics of modern tree libraries.

Algorithm	How it picks candidates	Strengths	Used by
Exact greedy	Enumerates every unique value of every feature	Optimal w.r.t. the criterion; simple	CART, scikit-learn (`splitter="best"`), early XGBoost (`tree_method="exact"`)
Random splitter	Picks a random threshold per feature, takes the best	Very fast, decorrelates trees in an ensemble	scikit-learn (`splitter="random"`), `ExtraTreeClassifier`
Histogram-based	Buckets each feature into a fixed number of bins and scans bin boundaries	Fast and memory-light on large data	LightGBM, XGBoost (`tree_method="hist"`), CatBoost, scikit-learn `HistGradientBoosting*`
Approximate greedy with quantile sketch	Proposes candidate thresholds at weighted quantiles of the feature	Scales to billions of rows; supports distributed training	XGBoost (`tree_method="approx"`)
Sparsity-aware	Sends missing values to the side with the larger gain	Native handling of sparse and missing data	XGBoost, LightGBM, scikit-learn `HistGradientBoosting*`
Oblique / multivariate	Searches over linear combinations of features (hyperplane splits)	Captures diagonal structure with fewer nodes	OC1, scikit-learn `ObliqueTree` (third-party), CART-LC

The exact greedy algorithm tries every value as a candidate threshold for every feature at every node. It is the textbook description of CART and scales as O(features times samples) per split, which is fine for small data and prohibitive for large data. Histogram-based splitters trade a small amount of accuracy for speed by binning continuous features into, by default, 255 buckets and scanning only the bucket boundaries. The trick is that once the per-bin gradient and Hessian sums are computed, scoring all candidate splits for one feature is O(bins) rather than O(samples). LightGBM also exploits histogram subtraction, computing the histogram of the smaller child and subtracting it from the parent to get the larger child for free. These optimisations are the reason histogram-based gradient boosting is the default choice on tabular data with millions of rows.

XGBoost's approximate greedy algorithm uses a different idea, the weighted quantile sketch introduced in the original XGBoost paper by Chen and Guestrin. Instead of binning into fixed-width buckets, it places candidate thresholds at quantiles of the feature distribution weighted by the second-order gradient (the Hessian) so that each bucket carries roughly equal loss curvature. This produces accurate splits even when the feature distribution is skewed and is provably mergeable across distributed shards.

Random splitters and Extra Trees

The random splitter is a deliberate weakening of the exact greedy algorithm. Instead of scanning every threshold, it draws a single random threshold for each feature in the candidate set and picks the best of those random thresholds. The motivation is variance reduction in an ensemble: when many random trees are aggregated, the noise introduced by random thresholds averages out, while the trees become less correlated and the ensemble's variance falls.

Extremely Randomized Trees, introduced by Pierre Geurts, Damien Ernst, and Louis Wehenkel in their 2006 Machine Learning paper of the same name, take this idea further. At each node, Extra Trees draw a random subset of features (like random forest) and then a single random threshold per feature, and pick the best of those random splits using the standard impurity criterion. The result is faster training (no full sort per feature) and often comparable accuracy to a random forest, especially when the dataset is noisy and the decorrelation benefits dominate. In scikit-learn, ExtraTreesClassifier and ExtraTreesRegressor implement this algorithm, and the underlying ExtraTreeClassifier corresponds to DecisionTreeClassifier(splitter="random").

scikit-learn's splitter parameter

Scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor expose the choice of splitting algorithm through a splitter parameter with two settings:

Value	Behaviour
`"best"` (default)	Exact greedy: scan every feature and every candidate threshold, pick the highest-gain split
`"random"`	For each feature, sample one random threshold; pick the best of those

A standalone DecisionTreeClassifier(splitter="random") is rarely competitive with the default, but it shines inside an ensemble such as BaggingClassifier or ExtraTreesClassifier, where decorrelation between trees boosts the aggregate accuracy. The criterion parameter (Gini, entropy, log loss, MSE, Friedman MSE, MAE, Poisson) is independent of the splitter parameter and chooses the scoring function the splitter uses.

For a deeper treatment of how thresholds are chosen for numerical features, see threshold (decision trees), which discusses tie-breaking, missing-value handling, and how histogram-based libraries propose candidate thresholds.

Splitters in CART, ID3, and C4.5

The historical decision-tree algorithms differ mainly in their splitters. The CART algorithm, introduced by Breiman, Friedman, Olshen, and Stone in 1984, uses exact greedy splitting with Gini impurity for classification and MSE for regression, and produces strictly binary trees. ID3, introduced by Quinlan in 1986, uses information gain with entropy and produces multiway splits on categorical features. C4.5, Quinlan's 1993 extension of ID3, uses gain ratio (information gain divided by split entropy) to penalise high-cardinality features, supports continuous attributes via threshold splits, and includes pessimistic error pruning. The differences are entirely inside the splitter and the tree construction loop, not in the prediction logic, which is identical: traverse the tree, return the leaf prediction.

Modern gradient-boosted tree libraries inherit from this lineage but have re-engineered the splitter for scale. LightGBM adds gradient-based one-side sampling (GOSS) on top of histogram splitting to focus the splitter on samples with large gradients; XGBoost adds the weighted quantile sketch and a sparsity-aware default direction for missing values; gradient boosting frameworks such as CatBoost add ordered boosting and oblivious trees in which the splitter chooses the same condition at every node of a given depth.

Two meanings, one underlying idea

The two senses of splitter, the data partitioner and the tree-node split-finder, share a common premise: machine learning needs principled rules for dividing data so that what comes out of the model is honest. The data splitter divides examples between training and evaluation; the tree splitter divides examples between branches of a tree. Both have to fight against shortcuts that look attractive in the short term and break the model in the long term, whether that is a leaky split that flatters the test metric or a greedy threshold that overfits a noisy feature. The right choice in either case depends on the structure of the data, the size of the dataset, and the downstream use of the model.

References

Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. *Journal of Machine Learning Research*, 12, 2825-2830. https://jmlr.org/papers/v12/pedregosa11a.html
scikit-learn developers. Cross-validation: evaluating estimator performance (user guide). https://scikit-learn.org/stable/modules/cross_validation.html
scikit-learn developers. Visualizing cross-validation behavior in scikit-learn. https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html
scikit-learn developers. DecisionTreeClassifier reference documentation. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
scikit-learn developers. DecisionTreeRegressor reference documentation. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
scikit-learn developers. Decision Trees (user guide). https://scikit-learn.org/stable/modules/tree.html
Google for Developers. Machine Learning Glossary: Decision Forests, entry for *splitter*. https://developers.google.com/machine-learning/glossary/df
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth.
Quinlan, J. R. (1986). Induction of decision trees. *Machine Learning*, 1(1), 81-106.
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. *Machine Learning*, 63(1), 3-42. https://link.springer.com/article/10.1007/s10994-006-6226-1
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. *Annals of Statistics*, 29(5), 1189-1232.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. *KDD '16*. https://arxiv.org/abs/1603.02754
Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. *NeurIPS 2017*. https://proceedings.neurips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
LightGBM developers. Features documentation, including histogram-based learning. https://lightgbm.readthedocs.io/en/latest/Features.html
AWS Prescriptive Guidance. Splits and data leakage. https://docs.aws.amazon.com/prescriptive-guidance/latest/ml-operations-planning/splits-leakage.html

Explain Like I'm 5 (ELI5)

Imagine you want to teach your little robot to recognize different animals. You have lots of pictures of animals to help your robot learn. To make sure your robot really knows its stuff, you need to test it using some of the pictures, but pictures the robot has never seen before.

A splitter in machine learning is like a helper who organizes the pictures into groups for teaching and testing the robot. The helper makes sure that the robot sees a variety of animals in each group, so it learns how to recognize all of them properly. Once the robot has seen many different groups of pictures, you can be more confident that it can recognize animals it hasn't seen before.

There is also a second kind of splitter inside a special model called a decision tree. A decision tree is like a game of twenty questions: at every step it asks one yes-or-no question about the picture, like "does it have whiskers?" The splitter is the part of the program that decides which question to ask at each step, by trying lots of possible questions and picking the one that does the best job of separating cats from dogs from rabbits. Both kinds of splitters share the same goal: chopping data into smaller pieces in a smart way, so the robot can learn from the pieces and we can check it has really learned.

Splitter as a data partitioner

Common splitting strategies

scikit-learn splitter classes

Train/validation/test ratios

Stratification and imbalanced data

Splitters and data leakage

Splitter inside a decision tree

Split criteria

Split-finding algorithms

Random splitters and Extra Trees

scikit-learn's splitter parameter

Splitters in CART, ID3, and C4.5

Two meanings, one underlying idea

References

Explain Like I'm 5 (ELI5)

Improve this article

Related Articles

ARC-AGI 2

Gini Impurity

Information Gain

AUC-ROC

Axis-aligned condition

Binary condition

Splitter as a data partitioner

Common splitting strategies

scikit-learn splitter classes

Train/validation/test ratios

Stratification and imbalanced data

Splitters and data leakage

Splitter inside a decision tree

Split criteria

Split-finding algorithms

Random splitters and Extra Trees

scikit-learn's splitter parameter

Splitters in CART, ID3, and C4.5

Two meanings, one underlying idea

References

Explain Like I'm 5 (ELI5)

Related Articles

ARC-AGI 2

Gini Impurity

Information Gain

AUC-ROC

Axis-aligned condition

Binary condition