In-set condition
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v3 ยท 2,204 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v3 ยท 2,204 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Decision tree, Oblique condition, Threshold for decision trees, Categorical feature
An in-set condition is a split condition used inside a decision tree node that tests whether the value of a single categorical feature belongs to a specified subset of that feature's possible values. Google's Machine Learning Glossary defines it as "a condition that tests for the presence of one item in a set of items," and gives the example house-style in [tudor, colonial, cape] [1]. During inference, if the example's house-style value is one of those three, the condition evaluates to Yes and the example follows the positive branch; otherwise it goes the other way [1].
In-set conditions sit alongside the other two condition families that decision-forest libraries normally recognise: axis-aligned (numerical threshold) conditions and oblique conditions [2]. They are the standard tool for categorical splits because they encode group membership directly inside one tree node, which is far more compact and usually more accurate than splitting on a one-hot encoded version of the same feature. Google's glossary states plainly that "in-set conditions usually lead to more efficient decision trees than conditions that test one-hot encoded features" [1].
The formal shape of an in-set condition is
is feature_x in S?
where feature_x is a single categorical feature and S is a subset of that feature's vocabulary. Some examples:
species in {"cat", "dog", "bird"} (from Google's decision-forests course) [2]country in {France, Germany, Spain}userAgent in {"Mozilla/5.0", "InternetExplorer/10.0"}fruit in {apple, banana}If the example's value lies in S, the example is sent down one child; otherwise it is sent down the other. Because there are only two children, the in-set condition is binary, just like a numerical threshold split. A multiway categorical split (one branch per category) is a different construct used by C4.5 and is not an in-set condition [6].
Google's decision-forests documentation groups split conditions into three families [2]. An axis-aligned condition "involves only a single feature" with the example area > 200, while an oblique condition "involves more than one feature" with the example height > width [1]. The table below summarises how the three families differ.
| Condition type | Example | Features used | Feature kind | Typical libraries |
|---|---|---|---|---|
| Axis-aligned (numerical threshold) | area > 200 | One | Numerical | All decision-tree libraries |
| In-set | species in {cat, dog, bird} | One | Categorical | LightGBM, XGBoost (>=1.5), TensorFlow Decision Forests, YDF, CART |
| Oblique | 5 * num_legs + 2 * num_eyes >= 10 | Multiple | Numerical | YDF, TF-DF, oblique-tree research libraries |
An axis-aligned condition uses a threshold for decision trees to compare a single numerical feature against a learned cut-point. An oblique condition draws a hyperplane that combines several numerical features at once, which can capture diagonal decision boundaries that axis-aligned trees need many splits to approximate [2]. An in-set condition is the categorical analogue of the axis-aligned condition: one feature, two children, but the test is set membership instead of inequality.
A categorical feature has no natural ordering. "Tudor" is not less than or greater than "colonial." Standard numerical splits do not apply directly, so trees need a different mechanism. There are three common approaches, each with different consequences for tree shape and accuracy.
| Approach | What it does | Trade-off |
|---|---|---|
| One-hot encoding plus axis-aligned splits | Convert each category to a 0/1 column, then use threshold conditions | Each split isolates a single category, leading to deep, imbalanced trees, especially for high-cardinality features |
| Multiway categorical split | One child per category | Used by C4.5 and ID3; leaves with very few samples, hurts variance |
| In-set condition (binary partition) | One child for categories in S, another for the rest | Compact, supports balanced trees, used by CART, LightGBM, TF-DF, and modern XGBoost |
Google's glossary notes that "in-set conditions usually lead to more efficient decision trees than conditions that test one-hot encoded features" [1]. The intuition: with one-hot, the tree can only ever pick one category at a time per split, so it takes many levels to express "the colour is red, blue, or green." An in-set condition expresses the same logic in a single node. The scikit-learn documentation makes the same point about its own native categorical handling, noting that "one-hot encoding requires more tree depth to achieve equivalent splits" [12].
Finding the best in-set condition is harder than finding the best numerical threshold. With K distinct categories, the number of non-trivial bipartitions of the value set is 2^(K-1) - 1. Searching this space exhaustively becomes intractable around K = 30 or so.
Luckily there is a classic shortcut. Walter D. Fisher's 1958 paper "On Grouping for Maximum Homogeneity," published in the Journal of the American Statistical Association (volume 53, pages 789-798), proved that the optimal groups must be contiguous when the categories are placed in order [4]. For a binary classification target with a concave splitting criterion such as Gini or entropy, the optimal partition can therefore be found by:
This reduces the search from O(2^K) to O(K log K) and gives the same answer as the brute-force search [4][7]. Breiman, Friedman, Olshen, and Stone proved an analogous result for regression with squared error in the original CART book and generalised Fisher's contiguity result to arbitrary convex node-impurity functions [5]. Most modern gradient-boosting libraries use Fisher's trick or a gradient-based extension of it.
| Setting | Optimal-split complexity |
|---|---|
| Brute force | O(2^(K-1)) partitions |
| Binary classification, Gini or entropy (Fisher 1958) | O(K log K) after sorting |
| Regression, squared error | O(K log K) after sorting by mean target |
| Multiclass, generic criteria | NP-hard in general; libraries use heuristics |
Not every decision-tree library implements in-set conditions, and those that do use slightly different splitting strategies. The table below covers the libraries most practitioners encounter.
| Library | Native in-set support | Splitting strategy | Notes |
|---|---|---|---|
| CART (Breiman et al., 1984) | Yes | Fisher-style sort for binary targets, exhaustive otherwise | The original native-categorical implementation [5] |
| C4.5 (Quinlan, 1993) | Multiway, not in-set | One branch per category | Different family of categorical split [6] |
scikit-learn DecisionTreeClassifier / Regressor | No | Requires one-hot or ordinal encoding upstream | HistGradientBoosting* does support native categoricals [12] |
| LightGBM | Yes, via categorical_feature | Sorts histogram by sum_gradient / sum_hessian, picks best partition; Fisher 1958 + gradient extension | O(K log K) [7][8] |
| XGBoost (>=1.5) | Yes, via enable_categorical=True with tree_method='hist' or 'approx' | Sorts gradient histogram, then partitions; max_cat_to_onehot controls the cutoff | Experimental support added in 1.5; optimal partitioning added in 1.6 [9] |
| CatBoost | Native categorical support, but uses ordered target statistics rather than in-set conditions | Encodes categoricals as numerical target stats, then uses standard threshold splits in symmetric (oblivious) trees | Different mechanism, similar end goal [10] |
| TensorFlow Decision Forests and YDF | Yes, first-class | Three different categorical-split algorithms, including optimal Fisher-style | YDF documents "optimal categorical splits" as a headline feature [11] |
The scikit-learn case is worth flagging. The classic DecisionTreeClassifier and RandomForestClassifier cannot consume categorical features directly; you have to encode them yourself. For high-cardinality features this often means one-hot encoding, which scales badly. Practitioners who care about categorical performance in scikit-learn often switch to HistGradientBoostingClassifier, which has native categorical support: at each split it "partitions the categories of such a feature into disjoint sets using a heuristic that sorts them with respect to the target" [12]. The newer gradient-boosting estimators, LightGBM, and XGBoost are all options here.
Suppose you are predicting whether a customer will click an ad and one of your features is device_type with four possible values: phone, tablet, desktop, tv. Click-through rates in your training data look like this:
| Device type | Examples | Clicks | Click rate |
|---|---|---|---|
| phone | 4000 | 800 | 20% |
| tablet | 1000 | 150 | 15% |
| desktop | 3000 | 300 | 10% |
| tv | 500 | 25 | 5% |
Using Fisher's trick, you sort the categories by click rate descending: phone, tablet, desktop, tv. The candidate in-set conditions then reduce to the three prefix splits:
| Candidate condition | Left (Yes) child | Right (No) child |
|---|---|---|
device_type in {phone} | phone | tablet, desktop, tv |
device_type in {phone, tablet} | phone, tablet | desktop, tv |
device_type in {phone, tablet, desktop} | phone, tablet, desktop | tv |
You compute the Gini impurity reduction for each candidate and keep the best one. This is exactly three evaluations instead of 2^(4-1) - 1 = 7. The saving looks small here but grows fast: with K = 50 categories, 2^49 - 1 is more than 5 * 10^14, while K log K is roughly 280 [7].
In-set conditions tend to dominate one-hot baselines in two situations:
For low-cardinality features (say, K <= 4), the difference is usually small. XGBoost's max_cat_to_onehot parameter exists for exactly this reason: it "controls whether one-hot encoding or partitioning should be used for each feature," falling back to one-hot below the threshold, since the simpler representation is typically as good and slightly cheaper [9].
In-set conditions are powerful but not free.
The choice of split-condition family has a direct effect on tabular model quality. On benchmarks dominated by categorical features such as Criteo, Avazu, or KDD Cup data, libraries with native in-set support consistently outperform pipelines that one-hot encode then run a categorical-blind tree learner. That is one reason LightGBM, XGBoost, and CatBoost have largely displaced scikit-learn's classic tree implementations in industry tabular workflows.
In-set conditions also matter for interpretability work. A condition like country in {US, CA, MX} is more readable than "country_US == 1 OR country_CA == 1 OR country_MX == 1" reconstructed across three tree levels, even if the two encode the same logic.