# In-set condition

> Source: https://aiwiki.ai/wiki/in-set_condition
> Updated: 2026-06-24
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Decision tree](/wiki/decision_tree), [Oblique condition](/wiki/oblique_condition), [Threshold for decision trees](/wiki/threshold_for_decision_trees), [Categorical feature](/wiki/categorical_feature)*

An **in-set condition** is a split [condition](/wiki/condition) used inside a [decision tree](/wiki/decision_tree) node that tests whether the value of a single [categorical feature](/wiki/categorical_feature) belongs to a specified subset of that feature's possible values. Google's Machine Learning Glossary defines it as "a condition that tests for the presence of one item in a set of items," and gives the example `house-style in [tudor, colonial, cape]` [1]. During inference, if the example's `house-style` value is one of those three, the condition evaluates to Yes and the example follows the positive branch; otherwise it goes the other way [1].

In-set conditions sit alongside the other two condition families that decision-forest libraries normally recognise: axis-aligned (numerical threshold) conditions and [oblique conditions](/wiki/oblique_condition) [2]. They are the standard tool for categorical splits because they encode group membership directly inside one tree node, which is far more compact and usually more accurate than splitting on a [one-hot encoded](/wiki/one-hot_encoding) version of the same feature. Google's glossary states plainly that "in-set conditions usually lead to more efficient decision trees than conditions that test one-hot encoded features" [1].

## What does an in-set condition look like?

The formal shape of an in-set condition is

```
is feature_x in S?
```

where `feature_x` is a single categorical feature and `S` is a subset of that feature's vocabulary. Some examples:

- `species in {"cat", "dog", "bird"}` (from Google's decision-forests course) [2]
- `country in {France, Germany, Spain}`
- `userAgent in {"Mozilla/5.0", "InternetExplorer/10.0"}`
- `fruit in {apple, banana}`

If the example's value lies in `S`, the example is sent down one child; otherwise it is sent down the other. Because there are only two children, the in-set condition is binary, just like a numerical threshold split. A multiway categorical split (one branch per category) is a different construct used by [C4.5](/wiki/c4_5_algorithm) and is not an in-set condition [6].

## How do in-set conditions compare to other condition types?

Google's decision-forests documentation groups split conditions into three families [2]. An axis-aligned condition "involves only a single feature" with the example `area > 200`, while an oblique condition "involves more than one feature" with the example `height > width` [1]. The table below summarises how the three families differ.

| Condition type | Example | Features used | Feature kind | Typical libraries |
|---|---|---|---|---|
| Axis-aligned (numerical threshold) | `area > 200` | One | Numerical | All decision-tree libraries |
| In-set | `species in {cat, dog, bird}` | One | Categorical | [LightGBM](/wiki/lightgbm), [XGBoost](/wiki/xgboost) (>=1.5), [TensorFlow Decision Forests](/wiki/tensorflow_decision_forests), YDF, [CART](/wiki/cart_algorithm) |
| Oblique | `5 * num_legs + 2 * num_eyes >= 10` | Multiple | Numerical | YDF, TF-DF, oblique-tree research libraries |

An axis-aligned condition uses a [threshold for decision trees](/wiki/threshold_for_decision_trees) to compare a single numerical feature against a learned cut-point. An [oblique condition](/wiki/oblique_condition) draws a hyperplane that combines several numerical features at once, which can capture diagonal decision boundaries that axis-aligned trees need many splits to approximate [2]. An in-set condition is the categorical analogue of the axis-aligned condition: one feature, two children, but the test is set membership instead of inequality.

## Why do categorical features need a special condition type?

A categorical feature has no natural ordering. "Tudor" is not less than or greater than "colonial." Standard numerical splits do not apply directly, so trees need a different mechanism. There are three common approaches, each with different consequences for tree shape and accuracy.

| Approach | What it does | Trade-off |
|---|---|---|
| One-hot encoding plus axis-aligned splits | Convert each category to a 0/1 column, then use threshold conditions | Each split isolates a single category, leading to deep, imbalanced trees, especially for high-cardinality features |
| Multiway categorical split | One child per category | Used by C4.5 and ID3; leaves with very few samples, hurts variance |
| In-set condition (binary partition) | One child for categories in `S`, another for the rest | Compact, supports balanced trees, used by CART, LightGBM, TF-DF, and modern XGBoost |

Google's glossary notes that "in-set conditions usually lead to more efficient decision trees than conditions that test one-hot encoded features" [1]. The intuition: with one-hot, the tree can only ever pick one category at a time per split, so it takes many levels to express "the colour is red, blue, or green." An in-set condition expresses the same logic in a single node. The scikit-learn documentation makes the same point about its own native categorical handling, noting that "one-hot encoding requires more tree depth to achieve equivalent splits" [12].

## How hard is it to find the best in-set split?

Finding the best in-set condition is harder than finding the best numerical threshold. With `K` distinct categories, the number of non-trivial bipartitions of the value set is `2^(K-1) - 1`. Searching this space exhaustively becomes intractable around `K = 30` or so.

Luckily there is a classic shortcut. Walter D. Fisher's 1958 paper "On Grouping for Maximum Homogeneity," published in the Journal of the American Statistical Association (volume 53, pages 789-798), proved that the optimal groups must be contiguous when the categories are placed in order [4]. For a binary classification target with a concave splitting criterion such as Gini or entropy, the optimal partition can therefore be found by:

1. Sorting the categories by the proportion of positive labels they each carry.
2. Treating the sorted categories as if they were ordered numerical values.
3. Picking the best prefix split.

This reduces the search from `O(2^K)` to `O(K log K)` and gives the same answer as the brute-force search [4][7]. Breiman, Friedman, Olshen, and Stone proved an analogous result for regression with squared error in the original CART book and generalised Fisher's contiguity result to arbitrary convex node-impurity functions [5]. Most modern gradient-boosting libraries use Fisher's trick or a gradient-based extension of it.

| Setting | Optimal-split complexity |
|---|---|
| Brute force | `O(2^(K-1))` partitions |
| Binary classification, Gini or entropy (Fisher 1958) | `O(K log K)` after sorting |
| Regression, squared error | `O(K log K)` after sorting by mean target |
| Multiclass, generic criteria | NP-hard in general; libraries use heuristics |

## Which libraries support in-set conditions?

Not every decision-tree library implements in-set conditions, and those that do use slightly different splitting strategies. The table below covers the libraries most practitioners encounter.

| Library | Native in-set support | Splitting strategy | Notes |
|---|---|---|---|
| [CART](/wiki/cart_algorithm) (Breiman et al., 1984) | Yes | Fisher-style sort for binary targets, exhaustive otherwise | The original native-categorical implementation [5] |
| [C4.5](/wiki/c4_5_algorithm) (Quinlan, 1993) | Multiway, not in-set | One branch per category | Different family of categorical split [6] |
| [scikit-learn](/wiki/scikit-learn) `DecisionTreeClassifier` / `Regressor` | No | Requires one-hot or ordinal encoding upstream | `HistGradientBoosting*` does support native categoricals [12] |
| [LightGBM](/wiki/lightgbm) | Yes, via `categorical_feature` | Sorts histogram by `sum_gradient / sum_hessian`, picks best partition; Fisher 1958 + gradient extension | `O(K log K)` [7][8] |
| [XGBoost](/wiki/xgboost) (>=1.5) | Yes, via `enable_categorical=True` with `tree_method='hist'` or `'approx'` | Sorts gradient histogram, then partitions; `max_cat_to_onehot` controls the cutoff | Experimental support added in 1.5; optimal partitioning added in 1.6 [9] |
| [CatBoost](/wiki/catboost) | Native categorical support, but uses ordered target statistics rather than in-set conditions | Encodes categoricals as numerical target stats, then uses standard threshold splits in symmetric (oblivious) trees | Different mechanism, similar end goal [10] |
| [TensorFlow Decision Forests](/wiki/tensorflow_decision_forests) and YDF | Yes, first-class | Three different categorical-split algorithms, including optimal Fisher-style | YDF documents "optimal categorical splits" as a headline feature [11] |

The scikit-learn case is worth flagging. The classic `DecisionTreeClassifier` and `RandomForestClassifier` cannot consume categorical features directly; you have to encode them yourself. For high-cardinality features this often means [one-hot encoding](/wiki/one-hot_encoding), which scales badly. Practitioners who care about categorical performance in scikit-learn often switch to `HistGradientBoostingClassifier`, which has native categorical support: at each split it "partitions the categories of such a feature into disjoint sets using a heuristic that sorts them with respect to the target" [12]. The newer gradient-boosting estimators, LightGBM, and XGBoost are all options here.

## A worked example

Suppose you are predicting whether a customer will click an ad and one of your features is `device_type` with four possible values: `phone`, `tablet`, `desktop`, `tv`. Click-through rates in your training data look like this:

| Device type | Examples | Clicks | Click rate |
|---|---|---|---|
| phone | 4000 | 800 | 20% |
| tablet | 1000 | 150 | 15% |
| desktop | 3000 | 300 | 10% |
| tv | 500 | 25 | 5% |

Using Fisher's trick, you sort the categories by click rate descending: `phone`, `tablet`, `desktop`, `tv`. The candidate in-set conditions then reduce to the three prefix splits:

| Candidate condition | Left (Yes) child | Right (No) child |
|---|---|---|
| `device_type in {phone}` | phone | tablet, desktop, tv |
| `device_type in {phone, tablet}` | phone, tablet | desktop, tv |
| `device_type in {phone, tablet, desktop}` | phone, tablet, desktop | tv |

You compute the Gini impurity reduction for each candidate and keep the best one. This is exactly three evaluations instead of `2^(4-1) - 1 = 7`. The saving looks small here but grows fast: with `K = 50` categories, `2^49 - 1` is more than `5 * 10^14`, while `K log K` is roughly 280 [7].

## When do in-set conditions help most?

In-set conditions tend to dominate one-hot baselines in two situations:

1. **High-cardinality features.** Things like ZIP code, product SKU, user ID, or city name can have thousands of distinct values. One-hot encoding produces thousands of nearly-empty columns and forces the tree to split them off one at a time, which wastes capacity and inflates depth. A single in-set condition can carve the value set into two meaningful groups in one node.
2. **Imbalanced category frequencies.** If most examples land in a handful of categories and the rest are rare, axis-aligned splits on one-hot columns push the tree toward extreme imbalance, since splitting off a rare category leaves a tiny child. In-set conditions can group rare categories together against frequent ones in a single test.

For low-cardinality features (say, `K <= 4`), the difference is usually small. XGBoost's `max_cat_to_onehot` parameter exists for exactly this reason: it "controls whether one-hot encoding or partitioning should be used for each feature," falling back to one-hot below the threshold, since the simpler representation is typically as good and slightly cheaper [9].

## Limitations and caveats

In-set conditions are powerful but not free.

- For some criteria (notably multiclass log-loss without simplifying assumptions), finding the optimal in-set partition is NP-hard, so libraries use greedy or gradient-based heuristics. The result is not always the global optimum.
- Native categorical handling can overfit on rare categories. Practitioners often set a minimum number of examples per category, or merge rare values into an "other" bucket, to keep splits stable.
- Models that use in-set conditions are slightly less portable. A serialised tree carries the categorical vocabularies it was trained with, so feeding the model an unseen category at inference time has to be handled explicitly. Both LightGBM and YDF treat unknown categories as out-of-dictionary and route them down a default branch [8][11].
- Some downstream tooling assumes purely numerical splits, which can complicate exporting an in-set tree to formats like ONNX or PMML.

## Why does this matter?

The choice of split-condition family has a direct effect on tabular model quality. On benchmarks dominated by categorical features such as Criteo, Avazu, or KDD Cup data, libraries with native in-set support consistently outperform pipelines that one-hot encode then run a categorical-blind tree learner. That is one reason LightGBM, XGBoost, and CatBoost have largely displaced scikit-learn's classic tree implementations in industry tabular workflows.

In-set conditions also matter for interpretability work. A condition like `country in {US, CA, MX}` is more readable than "country_US == 1 OR country_CA == 1 OR country_MX == 1" reconstructed across three tree levels, even if the two encode the same logic.

## See also

- [Decision tree](/wiki/decision_tree)
- [Oblique condition](/wiki/oblique_condition)
- [Threshold for decision trees](/wiki/threshold_for_decision_trees)
- [Categorical feature](/wiki/categorical_feature)
- [CART algorithm](/wiki/cart_algorithm)
- [C4.5 algorithm](/wiki/c4_5_algorithm)
- [Random forest](/wiki/random_forest)
- [Gradient boosting](/wiki/gradient_boosting)
- [LightGBM](/wiki/lightgbm)
- [XGBoost](/wiki/xgboost)
- [CatBoost](/wiki/catboost)
- [TensorFlow Decision Forests](/wiki/tensorflow_decision_forests)
- [scikit-learn](/wiki/scikit-learn)

## References

1. Google for Developers. "Machine Learning Glossary: Decision Forests, In-set condition." https://developers.google.com/machine-learning/glossary/df
2. Google for Developers. "Types of conditions." Decision Forests course. https://developers.google.com/machine-learning/decision-forests/conditions
3. Google for Developers. "Growing decision trees." Decision Forests course. https://developers.google.com/machine-learning/decision-forests/growing
4. Fisher, W. D. (1958). "On Grouping for Maximum Homogeneity." Journal of the American Statistical Association, 53(284), 789-798. https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479
5. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). "Classification and Regression Trees." Wadsworth.
6. Quinlan, J. R. (1993). "C4.5: Programs for Machine Learning." Morgan Kaufmann.
7. LightGBM documentation. "Optimal Split for Categorical Features." https://lightgbm.readthedocs.io/en/latest/Features.html
8. LightGBM documentation. "Advanced Topics: Categorical Feature Support." https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html
9. XGBoost documentation. "Categorical Data." https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html
10. CatBoost documentation. "Transforming categorical features to numerical features." https://catboost.ai/docs/en/concepts/algorithm-main-stages_cat-to-numberic
11. TensorFlow Decision Forests and YDF documentation. https://ydf.readthedocs.io/
12. scikit-learn documentation. "Categorical Feature Support in Gradient Boosting" and "Ensembles: Gradient boosting." https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_categorical.html