# Feature Cross

> Source: https://aiwiki.ai/wiki/feature_cross
> Updated: 2026-06-01
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Feature engineering](/wiki/feature_engineering)*

## introduction

A **feature cross** (also called a **crossed feature** or **feature interaction**) is a synthetic [feature](/wiki/feature) created by combining two or more existing features to capture their joint effect on a prediction. In [machine learning](/wiki/machine_learning), individual features sometimes fail to represent the patterns that emerge only when variables act together. Feature crossing addresses this gap by explicitly encoding interactions, giving models access to information that would otherwise remain hidden in the raw inputs.

Feature crosses are one of the most practical techniques in [feature engineering](/wiki/feature_engineering). They are especially valuable for [linear models](/wiki/linear_model) such as [logistic regression](/wiki/logistic_regression) and [linear regression](/wiki/linear_regression), which cannot learn interactions on their own. By adding crossed features, a linear model gains the ability to approximate nonlinear decision boundaries without increasing architectural complexity. The technique has roots in classical statistics, where interaction terms have been used in regression analysis for decades, but it gained renewed attention in the deep learning era through architectures like Wide and Deep,[1] [Deep and Cross Network](/wiki/deep_and_cross_network),[2] and [DeepFM](/wiki/deepfm)[6] that integrate explicit feature crosses with neural networks.

The term feature cross became widely adopted in industry partly through Google's [Machine Learning Crash Course](/wiki/machine_learning_crash_course), which devotes a section to the concept under the heading "Categorical data: Feature crosses."[10] Google's TensorFlow team also exposed the idea through library APIs such as `tf.feature_column.crossed_column` (now deprecated)[11] and `tf.keras.layers.HashedCrossing`,[12] cementing the vocabulary used by practitioners today.

## how feature crosses work

At a high level, a feature cross takes two or more source features and produces a new feature whose value depends on the specific combination of values from those sources. The exact mechanics differ depending on whether the source features are numerical or categorical.

### numerical feature crosses

For numerical (continuous) features, the simplest cross is the element-wise product. Given two features *A* and *B*, the crossed feature is:

```
C = A * B
```

For example, a dataset might contain the features `temperature` and `humidity`. Individually, neither variable may strongly predict rainfall. But their product, `temperature * humidity`, can capture the combined atmospheric condition that leads to rain. This multiplicative interaction is the most common form of numerical feature crossing.

Mathematically, if a linear model has the form:

```
y = w_0 + w_1 * x_1 + w_2 * x_2
```

then adding a numerical cross transforms it into:

```
y = w_0 + w_1 * x_1 + w_2 * x_2 + w_3 * (x_1 * x_2)
```

The new term `w_3 * (x_1 * x_2)` lets the model represent nonlinear behavior. The slope of `y` with respect to `x_1` now depends on the value of `x_2`, which is the defining property of an interaction effect in classical regression analysis.

Higher-order crosses are also possible. A degree-3 cross of features *A*, *B*, and *C* would be `A * B * C`. These higher-order interactions grow in number very quickly. With *n* features and degree *d*, the count of possible crosses is on the order of *n^d*, so practitioners must be selective about which crosses to include.

### categorical feature crosses

For [categorical data](/wiki/categorical_data), a feature cross is the **Cartesian product** of the value sets. If feature *X* has values {red, blue} and feature *Y* has values {small, large}, the crossed feature *X x Y* has four possible values: {red_small, red_large, blue_small, blue_large}.

After the cross is formed, each combination is typically represented through [one-hot encoding](/wiki/one-hot_encoding). The resulting vector has one dimension per unique combination, with a 1 in the position corresponding to the observed pair and 0 everywhere else.

| Feature X | Feature Y | Crossed Feature (X x Y) |
|-----------|-----------|-------------------------|
| red       | small     | red_small               |
| red       | large     | red_large               |
| blue      | small     | blue_small              |
| blue      | large     | blue_large              |

This encoding lets the model learn a separate weight for every combination, giving it far more expressive power than treating each feature independently. A canonical worked example from Google's documentation crosses `city` (e.g., New York, Boston, Seattle) with `weather` (e.g., sunny, rainy, snowy).[10] The crossed feature contains values such as `Boston_rainy` or `Seattle_snowy`, and a linear model can learn that, say, `Seattle_snowy` strongly predicts a delivery delay even when neither `Seattle` nor `snowy` alone is very informative on its own.

### bucketized numerical crosses

A common hybrid approach discretizes continuous features into buckets before crossing them. For instance, latitude and longitude can each be split into bins (for example, 10-degree ranges), and then crossed to produce a grid of geographic cells. Google's Machine Learning Crash Course uses this latitude-longitude example to show how a simple linear model, when given bucketized location crosses, can learn location-specific patterns that would otherwise require a nonlinear model.[10]

Bucketization turns a continuous variable into a categorical one with a small number of bins. Once bucketized, the variable behaves like any other categorical feature for the purpose of crossing. The advantage over a raw numerical product is that the model can learn distinct weights for each cell in the grid, capturing sharp regional patterns that a smooth product term would average out.

## why feature crosses matter

### enabling nonlinearity in linear models

A [linear model](/wiki/linear_model) computes predictions as a weighted sum of input features. Without feature crosses, it can only represent additive relationships. The effect of feature A is independent of feature B. Many real-world problems violate this assumption. By adding the product A * B as a new feature, the model can capture the interaction effect, which is the contribution that appears only when both A and B take certain values simultaneously.

Consider a fraud detection system. The features `transaction_amount` and `time_of_day` may each be weak predictors of fraud on their own. But a high transaction amount at 3 AM is far more suspicious than the same amount at noon. The crossed feature `transaction_amount * time_of_day` lets a linear model learn this nuance.

### memorization of specific patterns

Feature crosses excel at memorization: learning that a particular combination of inputs maps to a particular output. In recommendation systems, for example, a cross of `user_id x item_id` lets the model memorize which user-item pairs led to clicks. This memorization ability is central to Google's Wide and Deep architecture, discussed later in this article.[1]

Memorization and generalization are often described as two sides of the same coin in [recommender system](/wiki/recommender_system) design. Memorization captures co-occurrences that have been observed frequently in training data. Generalization extrapolates to combinations that were rare or unseen. Crossed features lean strongly toward memorization, while [embeddings](/wiki/embedding_vector) and dense neural layers lean toward generalization. Production systems usually need both.

### computational simplicity

Compared to deploying a [neural network](/wiki/neural_network) with multiple hidden layers, feature crosses provide interaction modeling at minimal computational cost. They require no gradient-based interaction discovery; the engineer specifies the cross, and the model learns the weights. Training and inference remain fast because the underlying model is still linear, which means a single matrix-vector multiplication at inference time.

This simplicity makes crosses attractive for latency-sensitive deployments such as ad-serving systems that must produce predictions in single-digit milliseconds. The underlying model, often [logistic regression](/wiki/logistic_regression) or a [follow-the-regularized-leader](/wiki/ftrl) variant, can be served on commodity CPUs without specialized hardware.

### interpretability

A cross feature has a clear semantic meaning. The weight on `country=Japan x device=mobile` directly answers the question "how much does being on a mobile device in Japan increase the predicted click rate?" That kind of inspectability matters for debugging models, explaining decisions to stakeholders, and complying with regulatory requirements that demand model transparency.

Neural networks, by contrast, distribute their interaction knowledge across many weights and nonlinear activations, which makes it harder to point at a single number and say what it means. The trade-off is one of the reasons crosses persist in production stacks even when deep models are available.

## feature crosses for categorical data

Categorical feature crosses are particularly common in web-scale applications such as advertising, search ranking, and recommendation systems. In these domains, inputs are often high-cardinality categorical variables (for example, user IDs, product IDs, query terms, publisher domains, geographic regions).

### the sparsity challenge

Crossing two categorical features with *m* and *n* unique values produces up to *m x n* possible combinations. If a user ID feature has 1 million values and a product ID feature has 100,000 values, the full cross has 100 billion possible values. The resulting one-hot vector is extremely sparse: only one entry out of 100 billion is nonzero for each example.

| Source Feature     | Cardinality | Crossed Feature        | Cardinality          |
|--------------------|-------------|------------------------|----------------------|
| Country (200)      | 200         | Country x Language     | 200 x 100 = 20,000  |
| Language (100)     | 100         |                        |                      |
| User ID (1M)       | 1,000,000   | User ID x Product ID   | 100 billion          |
| Product ID (100K)  | 100,000     |                        |                      |
| Query token (1M)   | 1,000,000   | Query x Country        | 200 million          |

While such sparse representations are feasible with sparse matrix libraries, the sheer dimensionality can slow training and inflate memory use. Most production systems either prune the cross to combinations that appear above a minimum count threshold, or apply the hashing trick described next.

### hashing for high-dimensional crosses

The **hashing trick** (also called **feature hashing**, formalized by Weinberger et al. in 2009) addresses the dimensionality problem.[7][15] Instead of maintaining a full one-hot vector for every possible combination, the crossed value is run through a hash function and mapped to one of a fixed number of buckets:

```
bucket_index = hash(feature_A_value, feature_B_value) % hash_bucket_size
```

This reduces the feature space from potentially billions of dimensions to a manageable, fixed size (for example, 10,000 or 100,000 buckets). The trade-off is **hash collisions**: different feature combinations may map to the same bucket, introducing some noise.[15] In practice, a sufficiently large bucket size keeps collision rates low enough that model accuracy is largely preserved.

The hashing trick was popularized by [Vowpal Wabbit](/wiki/vowpal_wabbit), an online learning system developed at Yahoo and later Microsoft Research. Vowpal Wabbit relied heavily on hashed features to train logistic regression models with billions of parameters using only modest amounts of memory. The same idea later became standard in TensorFlow, scikit-learn, and most production CTR-prediction stacks.

TensorFlow's `tf.feature_column.crossed_column` uses this approach internally.[11] The function signature was:

```python
tf.feature_column.crossed_column(
    keys,
    hash_bucket_size,
    hash_key=None
)
```

Here, `keys` lists the features to cross, and `hash_bucket_size` controls the number of hash buckets. A common recommendation is to include the original (uncrossed) features alongside the cross so the model retains access to the individual signals.

### tensorflow keras hashedcrossing

In TensorFlow 2.x, the `tf.feature_column` API has been deprecated in favor of Keras preprocessing layers.[11] The modern equivalent of `crossed_column` is `tf.keras.layers.HashedCrossing`. The Keras layer supports two output modes: `"int"` (returns the bucket index as an integer) and `"one_hot"` (returns a one-hot vector).[12] It can be composed naturally inside a Keras `Model` definition without the older `feature_column` plumbing:

```python
import tensorflow as tf

cross_layer = tf.keras.layers.HashedCrossing(
    num_bins=20,
    output_mode="one_hot"
)

city = tf.constant(["NYC", "LA", "NYC"])
device = tf.constant(["mobile", "desktop", "mobile"])
crossed = cross_layer((city, device))
```

For users who want to keep using the higher-level `tf.feature_column` style without writing layers manually, TensorFlow recommends the `tf.keras.utils.FeatureSpace` utility, which provides a declarative wrapper around the underlying preprocessing layers.

## polynomial features as feature crosses

Polynomial feature generation is a closely related technique that is most commonly applied to numerical data. [Scikit-learn](/wiki/scikit_learn)'s `PolynomialFeatures` class produces all polynomial combinations of input features up to a specified degree:[13]

```python
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=False)
# Input: [a, b]
# Output: [1, a, b, a^2, ab, b^2]
```

Setting `interaction_only=True` removes the pure-power terms (like `a^2` and `b^2`), leaving only the cross terms:

```python
poly = PolynomialFeatures(degree=2, interaction_only=True)
# Input: [a, b]
# Output: [1, a, b, ab]
```

This `interaction_only` mode is essentially pure feature crossing for numerical data. The key difference between polynomial features and categorical feature crosses is that polynomial features multiply continuous values, while categorical crosses form the Cartesian product and one-hot encode the result.

| Method                      | Data Type   | Technique                        | Output Format         |
|-----------------------------|-------------|----------------------------------|-----------------------|
| Categorical feature cross   | Categorical | Cartesian product + one-hot      | Sparse binary vector  |
| Polynomial features         | Numerical   | Multiplication of feature values | Dense numeric vector  |
| Bucketized cross            | Numerical (discretized) | Bin + Cartesian product | Sparse binary vector  |
| Hashed cross                | Categorical (high cardinality) | Hash combined keys to fixed bins | Sparse binary vector |

A related class in scikit-learn, `SplineTransformer`, generates basis-function expansions that capture nonlinear effects of single features without producing explicit crosses. Practitioners often combine `SplineTransformer` for individual nonlinearities with `PolynomialFeatures(interaction_only=True)` for cross terms.

## variants of feature crosses

Different applications call for different cross representations. The table below summarizes the main variants encountered in practice.

| Variant                   | Best For                                | Pros                                       | Cons                                            |
|---------------------------|------------------------------------------|--------------------------------------------|-------------------------------------------------|
| Manual concatenated cross | Small cardinality categorical data       | Simple, exact, interpretable               | Explodes with high cardinality                  |
| Hashed cross              | High cardinality, web-scale CTR systems  | Fixed memory footprint, fast               | Hash collisions reduce signal slightly          |
| Bucketized numerical cross | Geographic or time-of-day patterns      | Captures sharp regional variation          | Requires choosing bucket boundaries             |
| Polynomial cross          | Continuous numerical features             | Smooth multiplicative interactions          | Can amplify outliers and ill-conditioned scales |
| Embedded cross (FM, FFM)  | Sparse high-cardinality with rare combos | Generalizes to unseen pairs via embeddings | More parameters and tuning required             |
| Learned cross (DCN, xDeepFM) | Deep models with many features          | Discovers crosses automatically            | Less interpretable than explicit crosses        |

## feature crosses vs. learned interactions

### neural networks

A [neural network](/wiki/neural_network) with one or more hidden layers can learn feature interactions automatically through its nonlinear activation functions. Each neuron in a hidden layer computes a weighted sum of inputs and passes it through a nonlinearity, allowing the network to represent arbitrarily complex interactions without manual engineering.

However, neural networks learn interactions implicitly. They require sufficient training data to discover useful combinations, and the learned interactions are embedded in the network weights, making them difficult to interpret. Feature crosses, by contrast, are explicit and interpretable: the engineer can inspect which combinations matter and assign clear semantic meaning to each one.

Two failure modes are particularly common. First, neural networks may struggle to learn very high-order or very rare interactions when training data is limited. Crossing the relevant features and feeding them in directly removes the burden of discovery. Second, even when a network has enough capacity, the optimizer may not find the relevant interaction without thousands of gradient updates. Explicit crosses act as a strong inductive bias that accelerates training.

### factorization machines

[Factorization machines](/wiki/factorization_machines) (FM), introduced by Steffen Rendle in 2010 at the IEEE International Conference on Data Mining, generalize the idea of feature crosses by replacing each crossed weight with a dot product of two low-dimensional embeddings.[4] For pairwise interactions, the FM scoring function is:

```
y = w_0 + sum_i (w_i * x_i) + sum_{i<j} (<v_i, v_j> * x_i * x_j)
```

where each feature `i` has both a scalar weight `w_i` and an embedding vector `v_i`, and `<v_i, v_j>` is the dot product. The crucial advantage is that FMs can estimate interaction strengths even for pairs that never co-occur in training, because the embeddings are learned across all pairs that share at least one component.[4] FMs were a step toward bridging the gap between sparse explicit crosses and dense learned representations.

**Field-aware Factorization Machines (FFM)**, proposed by Juan et al. in 2016, extend FM by giving each feature multiple embedding vectors, one per interacting field.[9] FFM won several Kaggle CTR competitions, including the Criteo Display Ad Challenge, before deep models took over.[9]

### deep and cross network (DCN)

The [Deep and Cross Network](/wiki/deep_and_cross_network) (DCN), introduced by Wang et al. in the 2017 ADKDD workshop paper "Deep & Cross Network for Ad Click Predictions" (arXiv:1708.05123), automates feature crossing within a neural architecture.[2] DCN adds a "cross network" alongside a standard deep network. Each layer of the cross network explicitly computes feature interactions of increasing polynomial degree.[2]

The cross network update rule for layer *l+1* is:

```
x_{l+1} = x_0 * x_l^T * w_l + b_l + x_l
```

where `x_0` is the input feature vector and `w_l`, `b_l` are learned parameters. After *L* cross layers, the network has implicitly enumerated polynomial cross terms up to degree *L+1*, but with parameter cost that grows linearly in *L* rather than exponentially.

**DCN-V2**, published by Wang et al. at The Web Conference 2021 (arXiv:2008.13535), upgrades the cross network with a full weight matrix instead of a vector and adds a mixture-of-experts variant that exploits the low-rank structure of the learned cross matrix.[3] DCN-V2 has been deployed across multiple Google web-scale ranking systems and is reported to deliver significant offline and online metric improvements over the original DCN.[3]

### xdeepfm

The **eXtreme Deep Factorization Machine (xDeepFM)**, proposed by Lian et al. at KDD 2018 (arXiv:1803.05170), introduces a **Compressed Interaction Network (CIN)** that performs explicit feature crossing at the vector level rather than at the bit level.[5] CIN captures bounded-degree interactions explicitly while a parallel deep network captures arbitrary high-order interactions implicitly.[5] The combination is designed to inherit the strengths of factorization machines, [Wide and Deep](/wiki/wide_and_deep), and DCN while addressing some of their limitations.

### deepfm

**DeepFM**, introduced by Guo et al. at IJCAI 2017 (arXiv:1703.04247), is another widely deployed hybrid. It combines a factorization-machine component for low-order interactions with a [multi-layer perceptron](/wiki/multilayer_perceptron) for high-order interactions, with both components sharing the same input embeddings.[6] Compared with Wide and Deep, DeepFM removes the need for hand-crafted cross features by relying on the FM component to learn pairwise interactions automatically.[6]

### tree-based models

Decision trees and ensemble methods like [random forests](/wiki/random_forest) and [gradient boosting](/wiki/gradient_boosting) (including [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and [CatBoost](/wiki/catboost)) learn feature interactions inherently. A decision tree that splits first on feature A and then on feature B within a subtree has effectively learned the interaction A x B. Because of this, manually adding feature crosses to tree-based models usually provides less benefit than adding them to linear models.

That said, there are cases where explicit crosses still help tree models. If an important interaction involves more than two features, a tree may need several levels of splits to capture it, and providing the cross directly as a new feature can shorten the required tree depth. Crosses can also reduce the number of trees needed for the same predictive performance, which lowers inference latency in production.

| Model Type         | Learns Interactions Automatically? | Benefit of Manual Feature Crosses |
|--------------------|------------------------------------|-----------------------------------|
| [Linear model](/wiki/linear_model) | No                  | Very high                         |
| [Neural network](/wiki/neural_network) | Yes (implicitly) | Low to moderate                  |
| Decision tree / ensemble | Yes (via splits)              | Low (occasionally moderate)       |
| Factorization machine | Yes (pairwise via embeddings)    | Built-in                          |
| DCN / xDeepFM      | Yes (explicitly + implicitly)      | Built-in                          |

## google's wide and deep model

In 2016, Google published "Wide & Deep Learning for Recommender Systems" (Cheng et al., arXiv:1606.07792), describing an architecture that pairs a wide linear model with a deep neural network.[1] The **wide component** uses feature crosses to memorize specific user-item co-occurrences, while the **deep component** uses [embeddings](/wiki/embedding_vector) and hidden layers to generalize to unseen feature combinations.[1]

The wide side takes cross-product transformations of the form:

```
cross(feature_i, feature_j) = 1 if feature_i and feature_j are both active
```

These sparse crosses allow the model to memorize that, for example, "users who installed app X also installed app Y." The deep side embeds sparse features into low-dimensional dense vectors and feeds them through multiple layers to learn generalizable patterns.

Google deployed Wide and Deep on the [Google Play](/wiki/google_play) app store, which served over one billion active users and over one million apps at the time of the paper.[1] The system significantly improved app acquisition rates compared to wide-only and deep-only baselines while meeting strict training and serving latency requirements.[1] The architecture demonstrated that memorization (via feature crosses) and generalization (via deep networks) are complementary strengths that work best in combination.

The Wide and Deep paper also released a high-level TensorFlow API that made the architecture easy to reproduce, which contributed to the broad adoption of explicit feature crosses in production deep learning pipelines through the late 2010s. Even after pure deep models became more common, the lessons from Wide and Deep influenced later architectures including DCN, DeepFM, and xDeepFM, all of which try to give a deep model some form of explicit cross signal.

## implementation examples

### manual crossing in python

The simplest way to create a feature cross in Python is to combine columns directly in a pandas DataFrame:

```python
import pandas as pd

df = pd.DataFrame({
    'age': [25, 40, 35],
    'income': [50000, 80000, 65000]
})

# Numerical cross
df['age_x_income'] = df['age'] * df['income']
```

For categorical features:

```python
df = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC'],
    'device': ['mobile', 'desktop', 'mobile']
})

# Categorical cross
df['city_x_device'] = df['city'] + '_' + df['device']
```

### scikit-learn polynomialfeatures

```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, interaction_only=True)),
    ('clf', LogisticRegression())
])
pipeline.fit(X_train, y_train)
```

### tensorflow legacy crossed column

```python
import tensorflow as tf

city = tf.feature_column.categorical_column_with_vocabulary_list(
    'city', ['NYC', 'LA', 'Chicago'])
device = tf.feature_column.categorical_column_with_vocabulary_list(
    'device', ['mobile', 'desktop', 'tablet'])

city_x_device = tf.feature_column.crossed_column(
    [city, device], hash_bucket_size=20)
```

Note that `tf.feature_column` has been deprecated.[11] New TensorFlow code should prefer the Keras-native version below.

### tensorflow keras hashedcrossing layer

```python
import tensorflow as tf

cross = tf.keras.layers.HashedCrossing(
    num_bins=1000,
    output_mode='one_hot'
)

city = tf.keras.Input(shape=(1,), dtype=tf.string, name='city')
device = tf.keras.Input(shape=(1,), dtype=tf.string, name='device')
crossed = cross((city, device))
```

### pytorch manual crossing with hashing

PyTorch does not ship a dedicated cross layer, but the same pattern is easy to write by hand:

```python
import torch
import torch.nn as nn

class HashedCross(nn.Module):
    def __init__(self, num_bins, embedding_dim):
        super().__init__()
        self.num_bins = num_bins
        self.embed = nn.Embedding(num_bins, embedding_dim)

    def forward(self, a, b):
        # a, b are LongTensor IDs
        combined = a * 1_000_003 + b  # mix the two IDs
        bucket = combined % self.num_bins
        return self.embed(bucket)
```

Many production systems use a similar pattern with a 64-bit hash function (such as MurmurHash3) before the modulus to keep the bucket distribution uniform across keys.

### deep and cross network in tensorflow recommenders

TensorFlow Recommenders provides a ready-made `tfrs.layers.dcn.Cross` layer that implements the DCN cross-network update rule.[14] A minimal model looks like:

```python
import tensorflow as tf
import tensorflow_recommenders as tfrs

inputs = tf.keras.Input(shape=(64,))
x = tfrs.layers.dcn.Cross()(inputs)
x = tfrs.layers.dcn.Cross()(x)
output = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs, output)
```

## practical considerations

### choosing which features to cross

Not all feature combinations are useful. Practitioners typically rely on:

- **Domain knowledge.** Understanding which variables interact in the real world (for example, location and time for ride pricing, or device and creative size for ad serving).
- **Statistical tests.** Checking for interaction effects in exploratory data analysis using techniques such as ANOVA, mutual information, or visualizing per-bin response curves.
- **Model-based importance.** Training a gradient-boosted tree first and inspecting the most common pairs of features that appear together in tree paths. Pairs that frequently co-split are good cross candidates for a downstream linear model.
- **Automated search.** Tools like AutoCross (Luo et al., KDD 2019, arXiv:1904.12857) use beam search over a tree-structured space of candidate crosses with successive mini-batch gradient descent and multi-granularity discretization to find effective high-order crosses without expert tuning.[8]

### avoiding feature explosion

Crossing many features at high order produces a combinatorial explosion. A dataset with 50 features has 1,225 pairwise crosses and over 19,000 three-way crosses. Strategies to manage this include:

- Limiting cross order to 2 (pairwise interactions).
- Using the hashing trick to cap dimensionality at a fixed bucket count.
- Applying L1 [regularization](/wiki/regularization) to drive irrelevant cross weights to zero.
- Selecting features with known domain relevance before crossing.
- Filtering crosses by minimum count in the training set, dropping combinations seen fewer than, say, 10 times.

### sparsity and memory

Crossed categorical features are inherently sparse. Efficient storage with compressed sparse row (CSR) matrices and sparse-aware optimizers is essential for web-scale systems. TensorFlow, PyTorch, and scikit-learn all support sparse feature representations. For very wide models with billions of crossed features, distributed parameter servers or specialized embedding-table sharding (as in [TorchRec](/wiki/torchrec) or [DeepRec](/wiki/deeprec)) are commonly used.

### hashing collisions and quality

The hashing trick reduces dimensionality at the cost of collisions. Two strategies help mitigate the impact:

- **Increase the bucket size** until collision rates fall below a target threshold. For a uniform hash function and *n* unique keys mapped to *m* buckets, the expected number of collisions is roughly *n^2 / (2m)* by the birthday-bound approximation.[15]
- **Use multiple independent hashes.** The "two-hash" trick stores each value at two different buckets and averages or sums the predictions, which reduces variance from collisions at modest extra cost.

In practice, modern CTR systems use bucket sizes in the millions and report negligible accuracy loss compared with the full feature space.

### regularization

Because crossed features dramatically inflate the parameter count, regularization is essential. The two most common choices are:

- **L2 regularization (weight decay).** Penalizes the squared magnitude of weights, leading to smooth, dense solutions.
- **L1 regularization (lasso).** Penalizes the absolute magnitude of weights, leading to sparse solutions where most cross weights are exactly zero. L1 is often preferred for crossed features because it acts as automatic feature selection.

Google's FTRL-Proximal optimizer, used in many production click-through-rate models, combines L1 and L2 regularization and is particularly well suited to wide linear models with hashed crosses.

### feature drift and retraining

In production, the distribution of crossed features can shift quickly. A new product launch may introduce a previously unseen `category x country` combination, or a viral query may suddenly dominate the `query_term x location` cross. Regular retraining and online learning are common defenses. Some systems use streaming algorithms such as FTRL to update weights continuously as new data arrives.

## use cases

### click-through rate (CTR) prediction

[CTR prediction](/wiki/ctr_prediction) for online advertising is the canonical home of feature crossing. Crossed features such as `user_demographic x ad_category`, `query_term x ad_creative`, or `time_of_day x device` capture the joint signals that make a particular ad relevant to a particular user in a particular context. Major ad platforms including Google AdSense, Microsoft Bing Ads, and Facebook Ads have all published papers describing cross-heavy linear or hybrid models for CTR.

### recommendation systems

[Recommender systems](/wiki/recommender_system) for streaming, e-commerce, and app stores rely on user-item interactions. Crossing user IDs with item categories, time of day, or device produces strong memorization signals, while embeddings on the deep side handle generalization to new users and items. The Google Play Wide and Deep deployment is the best-known example.[1]

### search ranking

Web and product search ranking models cross query terms with document attributes, user location, and session features. The cross `query_intent x document_topic` is particularly common for capturing topical relevance.

### fraud and risk

Financial fraud detection benefits from crosses that combine transactional and contextual variables. Examples include `merchant_category x transaction_amount`, `card_country x transaction_country`, and `device_id x time_of_day`. These crosses encode the kinds of "out of pattern" combinations that often precede fraud.

### geo-temporal modeling

Ride-sharing, food delivery, and weather forecasting all rely on bucketized latitude and longitude crossed with time of day or day of week. The Google Machine Learning Crash Course example using bucketized lat-lon crosses for a California housing model is a classic illustration.[10]

## historical context

The idea of including interaction terms in regression models predates machine learning by many decades. Statisticians such as R. A. Fisher and George Box studied factorial designs and interaction effects in agricultural and industrial experiments going back to the 1920s and 1950s. Modern feature crossing is the same idea, scaled up and packaged for high-cardinality categorical data.

The first widely cited use of large-scale hashed feature crosses in industry came with sponsored search systems at Google, Yahoo, and Microsoft in the mid-2000s. These systems trained logistic regression models with billions of parameters by hashing combinations of query terms, ad creatives, and user attributes into fixed-size bucket arrays. Vowpal Wabbit, released as open source by John Langford and collaborators around 2007, made the hashing trick accessible outside the largest tech companies.

The 2010 Factorization Machines paper by Steffen Rendle reframed crossed features as embedding lookups, opening the door to dense generalization.[4] The 2016 Wide and Deep paper from Google then combined explicit crosses with deep networks.[1] From 2017 onward, DCN, DeepFM, and xDeepFM each tried to automate the discovery of which crosses matter most.[2][6][5] By 2020 and 2021, with the rise of deeper recommenders trained on trillions of examples, the trend shifted toward learning crosses implicitly through embeddings and attention, but the explicit cross has not disappeared. It remains a competitive technique whenever interpretability, low latency, or data efficiency is a priority.

## limitations

Feature crosses are powerful but not universally applicable. Some of their key limitations are:

- **Cardinality explosion.** Without hashing or careful pruning, crossed feature vocabularies can grow into the billions, exceeding the memory available on most hardware.
- **Cold-start problem.** A linear model with explicit crosses cannot generalize to combinations that never appear in training data. The weight on an unseen cross is initialized to zero (or its prior) and stays there. Embedding-based methods such as FM and DCN do better here.[4][2]
- **Manual effort.** Choosing which features to cross is still partly an art, even with tools like AutoCross.[8] Domain knowledge remains the most reliable guide.
- **Brittleness.** Crossed features can amplify noise in high-cardinality categorical variables, especially when individual categories appear in only a few training examples.
- **Interaction with regularization.** Without strong L1 or L2 regularization, the additional parameters introduced by crosses can lead to severe overfitting.

## explain like i'm 5 (ELI5)

Imagine you are trying to figure out which ice cream flavors people like. Knowing someone's favorite color ("blue") or their age ("7") alone does not tell you much. But if you put those two facts together ("7-year-old who likes blue"), you might discover that kids around that age who like blue tend to love blueberry ice cream. A feature cross is just putting two clues together to make a stronger clue that helps the computer guess better.

It is like making a checklist of every interesting pair: red and small, red and large, blue and small, blue and large. Each combo gets its own little cubbyhole, and the computer learns which cubbyhole goes with which answer. When two clues are weak on their own but powerful together, the cross is what lets the computer notice the partnership.

## see also

- [Feature engineering](/wiki/feature_engineering)
- [Wide and Deep](/wiki/wide_and_deep)
- [Deep and Cross Network](/wiki/deep_and_cross_network)
- [Factorization machines](/wiki/factorization_machines)
- [DeepFM](/wiki/deepfm)
- [Hashing trick](/wiki/hashing_trick)
- [One-hot encoding](/wiki/one-hot_encoding)
- [Categorical data](/wiki/categorical_data)
- [Recommender system](/wiki/recommender_system)
- [CTR prediction](/wiki/ctr_prediction)
- [TensorFlow](/wiki/tensorflow)
- [Scikit-learn](/wiki/scikit_learn)
- [Vowpal Wabbit](/wiki/vowpal_wabbit)
- [Embedding vector](/wiki/embedding_vector)
- [Logistic regression](/wiki/logistic_regression)

## references

1. Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., and Shah, H. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems*, ACM. arXiv:1606.07792. https://arxiv.org/abs/1606.07792
2. Wang, R., Fu, B., Fu, G., and Wang, M. (2017). "Deep & Cross Network for Ad Click Predictions." *Proceedings of the ADKDD'17*, ACM. arXiv:1708.05123. https://arxiv.org/abs/1708.05123
3. Wang, R., Shivanna, R., Cheng, D. Z., Jain, S., Lin, D., Hong, L., and Chi, E. H. (2021). "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems." *Proceedings of the Web Conference 2021*, ACM. arXiv:2008.13535. https://arxiv.org/abs/2008.13535
4. Rendle, S. (2010). "Factorization Machines." *Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM)*, IEEE. https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
5. Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., and Sun, G. (2018). "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems." *Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*. arXiv:1803.05170. https://arxiv.org/abs/1803.05170
6. Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. (2017). "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction." *Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI)*. arXiv:1703.04247. https://arxiv.org/abs/1703.04247
7. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. (2009). "Feature Hashing for Large Scale Multitask Learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*. arXiv:0902.2206. https://arxiv.org/abs/0902.2206
8. Luo, Y., Wang, M., Zhou, H., Yao, Q., Tu, W., Chen, Y., Yang, Q., and Dai, W. (2019). "AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications." *Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. arXiv:1904.12857. https://arxiv.org/abs/1904.12857
9. Juan, Y., Zhuang, Y., Chin, W.-S., and Lin, C.-J. (2016). "Field-aware Factorization Machines for CTR Prediction." *Proceedings of the 10th ACM Conference on Recommender Systems (RecSys)*, ACM. https://dl.acm.org/doi/10.1145/2959100.2959134
10. Google Developers. "Categorical data: Feature crosses." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/categorical-data/feature-crosses
11. TensorFlow. "tf.feature_column.crossed_column (deprecated)." *TensorFlow API Docs*. https://www.tensorflow.org/api_docs/python/tf/feature_column/crossed_column
12. TensorFlow. "tf.keras.layers.HashedCrossing." *TensorFlow Keras API Docs*. https://www.tensorflow.org/api_docs/python/tf/keras/layers/HashedCrossing
13. scikit-learn. "PolynomialFeatures." *scikit-learn Documentation*. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
14. TensorFlow Recommenders. "Deep & Cross Network (DCN)." *TensorFlow Documentation*. https://www.tensorflow.org/recommenders/examples/dcn
15. Wikipedia contributors. "Feature hashing." *Wikipedia, The Free Encyclopedia*. https://en.wikipedia.org/wiki/Feature_hashing

