Feature Cross

introduction

A feature cross (also called a crossed feature or feature interaction) is a synthetic feature created by combining two or more existing features to capture their joint effect on a prediction. In machine learning, individual features sometimes fail to represent the patterns that emerge only when variables act together. Feature crossing addresses this gap by explicitly encoding interactions, giving models access to information that would otherwise remain hidden in the raw inputs.

Feature crosses are one of the most practical techniques in feature engineering. They are especially valuable for linear models such as logistic regression and linear regression, which cannot learn interactions on their own. By adding crossed features, a linear model gains the ability to approximate nonlinear decision boundaries without increasing architectural complexity. The technique has roots in classical statistics, where interaction terms have been used in regression analysis for decades, but it gained renewed attention in the deep learning era through architectures like Wide and Deep, Deep and Cross Network, and DeepFM that integrate explicit feature crosses with neural networks.

The term feature cross became widely adopted in industry partly through Google's Machine Learning Crash Course, which devotes a section to the concept under the heading "Categorical data: Feature crosses." Google's TensorFlow team also exposed the idea through library APIs such as tf.feature_column.crossed_column (now deprecated) and tf.keras.layers.HashedCrossing, cementing the vocabulary used by practitioners today.

how feature crosses work

At a high level, a feature cross takes two or more source features and produces a new feature whose value depends on the specific combination of values from those sources. The exact mechanics differ depending on whether the source features are numerical or categorical.

numerical feature crosses

For numerical (continuous) features, the simplest cross is the element-wise product. Given two features A and B, the crossed feature is:

C = A * B

For example, a dataset might contain the features temperature and humidity. Individually, neither variable may strongly predict rainfall. But their product, temperature * humidity, can capture the combined atmospheric condition that leads to rain. This multiplicative interaction is the most common form of numerical feature crossing.

Mathematically, if a linear model has the form:

y = w_0 + w_1 * x_1 + w_2 * x_2

then adding a numerical cross transforms it into:

y = w_0 + w_1 * x_1 + w_2 * x_2 + w_3 * (x_1 * x_2)

The new term w_3 * (x_1 * x_2) lets the model represent nonlinear behavior. The slope of y with respect to x_1 now depends on the value of x_2, which is the defining property of an interaction effect in classical regression analysis.

Higher-order crosses are also possible. A degree-3 cross of features A, B, and C would be A * B * C. These higher-order interactions grow in number very quickly. With n features and degree d, the count of possible crosses is on the order of n^d, so practitioners must be selective about which crosses to include.

categorical feature crosses

For categorical data, a feature cross is the Cartesian product of the value sets. If feature X has values {red, blue} and feature Y has values {small, large}, the crossed feature X x Y has four possible values: {red_small, red_large, blue_small, blue_large}.

After the cross is formed, each combination is typically represented through one-hot encoding. The resulting vector has one dimension per unique combination, with a 1 in the position corresponding to the observed pair and 0 everywhere else.

Feature X	Feature Y	Crossed Feature (X x Y)
red	small	red_small
red	large	red_large
blue	small	blue_small
blue	large	blue_large

This encoding lets the model learn a separate weight for every combination, giving it far more expressive power than treating each feature independently. A canonical worked example from Google's documentation crosses city (e.g., New York, Boston, Seattle) with weather (e.g., sunny, rainy, snowy). The crossed feature contains values such as Boston_rainy or Seattle_snowy, and a linear model can learn that, say, Seattle_snowy strongly predicts a delivery delay even when neither Seattle nor snowy alone is very informative on its own.

bucketized numerical crosses

A common hybrid approach discretizes continuous features into buckets before crossing them. For instance, latitude and longitude can each be split into bins (for example, 10-degree ranges), and then crossed to produce a grid of geographic cells. Google's Machine Learning Crash Course uses this latitude-longitude example to show how a simple linear model, when given bucketized location crosses, can learn location-specific patterns that would otherwise require a nonlinear model.

Bucketization turns a continuous variable into a categorical one with a small number of bins. Once bucketized, the variable behaves like any other categorical feature for the purpose of crossing. The advantage over a raw numerical product is that the model can learn distinct weights for each cell in the grid, capturing sharp regional patterns that a smooth product term would average out.

why feature crosses matter

enabling nonlinearity in linear models

A linear model computes predictions as a weighted sum of input features. Without feature crosses, it can only represent additive relationships. The effect of feature A is independent of feature B. Many real-world problems violate this assumption. By adding the product A * B as a new feature, the model can capture the interaction effect, which is the contribution that appears only when both A and B take certain values simultaneously.

Consider a fraud detection system. The features transaction_amount and time_of_day may each be weak predictors of fraud on their own. But a high transaction amount at 3 AM is far more suspicious than the same amount at noon. The crossed feature transaction_amount * time_of_day lets a linear model learn this nuance.

memorization of specific patterns

Feature crosses excel at memorization: learning that a particular combination of inputs maps to a particular output. In recommendation systems, for example, a cross of user_id x item_id lets the model memorize which user-item pairs led to clicks. This memorization ability is central to Google's Wide and Deep architecture, discussed later in this article.

Memorization and generalization are often described as two sides of the same coin in recommender system design. Memorization captures co-occurrences that have been observed frequently in training data. Generalization extrapolates to combinations that were rare or unseen. Crossed features lean strongly toward memorization, while embeddings and dense neural layers lean toward generalization. Production systems usually need both.

computational simplicity

Compared to deploying a neural network with multiple hidden layers, feature crosses provide interaction modeling at minimal computational cost. They require no gradient-based interaction discovery; the engineer specifies the cross, and the model learns the weights. Training and inference remain fast because the underlying model is still linear, which means a single matrix-vector multiplication at inference time.

This simplicity makes crosses attractive for latency-sensitive deployments such as ad-serving systems that must produce predictions in single-digit milliseconds. The underlying model, often logistic regression or a follow-the-regularized-leader variant, can be served on commodity CPUs without specialized hardware.

interpretability

A cross feature has a clear semantic meaning. The weight on country=Japan x device=mobile directly answers the question "how much does being on a mobile device in Japan increase the predicted click rate?" That kind of inspectability matters for debugging models, explaining decisions to stakeholders, and complying with regulatory requirements that demand model transparency.

Neural networks, by contrast, distribute their interaction knowledge across many weights and nonlinear activations, which makes it harder to point at a single number and say what it means. The trade-off is one of the reasons crosses persist in production stacks even when deep models are available.

feature crosses for categorical data

Categorical feature crosses are particularly common in web-scale applications such as advertising, search ranking, and recommendation systems. In these domains, inputs are often high-cardinality categorical variables (for example, user IDs, product IDs, query terms, publisher domains, geographic regions).

the sparsity challenge

Crossing two categorical features with m and n unique values produces up to m x n possible combinations. If a user ID feature has 1 million values and a product ID feature has 100,000 values, the full cross has 100 billion possible values. The resulting one-hot vector is extremely sparse: only one entry out of 100 billion is nonzero for each example.

Source Feature	Cardinality	Crossed Feature	Cardinality
Country (200)	200	Country x Language	200 x 100 = 20,000
Language (100)	100
User ID (1M)	1,000,000	User ID x Product ID	100 billion
Product ID (100K)	100,000
Query token (1M)	1,000,000	Query x Country	200 million

While such sparse representations are feasible with sparse matrix libraries, the sheer dimensionality can slow training and inflate memory use. Most production systems either prune the cross to combinations that appear above a minimum count threshold, or apply the hashing trick described next.

hashing for high-dimensional crosses

The hashing trick (also called feature hashing, formalized by Weinberger et al. in 2009) addresses the dimensionality problem. Instead of maintaining a full one-hot vector for every possible combination, the crossed value is run through a hash function and mapped to one of a fixed number of buckets:

bucket_index = hash(feature_A_value, feature_B_value) % hash_bucket_size

This reduces the feature space from potentially billions of dimensions to a manageable, fixed size (for example, 10,000 or 100,000 buckets). The trade-off is hash collisions: different feature combinations may map to the same bucket, introducing some noise. In practice, a sufficiently large bucket size keeps collision rates low enough that model accuracy is largely preserved.

The hashing trick was popularized by Vowpal Wabbit, an online learning system developed at Yahoo and later Microsoft Research. Vowpal Wabbit relied heavily on hashed features to train logistic regression models with billions of parameters using only modest amounts of memory. The same idea later became standard in TensorFlow, scikit-learn, and most production CTR-prediction stacks.

TensorFlow's tf.feature_column.crossed_column uses this approach internally. The function signature was:

tf.feature_column.crossed_column(
    keys,
    hash_bucket_size,
    hash_key=None
)

Here, keys lists the features to cross, and hash_bucket_size controls the number of hash buckets. A common recommendation is to include the original (uncrossed) features alongside the cross so the model retains access to the individual signals.

tensorflow keras hashedcrossing

In TensorFlow 2.x, the tf.feature_column API has been deprecated in favor of Keras preprocessing layers. The modern equivalent of crossed_column is tf.keras.layers.HashedCrossing. The Keras layer supports two output modes: "int" (returns the bucket index as an integer) and "one_hot" (returns a one-hot vector). It can be composed naturally inside a Keras Model definition without the older feature_column plumbing:

import tensorflow as tf

cross_layer = tf.keras.layers.HashedCrossing(
    num_bins=20,
    output_mode="one_hot"
)

city = tf.constant(["NYC", "LA", "NYC"])
device = tf.constant(["mobile", "desktop", "mobile"])
crossed = cross_layer((city, device))

For users who want to keep using the higher-level tf.feature_column style without writing layers manually, TensorFlow recommends the tf.keras.utils.FeatureSpace utility, which provides a declarative wrapper around the underlying preprocessing layers.

polynomial features as feature crosses

Polynomial feature generation is a closely related technique that is most commonly applied to numerical data. Scikit-learn's PolynomialFeatures class produces all polynomial combinations of input features up to a specified degree:

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=False)
# Input: [a, b]
# Output: [1, a, b, a^2, ab, b^2]

Setting interaction_only=True removes the pure-power terms (like a^2 and b^2), leaving only the cross terms:

poly = PolynomialFeatures(degree=2, interaction_only=True)
# Input: [a, b]
# Output: [1, a, b, ab]

This interaction_only mode is essentially pure feature crossing for numerical data. The key difference between polynomial features and categorical feature crosses is that polynomial features multiply continuous values, while categorical crosses form the Cartesian product and one-hot encode the result.

Method	Data Type	Technique	Output Format
Categorical feature cross	Categorical	Cartesian product + one-hot	Sparse binary vector
Polynomial features	Numerical	Multiplication of feature values	Dense numeric vector
Bucketized cross	Numerical (discretized)	Bin + Cartesian product	Sparse binary vector
Hashed cross	Categorical (high cardinality)	Hash combined keys to fixed bins	Sparse binary vector

A related class in scikit-learn, SplineTransformer, generates basis-function expansions that capture nonlinear effects of single features without producing explicit crosses. Practitioners often combine SplineTransformer for individual nonlinearities with PolynomialFeatures(interaction_only=True) for cross terms.

variants of feature crosses

Different applications call for different cross representations. The table below summarizes the main variants encountered in practice.

Variant	Best For	Pros	Cons
Manual concatenated cross	Small cardinality categorical data	Simple, exact, interpretable	Explodes with high cardinality
Hashed cross	High cardinality, web-scale CTR systems	Fixed memory footprint, fast	Hash collisions reduce signal slightly
Bucketized numerical cross	Geographic or time-of-day patterns	Captures sharp regional variation	Requires choosing bucket boundaries
Polynomial cross	Continuous numerical features	Smooth multiplicative interactions	Can amplify outliers and ill-conditioned scales
Embedded cross (FM, FFM)	Sparse high-cardinality with rare combos	Generalizes to unseen pairs via embeddings	More parameters and tuning required
Learned cross (DCN, xDeepFM)	Deep models with many features	Discovers crosses automatically	Less interpretable than explicit crosses

feature crosses vs. learned interactions

neural networks

A neural network with one or more hidden layers can learn feature interactions automatically through its nonlinear activation functions. Each neuron in a hidden layer computes a weighted sum of inputs and passes it through a nonlinearity, allowing the network to represent arbitrarily complex interactions without manual engineering.

However, neural networks learn interactions implicitly. They require sufficient training data to discover useful combinations, and the learned interactions are embedded in the network weights, making them difficult to interpret. Feature crosses, by contrast, are explicit and interpretable: the engineer can inspect which combinations matter and assign clear semantic meaning to each one.

Two failure modes are particularly common. First, neural networks may struggle to learn very high-order or very rare interactions when training data is limited. Crossing the relevant features and feeding them in directly removes the burden of discovery. Second, even when a network has enough capacity, the optimizer may not find the relevant interaction without thousands of gradient updates. Explicit crosses act as a strong inductive bias that accelerates training.

factorization machines

Factorization machines (FM), introduced by Steffen Rendle in 2010 at the IEEE International Conference on Data Mining, generalize the idea of feature crosses by replacing each crossed weight with a dot product of two low-dimensional embeddings. For pairwise interactions, the FM scoring function is:

y = w_0 + sum_i (w_i * x_i) + sum_{i<j} (<v_i, v_j> * x_i * x_j)

where each feature i has both a scalar weight w_i and an embedding vector v_i, and <v_i, v_j> is the dot product. The crucial advantage is that FMs can estimate interaction strengths even for pairs that never co-occur in training, because the embeddings are learned across all pairs that share at least one component. FMs were a step toward bridging the gap between sparse explicit crosses and dense learned representations.

Field-aware Factorization Machines (FFM), proposed by Juan et al. in 2016, extend FM by giving each feature multiple embedding vectors, one per interacting field. FFM won several Kaggle CTR competitions, including the Criteo Display Ad Challenge, before deep models took over.

deep and cross network (DCN)

The Deep and Cross Network (DCN), introduced by Wang et al. in the 2017 ADKDD workshop paper "Deep & Cross Network for Ad Click Predictions" (arXiv:1708.05123), automates feature crossing within a neural architecture. DCN adds a "cross network" alongside a standard deep network. Each layer of the cross network explicitly computes feature interactions of increasing polynomial degree.

The cross network update rule for layer l+1 is:

x_{l+1} = x_0 * x_l^T * w_l + b_l + x_l

where x_0 is the input feature vector and w_l, b_l are learned parameters. After L cross layers, the network has implicitly enumerated polynomial cross terms up to degree L+1, but with parameter cost that grows linearly in L rather than exponentially.

DCN-V2, published by Wang et al. at The Web Conference 2021 (arXiv:2008.13535), upgrades the cross network with a full weight matrix instead of a vector and adds a mixture-of-experts variant that exploits the low-rank structure of the learned cross matrix. DCN-V2 has been deployed across multiple Google web-scale ranking systems and is reported to deliver significant offline and online metric improvements over the original DCN.

xdeepfm

The eXtreme Deep Factorization Machine (xDeepFM), proposed by Lian et al. at KDD 2018 (arXiv:1803.05170), introduces a Compressed Interaction Network (CIN) that performs explicit feature crossing at the vector level rather than at the bit level. CIN captures bounded-degree interactions explicitly while a parallel deep network captures arbitrary high-order interactions implicitly. The combination is designed to inherit the strengths of factorization machines, Wide and Deep, and DCN while addressing some of their limitations.

deepfm

DeepFM, introduced by Guo et al. at IJCAI 2017 (arXiv:1703.04247), is another widely deployed hybrid. It combines a factorization-machine component for low-order interactions with a multi-layer perceptron for high-order interactions, with both components sharing the same input embeddings. Compared with Wide and Deep, DeepFM removes the need for hand-crafted cross features by relying on the FM component to learn pairwise interactions automatically.

tree-based models

Decision trees and ensemble methods like random forests and gradient boosting (including XGBoost, LightGBM, and CatBoost) learn feature interactions inherently. A decision tree that splits first on feature A and then on feature B within a subtree has effectively learned the interaction A x B. Because of this, manually adding feature crosses to tree-based models usually provides less benefit than adding them to linear models.

That said, there are cases where explicit crosses still help tree models. If an important interaction involves more than two features, a tree may need several levels of splits to capture it, and providing the cross directly as a new feature can shorten the required tree depth. Crosses can also reduce the number of trees needed for the same predictive performance, which lowers inference latency in production.

Model Type	Learns Interactions Automatically?	Benefit of Manual Feature Crosses
Linear model	No	Very high
Neural network	Yes (implicitly)	Low to moderate
Decision tree / ensemble	Yes (via splits)	Low (occasionally moderate)
Factorization machine	Yes (pairwise via embeddings)	Built-in
DCN / xDeepFM	Yes (explicitly + implicitly)	Built-in

google's wide and deep model

In 2016, Google published "Wide & Deep Learning for Recommender Systems" (Cheng et al., arXiv:1606.07792), describing an architecture that pairs a wide linear model with a deep neural network. The wide component uses feature crosses to memorize specific user-item co-occurrences, while the deep component uses embeddings and hidden layers to generalize to unseen feature combinations.

The wide side takes cross-product transformations of the form:

cross(feature_i, feature_j) = 1 if feature_i and feature_j are both active

These sparse crosses allow the model to memorize that, for example, "users who installed app X also installed app Y." The deep side embeds sparse features into low-dimensional dense vectors and feeds them through multiple layers to learn generalizable patterns.

Google deployed Wide and Deep on the Google Play app store, which served over one billion active users and over one million apps at the time of the paper. The system significantly improved app acquisition rates compared to wide-only and deep-only baselines while meeting strict training and serving latency requirements. The architecture demonstrated that memorization (via feature crosses) and generalization (via deep networks) are complementary strengths that work best in combination.

The Wide and Deep paper also released a high-level TensorFlow API that made the architecture easy to reproduce, which contributed to the broad adoption of explicit feature crosses in production deep learning pipelines through the late 2010s. Even after pure deep models became more common, the lessons from Wide and Deep influenced later architectures including DCN, DeepFM, and xDeepFM, all of which try to give a deep model some form of explicit cross signal.

implementation examples

manual crossing in python

The simplest way to create a feature cross in Python is to combine columns directly in a pandas DataFrame:

import pandas as pd

df = pd.DataFrame({
    'age': [25, 40, 35],
    'income': [50000, 80000, 65000]
})

# Numerical cross
df['age_x_income'] = df['age'] * df['income']

For categorical features:

df = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC'],
    'device': ['mobile', 'desktop', 'mobile']
})

# Categorical cross
df['city_x_device'] = df['city'] + '_' + df['device']

scikit-learn polynomialfeatures

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, interaction_only=True)),
    ('clf', LogisticRegression())
])
pipeline.fit(X_train, y_train)

tensorflow legacy crossed column

import tensorflow as tf

city = tf.feature_column.categorical_column_with_vocabulary_list(
    'city', ['NYC', 'LA', 'Chicago'])
device = tf.feature_column.categorical_column_with_vocabulary_list(
    'device', ['mobile', 'desktop', 'tablet'])

city_x_device = tf.feature_column.crossed_column(
    [city, device], hash_bucket_size=20)

Note that tf.feature_column has been deprecated. New TensorFlow code should prefer the Keras-native version below.

tensorflow keras hashedcrossing layer

import tensorflow as tf

cross = tf.keras.layers.HashedCrossing(
    num_bins=1000,
    output_mode='one_hot'
)

city = tf.keras.Input(shape=(1,), dtype=tf.string, name='city')
device = tf.keras.Input(shape=(1,), dtype=tf.string, name='device')
crossed = cross((city, device))

pytorch manual crossing with hashing

PyTorch does not ship a dedicated cross layer, but the same pattern is easy to write by hand:

import torch
import torch.nn as nn

class HashedCross(nn.Module):
    def __init__(self, num_bins, embedding_dim):
        super().__init__()
        self.num_bins = num_bins
        self.embed = nn.Embedding(num_bins, embedding_dim)

    def forward(self, a, b):
        # a, b are LongTensor IDs
        combined = a * 1_000_003 + b  # mix the two IDs
        bucket = combined % self.num_bins
        return self.embed(bucket)

Many production systems use a similar pattern with a 64-bit hash function (such as MurmurHash3) before the modulus to keep the bucket distribution uniform across keys.

deep and cross network in tensorflow recommenders

TensorFlow Recommenders provides a ready-made tfrs.layers.dcn.Cross layer that implements the DCN cross-network update rule. A minimal model looks like:

import tensorflow as tf
import tensorflow_recommenders as tfrs

inputs = tf.keras.Input(shape=(64,))
x = tfrs.layers.dcn.Cross()(inputs)
x = tfrs.layers.dcn.Cross()(x)
output = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs, output)

practical considerations

choosing which features to cross

Not all feature combinations are useful. Practitioners typically rely on:

Domain knowledge. Understanding which variables interact in the real world (for example, location and time for ride pricing, or device and creative size for ad serving).
Statistical tests. Checking for interaction effects in exploratory data analysis using techniques such as ANOVA, mutual information, or visualizing per-bin response curves.
Model-based importance. Training a gradient-boosted tree first and inspecting the most common pairs of features that appear together in tree paths. Pairs that frequently co-split are good cross candidates for a downstream linear model.
Automated search. Tools like AutoCross (Luo et al., KDD 2019, arXiv:1904.12857) use beam search over a tree-structured space of candidate crosses with successive mini-batch gradient descent and multi-granularity discretization to find effective high-order crosses without expert tuning.

avoiding feature explosion

Crossing many features at high order produces a combinatorial explosion. A dataset with 50 features has 1,225 pairwise crosses and over 19,000 three-way crosses. Strategies to manage this include:

Limiting cross order to 2 (pairwise interactions).
Using the hashing trick to cap dimensionality at a fixed bucket count.
Applying L1 regularization to drive irrelevant cross weights to zero.
Selecting features with known domain relevance before crossing.
Filtering crosses by minimum count in the training set, dropping combinations seen fewer than, say, 10 times.

sparsity and memory

Crossed categorical features are inherently sparse. Efficient storage with compressed sparse row (CSR) matrices and sparse-aware optimizers is essential for web-scale systems. TensorFlow, PyTorch, and scikit-learn all support sparse feature representations. For very wide models with billions of crossed features, distributed parameter servers or specialized embedding-table sharding (as in TorchRec or DeepRec) are commonly used.

hashing collisions and quality

The hashing trick reduces dimensionality at the cost of collisions. Two strategies help mitigate the impact:

Increase the bucket size until collision rates fall below a target threshold. For a uniform hash function and n unique keys mapped to m buckets, the expected number of collisions is roughly n^2 / (2m) by the birthday-bound approximation.
Use multiple independent hashes. The "two-hash" trick stores each value at two different buckets and averages or sums the predictions, which reduces variance from collisions at modest extra cost.

In practice, modern CTR systems use bucket sizes in the millions and report negligible accuracy loss compared with the full feature space.

regularization

Because crossed features dramatically inflate the parameter count, regularization is essential. The two most common choices are:

L2 regularization (weight decay). Penalizes the squared magnitude of weights, leading to smooth, dense solutions.
L1 regularization (lasso). Penalizes the absolute magnitude of weights, leading to sparse solutions where most cross weights are exactly zero. L1 is often preferred for crossed features because it acts as automatic feature selection.

Google's FTRL-Proximal optimizer, used in many production click-through-rate models, combines L1 and L2 regularization and is particularly well suited to wide linear models with hashed crosses.

feature drift and retraining

In production, the distribution of crossed features can shift quickly. A new product launch may introduce a previously unseen category x country combination, or a viral query may suddenly dominate the query_term x location cross. Regular retraining and online learning are common defenses. Some systems use streaming algorithms such as FTRL to update weights continuously as new data arrives.

use cases

click-through rate (CTR) prediction

CTR prediction for online advertising is the canonical home of feature crossing. Crossed features such as user_demographic x ad_category, query_term x ad_creative, or time_of_day x device capture the joint signals that make a particular ad relevant to a particular user in a particular context. Major ad platforms including Google AdSense, Microsoft Bing Ads, and Facebook Ads have all published papers describing cross-heavy linear or hybrid models for CTR.

recommendation systems

Recommender systems for streaming, e-commerce, and app stores rely on user-item interactions. Crossing user IDs with item categories, time of day, or device produces strong memorization signals, while embeddings on the deep side handle generalization to new users and items. The Google Play Wide and Deep deployment is the best-known example.

search ranking

Web and product search ranking models cross query terms with document attributes, user location, and session features. The cross query_intent x document_topic is particularly common for capturing topical relevance.

fraud and risk

Financial fraud detection benefits from crosses that combine transactional and contextual variables. Examples include merchant_category x transaction_amount, card_country x transaction_country, and device_id x time_of_day. These crosses encode the kinds of "out of pattern" combinations that often precede fraud.

geo-temporal modeling

Ride-sharing, food delivery, and weather forecasting all rely on bucketized latitude and longitude crossed with time of day or day of week. The Google Machine Learning Crash Course example using bucketized lat-lon crosses for a California housing model is a classic illustration.

historical context

The idea of including interaction terms in regression models predates machine learning by many decades. Statisticians such as R. A. Fisher and George Box studied factorial designs and interaction effects in agricultural and industrial experiments going back to the 1920s and 1950s. Modern feature crossing is the same idea, scaled up and packaged for high-cardinality categorical data.

The first widely cited use of large-scale hashed feature crosses in industry came with sponsored search systems at Google, Yahoo, and Microsoft in the mid-2000s. These systems trained logistic regression models with billions of parameters by hashing combinations of query terms, ad creatives, and user attributes into fixed-size bucket arrays. Vowpal Wabbit, released as open source by John Langford and collaborators around 2007, made the hashing trick accessible outside the largest tech companies.

The 2010 Factorization Machines paper by Steffen Rendle reframed crossed features as embedding lookups, opening the door to dense generalization. The 2016 Wide and Deep paper from Google then combined explicit crosses with deep networks. From 2017 onward, DCN, DeepFM, and xDeepFM each tried to automate the discovery of which crosses matter most. By 2020 and 2021, with the rise of deeper recommenders trained on trillions of examples, the trend shifted toward learning crosses implicitly through embeddings and attention, but the explicit cross has not disappeared. It remains a competitive technique whenever interpretability, low latency, or data efficiency is a priority.

limitations

Feature crosses are powerful but not universally applicable. Some of their key limitations are:

Cardinality explosion. Without hashing or careful pruning, crossed feature vocabularies can grow into the billions, exceeding the memory available on most hardware.
Cold-start problem. A linear model with explicit crosses cannot generalize to combinations that never appear in training data. The weight on an unseen cross is initialized to zero (or its prior) and stays there. Embedding-based methods such as FM and DCN do better here.
Manual effort. Choosing which features to cross is still partly an art, even with tools like AutoCross. Domain knowledge remains the most reliable guide.
Brittleness. Crossed features can amplify noise in high-cardinality categorical variables, especially when individual categories appear in only a few training examples.
Interaction with regularization. Without strong L1 or L2 regularization, the additional parameters introduced by crosses can lead to severe overfitting.

explain like i'm 5 (ELI5)

Imagine you are trying to figure out which ice cream flavors people like. Knowing someone's favorite color ("blue") or their age ("7") alone does not tell you much. But if you put those two facts together ("7-year-old who likes blue"), you might discover that kids around that age who like blue tend to love blueberry ice cream. A feature cross is just putting two clues together to make a stronger clue that helps the computer guess better.

It is like making a checklist of every interesting pair: red and small, red and large, blue and small, blue and large. Each combo gets its own little cubbyhole, and the computer learns which cubbyhole goes with which answer. When two clues are weak on their own but powerful together, the cross is what lets the computer notice the partnership.

references

Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., and Shah, H. (2016). "Wide & Deep Learning for Recommender Systems." Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, ACM. arXiv:1606.07792. https://arxiv.org/abs/1606.07792
Wang, R., Fu, B., Fu, G., and Wang, M. (2017). "Deep & Cross Network for Ad Click Predictions." Proceedings of the ADKDD'17, ACM. arXiv:1708.05123. https://arxiv.org/abs/1708.05123
Wang, R., Shivanna, R., Cheng, D. Z., Jain, S., Lin, D., Hong, L., and Chi, E. H. (2021). "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems." Proceedings of the Web Conference 2021, ACM. arXiv:2008.13535. https://arxiv.org/abs/2008.13535
Rendle, S. (2010). "Factorization Machines." Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM), IEEE. https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., and Sun, G. (2018). "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. arXiv:1803.05170. https://arxiv.org/abs/1803.05170
Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. (2017). "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction." Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI). arXiv:1703.04247. https://arxiv.org/abs/1703.04247
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. (2009). "Feature Hashing for Large Scale Multitask Learning." Proceedings of the 26th International Conference on Machine Learning (ICML). arXiv:0902.2206. https://arxiv.org/abs/0902.2206
Luo, Y., Wang, M., Zhou, H., Yao, Q., Tu, W., Chen, Y., Yang, Q., and Dai, W. (2019). "AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. arXiv:1904.12857. https://arxiv.org/abs/1904.12857
Juan, Y., Zhuang, Y., Chin, W.-S., and Lin, C.-J. (2016). "Field-aware Factorization Machines for CTR Prediction." Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), ACM. https://dl.acm.org/doi/10.1145/2959100.2959134
Google Developers. "Categorical data: Feature crosses." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/categorical-data/feature-crosses
TensorFlow. "tf.feature_column.crossed_column (deprecated)." TensorFlow API Docs. https://www.tensorflow.org/api_docs/python/tf/feature_column/crossed_column
TensorFlow. "tf.keras.layers.HashedCrossing." TensorFlow Keras API Docs. https://www.tensorflow.org/api_docs/python/tf/keras/layers/HashedCrossing
scikit-learn. "PolynomialFeatures." scikit-learn Documentation. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
TensorFlow Recommenders. "Deep & Cross Network (DCN)." TensorFlow Documentation. https://www.tensorflow.org/recommenders/examples/dcn
Wikipedia contributors. "Feature hashing." Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Feature_hashing

introduction

how feature crosses work

numerical feature crosses

categorical feature crosses

bucketized numerical crosses

why feature crosses matter

enabling nonlinearity in linear models

memorization of specific patterns

computational simplicity

interpretability

feature crosses for categorical data

the sparsity challenge

hashing for high-dimensional crosses

tensorflow keras hashedcrossing

polynomial features as feature crosses

variants of feature crosses

feature crosses vs. learned interactions

neural networks

factorization machines

deep and cross network (DCN)

xdeepfm

deepfm

tree-based models

google's wide and deep model

implementation examples

manual crossing in python

scikit-learn polynomialfeatures

tensorflow legacy crossed column

tensorflow keras hashedcrossing layer

pytorch manual crossing with hashing

deep and cross network in tensorflow recommenders

practical considerations

choosing which features to cross

avoiding feature explosion

sparsity and memory

hashing collisions and quality

regularization

feature drift and retraining

use cases

click-through rate (CTR) prediction

recommendation systems

search ranking

fraud and risk

geo-temporal modeling

historical context

limitations

explain like i'm 5 (ELI5)

see also

references

Improve this article

Related Articles

ARC-AGI 2

Discrete Feature

Bucketing

Categorical Data

Continuous Feature

Dense Feature

introduction

how feature crosses work

numerical feature crosses

categorical feature crosses

bucketized numerical crosses

why feature crosses matter

enabling nonlinearity in linear models

memorization of specific patterns

computational simplicity

interpretability

feature crosses for categorical data

the sparsity challenge

hashing for high-dimensional crosses

tensorflow keras hashedcrossing

polynomial features as feature crosses

variants of feature crosses

feature crosses vs. learned interactions

neural networks

factorization machines

deep and cross network (DCN)

xdeepfm

deepfm

tree-based models