See also: Machine learning terms, Feature engineering
A feature cross (also called a crossed feature or feature interaction) is a synthetic feature created by combining two or more existing features to capture their joint effect on a prediction. In machine learning, individual features sometimes fail to represent the patterns that emerge only when variables act together. Feature crossing addresses this gap by explicitly encoding interactions, giving models access to information that would otherwise remain hidden.
Feature crosses are one of the most practical techniques in feature engineering. They are especially valuable for linear models such as logistic regression and linear regression, which cannot learn interactions on their own. By adding crossed features, a linear model gains the ability to approximate nonlinear decision boundaries without increasing architectural complexity.
At a high level, a feature cross takes two or more source features and produces a new feature whose value depends on the specific combination of values from those sources. The exact mechanics differ depending on whether the source features are numerical or categorical.
For numerical (continuous) features, the simplest cross is the element-wise product. Given two features A and B, the crossed feature is:
C = A * B
For example, a dataset might contain the features temperature and humidity. Individually, neither variable may strongly predict rainfall. But their product, temperature * humidity, can capture the combined atmospheric condition that leads to rain. This multiplicative interaction is the most common form of numerical feature crossing.
Higher-order crosses are also possible. A degree-3 cross of features A, B, and C would be A * B * C. These higher-order interactions grow in number very quickly; with n features and degree d, the count of possible crosses is on the order of n^d, so practitioners must be selective about which crosses to include.
For categorical data, a feature cross is the Cartesian product of the value sets. If feature X has values {red, blue} and feature Y has values {small, large}, the crossed feature X x Y has four possible values: {red_small, red_large, blue_small, blue_large}.
After the cross is formed, each combination is typically represented through one-hot encoding. The resulting vector has one dimension per unique combination, with a 1 in the position corresponding to the observed pair and 0 everywhere else.
| Feature X | Feature Y | Crossed Feature (X x Y) |
|---|---|---|
| red | small | red_small |
| red | large | red_large |
| blue | small | blue_small |
| blue | large | blue_large |
This encoding lets the model learn a separate weight for every combination, giving it far more expressive power than treating each feature independently.
A common hybrid approach discretizes continuous features into buckets before crossing them. For instance, latitude and longitude can each be split into bins (e.g., 10-degree ranges), and then crossed to produce a grid of geographic cells. Google's Machine Learning Crash Course uses this latitude-longitude example to show how a simple linear model, when given bucketized location crosses, can learn location-specific patterns that would otherwise require a nonlinear model.
A linear model computes predictions as a weighted sum of input features. Without feature crosses, it can only represent additive relationships: the effect of feature A is independent of feature B. Many real-world problems violate this assumption. By adding the product A * B as a new feature, the model can capture the interaction effect, which is the contribution that appears only when both A and B take certain values simultaneously.
Consider a fraud detection system. The features transaction_amount and time_of_day may each be weak predictors of fraud on their own. But a high transaction amount at 3 AM is far more suspicious than the same amount at noon. The crossed feature transaction_amount * time_of_day lets a linear model learn this nuance.
Feature crosses excel at memorization: learning that a particular combination of inputs maps to a particular output. In recommendation systems, for example, a cross of user_id x item_id lets the model memorize which user-item pairs led to clicks. This memorization ability is central to Google's Wide and Deep architecture, discussed later in this article.
Compared to deploying a neural network with multiple hidden layers, feature crosses provide interaction modeling at minimal computational cost. They require no gradient-based interaction discovery; the engineer specifies the cross, and the model learns the weights. Training and inference remain fast because the underlying model is still linear.
Categorical feature crosses are particularly common in web-scale applications such as advertising, search ranking, and recommendation systems. In these domains, inputs are often high-cardinality categorical variables (e.g., user IDs, product IDs, query terms).
Crossing two categorical features with m and n unique values produces up to m x n possible combinations. If a user ID feature has 1 million values and a product ID feature has 100,000 values, the full cross has 100 billion possible values. The resulting one-hot vector is extremely sparse: only one entry out of 100 billion is nonzero for each example.
| Source Feature | Cardinality | Crossed Feature | Cardinality |
|---|---|---|---|
| Country (200) | 200 | Country x Language | 200 x 100 = 20,000 |
| Language (100) | 100 | ||
| User ID (1M) | 1,000,000 | User ID x Product ID | 100 billion |
| Product ID (100K) | 100,000 |
While such sparse representations are feasible with sparse matrix libraries, the sheer dimensionality can slow training and inflate memory use.
The hashing trick (also called feature hashing) addresses the dimensionality problem. Instead of maintaining a full one-hot vector for every possible combination, the crossed value is run through a hash function and mapped to one of a fixed number of buckets:
bucket_index = hash(feature_A_value, feature_B_value) % hash_bucket_size
This reduces the feature space from potentially billions of dimensions to a manageable, fixed size (e.g., 10,000 or 100,000 buckets). The trade-off is hash collisions: different feature combinations may map to the same bucket, introducing some noise. In practice, a sufficiently large bucket size keeps collision rates low enough that model accuracy is largely preserved.
TensorFlow's tf.feature_column.crossed_column uses this approach internally. The function signature is:
tf.feature_column.crossed_column(
keys,
hash_bucket_size,
hash_key=None
)
Here, keys lists the features to cross, and hash_bucket_size controls the number of hash buckets. A common recommendation is to include the original (uncrossed) features alongside the cross so the model retains access to the individual signals.
Polynomial feature generation is a closely related technique that is most commonly applied to numerical data. Scikit-learn's PolynomialFeatures class produces all polynomial combinations of input features up to a specified degree:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False)
# Input: [a, b]
# Output: [1, a, b, a^2, ab, b^2]
Setting interaction_only=True removes the pure-power terms (like a^2 and b^2), leaving only the cross terms:
poly = PolynomialFeatures(degree=2, interaction_only=True)
# Input: [a, b]
# Output: [1, a, b, ab]
This interaction_only mode is essentially pure feature crossing for numerical data. The key difference between polynomial features and categorical feature crosses is that polynomial features multiply continuous values, while categorical crosses form the Cartesian product and one-hot encode the result.
| Method | Data Type | Technique | Output Format |
|---|---|---|---|
| Categorical feature cross | Categorical | Cartesian product + one-hot | Sparse binary vector |
| Polynomial features | Numerical | Multiplication of feature values | Dense numeric vector |
| Bucketized cross | Numerical (discretized) | Bin + Cartesian product | Sparse binary vector |
A neural network with one or more hidden layers can learn feature interactions automatically through its nonlinear activation functions. Each neuron in a hidden layer computes a weighted sum of inputs and passes it through a nonlinearity, allowing the network to represent arbitrarily complex interactions without manual engineering.
However, neural networks learn interactions implicitly. They require sufficient training data to discover useful combinations, and the learned interactions are embedded in the network weights, making them difficult to interpret. Feature crosses, by contrast, are explicit and interpretable: the engineer can inspect which combinations matter and assign clear semantic meaning to each one.
The Deep and Cross Network (DCN), introduced by Wang et al. in 2017, automates feature crossing within a neural architecture. DCN adds a "cross network" alongside a standard deep network. Each layer of the cross network explicitly computes feature interactions of increasing polynomial degree, while the deep network captures implicit patterns. DCN-V2 (2020) improved on the original design with mixture-of-experts layers in the cross network, achieving better performance on web-scale ranking tasks at Google.
DCN can be seen as a middle ground: it retains the explicit interaction modeling of feature crosses while automating the process of discovering which crosses are useful.
Decision trees and ensemble methods like random forests and gradient boosting learn feature interactions inherently. A decision tree that splits first on feature A and then on feature B within a subtree has effectively learned the interaction A x B. Because of this, manually adding feature crosses to tree-based models usually provides less benefit than adding them to linear models.
That said, there are cases where explicit crosses still help tree models. If an important interaction involves more than two features, a tree may need several levels of splits to capture it, and providing the cross directly as a new feature can shorten the required tree depth.
| Model Type | Learns Interactions Automatically? | Benefit of Manual Feature Crosses |
|---|---|---|
| Linear model | No | Very high |
| Neural network | Yes (implicitly) | Low to moderate |
| Decision tree / ensemble | Yes (via splits) | Low (occasionally moderate) |
| DCN / Cross Network | Yes (explicitly + implicitly) | Built-in |
In 2016, Google published "Wide and Deep Learning for Recommender Systems" (Cheng et al.), describing an architecture that pairs a wide linear model with a deep neural network. The wide component uses feature crosses to memorize specific user-item co-occurrences, while the deep component uses embeddings and hidden layers to generalize to unseen feature combinations.
The wide side takes cross-product transformations of the form:
cross(feature_i, feature_j) = 1 if feature_i and feature_j are both active
These sparse crosses allow the model to memorize that "users who installed app X also installed app Y." The deep side embeds sparse features into low-dimensional dense vectors and feeds them through multiple layers to learn generalizable patterns.
Google deployed Wide and Deep on Google Play, where it significantly increased app install rates compared to wide-only and deep-only baselines. The architecture demonstrated that memorization (via feature crosses) and generalization (via deep networks) are complementary strengths that work best in combination.
The simplest way to create a feature cross in Python is to combine columns directly in a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({
'age': [25, 40, 35],
'income': [50000, 80000, 65000]
})
# Numerical cross
df['age_x_income'] = df['age'] * df['income']
For categorical features:
df = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC'],
'device': ['mobile', 'desktop', 'mobile']
})
# Categorical cross
df['city_x_device'] = df['city'] + '_' + df['device']
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, interaction_only=True)),
('clf', LogisticRegression())
])
pipeline.fit(X_train, y_train)
import tensorflow as tf
city = tf.feature_column.categorical_column_with_vocabulary_list(
'city', ['NYC', 'LA', 'Chicago'])
device = tf.feature_column.categorical_column_with_vocabulary_list(
'device', ['mobile', 'desktop', 'tablet'])
city_x_device = tf.feature_column.crossed_column(
[city, device], hash_bucket_size=20)
Not all feature combinations are useful. Practitioners typically rely on:
Crossing many features at high order produces a combinatorial explosion. A dataset with 50 features has 1,225 pairwise crosses and over 19,000 three-way crosses. Strategies to manage this include:
Crossed categorical features are inherently sparse. Efficient storage with compressed sparse row (CSR) matrices and sparse-aware optimizers is essential for web-scale systems. TensorFlow, PyTorch, and scikit-learn all support sparse feature representations.
Imagine you are trying to figure out which ice cream flavors people like. Knowing someone's favorite color ("blue") or their age ("7") alone does not tell you much. But if you put those two facts together ("7-year-old who likes blue"), you might discover that kids around that age who like blue tend to love blueberry ice cream. A feature cross is just putting two clues together to make a stronger clue that helps the computer guess better.