Feature Cross
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,546 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,546 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms, Feature engineering
A feature cross (also called a crossed feature or feature interaction) is a synthetic feature created by combining two or more existing features to capture their joint effect on a prediction. In machine learning, individual features sometimes fail to represent the patterns that emerge only when variables act together. Feature crossing addresses this gap by explicitly encoding interactions, giving models access to information that would otherwise remain hidden in the raw inputs.
Feature crosses are one of the most practical techniques in feature engineering. They are especially valuable for linear models such as logistic regression and linear regression, which cannot learn interactions on their own. By adding crossed features, a linear model gains the ability to approximate nonlinear decision boundaries without increasing architectural complexity. The technique has roots in classical statistics, where interaction terms have been used in regression analysis for decades, but it gained renewed attention in the deep learning era through architectures like Wide and Deep, Deep and Cross Network, and DeepFM that integrate explicit feature crosses with neural networks.
The term feature cross became widely adopted in industry partly through Google's Machine Learning Crash Course, which devotes a section to the concept under the heading "Categorical data: Feature crosses." Google's TensorFlow team also exposed the idea through library APIs such as tf.feature_column.crossed_column (now deprecated) and tf.keras.layers.HashedCrossing, cementing the vocabulary used by practitioners today.
At a high level, a feature cross takes two or more source features and produces a new feature whose value depends on the specific combination of values from those sources. The exact mechanics differ depending on whether the source features are numerical or categorical.
For numerical (continuous) features, the simplest cross is the element-wise product. Given two features A and B, the crossed feature is:
C = A * B
For example, a dataset might contain the features temperature and humidity. Individually, neither variable may strongly predict rainfall. But their product, temperature * humidity, can capture the combined atmospheric condition that leads to rain. This multiplicative interaction is the most common form of numerical feature crossing.
Mathematically, if a linear model has the form:
y = w_0 + w_1 * x_1 + w_2 * x_2
then adding a numerical cross transforms it into:
y = w_0 + w_1 * x_1 + w_2 * x_2 + w_3 * (x_1 * x_2)
The new term w_3 * (x_1 * x_2) lets the model represent nonlinear behavior. The slope of y with respect to x_1 now depends on the value of x_2, which is the defining property of an interaction effect in classical regression analysis.
Higher-order crosses are also possible. A degree-3 cross of features A, B, and C would be A * B * C. These higher-order interactions grow in number very quickly. With n features and degree d, the count of possible crosses is on the order of n^d, so practitioners must be selective about which crosses to include.
For categorical data, a feature cross is the Cartesian product of the value sets. If feature X has values {red, blue} and feature Y has values {small, large}, the crossed feature X x Y has four possible values: {red_small, red_large, blue_small, blue_large}.
After the cross is formed, each combination is typically represented through one-hot encoding. The resulting vector has one dimension per unique combination, with a 1 in the position corresponding to the observed pair and 0 everywhere else.
| Feature X | Feature Y | Crossed Feature (X x Y) |
|---|---|---|
| red | small | red_small |
| red | large | red_large |
| blue | small | blue_small |
| blue | large | blue_large |
This encoding lets the model learn a separate weight for every combination, giving it far more expressive power than treating each feature independently. A canonical worked example from Google's documentation crosses city (e.g., New York, Boston, Seattle) with weather (e.g., sunny, rainy, snowy). The crossed feature contains values such as Boston_rainy or Seattle_snowy, and a linear model can learn that, say, Seattle_snowy strongly predicts a delivery delay even when neither Seattle nor snowy alone is very informative on its own.
A common hybrid approach discretizes continuous features into buckets before crossing them. For instance, latitude and longitude can each be split into bins (for example, 10-degree ranges), and then crossed to produce a grid of geographic cells. Google's Machine Learning Crash Course uses this latitude-longitude example to show how a simple linear model, when given bucketized location crosses, can learn location-specific patterns that would otherwise require a nonlinear model.
Bucketization turns a continuous variable into a categorical one with a small number of bins. Once bucketized, the variable behaves like any other categorical feature for the purpose of crossing. The advantage over a raw numerical product is that the model can learn distinct weights for each cell in the grid, capturing sharp regional patterns that a smooth product term would average out.
A linear model computes predictions as a weighted sum of input features. Without feature crosses, it can only represent additive relationships. The effect of feature A is independent of feature B. Many real-world problems violate this assumption. By adding the product A * B as a new feature, the model can capture the interaction effect, which is the contribution that appears only when both A and B take certain values simultaneously.
Consider a fraud detection system. The features transaction_amount and time_of_day may each be weak predictors of fraud on their own. But a high transaction amount at 3 AM is far more suspicious than the same amount at noon. The crossed feature transaction_amount * time_of_day lets a linear model learn this nuance.
Feature crosses excel at memorization: learning that a particular combination of inputs maps to a particular output. In recommendation systems, for example, a cross of user_id x item_id lets the model memorize which user-item pairs led to clicks. This memorization ability is central to Google's Wide and Deep architecture, discussed later in this article.
Memorization and generalization are often described as two sides of the same coin in recommender system design. Memorization captures co-occurrences that have been observed frequently in training data. Generalization extrapolates to combinations that were rare or unseen. Crossed features lean strongly toward memorization, while embeddings and dense neural layers lean toward generalization. Production systems usually need both.
Compared to deploying a neural network with multiple hidden layers, feature crosses provide interaction modeling at minimal computational cost. They require no gradient-based interaction discovery; the engineer specifies the cross, and the model learns the weights. Training and inference remain fast because the underlying model is still linear, which means a single matrix-vector multiplication at inference time.
This simplicity makes crosses attractive for latency-sensitive deployments such as ad-serving systems that must produce predictions in single-digit milliseconds. The underlying model, often logistic regression or a follow-the-regularized-leader variant, can be served on commodity CPUs without specialized hardware.
A cross feature has a clear semantic meaning. The weight on country=Japan x device=mobile directly answers the question "how much does being on a mobile device in Japan increase the predicted click rate?" That kind of inspectability matters for debugging models, explaining decisions to stakeholders, and complying with regulatory requirements that demand model transparency.
Neural networks, by contrast, distribute their interaction knowledge across many weights and nonlinear activations, which makes it harder to point at a single number and say what it means. The trade-off is one of the reasons crosses persist in production stacks even when deep models are available.
Categorical feature crosses are particularly common in web-scale applications such as advertising, search ranking, and recommendation systems. In these domains, inputs are often high-cardinality categorical variables (for example, user IDs, product IDs, query terms, publisher domains, geographic regions).
Crossing two categorical features with m and n unique values produces up to m x n possible combinations. If a user ID feature has 1 million values and a product ID feature has 100,000 values, the full cross has 100 billion possible values. The resulting one-hot vector is extremely sparse: only one entry out of 100 billion is nonzero for each example.
| Source Feature | Cardinality | Crossed Feature | Cardinality |
|---|---|---|---|
| Country (200) | 200 | Country x Language | 200 x 100 = 20,000 |
| Language (100) | 100 | ||
| User ID (1M) | 1,000,000 | User ID x Product ID | 100 billion |
| Product ID (100K) | 100,000 | ||
| Query token (1M) | 1,000,000 | Query x Country | 200 million |
While such sparse representations are feasible with sparse matrix libraries, the sheer dimensionality can slow training and inflate memory use. Most production systems either prune the cross to combinations that appear above a minimum count threshold, or apply the hashing trick described next.
The hashing trick (also called feature hashing, formalized by Weinberger et al. in 2009) addresses the dimensionality problem. Instead of maintaining a full one-hot vector for every possible combination, the crossed value is run through a hash function and mapped to one of a fixed number of buckets:
bucket_index = hash(feature_A_value, feature_B_value) % hash_bucket_size
This reduces the feature space from potentially billions of dimensions to a manageable, fixed size (for example, 10,000 or 100,000 buckets). The trade-off is hash collisions: different feature combinations may map to the same bucket, introducing some noise. In practice, a sufficiently large bucket size keeps collision rates low enough that model accuracy is largely preserved.
The hashing trick was popularized by Vowpal Wabbit, an online learning system developed at Yahoo and later Microsoft Research. Vowpal Wabbit relied heavily on hashed features to train logistic regression models with billions of parameters using only modest amounts of memory. The same idea later became standard in TensorFlow, scikit-learn, and most production CTR-prediction stacks.
TensorFlow's tf.feature_column.crossed_column uses this approach internally. The function signature was:
tf.feature_column.crossed_column(
keys,
hash_bucket_size,
hash_key=None
)
Here, keys lists the features to cross, and hash_bucket_size controls the number of hash buckets. A common recommendation is to include the original (uncrossed) features alongside the cross so the model retains access to the individual signals.
In TensorFlow 2.x, the tf.feature_column API has been deprecated in favor of Keras preprocessing layers. The modern equivalent of crossed_column is tf.keras.layers.HashedCrossing. The Keras layer supports two output modes: "int" (returns the bucket index as an integer) and "one_hot" (returns a one-hot vector). It can be composed naturally inside a Keras Model definition without the older feature_column plumbing:
import tensorflow as tf
cross_layer = tf.keras.layers.HashedCrossing(
num_bins=20,
output_mode="one_hot"
)
city = tf.constant(["NYC", "LA", "NYC"])
device = tf.constant(["mobile", "desktop", "mobile"])
crossed = cross_layer((city, device))
For users who want to keep using the higher-level tf.feature_column style without writing layers manually, TensorFlow recommends the tf.keras.utils.FeatureSpace utility, which provides a declarative wrapper around the underlying preprocessing layers.
Polynomial feature generation is a closely related technique that is most commonly applied to numerical data. Scikit-learn's PolynomialFeatures class produces all polynomial combinations of input features up to a specified degree:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False)
# Input: [a, b]
# Output: [1, a, b, a^2, ab, b^2]
Setting interaction_only=True removes the pure-power terms (like a^2 and b^2), leaving only the cross terms:
poly = PolynomialFeatures(degree=2, interaction_only=True)
# Input: [a, b]
# Output: [1, a, b, ab]
This interaction_only mode is essentially pure feature crossing for numerical data. The key difference between polynomial features and categorical feature crosses is that polynomial features multiply continuous values, while categorical crosses form the Cartesian product and one-hot encode the result.
| Method | Data Type | Technique | Output Format |
|---|---|---|---|
| Categorical feature cross | Categorical | Cartesian product + one-hot | Sparse binary vector |
| Polynomial features | Numerical | Multiplication of feature values | Dense numeric vector |
| Bucketized cross | Numerical (discretized) | Bin + Cartesian product | Sparse binary vector |
| Hashed cross | Categorical (high cardinality) | Hash combined keys to fixed bins | Sparse binary vector |
A related class in scikit-learn, SplineTransformer, generates basis-function expansions that capture nonlinear effects of single features without producing explicit crosses. Practitioners often combine SplineTransformer for individual nonlinearities with PolynomialFeatures(interaction_only=True) for cross terms.
Different applications call for different cross representations. The table below summarizes the main variants encountered in practice.
| Variant | Best For | Pros | Cons |
|---|---|---|---|
| Manual concatenated cross | Small cardinality categorical data | Simple, exact, interpretable | Explodes with high cardinality |
| Hashed cross | High cardinality, web-scale CTR systems | Fixed memory footprint, fast | Hash collisions reduce signal slightly |
| Bucketized numerical cross | Geographic or time-of-day patterns | Captures sharp regional variation | Requires choosing bucket boundaries |
| Polynomial cross | Continuous numerical features | Smooth multiplicative interactions | Can amplify outliers and ill-conditioned scales |
| Embedded cross (FM, FFM) | Sparse high-cardinality with rare combos | Generalizes to unseen pairs via embeddings | More parameters and tuning required |
| Learned cross (DCN, xDeepFM) | Deep models with many features | Discovers crosses automatically | Less interpretable than explicit crosses |
A neural network with one or more hidden layers can learn feature interactions automatically through its nonlinear activation functions. Each neuron in a hidden layer computes a weighted sum of inputs and passes it through a nonlinearity, allowing the network to represent arbitrarily complex interactions without manual engineering.
However, neural networks learn interactions implicitly. They require sufficient training data to discover useful combinations, and the learned interactions are embedded in the network weights, making them difficult to interpret. Feature crosses, by contrast, are explicit and interpretable: the engineer can inspect which combinations matter and assign clear semantic meaning to each one.
Two failure modes are particularly common. First, neural networks may struggle to learn very high-order or very rare interactions when training data is limited. Crossing the relevant features and feeding them in directly removes the burden of discovery. Second, even when a network has enough capacity, the optimizer may not find the relevant interaction without thousands of gradient updates. Explicit crosses act as a strong inductive bias that accelerates training.
Factorization machines (FM), introduced by Steffen Rendle in 2010 at the IEEE International Conference on Data Mining, generalize the idea of feature crosses by replacing each crossed weight with a dot product of two low-dimensional embeddings. For pairwise interactions, the FM scoring function is:
y = w_0 + sum_i (w_i * x_i) + sum_{i<j} (<v_i, v_j> * x_i * x_j)
where each feature i has both a scalar weight w_i and an embedding vector v_i, and <v_i, v_j> is the dot product. The crucial advantage is that FMs can estimate interaction strengths even for pairs that never co-occur in training, because the embeddings are learned across all pairs that share at least one component. FMs were a step toward bridging the gap between sparse explicit crosses and dense learned representations.
Field-aware Factorization Machines (FFM), proposed by Juan et al. in 2016, extend FM by giving each feature multiple embedding vectors, one per interacting field. FFM won several Kaggle CTR competitions, including the Criteo Display Ad Challenge, before deep models took over.
The Deep and Cross Network (DCN), introduced by Wang et al. in the 2017 ADKDD workshop paper "Deep & Cross Network for Ad Click Predictions" (arXiv:1708.05123), automates feature crossing within a neural architecture. DCN adds a "cross network" alongside a standard deep network. Each layer of the cross network explicitly computes feature interactions of increasing polynomial degree.
The cross network update rule for layer l+1 is:
x_{l+1} = x_0 * x_l^T * w_l + b_l + x_l
where x_0 is the input feature vector and w_l, b_l are learned parameters. After L cross layers, the network has implicitly enumerated polynomial cross terms up to degree L+1, but with parameter cost that grows linearly in L rather than exponentially.
DCN-V2, published by Wang et al. at The Web Conference 2021 (arXiv:2008.13535), upgrades the cross network with a full weight matrix instead of a vector and adds a mixture-of-experts variant that exploits the low-rank structure of the learned cross matrix. DCN-V2 has been deployed across multiple Google web-scale ranking systems and is reported to deliver significant offline and online metric improvements over the original DCN.
The eXtreme Deep Factorization Machine (xDeepFM), proposed by Lian et al. at KDD 2018 (arXiv:1803.05170), introduces a Compressed Interaction Network (CIN) that performs explicit feature crossing at the vector level rather than at the bit level. CIN captures bounded-degree interactions explicitly while a parallel deep network captures arbitrary high-order interactions implicitly. The combination is designed to inherit the strengths of factorization machines, Wide and Deep, and DCN while addressing some of their limitations.
DeepFM, introduced by Guo et al. at IJCAI 2017 (arXiv:1703.04247), is another widely deployed hybrid. It combines a factorization-machine component for low-order interactions with a multi-layer perceptron for high-order interactions, with both components sharing the same input embeddings. Compared with Wide and Deep, DeepFM removes the need for hand-crafted cross features by relying on the FM component to learn pairwise interactions automatically.
Decision trees and ensemble methods like random forests and gradient boosting (including XGBoost, LightGBM, and CatBoost) learn feature interactions inherently. A decision tree that splits first on feature A and then on feature B within a subtree has effectively learned the interaction A x B. Because of this, manually adding feature crosses to tree-based models usually provides less benefit than adding them to linear models.
That said, there are cases where explicit crosses still help tree models. If an important interaction involves more than two features, a tree may need several levels of splits to capture it, and providing the cross directly as a new feature can shorten the required tree depth. Crosses can also reduce the number of trees needed for the same predictive performance, which lowers inference latency in production.
| Model Type | Learns Interactions Automatically? | Benefit of Manual Feature Crosses |
|---|---|---|
| Linear model | No | Very high |
| Neural network | Yes (implicitly) | Low to moderate |
| Decision tree / ensemble | Yes (via splits) | Low (occasionally moderate) |
| Factorization machine | Yes (pairwise via embeddings) | Built-in |
| DCN / xDeepFM | Yes (explicitly + implicitly) | Built-in |
In 2016, Google published "Wide & Deep Learning for Recommender Systems" (Cheng et al., arXiv:1606.07792), describing an architecture that pairs a wide linear model with a deep neural network. The wide component uses feature crosses to memorize specific user-item co-occurrences, while the deep component uses embeddings and hidden layers to generalize to unseen feature combinations.
The wide side takes cross-product transformations of the form:
cross(feature_i, feature_j) = 1 if feature_i and feature_j are both active
These sparse crosses allow the model to memorize that, for example, "users who installed app X also installed app Y." The deep side embeds sparse features into low-dimensional dense vectors and feeds them through multiple layers to learn generalizable patterns.
Google deployed Wide and Deep on the Google Play app store, which served over one billion active users and over one million apps at the time of the paper. The system significantly improved app acquisition rates compared to wide-only and deep-only baselines while meeting strict training and serving latency requirements. The architecture demonstrated that memorization (via feature crosses) and generalization (via deep networks) are complementary strengths that work best in combination.
The Wide and Deep paper also released a high-level TensorFlow API that made the architecture easy to reproduce, which contributed to the broad adoption of explicit feature crosses in production deep learning pipelines through the late 2010s. Even after pure deep models became more common, the lessons from Wide and Deep influenced later architectures including DCN, DeepFM, and xDeepFM, all of which try to give a deep model some form of explicit cross signal.
The simplest way to create a feature cross in Python is to combine columns directly in a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({
'age': [25, 40, 35],
'income': [50000, 80000, 65000]
})
# Numerical cross
df['age_x_income'] = df['age'] * df['income']
For categorical features:
df = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC'],
'device': ['mobile', 'desktop', 'mobile']
})
# Categorical cross
df['city_x_device'] = df['city'] + '_' + df['device']
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, interaction_only=True)),
('clf', LogisticRegression())
])
pipeline.fit(X_train, y_train)
import tensorflow as tf
city = tf.feature_column.categorical_column_with_vocabulary_list(
'city', ['NYC', 'LA', 'Chicago'])
device = tf.feature_column.categorical_column_with_vocabulary_list(
'device', ['mobile', 'desktop', 'tablet'])
city_x_device = tf.feature_column.crossed_column(
[city, device], hash_bucket_size=20)
Note that tf.feature_column has been deprecated. New TensorFlow code should prefer the Keras-native version below.
import tensorflow as tf
cross = tf.keras.layers.HashedCrossing(
num_bins=1000,
output_mode='one_hot'
)
city = tf.keras.Input(shape=(1,), dtype=tf.string, name='city')
device = tf.keras.Input(shape=(1,), dtype=tf.string, name='device')
crossed = cross((city, device))
PyTorch does not ship a dedicated cross layer, but the same pattern is easy to write by hand:
import torch
import torch.nn as nn
class HashedCross(nn.Module):
def __init__(self, num_bins, embedding_dim):
super().__init__()
self.num_bins = num_bins
self.embed = nn.Embedding(num_bins, embedding_dim)
def forward(self, a, b):
# a, b are LongTensor IDs
combined = a * 1_000_003 + b # mix the two IDs
bucket = combined % self.num_bins
return self.embed(bucket)
Many production systems use a similar pattern with a 64-bit hash function (such as MurmurHash3) before the modulus to keep the bucket distribution uniform across keys.
TensorFlow Recommenders provides a ready-made tfrs.layers.dcn.Cross layer that implements the DCN cross-network update rule. A minimal model looks like:
import tensorflow as tf
import tensorflow_recommenders as tfrs
inputs = tf.keras.Input(shape=(64,))
x = tfrs.layers.dcn.Cross()(inputs)
x = tfrs.layers.dcn.Cross()(x)
output = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs, output)
Not all feature combinations are useful. Practitioners typically rely on:
Crossing many features at high order produces a combinatorial explosion. A dataset with 50 features has 1,225 pairwise crosses and over 19,000 three-way crosses. Strategies to manage this include:
Crossed categorical features are inherently sparse. Efficient storage with compressed sparse row (CSR) matrices and sparse-aware optimizers is essential for web-scale systems. TensorFlow, PyTorch, and scikit-learn all support sparse feature representations. For very wide models with billions of crossed features, distributed parameter servers or specialized embedding-table sharding (as in TorchRec or DeepRec) are commonly used.
The hashing trick reduces dimensionality at the cost of collisions. Two strategies help mitigate the impact:
In practice, modern CTR systems use bucket sizes in the millions and report negligible accuracy loss compared with the full feature space.
Because crossed features dramatically inflate the parameter count, regularization is essential. The two most common choices are:
Google's FTRL-Proximal optimizer, used in many production click-through-rate models, combines L1 and L2 regularization and is particularly well suited to wide linear models with hashed crosses.
In production, the distribution of crossed features can shift quickly. A new product launch may introduce a previously unseen category x country combination, or a viral query may suddenly dominate the query_term x location cross. Regular retraining and online learning are common defenses. Some systems use streaming algorithms such as FTRL to update weights continuously as new data arrives.
CTR prediction for online advertising is the canonical home of feature crossing. Crossed features such as user_demographic x ad_category, query_term x ad_creative, or time_of_day x device capture the joint signals that make a particular ad relevant to a particular user in a particular context. Major ad platforms including Google AdSense, Microsoft Bing Ads, and Facebook Ads have all published papers describing cross-heavy linear or hybrid models for CTR.
Recommender systems for streaming, e-commerce, and app stores rely on user-item interactions. Crossing user IDs with item categories, time of day, or device produces strong memorization signals, while embeddings on the deep side handle generalization to new users and items. The Google Play Wide and Deep deployment is the best-known example.
Web and product search ranking models cross query terms with document attributes, user location, and session features. The cross query_intent x document_topic is particularly common for capturing topical relevance.
Financial fraud detection benefits from crosses that combine transactional and contextual variables. Examples include merchant_category x transaction_amount, card_country x transaction_country, and device_id x time_of_day. These crosses encode the kinds of "out of pattern" combinations that often precede fraud.
Ride-sharing, food delivery, and weather forecasting all rely on bucketized latitude and longitude crossed with time of day or day of week. The Google Machine Learning Crash Course example using bucketized lat-lon crosses for a California housing model is a classic illustration.
The idea of including interaction terms in regression models predates machine learning by many decades. Statisticians such as R. A. Fisher and George Box studied factorial designs and interaction effects in agricultural and industrial experiments going back to the 1920s and 1950s. Modern feature crossing is the same idea, scaled up and packaged for high-cardinality categorical data.
The first widely cited use of large-scale hashed feature crosses in industry came with sponsored search systems at Google, Yahoo, and Microsoft in the mid-2000s. These systems trained logistic regression models with billions of parameters by hashing combinations of query terms, ad creatives, and user attributes into fixed-size bucket arrays. Vowpal Wabbit, released as open source by John Langford and collaborators around 2007, made the hashing trick accessible outside the largest tech companies.
The 2010 Factorization Machines paper by Steffen Rendle reframed crossed features as embedding lookups, opening the door to dense generalization. The 2016 Wide and Deep paper from Google then combined explicit crosses with deep networks. From 2017 onward, DCN, DeepFM, and xDeepFM each tried to automate the discovery of which crosses matter most. By 2020 and 2021, with the rise of deeper recommenders trained on trillions of examples, the trend shifted toward learning crosses implicitly through embeddings and attention, but the explicit cross has not disappeared. It remains a competitive technique whenever interpretability, low latency, or data efficiency is a priority.
Feature crosses are powerful but not universally applicable. Some of their key limitations are:
Imagine you are trying to figure out which ice cream flavors people like. Knowing someone's favorite color ("blue") or their age ("7") alone does not tell you much. But if you put those two facts together ("7-year-old who likes blue"), you might discover that kids around that age who like blue tend to love blueberry ice cream. A feature cross is just putting two clues together to make a stronger clue that helps the computer guess better.
It is like making a checklist of every interesting pair: red and small, red and large, blue and small, blue and large. Each combo gets its own little cubbyhole, and the computer learns which cubbyhole goes with which answer. When two clues are weak on their own but powerful together, the cross is what lets the computer notice the partnership.