See also: Machine learning, Deep learning, Recommendation system
A wide model is a type of machine learning model that uses a large number of input features, often with sparse, high-dimensional representations such as one-hot encoding and cross-product feature transformations, to memorize specific patterns in training data. Wide models are most commonly associated with the Wide & Deep Learning framework introduced by Google in 2016, which combines a wide linear component with a deep neural network to balance memorization and generalization. The term "wide model" can also refer more broadly to neural networks with a large number of neurons per layer relative to their depth.
Imagine you are studying for a test. One way to prepare is to memorize every single flashcard your teacher gave you. You will do great on any question that matches a flashcard exactly, but if the test has a question you never saw before, you might be stuck. That is what a wide model does: it memorizes specific combinations of things it has seen.
Another way to study is to understand the general ideas behind the flashcards. You might not remember every detail, but you can figure out answers to new questions by thinking about the big picture. That is what a deep model does.
A Wide & Deep model uses both strategies at the same time. It memorizes the specific flashcards AND understands the big ideas, so it can handle both familiar questions and new ones. This is why Google uses it to recommend apps: it remembers that people who searched for "fried chicken" liked a specific restaurant app, and it also understands that people who like food apps might enjoy cooking apps too.
In many real-world applications, particularly in recommendation systems and advertising, models need to perform two distinct tasks simultaneously:
Traditional linear regression and logistic regression models excel at memorization when supplied with cross-product feature transformations. These models are interpretable, fast to train, and effective with sparse representations. However, they require extensive feature engineering and cannot generalize well to unseen feature combinations.
Deep neural networks, on the other hand, can generalize to previously unseen feature combinations through learned embeddings. But when the interaction matrix between users and items is sparse and high-rank (as is typical in recommendation settings), deep networks can over-generalize and produce less relevant predictions.
This tension between memorization and generalization motivated the development of the Wide & Deep Learning framework.
The Wide & Deep Learning framework was introduced in the paper "Wide & Deep Learning for Recommender Systems" by Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah, all from Google. The paper was published at the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016) and made available on arXiv in June 2016 (arXiv:1606.07792).
The framework jointly trains two components: a wide (linear) component and a deep (neural network) component. Their outputs are combined before being fed to a common loss function for joint training.
The wide component is a generalized linear model of the form:
y = w^T * x + b
where y is the prediction, w is a vector of model parameters, x is the feature vector, and b is the bias term.
The key ingredient of the wide component is the cross-product transformation, which creates new features by taking the cross-product of binary input features. For example, if the input features are gender=female and language=en, the cross-product transformation produces a new feature AND(gender=female, language=en) that is 1 only when both conditions are true. These cross-product features allow the wide component to memorize specific feature co-occurrences and their correlation with the target label.
The wide component captures interactions that appear frequently in the training data. For instance, in an app recommendation setting, the model can learn that AND(user_installed_app=netflix, impression_app=hulu) has a high correlation with app installation, effectively memorizing the relationship between these two apps.
A limitation of cross-product transformations is that they cannot generalize to feature pairs that have not appeared in the training data. If a new streaming app launches and no training examples exist with that app, the wide component has no information about it.
The deep component is a feed-forward neural network that converts high-dimensional, sparse categorical features into low-dimensional, dense embedding vectors. Each categorical feature is mapped to an embedding vector (typically 32 dimensions in the original Google implementation), and all embeddings are concatenated together with continuous dense features to form a single dense input vector.
This concatenated vector is then passed through multiple hidden layers with ReLU activation functions. In the Google Play implementation described in the original paper, the deep component used three hidden layers with 1024, 512, and 256 units respectively, fed by a concatenated embedding vector of approximately 1200 dimensions.
Because the deep component learns embeddings, it can generalize to feature combinations never seen during training. If two apps have similar embedding vectors (because they share attributes like category, developer, or user demographics), the deep component can predict that a user who likes one might also like the other, even without direct co-occurrence evidence.
The wide and deep components are combined by summing their output log-odds as the final prediction, which is then fed into a common logistic loss function. During training, prediction errors are backpropagated to both sides of the model simultaneously using mini-batch stochastic optimization.
This joint training approach differs from an ensemble of separate models. In an ensemble, individual models are trained independently and their predictions are combined only at inference time. In joint training, both components are optimized together. The wide component only needs to complement the weaknesses of the deep component (and vice versa), so it requires fewer cross-product feature transformations than a standalone wide model would.
The optimizers used for the two components also differ. In the original implementation, the wide component was trained using Follow-the-Regularized-Leader (FTRL) with L1 regularization, while the deep component was trained using AdaGrad.
| Component | Model type | Input features | Role | Optimizer |
|---|---|---|---|---|
| Wide | Generalized linear model | Cross-product transformations of sparse categorical features | Memorization of specific feature co-occurrences | FTRL with L1 regularization |
| Deep | Feed-forward neural network | Dense embeddings of categorical features concatenated with continuous features | Generalization to unseen feature combinations | AdaGrad |
| Combined | Joint model | Both sparse cross-product features and dense embeddings | Balanced memorization and generalization | Both (simultaneously) |
The Wide & Deep model was deployed in production at Google Play, a commercial mobile app store with over one billion active users and over one million apps. The system powered the app recommendation feature on the main landing page of the store.
The recommendation pipeline consisted of two stages:
The model was evaluated both offline (on held-out data) and online (through live A/B testing over a three-week period).
| Metric | Wide-only model | Deep-only model | Wide & Deep model |
|---|---|---|---|
| Offline AUC improvement over baseline | Baseline | +0.004 | +0.006 |
| Online acquisition gain vs. control | Control | +2.9% | +3.9% |
| Serving latency | Not reported | Not reported | 14 ms (reduced from 31 ms) |
The online A/B test showed that the Wide & Deep model achieved a 3.9% improvement in app acquisition rate compared to the wide-only control group, and a 2.9% improvement compared to the deep-only model. The serving latency was also reduced from 31 ms to 14 ms through architectural optimizations.
The distinction between memorization and generalization is central to understanding wide models and their role in the Wide & Deep framework.
Memorization refers to a model's ability to learn and exploit specific feature co-occurrences in the training data. A purely memorization-based model (like a wide linear model with cross-product features) can make highly precise predictions for feature combinations it has seen before, but it cannot handle novel combinations.
In recommendation systems, memorization means the model can recall that a specific user who installed app A also installed app B. This is valuable for "exploitation" of known user preferences but does not help with discovering new interests.
Generalization refers to a model's ability to make accurate predictions on data it has not seen during training. A deep learning model generalizes by learning low-dimensional feature representations (embeddings) that capture semantic similarities between items. Items with similar attributes end up close together in embedding space, allowing the model to infer relationships between items even without direct co-occurrence data.
In recommendation systems, generalization enables "exploration" of new content. If a user likes cooking apps, the model can infer they might also enjoy meal planning apps, even if no users in the training data have installed both.
| Aspect | Memorization (wide) | Generalization (deep) |
|---|---|---|
| Mechanism | Cross-product feature transformations | Learned dense embeddings |
| Strength | Precise recall of seen patterns | Transfer to unseen combinations |
| Weakness | Cannot handle novel feature pairs | May over-generalize with sparse data |
| Analogy in recommendation | "Users who bought X also bought Y" | "Users who like category A may like category B" |
| Feature type | Sparse, high-dimensional | Dense, low-dimensional |
| Engineering effort | Requires manual feature engineering | Learns features automatically |
The term "wide model" is also used in a broader sense in neural network research to describe networks that have many neurons per hidden layer, as opposed to deep networks that have many layers.
In 2016, Sergey Zagoruyko and Nikos Komodakis introduced Wide Residual Networks (WRN), which challenged the prevailing trend of making networks deeper. They demonstrated that a 16-layer wide residual network could match or exceed the accuracy of a 1000-layer thin residual network with a comparable number of parameters, while being several times faster to train. Their experiments on CIFAR-10, CIFAR-100, SVHN, COCO, and ImageNet showed that increasing width is often a more computationally efficient strategy than increasing depth.
Wide residual networks addressed two key problems with extremely deep networks:
Theoretical work on infinitely wide neural networks has revealed connections between wide networks and kernel methods. The Neural Tangent Kernel (NTK) framework, developed by Jacot, Gabriel, and Hongler in 2018, showed that in the limit of infinite width, a neural network's training dynamics become equivalent to kernel regression with a specific kernel (the NTK). In this regime, known as the "lazy training" regime, the network's parameters barely move from their initialization, and the network effectively fits the data by reweighting a fixed set of basis functions rather than learning new representations.
This theoretical result has practical implications: very wide networks may behave more like kernel methods and less like feature-learning neural networks. Finite-width networks used in practice do learn representations and exhibit feature learning, which is considered one of the most important properties of deep learning. The NTK framework helps explain why simply making a network wider does not always improve performance and why depth plays a complementary role in learning hierarchical features.
Research comparing wide and deep architectures has revealed several key differences:
| Property | Wider networks | Deeper networks |
|---|---|---|
| Feature representation | Capture diverse features at each layer simultaneously | Build hierarchical, increasingly abstract features across layers |
| Optimization | Smoother loss landscapes, easier to train | More complex optimization, potential for vanishing/exploding gradients |
| Computational cost | Scales quadratically with width per layer | Scales linearly with depth (but more sequential) |
| Theoretical behavior | Approach kernel methods at infinite width | Can learn richer representations beyond kernel regime |
| Robustness | More robust to adversarial perturbations due to redundancy | More susceptible to adversarial examples in some cases |
| Universal approximation | Single hidden layer with sufficient width can approximate any function (Cybenko, 1989) | Networks with width n+m+2 and arbitrary depth are also universal approximators (Lu et al., 2017) |
The universal approximation theorem, first proven by George Cybenko in 1989 for sigmoid activation functions and later extended to other activations by Kurt Hornik in 1991, states that a feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets, given sufficient width. While this theorem guarantees that wide, shallow networks have the representational capacity to approximate any function, it does not provide bounds on how many neurons are needed. In practice, deep networks can often represent the same functions much more efficiently (requiring exponentially fewer parameters) than shallow, wide networks.
The Wide & Deep framework inspired a family of models that aim to capture feature interactions more effectively. These successors generally improve on the original Wide & Deep architecture by replacing the manually engineered cross-product features in the wide component with learned feature interactions.
DeepFM, proposed by Guo et al. at IJCAI 2017, replaces the wide component's linear model with a Factorization Machine (FM). The FM component models second-order (pairwise) feature interactions using learned latent vectors, while the deep component models higher-order interactions through a feed-forward network. Both components share the same input embeddings, eliminating the need for manual feature engineering. The final prediction is the sum of the FM and deep outputs.
Key advantages of DeepFM over Wide & Deep:
The Deep & Cross Network, introduced by Wang et al. in 2017, replaces the wide component with a "cross network" that explicitly computes feature crossings at each layer. Each layer of the cross network takes the original input features and the output of the previous cross layer and combines them through a specific crossing operation. This allows the model to learn bounded-degree feature interactions without manual feature engineering.
DCN-V2, an improved version proposed by Wang et al. in 2021, replaced the cross weight vector in the original DCN with a full weight matrix. This change enabled the model to capture "bit-wise" interactions (interactions between specific positions within feature vectors) rather than just element-wise interactions. DCN-V2 achieved state-of-the-art results on several benchmark datasets and was deployed at scale in Google's production ranking systems.
| Model | Year | Wide/interaction component | Feature engineering required | Feature interaction order | Key innovation |
|---|---|---|---|---|---|
| Wide & Deep | 2016 | Linear model with cross-product features | Yes (manual) | Manually specified | Joint training of wide and deep components |
| DeepFM | 2017 | Factorization Machine | No | 2nd order (automatic) | Shared embeddings between FM and deep components |
| DCN | 2017 | Cross network with weight vector | No | Bounded degree (automatic) | Cross network for explicit feature crossing |
| xDeepFM | 2018 | Compressed Interaction Network (CIN) | No | Bounded order (automatic) | Vector-level feature interactions via outer products |
| DCN-V2 | 2021 | Cross network with weight matrix | No | Bounded degree (automatic) | Bit-wise interactions via full weight matrix |
Wide models and the Wide & Deep framework have been applied across a range of domains:
The original application of Wide & Deep was in recommendation systems, specifically for Google Play app recommendations. The framework has since been widely adopted by companies building large-scale recommendation engines for content, products, and advertisements. Both memorization of user-item interaction history and generalization to new items are important for producing relevant recommendations.
Click-through rate (CTR) prediction is a core task in online advertising, where the goal is to estimate the probability that a user will click on a given advertisement. Wide & Deep models and their successors (DeepFM, DCN, DCN-V2) have become standard architectures for CTR prediction, as they handle the sparse, high-dimensional feature spaces typical of ad targeting while also capturing complex user-ad interactions through learned representations.
Wide & Deep models have been applied to search ranking problems, where the system must score and rank search results based on relevance to a user query. The wide component can memorize specific query-document matches, while the deep component generalizes to related queries and documents that share semantic similarity.
In natural language processing, wide models have been used for text classification and sentiment analysis. Sparse feature representations (such as bag-of-words or n-gram features) fed into a wide component can capture specific word patterns associated with particular classes, while the deep component can learn distributed representations of text for better generalization.
In computer vision, the concept of wider networks has been explored through Wide Residual Networks and other architectures that increase the number of channels (feature maps) per layer. These wider architectures have shown competitive or superior performance compared to extremely deep but narrow networks, particularly when training computational budgets are constrained.
The Wide & Deep model was open-sourced as part of TensorFlow. The tf.estimator.DNNLinearCombinedClassifier and tf.estimator.DNNLinearCombinedRegressor classes provided built-in implementations of the framework. In TensorFlow 2.x and Keras, practitioners can build custom Wide & Deep architectures using the Functional API by defining separate input branches for the wide and deep components and merging them before the output layer.
A PyTorch implementation is available through the pytorch-widedeep library, which provides a flexible framework for combining tabular data with text and image inputs using Wide & Deep architectures.
A typical Wide & Deep architecture for a recommendation task might include:
| Parameter | Value |
|---|---|
| Wide features | Cross-product transformations of user and item categorical features |
| Deep embedding dimension | 32 per categorical feature |
| Deep hidden layers | 3 layers (1024, 512, 256 units) |
| Activation function | ReLU |
| Output | Logistic (sigmoid) for binary classification |
| Wide optimizer | FTRL with L1 regularization |
| Deep optimizer | AdaGrad or Adam |
| Training | Mini-batch stochastic gradient descent |