Wide Model

Machine Learning Neural Networks

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v4 · 3,863 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A wide model is a type of machine learning model that uses a large number of input features, often with sparse, high-dimensional representations such as one-hot encoding and cross-product feature transformations, to memorize specific patterns in training data. Wide models are most commonly associated with the Wide & Deep Learning framework introduced by Google in 2016, which combines a wide linear component with a deep neural network to balance memorization and generalization.^[1]^[12] The term "wide model" can also refer more broadly to neural networks with a large number of neurons per layer relative to their depth.

Explain like I'm 5 (ELI5)

Imagine you are studying for a test. One way to prepare is to memorize every single flashcard your teacher gave you. You will do great on any question that matches a flashcard exactly, but if the test has a question you never saw before, you might be stuck. That is what a wide model does: it memorizes specific combinations of things it has seen.

Another way to study is to understand the general ideas behind the flashcards. You might not remember every detail, but you can figure out answers to new questions by thinking about the big picture. That is what a deep model does.

A Wide & Deep model uses both strategies at the same time. It memorizes the specific flashcards AND understands the big ideas, so it can handle both familiar questions and new ones. This is why Google uses it to recommend apps: it remembers that people who searched for "fried chicken" liked a specific restaurant app, and it also understands that people who like food apps might enjoy cooking apps too.

Background and motivation

In many real-world applications, particularly in recommendation systems and advertising, models need to perform two distinct tasks simultaneously:

Memorization: learning the direct associations between features that co-occur frequently in the training data. For example, remembering that users who installed a specific app also tend to install a related app.^[1]
Generalization: being able to make predictions about feature combinations that have rarely or never appeared in the training data, by learning transferable representations.^[1]

Traditional linear regression and logistic regression models excel at memorization when supplied with cross-product feature transformations. These models are interpretable, fast to train, and effective with sparse representations. However, they require extensive feature engineering and cannot generalize well to unseen feature combinations.^[1]

Deep neural networks, on the other hand, can generalize to previously unseen feature combinations through learned embeddings. But when the interaction matrix between users and items is sparse and high-rank (as is typical in recommendation settings), deep networks can over-generalize and produce less relevant predictions.^[1]

This tension between memorization and generalization motivated the development of the Wide & Deep Learning framework.

The Wide & Deep Learning framework

The Wide & Deep Learning framework was introduced in the paper "Wide & Deep Learning for Recommender Systems" by Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah, all from Google. The paper was published at the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016) and made available on arXiv in June 2016 (arXiv:1606.07792).^[1]

The framework jointly trains two components: a wide (linear) component and a deep (neural network) component. Their outputs are combined before being fed to a common loss function for joint training.^[1]

Wide component

The wide component is a generalized linear model of the form:

y = w^T * x + b

where y is the prediction, w is a vector of model parameters, x is the feature vector, and b is the bias term.

The key ingredient of the wide component is the cross-product transformation, which creates new features by taking the cross-product of binary input features. For example, if the input features are gender=female and language=en, the cross-product transformation produces a new feature AND(gender=female, language=en) that is 1 only when both conditions are true.^[1] These cross-product features allow the wide component to memorize specific feature co-occurrences and their correlation with the target label.

The wide component captures interactions that appear frequently in the training data. For instance, in an app recommendation setting, the model can learn that AND(user_installed_app=netflix, impression_app=hulu) has a high correlation with app installation, effectively memorizing the relationship between these two apps.^[1]

A limitation of cross-product transformations is that they cannot generalize to feature pairs that have not appeared in the training data.^[1] If a new streaming app launches and no training examples exist with that app, the wide component has no information about it.

Deep component

The deep component is a feed-forward neural network that converts high-dimensional, sparse categorical features into low-dimensional, dense embedding vectors. Each categorical feature is mapped to an embedding vector (typically 32 dimensions in the original Google implementation), and all embeddings are concatenated together with continuous dense features to form a single dense input vector.^[1]

This concatenated vector is then passed through multiple hidden layers with ReLU activation functions. In the Google Play implementation described in the original paper, the deep component used three hidden layers with 1024, 512, and 256 units respectively, fed by a concatenated embedding vector of approximately 1200 dimensions.^[1]

Because the deep component learns embeddings, it can generalize to feature combinations never seen during training. If two apps have similar embedding vectors (because they share attributes like category, developer, or user demographics), the deep component can predict that a user who likes one might also like the other, even without direct co-occurrence evidence.

Joint training

The wide and deep components are combined by summing their output log-odds as the final prediction, which is then fed into a common logistic loss function. During training, prediction errors are backpropagated to both sides of the model simultaneously using mini-batch stochastic optimization.^[1]

This joint training approach differs from an ensemble of separate models. In an ensemble, individual models are trained independently and their predictions are combined only at inference time. In joint training, both components are optimized together. The wide component only needs to complement the weaknesses of the deep component (and vice versa), so it requires fewer cross-product feature transformations than a standalone wide model would.^[1]

The optimizers used for the two components also differ. In the original implementation, the wide component was trained using Follow-the-Regularized-Leader (FTRL) with L1 regularization, while the deep component was trained using AdaGrad.^[1]

Component	Model type	Input features	Role	Optimizer
Wide	Generalized linear model	Cross-product transformations of sparse categorical features	Memorization of specific feature co-occurrences	FTRL with L1 regularization
Deep	Feed-forward neural network	Dense embeddings of categorical features concatenated with continuous features	Generalization to unseen feature combinations	AdaGrad
Combined	Joint model	Both sparse cross-product features and dense embeddings	Balanced memorization and generalization	Both (simultaneously)

Google Play deployment

The Wide & Deep model was deployed in production at Google Play, a commercial mobile app store with over one billion active users and over one million apps.^[1] The system powered the app recommendation feature on the main landing page of the store.

The recommendation pipeline consisted of two stages:

Retrieval: A combination of machine-learned models and human-defined rules generated a short list of candidate apps from the full catalog based on the user's context.
Ranking: The Wide & Deep model scored each candidate app and ranked them by their predicted probability of installation.

Experimental results

The model was evaluated both offline (on held-out data) and online (through live A/B testing over a three-week period).^[1]

Metric	Wide-only model	Deep-only model	Wide & Deep model
Offline AUC improvement over baseline	Baseline	+0.004	+0.006
Online acquisition gain vs. control	Control	+2.9%	+3.9%
Serving latency	Not reported	Not reported	14 ms (reduced from 31 ms)

The online A/B test showed that the Wide & Deep model achieved a 3.9% improvement in app acquisition rate compared to the wide-only control group, and a 2.9% improvement compared to the deep-only model.^[1]^[12] The serving latency was also reduced from 31 ms to 14 ms through architectural optimizations.^[1]

Memorization vs. generalization

The distinction between memorization and generalization is central to understanding wide models and their role in the Wide & Deep framework.^[1]

Memorization

Memorization refers to a model's ability to learn and exploit specific feature co-occurrences in the training data. A purely memorization-based model (like a wide linear model with cross-product features) can make highly precise predictions for feature combinations it has seen before, but it cannot handle novel combinations.^[1]

In recommendation systems, memorization means the model can recall that a specific user who installed app A also installed app B. This is valuable for "exploitation" of known user preferences but does not help with discovering new interests.

Generalization

Generalization refers to a model's ability to make accurate predictions on data it has not seen during training. A deep learning model generalizes by learning low-dimensional feature representations (embeddings) that capture semantic similarities between items. Items with similar attributes end up close together in embedding space, allowing the model to infer relationships between items even without direct co-occurrence data.^[1]

In recommendation systems, generalization enables "exploration" of new content. If a user likes cooking apps, the model can infer they might also enjoy meal planning apps, even if no users in the training data have installed both.

The tradeoff in practice

Aspect	Memorization (wide)	Generalization (deep)
Mechanism	Cross-product feature transformations	Learned dense embeddings
Strength	Precise recall of seen patterns	Transfer to unseen combinations
Weakness	Cannot handle novel feature pairs	May over-generalize with sparse data
Analogy in recommendation	"Users who bought X also bought Y"	"Users who like category A may like category B"
Feature type	Sparse, high-dimensional	Dense, low-dimensional
Engineering effort	Requires manual feature engineering	Learns features automatically

Wide models beyond Wide & Deep

The term "wide model" is also used in a broader sense in neural network research to describe networks that have many neurons per hidden layer, as opposed to deep networks that have many layers.

Wide residual networks

In 2016, Sergey Zagoruyko and Nikos Komodakis introduced Wide Residual Networks (WRN), which challenged the prevailing trend of making networks deeper.^[2] They demonstrated that a 16-layer wide residual network could match or exceed the accuracy of a 1000-layer thin residual network with a comparable number of parameters, while being several times faster to train.^[2] Their experiments on CIFAR-10, CIFAR-100, SVHN, COCO, and ImageNet showed that increasing width is often a more computationally efficient strategy than increasing depth.^[2]

Wide residual networks addressed two key problems with extremely deep networks:^[2]

Diminishing feature reuse: as networks grow very deep, gradients and information flow can degrade, causing earlier layers to contribute less to learning.
Training efficiency: each additional layer adds both computational cost and training time, with diminishing returns in accuracy.

Neural tangent kernel and infinite-width networks

Theoretical work on infinitely wide neural networks has revealed connections between wide networks and kernel methods. The Neural Tangent Kernel (NTK) framework, developed by Jacot, Gabriel, and Hongler in 2018, showed that in the limit of infinite width, a neural network's training dynamics become equivalent to kernel regression with a specific kernel (the NTK).^[9] In this regime, known as the "lazy training" regime, the network's parameters barely move from their initialization, and the network effectively fits the data by reweighting a fixed set of basis functions rather than learning new representations.^[9]

This theoretical result has practical implications: very wide networks may behave more like kernel methods and less like feature-learning neural networks. Finite-width networks used in practice do learn representations and exhibit feature learning, which is considered one of the most important properties of deep learning. The NTK framework helps explain why simply making a network wider does not always improve performance and why depth plays a complementary role in learning hierarchical features.

Width vs. depth in neural network design

Research comparing wide and deep architectures has revealed several key differences:^[11]

Property	Wider networks	Deeper networks
Feature representation	Capture diverse features at each layer simultaneously	Build hierarchical, increasingly abstract features across layers
Optimization	Smoother loss landscapes, easier to train	More complex optimization, potential for vanishing/exploding gradients
Computational cost	Scales quadratically with width per layer	Scales linearly with depth (but more sequential)
Theoretical behavior	Approach kernel methods at infinite width	Can learn richer representations beyond kernel regime
Robustness	More robust to adversarial perturbations due to redundancy	More susceptible to adversarial examples in some cases
Universal approximation	Single hidden layer with sufficient width can approximate any function (Cybenko, 1989)^[7]	Networks with width n+m+2 and arbitrary depth are also universal approximators (Lu et al., 2017)^[10]

The universal approximation theorem, first proven by George Cybenko in 1989 for sigmoid activation functions and later extended to other activations by Kurt Hornik in 1991, states that a feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets, given sufficient width.^[7]^[8] While this theorem guarantees that wide, shallow networks have the representational capacity to approximate any function, it does not provide bounds on how many neurons are needed. In practice, deep networks can often represent the same functions much more efficiently (requiring exponentially fewer parameters) than shallow, wide networks.^[10]

The Wide & Deep framework inspired a family of models that aim to capture feature interactions more effectively. These successors generally improve on the original Wide & Deep architecture by replacing the manually engineered cross-product features in the wide component with learned feature interactions.

DeepFM (2017)

DeepFM, proposed by Guo et al. at IJCAI 2017, replaces the wide component's linear model with a Factorization Machine (FM). The FM component models second-order (pairwise) feature interactions using learned latent vectors, while the deep component models higher-order interactions through a feed-forward network. Both components share the same input embeddings, eliminating the need for manual feature engineering. The final prediction is the sum of the FM and deep outputs.^[3]

Key advantages of DeepFM over Wide & Deep:

No need for handcrafted cross-product features
The FM component automatically learns pairwise feature interactions
Shared embeddings reduce the total number of parameters

Deep & Cross Network (DCN, 2017)

The Deep & Cross Network, introduced by Wang et al. in 2017, replaces the wide component with a "cross network" that explicitly computes feature crossings at each layer. Each layer of the cross network takes the original input features and the output of the previous cross layer and combines them through a specific crossing operation. This allows the model to learn bounded-degree feature interactions without manual feature engineering.^[4]

DCN-V2 (2021)

DCN-V2, an improved version proposed by Wang et al. in 2021, replaced the cross weight vector in the original DCN with a full weight matrix. This change enabled the model to capture "bit-wise" interactions (interactions between specific positions within feature vectors) rather than just element-wise interactions. DCN-V2 achieved state-of-the-art results on several benchmark datasets and was deployed at scale in Google's production ranking systems.^[5]

Comparison of feature interaction models

Model	Year	Wide/interaction component	Feature engineering required	Feature interaction order	Key innovation
Wide & Deep	2016	Linear model with cross-product features	Yes (manual)	Manually specified	Joint training of wide and deep components^[1]
DeepFM	2017	Factorization Machine	No	2nd order (automatic)	Shared embeddings between FM and deep components^[3]
DCN	2017	Cross network with weight vector	No	Bounded degree (automatic)	Cross network for explicit feature crossing^[4]
xDeepFM	2018	Compressed Interaction Network (CIN)	No	Bounded order (automatic)	Vector-level feature interactions via outer products^[6]
DCN-V2	2021	Cross network with weight matrix	No	Bounded degree (automatic)	Bit-wise interactions via full weight matrix^[5]

Applications

Wide models and the Wide & Deep framework have been applied across a range of domains:

Recommendation systems

The original application of Wide & Deep was in recommendation systems, specifically for Google Play app recommendations.^[1] The framework has since been widely adopted by companies building large-scale recommendation engines for content, products, and advertisements. Both memorization of user-item interaction history and generalization to new items are important for producing relevant recommendations.

Click-through rate prediction

Click-through rate (CTR) prediction is a core task in online advertising, where the goal is to estimate the probability that a user will click on a given advertisement. Wide & Deep models and their successors (DeepFM, DCN, DCN-V2) have become standard architectures for CTR prediction, as they handle the sparse, high-dimensional feature spaces typical of ad targeting while also capturing complex user-ad interactions through learned representations.

Search ranking

Wide & Deep models have been applied to search ranking problems, where the system must score and rank search results based on relevance to a user query. The wide component can memorize specific query-document matches, while the deep component generalizes to related queries and documents that share semantic similarity.

Natural language processing

In natural language processing, wide models have been used for text classification and sentiment analysis. Sparse feature representations (such as bag-of-words or n-gram features) fed into a wide component can capture specific word patterns associated with particular classes, while the deep component can learn distributed representations of text for better generalization.

Computer vision

In computer vision, the concept of wider networks has been explored through Wide Residual Networks and other architectures that increase the number of channels (feature maps) per layer. These wider architectures have shown competitive or superior performance compared to extremely deep but narrow networks, particularly when training computational budgets are constrained.^[2]

Implementation

The Wide & Deep model was open-sourced as part of TensorFlow. The tf.estimator.DNNLinearCombinedClassifier and tf.estimator.DNNLinearCombinedRegressor classes provided built-in implementations of the framework.^[12] In TensorFlow 2.x and Keras, practitioners can build custom Wide & Deep architectures using the Functional API by defining separate input branches for the wide and deep components and merging them before the output layer.

A PyTorch implementation is available through the pytorch-widedeep library, which provides a flexible framework for combining tabular data with text and image inputs using Wide & Deep architectures.

Example architecture specification

A typical Wide & Deep architecture for a recommendation task might include:

Parameter	Value
Wide features	Cross-product transformations of user and item categorical features
Deep embedding dimension	32 per categorical feature
Deep hidden layers	3 layers (1024, 512, 256 units)
Activation function	ReLU
Output	Logistic (sigmoid) for binary classification
Wide optimizer	FTRL with L1 regularization
Deep optimizer	AdaGrad or Adam
Training	Mini-batch stochastic gradient descent

Advantages and limitations

Advantages

Combined memorization and generalization: The architecture handles both the precise recall of known patterns and the ability to generalize to new patterns in a single model.^[1]
Proven at scale: The framework has been deployed in production systems serving billions of users at Google and other major technology companies.^[1]
Flexible architecture: The wide and deep components can be independently configured to suit different data characteristics and task requirements.
Efficient serving: Joint training allows the wide component to remain small (needing only a few cross-product features) since it only has to complement the deep component.^[1]

Limitations

Feature engineering for the wide component: The original Wide & Deep model requires manual engineering of cross-product features for the wide component. This requires domain expertise and can be time-consuming. (Successor models like DeepFM and DCN address this limitation.)^[3]^[4]
Cannot extract both feature types simultaneously: The wide and deep components operate on different feature representations and cannot share their respective feature extraction capabilities.
Complexity of joint training: Training two different model architectures jointly requires careful tuning of separate learning rates, regularization strategies, and hyperparameters for each component.
Cold-start problem: While the deep component can generalize to new items through embeddings, the wide component provides no value for completely new items that have no cross-product feature history.

References

Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., & Shah, H. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016)*, pp. 7-10. arXiv:1606.07792. ↩
Zagoruyko, S. & Komodakis, N. (2016). "Wide Residual Networks." *Proceedings of the British Machine Vision Conference (BMVC 2016)*. arXiv:1605.07146. ↩
Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction." *Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017)*, pp. 1725-1731. arXiv:1703.04247. ↩
Wang, R., Fu, B., Fu, G., & Wang, M. (2017). "Deep & Cross Network for Ad Click Predictions." *Proceedings of the ADKDD'17*. arXiv:1708.05123. ↩
Wang, R., Shivanna, R., Cheng, D., Jain, S., Lin, D., Hong, L., & Chi, E. (2021). "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems." *Proceedings of the Web Conference 2021*. arXiv:2008.13535. ↩
Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems." *Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. arXiv:1803.05170. ↩
Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals, and Systems*, 2(4), 303-314. ↩
Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." *Neural Networks*, 4(2), 251-257. ↩
Jacot, A., Gabriel, F., & Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS 2018)*. arXiv:1806.07572. ↩
Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). "The Expressive Power of Neural Networks: A View from the Width." *Advances in Neural Information Processing Systems (NeurIPS 2017)*. arXiv:1709.02540. ↩
Nguyen, T., Raghu, M., & Kornblith, S. (2021). "Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth." *Proceedings of the 9th International Conference on Learning Representations (ICLR 2021)*. arXiv:2010.15327. ↩
Google Research Blog. (2016). "Wide & Deep Learning: Better Together with TensorFlow." https://research.google/blog/wide-amp-deep-learning-better-together-with-tensorflow/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms/All Terms

Explain like I'm 5 (ELI5)

Background and motivation

The Wide & Deep Learning framework

Wide component

Deep component

Joint training

Google Play deployment

Experimental results

Memorization vs. generalization

Memorization

Generalization

The tradeoff in practice

Wide models beyond Wide & Deep

Wide residual networks

Neural tangent kernel and infinite-width networks

Width vs. depth in neural network design

Successors and related architectures

DeepFM (2017)

Deep & Cross Network (DCN, 2017)

DCN-V2 (2021)

Comparison of feature interaction models

Applications

Recommendation systems

Click-through rate prediction

Search ranking

Natural language processing

Computer vision

Implementation

Example architecture specification

Advantages and limitations

Advantages

Limitations

See also

References

Improve this article

Related Articles

Discriminator

Mixture of Experts (MoE)

Spatial Pooling

Activation Function

Attention

Backpropagation

What links here

Related Articles

Discriminator

Mixture of Experts (MoE)

Spatial Pooling

Activation Function

Attention

Backpropagation

What links here