# Wide Model

> Source: https://aiwiki.ai/wiki/wide_model
> Updated: 2026-07-16
> Categories: Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning](/wiki/machine_learning), [Deep learning](/wiki/deep_learning), [Recommendation system](/wiki/recommender_system)*

A **wide model** is a type of [machine learning](/wiki/machine_learning) model that uses a large number of input features, often with sparse, high-dimensional representations such as [one-hot encoding](/wiki/one-hot_encoding) and cross-product feature transformations, to memorize specific patterns in training data. Wide models are most commonly associated with the **Wide & Deep Learning** framework introduced by Google in 2016, which combines a wide linear component with a [deep neural network](/wiki/deep_neural_network) to balance memorization and [generalization](/wiki/generalization).[1][12] The term "wide model" can also refer more broadly to neural networks with a large number of neurons per layer relative to their depth.

## Explain like I'm 5 (ELI5)

Imagine you are studying for a test. One way to prepare is to memorize every single flashcard your teacher gave you. You will do great on any question that matches a flashcard exactly, but if the test has a question you never saw before, you might be stuck. That is what a wide model does: it memorizes specific combinations of things it has seen.

Another way to study is to understand the general ideas behind the flashcards. You might not remember every detail, but you can figure out answers to new questions by thinking about the big picture. That is what a deep model does.

A Wide & Deep model uses both strategies at the same time. It memorizes the specific flashcards AND understands the big ideas, so it can handle both familiar questions and new ones. This is why Google uses it to recommend apps: it remembers that people who searched for "fried chicken" liked a specific restaurant app, and it also understands that people who like food apps might enjoy cooking apps too.

## Background and motivation

In many real-world applications, particularly in [recommendation systems](/wiki/recommender_system) and advertising, models need to perform two distinct tasks simultaneously:

1. **Memorization**: learning the direct associations between features that co-occur frequently in the training data. For example, remembering that users who installed a specific app also tend to install a related app.[1]
2. **Generalization**: being able to make predictions about feature combinations that have rarely or never appeared in the training data, by learning transferable representations.[1]

Traditional [linear regression](/wiki/linear_regression) and [logistic regression](/wiki/logistic_regression) models excel at memorization when supplied with cross-product feature transformations. These models are interpretable, fast to train, and effective with [sparse representations](/wiki/sparse_representation). However, they require extensive [feature engineering](/wiki/feature_engineering) and cannot generalize well to unseen feature combinations.[1]

[Deep neural networks](/wiki/deep_neural_network), on the other hand, can generalize to previously unseen feature combinations through learned [embeddings](/wiki/embeddings). But when the interaction matrix between users and items is sparse and high-rank (as is typical in recommendation settings), deep networks can over-generalize and produce less relevant predictions.[1]

This tension between memorization and generalization motivated the development of the Wide & Deep Learning framework.

## The Wide & Deep Learning framework

The Wide & Deep Learning framework was introduced in the paper "Wide & Deep Learning for Recommender Systems" by Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah, all from Google. The paper was published at the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016) and made available on arXiv in June 2016 (arXiv:1606.07792).[1]

The framework jointly trains two components: a wide (linear) component and a deep (neural network) component. Their outputs are combined before being fed to a common [loss function](/wiki/loss_function) for joint training.[1]

### Wide component

The wide component is a generalized linear model of the form:

```
y = w^T * x + b
```

where `y` is the prediction, `w` is a vector of model parameters, `x` is the feature vector, and `b` is the bias term.

The key ingredient of the wide component is the **cross-product transformation**, which creates new features by taking the cross-product of binary input features. For example, if the input features are `gender=female` and `language=en`, the cross-product transformation produces a new feature `AND(gender=female, language=en)` that is 1 only when both conditions are true.[1] These cross-product features allow the wide component to memorize specific feature co-occurrences and their correlation with the target label.

The wide component captures interactions that appear frequently in the training data. For instance, in an app recommendation setting, the model can learn that `AND(user_installed_app=netflix, impression_app=hulu)` has a high correlation with app installation, effectively memorizing the relationship between these two apps.[1]

A limitation of cross-product transformations is that they cannot generalize to feature pairs that have not appeared in the training data.[1] If a new streaming app launches and no training examples exist with that app, the wide component has no information about it.

### Deep component

The deep component is a feed-forward [neural network](/wiki/neural_network) that converts high-dimensional, sparse categorical features into low-dimensional, dense embedding vectors. Each categorical feature is mapped to an embedding vector (typically 32 dimensions in the original Google implementation), and all embeddings are concatenated together with continuous dense features to form a single dense input vector.[1]

This concatenated vector is then passed through multiple hidden layers with [ReLU](/wiki/relu) [activation functions](/wiki/activation_function). In the Google Play implementation described in the original paper, the deep component used three hidden layers with 1024, 512, and 256 units respectively, fed by a concatenated embedding vector of approximately 1200 dimensions.[1]

Because the deep component learns embeddings, it can generalize to feature combinations never seen during training. If two apps have similar embedding vectors (because they share attributes like category, developer, or user demographics), the deep component can predict that a user who likes one might also like the other, even without direct co-occurrence evidence.

### Joint training

The wide and deep components are combined by summing their output log-odds as the final prediction, which is then fed into a common logistic [loss function](/wiki/loss_function). During training, prediction errors are [backpropagated](/wiki/backpropagation) to both sides of the model simultaneously using mini-batch stochastic optimization.[1]

This joint training approach differs from an [ensemble](/wiki/ensemble_learning) of separate models. In an ensemble, individual models are trained independently and their predictions are combined only at inference time. In joint training, both components are optimized together. The wide component only needs to complement the weaknesses of the deep component (and vice versa), so it requires fewer cross-product feature transformations than a standalone wide model would.[1]

The optimizers used for the two components also differ. In the original implementation, the wide component was trained using Follow-the-Regularized-Leader (FTRL) with [L1 regularization](/wiki/l1_regularization), while the deep component was trained using [AdaGrad](/wiki/adagrad).[1]

| Component | Model type | Input features | Role | Optimizer |
|---|---|---|---|---|
| Wide | Generalized linear model | Cross-product transformations of sparse categorical features | Memorization of specific feature co-occurrences | FTRL with L1 regularization |
| Deep | Feed-forward neural network | Dense embeddings of categorical features concatenated with continuous features | Generalization to unseen feature combinations | AdaGrad |
| Combined | Joint model | Both sparse cross-product features and dense embeddings | Balanced memorization and generalization | Both (simultaneously) |

## Google Play deployment

The Wide & Deep model was deployed in production at Google Play, a commercial mobile app store with over one billion active users and over one million apps.[1] The system powered the app recommendation feature on the main landing page of the store.

The recommendation pipeline consisted of two stages:

1. **Retrieval**: A combination of machine-learned models and human-defined rules generated a short list of candidate apps from the full catalog based on the user's context.
2. **Ranking**: The Wide & Deep model scored each candidate app and ranked them by their predicted probability of installation.

### Experimental results

The model was evaluated both offline (on held-out data) and online (through live A/B testing over a three-week period).[1]

| Metric | Wide-only model | Deep-only model | Wide & Deep model |
|---|---|---|---|
| Offline AUC improvement over baseline | Baseline | +0.004 | +0.006 |
| Online acquisition gain vs. control | Control | +2.9% | +3.9% |
| Serving latency | Not reported | Not reported | 14 ms (reduced from 31 ms) |

The online A/B test showed that the Wide & Deep model achieved a 3.9% improvement in app acquisition rate compared to the wide-only control group, and a 2.9% improvement compared to the deep-only model.[1][12] The serving latency was also reduced from 31 ms to 14 ms through architectural optimizations.[1]

## Memorization vs. generalization

The distinction between memorization and generalization is central to understanding wide models and their role in the Wide & Deep framework.[1]

### Memorization

Memorization refers to a model's ability to learn and exploit specific feature co-occurrences in the training data. A purely memorization-based model (like a wide linear model with cross-product features) can make highly precise predictions for feature combinations it has seen before, but it cannot handle novel combinations.[1]

In recommendation systems, memorization means the model can recall that a specific user who installed app A also installed app B. This is valuable for "exploitation" of known user preferences but does not help with discovering new interests.

### Generalization

Generalization refers to a model's ability to make accurate predictions on data it has not seen during training. A [deep learning](/wiki/deep_learning) model generalizes by learning low-dimensional feature representations (embeddings) that capture semantic similarities between items. Items with similar attributes end up close together in embedding space, allowing the model to infer relationships between items even without direct co-occurrence data.[1]

In recommendation systems, generalization enables "exploration" of new content. If a user likes cooking apps, the model can infer they might also enjoy meal planning apps, even if no users in the training data have installed both.

### The tradeoff in practice

| Aspect | Memorization (wide) | Generalization (deep) |
|---|---|---|
| Mechanism | Cross-product feature transformations | Learned dense [embeddings](/wiki/embeddings) |
| Strength | Precise recall of seen patterns | Transfer to unseen combinations |
| Weakness | Cannot handle novel feature pairs | May over-generalize with sparse data |
| Analogy in recommendation | "Users who bought X also bought Y" | "Users who like category A may like category B" |
| Feature type | Sparse, high-dimensional | Dense, low-dimensional |
| Engineering effort | Requires manual feature engineering | Learns features automatically |

## Wide models beyond Wide & Deep

The term "wide model" is also used in a broader sense in neural network research to describe networks that have many neurons per hidden layer, as opposed to deep networks that have many layers.

### Wide residual networks

In 2016, Sergey Zagoruyko and Nikos Komodakis introduced Wide Residual Networks (WRN), which challenged the prevailing trend of making networks deeper.[2] They demonstrated that a 16-layer wide residual network could match or exceed the accuracy of a 1000-layer thin [residual network](/wiki/convolutional_neural_network) with a comparable number of parameters, while being several times faster to train.[2] Their experiments on CIFAR-10, CIFAR-100, SVHN, COCO, and ImageNet showed that increasing width is often a more computationally efficient strategy than increasing depth.[2]

Wide residual networks addressed two key problems with extremely deep networks:[2]

- **Diminishing feature reuse**: as networks grow very deep, gradients and information flow can degrade, causing earlier layers to contribute less to learning.
- **Training efficiency**: each additional layer adds both computational cost and training time, with diminishing returns in accuracy.

### Neural tangent kernel and infinite-width networks

Theoretical work on infinitely wide neural networks has revealed connections between wide networks and kernel methods. The Neural Tangent Kernel (NTK) framework, developed by Jacot, Gabriel, and Hongler in 2018, showed that in the limit of infinite width, a neural network's training dynamics become equivalent to kernel regression with a specific kernel (the NTK).[9] In this regime, known as the "lazy training" regime, the network's parameters barely move from their initialization, and the network effectively fits the data by reweighting a fixed set of basis functions rather than learning new representations.[9]

This theoretical result has practical implications: very wide networks may behave more like kernel methods and less like feature-learning neural networks. Finite-width networks used in practice do learn representations and exhibit feature learning, which is considered one of the most important properties of [deep learning](/wiki/deep_learning). The NTK framework helps explain why simply making a network wider does not always improve performance and why depth plays a complementary role in learning hierarchical features.

### Width vs. depth in neural network design

Research comparing wide and deep architectures has revealed several key differences:[11]

| Property | Wider networks | Deeper networks |
|---|---|---|
| Feature representation | Capture diverse features at each layer simultaneously | Build hierarchical, increasingly abstract features across layers |
| Optimization | Smoother loss landscapes, easier to train | More complex optimization, potential for vanishing/exploding gradients |
| Computational cost | Scales quadratically with width per layer | Scales linearly with depth (but more sequential) |
| Theoretical behavior | Approach kernel methods at infinite width | Can learn richer representations beyond kernel regime |
| Robustness | More robust to adversarial perturbations due to redundancy | More susceptible to adversarial examples in some cases |
| Universal approximation | Single hidden layer with sufficient width can approximate any function (Cybenko, 1989)[7] | Networks with width n+m+2 and arbitrary depth are also universal approximators (Lu et al., 2017)[10] |

The universal approximation theorem, first proven by George Cybenko in 1989 for sigmoid activation functions and later extended to other activations by Kurt Hornik in 1991, states that a feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets, given sufficient width.[7][8] While this theorem guarantees that wide, shallow networks have the representational capacity to approximate any function, it does not provide bounds on how many neurons are needed. In practice, deep networks can often represent the same functions much more efficiently (requiring exponentially fewer parameters) than shallow, wide networks.[10]

## Successors and related architectures

The Wide & Deep framework inspired a family of models that aim to capture feature interactions more effectively. These successors generally improve on the original Wide & Deep architecture by replacing the manually engineered cross-product features in the wide component with learned feature interactions.

### DeepFM (2017)

DeepFM, proposed by Guo et al. at IJCAI 2017, replaces the wide component's linear model with a Factorization Machine (FM). The FM component models second-order (pairwise) feature interactions using learned latent vectors, while the deep component models higher-order interactions through a feed-forward network. Both components share the same input embeddings, eliminating the need for manual feature engineering. The final prediction is the sum of the FM and deep outputs.[3]

Key advantages of DeepFM over Wide & Deep:

- No need for handcrafted cross-product features
- The FM component automatically learns pairwise feature interactions
- Shared embeddings reduce the total number of parameters

### Deep & Cross Network (DCN, 2017)

The Deep & Cross Network, introduced by Wang et al. in 2017, replaces the wide component with a "cross network" that explicitly computes feature crossings at each layer. Each layer of the cross network takes the original input features and the output of the previous cross layer and combines them through a specific crossing operation. This allows the model to learn bounded-degree feature interactions without manual feature engineering.[4]

### DCN-V2 (2021)

DCN-V2, an improved version proposed by Wang et al. in 2021, replaced the cross weight vector in the original DCN with a full weight matrix. This change enabled the model to capture "bit-wise" interactions (interactions between specific positions within feature vectors) rather than just element-wise interactions. DCN-V2 achieved state-of-the-art results on several benchmark datasets and was deployed at scale in Google's production ranking systems.[5]

### Comparison of feature interaction models

| Model | Year | Wide/interaction component | Feature engineering required | Feature interaction order | Key innovation |
|---|---|---|---|---|---|
| Wide & Deep | 2016 | Linear model with cross-product features | Yes (manual) | Manually specified | Joint training of wide and deep components[1] |
| DeepFM | 2017 | Factorization Machine | No | 2nd order (automatic) | Shared embeddings between FM and deep components[3] |
| DCN | 2017 | Cross network with weight vector | No | Bounded degree (automatic) | Cross network for explicit feature crossing[4] |
| xDeepFM | 2018 | Compressed Interaction Network (CIN) | No | Bounded order (automatic) | Vector-level feature interactions via outer products[6] |
| DCN-V2 | 2021 | Cross network with weight matrix | No | Bounded degree (automatic) | Bit-wise interactions via full weight matrix[5] |

## Applications

Wide models and the Wide & Deep framework have been applied across a range of domains:

### Recommendation systems

The original application of Wide & Deep was in [recommendation systems](/wiki/recommender_system), specifically for Google Play app recommendations.[1] The framework has since been widely adopted by companies building large-scale recommendation engines for content, products, and advertisements. Both memorization of user-item interaction history and generalization to new items are important for producing relevant recommendations.

### Click-through rate prediction

Click-through rate (CTR) prediction is a core task in online advertising, where the goal is to estimate the probability that a user will click on a given advertisement. Wide & Deep models and their successors (DeepFM, DCN, DCN-V2) have become standard architectures for CTR prediction, as they handle the sparse, high-dimensional feature spaces typical of ad targeting while also capturing complex user-ad interactions through learned representations.

### Search ranking

Wide & Deep models have been applied to search ranking problems, where the system must score and rank search results based on relevance to a user query. The wide component can memorize specific query-document matches, while the deep component generalizes to related queries and documents that share semantic similarity.

### Natural language processing

In [natural language processing](/wiki/natural_language_processing), wide models have been used for text classification and [sentiment analysis](/wiki/sentiment_analysis). Sparse feature representations (such as bag-of-words or n-gram features) fed into a wide component can capture specific word patterns associated with particular classes, while the deep component can learn distributed representations of text for better generalization.

### Computer vision

In [computer vision](/wiki/computer_vision), the concept of wider networks has been explored through Wide Residual Networks and other architectures that increase the number of channels (feature maps) per layer. These wider architectures have shown competitive or superior performance compared to extremely deep but narrow networks, particularly when training computational budgets are constrained.[2]

## Implementation

The Wide & Deep model was open-sourced as part of [TensorFlow](/wiki/tensorflow). The `tf.estimator.DNNLinearCombinedClassifier` and `tf.estimator.DNNLinearCombinedRegressor` classes provided built-in implementations of the framework.[12] In TensorFlow 2.x and Keras, practitioners can build custom Wide & Deep architectures using the Functional API by defining separate input branches for the wide and deep components and merging them before the output layer.

A [PyTorch](/wiki/pytorch) implementation is available through the `pytorch-widedeep` library, which provides a flexible framework for combining tabular data with text and image inputs using Wide & Deep architectures.

### Example architecture specification

A typical Wide & Deep architecture for a recommendation task might include:

| Parameter | Value |
|---|---|
| Wide features | Cross-product transformations of user and item categorical features |
| Deep embedding dimension | 32 per categorical feature |
| Deep hidden layers | 3 layers (1024, 512, 256 units) |
| Activation function | [ReLU](/wiki/relu) |
| Output | Logistic (sigmoid) for binary classification |
| Wide optimizer | FTRL with [L1 regularization](/wiki/l1_regularization) |
| Deep optimizer | [AdaGrad](/wiki/adagrad) or [Adam](/wiki/adam_optimizer) |
| Training | Mini-batch [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) |

## Advantages and limitations

### Advantages

- **Combined memorization and generalization**: The architecture handles both the precise recall of known patterns and the ability to generalize to new patterns in a single model.[1]
- **Proven at scale**: The framework has been deployed in production systems serving billions of users at Google and other major technology companies.[1]
- **Flexible architecture**: The wide and deep components can be independently configured to suit different data characteristics and task requirements.
- **Efficient serving**: Joint training allows the wide component to remain small (needing only a few cross-product features) since it only has to complement the deep component.[1]

### Limitations

- **Feature engineering for the wide component**: The original Wide & Deep model requires manual engineering of cross-product features for the wide component. This requires domain expertise and can be time-consuming. (Successor models like DeepFM and DCN address this limitation.)[3][4]
- **Cannot extract both feature types simultaneously**: The wide and deep components operate on different feature representations and cannot share their respective feature extraction capabilities.
- **Complexity of joint training**: Training two different model architectures jointly requires careful tuning of separate [learning rates](/wiki/learning_rate), [regularization](/wiki/regularization) strategies, and [hyperparameters](/wiki/hyperparameter) for each component.
- **Cold-start problem**: While the deep component can generalize to new items through embeddings, the wide component provides no value for completely new items that have no cross-product feature history.

## See also

- [Deep learning](/wiki/deep_learning)
- [Neural network](/wiki/neural_network)
- [Feature engineering](/wiki/feature_engineering)
- [Recommendation system](/wiki/recommender_system)
- [Logistic regression](/wiki/logistic_regression)
- [Embeddings](/wiki/embeddings)
- [Overfitting](/wiki/overfitting)
- [Generalization](/wiki/generalization)

## References

1. Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., & Shah, H. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016)*, pp. 7-10. arXiv:1606.07792.
2. Zagoruyko, S. & Komodakis, N. (2016). "Wide Residual Networks." *Proceedings of the British Machine Vision Conference (BMVC 2016)*. arXiv:1605.07146.
3. Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction." *Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017)*, pp. 1725-1731. arXiv:1703.04247.
4. Wang, R., Fu, B., Fu, G., & Wang, M. (2017). "Deep & Cross Network for Ad Click Predictions." *Proceedings of the ADKDD'17*. arXiv:1708.05123.
5. Wang, R., Shivanna, R., Cheng, D., Jain, S., Lin, D., Hong, L., & Chi, E. (2021). "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems." *Proceedings of the Web Conference 2021*. arXiv:2008.13535.
6. Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems." *Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. arXiv:1803.05170.
7. Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals, and Systems*, 2(4), 303-314.
8. Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." *Neural Networks*, 4(2), 251-257.
9. Jacot, A., Gabriel, F., & Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS 2018)*. arXiv:1806.07572.
10. Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). "The Expressive Power of Neural Networks: A View from the Width." *Advances in Neural Information Processing Systems (NeurIPS 2017)*. arXiv:1709.02540.
11. Nguyen, T., Raghu, M., & Kornblith, S. (2021). "Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth." *Proceedings of the 9th International Conference on Learning Representations (ICLR 2021)*. arXiv:2010.15327.
12. Google Research Blog. (2016). "Wide & Deep Learning: Better Together with TensorFlow." https://research.google/blog/wide-amp-deep-learning-better-together-with-tensorflow/