# Dense Feature

> Source: https://aiwiki.ai/wiki/dense_feature
> Updated: 2026-06-24
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **dense feature** is a [feature](/wiki/feature) in [machine learning](/wiki/machine_learning) whose vector representation consists mostly or entirely of non-zero values, typically stored as a dense numeric array (a vector or tensor) of floating-point numbers. Google's Machine Learning Glossary defines a dense feature as "a feature in which most or all values are nonzero, typically a Tensor of floating-point values," and gives a 10-element tensor with 9 non-zero values as an example. [1] Dense features encode information as compact, continuous numerical vectors where every dimension carries meaningful signal, standing in contrast to [sparse features](/wiki/sparse_feature), which contain predominantly zero-valued entries. They appear naturally in many data types (such as sensor readings, pixel intensities, and audio amplitudes) and can also be produced by learned transformations like [word embeddings](/wiki/word_embedding) and [neural network](/wiki/neural_network) hidden layers.

## Explain like I'm 5 (ELI5)

Imagine you have a box of crayons. A "dense" box means almost every crayon slot is filled with a different color. If someone asks you to describe a picture, you can use lots of colors to describe every little detail. A "sparse" box, on the other hand, is mostly empty slots with only a few crayons. Dense features in machine learning work the same way: they are descriptions of something (like a photo or a sentence) that use lots of numbers, and almost none of those numbers are zero. That means the computer has a rich, detailed description to work with when it tries to learn patterns.

## What is a dense feature? Definition and core concepts

In machine learning and statistics, a feature (also called a variable or attribute) is a measurable property of the data being observed. Features serve as the inputs to models that produce predictions or classifications. When those input values are represented as vectors, the distribution of zero versus non-zero entries determines whether the representation is considered dense or sparse.

A dense feature vector has the property that most or all of its components are non-zero real numbers. Formally, given a feature vector **x** in R^d, the vector is dense if the number of non-zero entries is close to d. This is sometimes quantified using the concept of sparsity ratio:

```
Sparsity ratio = (number of zero entries) / (total entries)
```

A vector with a sparsity ratio near 0 is dense; a vector with a sparsity ratio near 1 is sparse.

Dense features typically arise in two ways:

1. **Naturally dense data:** Measurements from the physical world where every reading carries information, such as pixel values in images, amplitude values in audio signals, or sensor readings from IoT devices.
2. **Learned dense representations:** Representations produced by [deep learning](/wiki/deep_learning) models that compress high-dimensional, often sparse inputs into lower-dimensional dense vectors through embedding layers or encoder networks.

## How do dense features differ from sparse features?

The distinction between dense and sparse features is one of the most fundamental concepts in [feature engineering](/wiki/feature_engineering) and model design. A sparse feature, by contrast, is one "whose values are predominantly zero or empty," as Google's glossary puts it, with a one-hot vector over a large vocabulary being the canonical example. [1] The following table summarizes the key differences.

| Property | Dense features | Sparse features |
|---|---|---|
| Value distribution | Most or all values are non-zero | Most values are zero |
| Typical dimensionality | Lower (tens to hundreds of dimensions) | Higher (thousands to millions of dimensions) |
| Storage efficiency | Requires storing every element | Can use compressed formats (CSR, CSC) to skip zeros |
| Information per dimension | Each dimension carries meaningful signal | Most dimensions carry no signal for a given sample |
| Common sources | Sensor data, pixel intensities, embeddings | [One-hot encoding](/wiki/one-hot_encoding), [bag of words](/wiki/bag_of_words), [TF-IDF](/wiki/tf_idf) vectors |
| Interpretability | Individual dimensions are often not directly interpretable | Individual dimensions may correspond to specific categories or terms |
| Suitable algorithms | [Neural networks](/wiki/neural_network), k-nearest neighbors, SVMs with RBF kernels | [Logistic regression](/wiki/logistic_regression), [SVMs](/wiki/support_vector_machine_svm) with linear kernels, [decision trees](/wiki/decision_tree) |
| Computational profile | Matrix operations on fully populated arrays | Requires sparse matrix libraries; benefits from sparsity-aware optimizations |

### Example comparison

Consider representing the word "cat" in a vocabulary of 10,000 words:

- **Sparse (one-hot):** A vector of length 10,000 with a single 1 at the index for "cat" and 0 everywhere else. The sparsity ratio is 9,999/10,000 = 0.9999.
- **Dense (embedding):** A learned vector of length 300 (as in [Word2Vec](/wiki/word2vec) or GloVe), where every element is a non-zero floating-point number like [0.25, -0.73, 0.41, ...]. The sparsity ratio is approximately 0.

## What are the types and sources of dense features?

### Continuous numerical features

The most straightforward dense features are [continuous](/wiki/continuous) numerical values drawn from real-world measurements. These include:

- **Image pixels:** Each pixel in a grayscale image is a single intensity value (0 to 255), and in a color image, each pixel has three channel values (red, green, blue). A 224 x 224 RGB image produces a dense vector of 150,528 values.
- **Audio waveforms:** Raw audio is a time series of amplitude values sampled at rates like 16 kHz or 44.1 kHz. Every sample contains a non-zero (or near-non-zero) floating-point value.
- **Sensor and IoT data:** Temperature, pressure, acceleration, and other physical measurements are inherently dense. Each reading from every sensor at every timestep contributes a non-zero value.
- **Financial data:** Stock prices, trading volumes, and economic indicators form dense numerical feature sets.

### Learned dense embeddings

Modern machine learning frequently converts sparse or [categorical](/wiki/categorical) inputs into dense vectors through learned [embedding](/wiki/embedding_layer) layers. This is one of the most common ways dense features are created in practice.

**Word embeddings** are a classic example. Models like [Word2Vec](/wiki/word2vec), GloVe (Global Vectors for Word Representation), and FastText learn to map each word in a vocabulary to a dense vector of fixed dimensionality (typically 50 to 300 dimensions). These vectors capture semantic relationships: words with similar meanings end up with similar vectors. For instance, the vectors for "king" and "queen" are closer together than the vectors for "king" and "banana."

GloVe works by constructing a word co-occurrence matrix from a large text corpus and then factorizing that matrix so that the dot product of any two word vectors approximates their pointwise mutual information. The result is a set of dense vectors that encode both syntactic and semantic properties of words. [2]

**Contextual embeddings** from models like [BERT](/wiki/bert) and [GPT](/wiki/gpt) go further by producing different dense vectors for the same word depending on its surrounding context. Unlike static embeddings from Word2Vec, BERT generates dense representations at the subword level using [transformer](/wiki/transformer) architectures, which means it can handle words not present in its training vocabulary. [5]

**Entity embeddings** for [categorical](/wiki/categorical) variables convert categories (such as user IDs, product IDs, or zip codes) into dense vectors through a trainable embedding layer. Instead of using one-hot encoding, which produces very sparse and high-dimensional vectors, an embedding layer maps each category to a compact dense vector. Research has shown that entity embeddings can capture meaningful relationships between categories. For example, embeddings for geographic regions might learn spatial proximity, and embeddings for product categories might learn functional similarity. [8]

### Dense features from feature extraction

In [computer vision](/wiki/computer_vision), dense feature descriptors compute a feature vector at every pixel or at regularly spaced grid points across an image. Traditional hand-crafted methods include:

- **Dense SIFT (Scale-Invariant Feature Transform):** Computes SIFT descriptors on a dense grid rather than only at detected keypoints. Research has shown that dense SIFT often outperforms sparse keypoint-based SIFT for tasks like object categorization and texture classification because the dense grid captures more information about the image. [10]
- **HOG (Histogram of Oriented Gradients):** Computes gradient orientation histograms over a dense grid of uniformly spaced cells. HOG was originally developed for pedestrian detection and counts gradient orientations in localized portions of an image. [9]

These hand-crafted dense descriptors have largely been replaced by learned features from [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs), which automatically extract dense [feature maps](/wiki/feature_extraction) at multiple spatial resolutions. The intermediate [layers](/wiki/layer) of a CNN produce dense activation maps where every spatial location has a multi-channel feature vector.

## How are dense features processed and preprocessed?

Dense features require specific preprocessing steps to work well with machine learning algorithms. The choice of preprocessing technique can significantly affect model performance.

### Feature scaling

Because dense features often have different numerical ranges (for example, age might range from 0 to 100 while income might range from 0 to 1,000,000), feature scaling brings all features to a comparable range. This is particularly important for algorithms that rely on distance calculations or [gradient descent](/wiki/gradient_descent).

| Technique | Formula | Range | Best used when |
|---|---|---|---|
| Min-max [normalization](/wiki/normalization) | x' = (x - x_min) / (x_max - x_min) | [0, 1] | Bounded distributions with known min and max |
| Z-score standardization | x' = (x - mean) / std | Unbounded (centered at 0) | Gaussian or approximately normal distributions |
| Robust scaling | x' = (x - median) / IQR | Unbounded | Data with significant outliers |
| Max-abs scaling | x' = x / max(abs(x)) | [-1, 1] | Sparse data that should not be centered |

### Batch normalization

[Batch normalization](/wiki/batch_normalization) is a technique used within [neural networks](/wiki/neural_network) to normalize dense feature activations during training. It computes the mean and variance of activations for each feature within a mini-batch, then normalizes using these statistics. Learnable scale and shift parameters allow the network to adapt to the optimal activation distribution. Batch normalization helps with training speed, stability, and can act as a form of [regularization](/wiki/regularization). [7]

### Handling missing values

Unlike sparse features where zero values are expected and meaningful, missing values in dense features typically indicate data collection errors or unavailable measurements. Common strategies include mean or median imputation, k-nearest neighbor imputation, and model-based imputation using algorithms that can predict missing values from observed ones.

### Dimensionality reduction

When dense feature vectors are very high-dimensional, [dimensionality reduction](/wiki/principal_component_analysis) techniques can compress them while preserving the most important information:

- **Principal Component Analysis (PCA):** Projects features onto principal component directions that capture the maximum variance. Only the top components are retained.
- **Autoencoders:** [Neural network](/wiki/neural_network) architectures that learn to compress dense features into a lower-dimensional bottleneck representation and then reconstruct them.
- **t-SNE and UMAP:** Non-linear methods often used to visualize high-dimensional dense features in 2D or 3D spaces.

## How are dense features used in neural network architectures?

### Dense (fully connected) layers

The term "dense" also refers to [fully connected layers](/wiki/dense_layer) in neural networks, where every input [neuron](/wiki/neuron) is connected to every output neuron. A [dense layer](/wiki/dense_layer) performs a linear transformation (matrix multiplication plus bias) followed by an [activation function](/wiki/activation_function). Dense layers are designed to process dense feature vectors and learn interactions between all input features.

In a typical CNN architecture, convolutional layers extract spatial dense features from images, and then one or more dense layers at the end of the network combine these features to produce final predictions.

### Processing dense and sparse features together

Many real-world systems need to process both dense and sparse features simultaneously. This is particularly common in [recommendation systems](/wiki/recommender_system) and advertising click-through rate prediction, where the input data includes both dense numerical features (such as user age or item price) and sparse categorical features (such as user ID or item category).

Several architectures have been designed specifically for this purpose:

| Model | Year | Key idea | How dense features are used |
|---|---|---|---|
| Wide and Deep | 2016 | Combines a wide linear model with a deep neural network | Dense embeddings from sparse features are fed into the deep component; raw features go to the wide component |
| DeepFM | 2017 | Combines factorization machines with deep neural networks | Dense embeddings of all categorical fields are concatenated with dense numerical features as shared input |
| DCN (Deep and Cross Network) | 2017 | Introduces a cross network for explicit feature interactions | Concatenation of dense embeddings and normalized dense features serves as input to both cross and deep networks |
| xDeepFM | 2018 | Adds compressed interaction network | Processes dense feature embeddings through both explicit and implicit interaction layers |

The Wide and Deep architecture, introduced by Google in 2016 (Cheng et al.), was one of the first systems to demonstrate the value of combining memorization (from the wide component) with generalization (from the deep component). The paper describes how "deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features." [3] The deep component converts sparse features into low-dimensional dense embeddings and passes them through multiple hidden layers, while the wide component handles cross-product feature transformations directly. This architecture was productionized and evaluated on Google Play, a store with over one billion active users and over one million apps, and online experiments showed it significantly increased app acquisitions compared with wide-only or deep-only models. [3]

DeepFM (Guo et al., 2017) improved upon Wide and Deep by replacing the wide component with a factorization machine, eliminating the need for manual feature engineering. The FM component models low-order feature interactions using dot products of embedding vectors, while the deep component learns high-order interactions. Both components share the same embedding input, meaning that dense embeddings of sparse features are used by both simultaneously. [4]

## How are dense features used in natural language processing?

In [natural language processing](/wiki/natural_language_processing) (NLP), the shift from sparse to dense feature representations was one of the most significant developments in the field's history.

### From sparse to dense word representations

Traditional NLP relied on sparse representations like [bag of words](/wiki/bag_of_words) and TF-IDF, where each dimension corresponded to a specific word in the vocabulary. For a vocabulary of 100,000 words, each document would be represented as a 100,000-dimensional vector with mostly zero entries.

The introduction of dense word embeddings changed this fundamentally. [Word2Vec](/wiki/word2vec), introduced by Mikolov et al. at Google in 2013, demonstrated that words could be represented as 100 to 300 dimensional dense vectors that captured semantic relationships. [1] Word2Vec uses two architectures:

- **Continuous Bag of Words (CBOW):** Predicts a target word from its surrounding context words.
- **Skip-gram:** Predicts surrounding context words from a target word.

GloVe, developed at Stanford by Pennington et al. in 2014, takes a different approach by analyzing global word co-occurrence statistics rather than local context windows. Pre-trained GloVe embeddings, trained on roughly 6 billion tokens of Wikipedia and Gigaword text, are released in 50, 100, 200, and 300 dimensions. [2]

### Dense representations in modern NLP

Modern language models like [BERT](/wiki/bert) and [GPT](/wiki/gpt) produce contextual dense feature vectors for every token in a sequence. BERT-base generates 768-dimensional dense vectors, while BERT-large produces 1,024-dimensional vectors. [5] These dense representations capture not just word-level semantics but also syntactic structure, co-reference relationships, and discourse-level meaning.

These dense features serve as the foundation for downstream tasks such as [sentiment analysis](/wiki/sentiment_analysis), named entity recognition, question answering, and [text classification](/wiki/text_classification_models). [Fine-tuning](/wiki/fine_tuning) pre-trained dense representations on task-specific data has become the standard approach in NLP.

## How are dense features used in information retrieval?

Dense features have transformed [information retrieval](/wiki/information_retrieval) through dense retrieval methods, which represent queries and documents as dense vectors and use similarity measures like [cosine similarity](/wiki/cosine_similarity) to find relevant matches.

**Dense Passage Retrieval (DPR)**, introduced by Karpukhin et al. at Facebook AI in 2020, uses two separate BERT-based encoders to produce dense vectors for questions and passages. The system pre-computes passage embeddings and stores them in a [vector database](/wiki/vector_database) index (such as FAISS), enabling fast nearest-neighbor search at query time. According to the authors, the dense retriever "outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy" across several open-domain QA benchmarks. [6]

Dense retrieval plays a central role in [Retrieval-Augmented Generation](/wiki/retrieval_augmented_generation) (RAG) systems, where dense feature vectors are used to find relevant text passages that are then passed to a [large language model](/wiki/large_language_model) for answer generation.

### Dense vs. sparse retrieval comparison

| Aspect | Dense retrieval | Sparse retrieval (BM25/TF-IDF) |
|---|---|---|
| Representation | Learned dense vectors (typically 768 dimensions) | Sparse term-frequency vectors |
| Matching | [Semantic similarity](/wiki/semantic_search) via dot product or cosine | Exact lexical matching |
| Strengths | Handles synonyms, paraphrases, and semantic variations | Fast, interpretable, strong for exact keyword queries |
| Weaknesses | Requires training data; may miss exact term matches | Cannot handle semantic variations or paraphrases |
| Infrastructure | Requires vector index (FAISS, HNSW) | Inverted index (Lucene, Elasticsearch) |

## How are dense features used in computer vision?

In [computer vision](/wiki/computer_vision), dense features refer to feature representations computed at every spatial location in an image, as opposed to sparse features computed only at detected keypoints.

[Convolutional neural networks](/wiki/convolutional_neural_network) naturally produce dense feature maps. Each convolutional layer outputs a 3D tensor (height x width x channels) where every spatial position has a dense feature vector. For example, the last convolutional layer of ResNet-50 produces a 7 x 7 x 2048 tensor, meaning 49 spatial locations each described by a 2048-dimensional dense vector. [13]

Dense feature maps are used in several vision tasks:

- **[Object detection](/wiki/object_detection):** Models like Faster R-CNN and YOLO use dense feature maps to generate predictions at every spatial location.
- **Semantic segmentation:** Fully convolutional networks produce per-pixel dense predictions by using dense features throughout the network.
- **[Image recognition](/wiki/image_recognition):** Global average pooling reduces dense spatial features to a single dense vector per image for classification.
- **Dense correspondence:** Methods like DensePose compute dense feature maps that establish correspondence between image pixels and 3D surface models.

## How do you convert sparse features to dense features?

Converting sparse features into dense representations is a common operation in machine learning pipelines. Several techniques are used:

### Embedding layers

The most common approach in deep learning uses trainable [embedding layers](/wiki/embedding_layer) to map each sparse categorical value to a dense vector. For a categorical feature with N possible values, an embedding layer maintains a lookup table of N dense vectors, each of dimensionality d (where d is much smaller than N). During training, these vectors are updated through [backpropagation](/wiki/backpropagation).

### Feature hashing

Feature hashing (also called the hashing trick) applies a hash function to map high-dimensional sparse features into a fixed-size dense vector. [12] Large-scale advertising datasets such as Criteo contain tens of millions of distinct categorical feature values, and feature hashing lets practitioners reduce them to a fixed, much smaller number of dimensions with minimal performance loss. Feature hashing is fast, memory-efficient, and well-suited for online learning scenarios.

### Dimensionality reduction

[Principal Component Analysis](/wiki/principal_component_analysis) (PCA) and other linear methods can project sparse feature vectors into lower-dimensional dense subspaces. However, these methods are limited by their linear nature and may not capture complex non-linear relationships.

### Autoencoders

[Autoencoders](/wiki/autoencoder) are neural networks trained to compress high-dimensional inputs (including sparse ones) into lower-dimensional dense bottleneck representations and then reconstruct the original input. The bottleneck layer serves as a dense encoding of the input.

## What are the advantages of dense features?

- **Semantic richness:** Dense features capture nuanced relationships and semantic similarities that sparse representations cannot. For example, dense word embeddings encode analogical relationships (king - man + woman = queen).
- **Computational efficiency for neural networks:** Dense matrix operations are highly optimized on modern hardware, especially [GPUs](/wiki/gpu_computing) and TPUs. Multiplying dense matrices is faster than handling sparse matrix operations for the same effective information content.
- **Lower dimensionality:** Dense features typically require far fewer dimensions than sparse alternatives. A 300-dimensional dense word embedding encodes more semantic information than a 100,000-dimensional one-hot vector.
- **Generalization:** Dense representations can generalize to unseen inputs by placing similar items close together in the embedding space. This allows models to make reasonable predictions even for inputs not seen during training.
- **Transfer learning:** Pre-trained dense features (such as word embeddings or CNN features) can be reused across different tasks, a technique known as [transfer learning](/wiki/transfer_learning). Entity embeddings trained on one task can serve as informative features for tree-based models or other classifiers.

## What are the disadvantages and challenges of dense features?

- **Loss of interpretability:** Individual dimensions of a dense feature vector usually do not correspond to any identifiable concept, making it difficult to understand what the model has learned.
- **Computational cost for very large inputs:** While dense representations are efficient per dimension, processing very high-resolution images or long sequences can be computationally expensive because every element must be stored and processed.
- **Memory requirements:** Dense vectors require storing every element, unlike sparse representations that can skip zero values. For a batch of 10,000 data points each with 1,024 dense features stored as 32-bit floats, the system must store roughly 40 MB of values (10,000 x 1,024 x 4 bytes).
- **Training data requirements:** Learning good dense representations typically requires large amounts of training data. Models like Word2Vec and BERT were trained on billions of words of text. [1][5]
- **Risk of [overfitting](/wiki/overfitting):** Dense layers with many parameters can memorize training data instead of learning generalizable patterns, particularly on smaller datasets. Techniques like [dropout](/wiki/dropout) and [regularization](/wiki/regularization) are commonly used to mitigate this.

## Practical considerations

### Choosing between dense and sparse representations

The choice between dense and sparse feature representations depends on several factors:

- **Data type:** Naturally continuous data (images, audio, sensor data) is inherently dense. Categorical data with many possible values benefits from conversion to dense embeddings.
- **Model type:** [Neural networks](/wiki/neural_network) generally work best with dense features. Linear models and tree-based models can work well with both dense and sparse features.
- **Dataset size:** Dense embedding approaches require sufficient training data to learn meaningful representations. With very small datasets, sparse representations may be more reliable.
- **Latency requirements:** Dense embeddings can be pre-computed and cached for fast inference, while sparse features with very high dimensionality may require more memory at serving time.

### Tools and libraries

Several popular tools support working with dense features:

- **[PyTorch](/wiki/pytorch) and [TensorFlow](/wiki/tensorflow):** Both provide embedding layers, dense layers, and optimized GPU operations for dense feature processing.
- **[scikit-learn](/wiki/scikit-learn):** Offers preprocessing utilities like StandardScaler, MinMaxScaler, and PCA for dense feature preparation.
- **Gensim:** Provides implementations of Word2Vec and other dense word embedding models.
- **FAISS (Facebook AI Similarity Search):** Optimized library for efficient nearest-neighbor search over large collections of dense vectors.
- **NumPy and pandas:** Standard tools for manipulating dense numerical feature arrays and data frames.

## See also

- [Sparse feature](/wiki/sparse_feature)
- [Feature engineering](/wiki/feature_engineering)
- [Feature vector](/wiki/feature_vector)
- [Feature extraction](/wiki/feature_extraction)
- [Word embedding](/wiki/word_embedding)
- [Dense layer](/wiki/dense_layer)
- [Embedding layer](/wiki/embedding_layer)
- [Normalization](/wiki/normalization)
- [Dimensionality reduction](/wiki/principal_component_analysis)

## References

1. Google for Developers. "Machine Learning Glossary." *developers.google.com*. Entries for "dense feature" and "sparse feature." https://developers.google.com/machine-learning/glossary
2. Pennington, J., Socher, R., & Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, pp. 1532-1543. https://nlp.stanford.edu/projects/glove/
3. Cheng, H.-T., Koc, L., Harmsen, J., et al. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems*, pp. 7-10. https://arxiv.org/abs/1606.07792
4. Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). "DeepFM: A Factorization-Machine based Neural Network for CTR Prediction." *Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI)*, pp. 1725-1731. https://arxiv.org/abs/1703.04247
5. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*, pp. 4171-4186. https://arxiv.org/abs/1810.04805
6. Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, pp. 6769-6781. https://arxiv.org/abs/2004.04906
7. Ioffe, S. & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, pp. 448-456. https://arxiv.org/abs/1502.03167
8. Guo, C. & Berkhahn, F. (2016). "Entity Embeddings of Categorical Variables." *arXiv preprint arXiv:1604.06737*. https://arxiv.org/abs/1604.06737
9. Dalal, N. & Triggs, B. (2005). "Histograms of Oriented Gradients for Human Detection." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 886-893.
10. Lowe, D. G. (2004). "Distinctive Image Features from Scale-Invariant Keypoints." *International Journal of Computer Vision*, 60(2), pp. 91-110.
11. Wang, R., Fu, B., Fu, G., & Wang, M. (2017). "Deep & Cross Network for Ad Click Predictions." *Proceedings of the ADKDD'17*, Article 12. https://arxiv.org/abs/1708.05123
12. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). "Feature Hashing for Large Scale Multitask Learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, pp. 1113-1120. https://arxiv.org/abs/0902.2206
13. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, pp. 770-778. https://arxiv.org/abs/1512.03385