# Sparse Feature

> Source: https://aiwiki.ai/wiki/sparse_feature
> Updated: 2026-06-24
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **sparse feature** is a [feature](/wiki/feature) in [machine learning](/wiki/machine_learning) whose values are predominantly zero or empty across a dataset. Google's Machine Learning Crash Course defines it directly: "A feature whose values are predominantly zero (or empty) is termed a sparse feature."[13] In contrast to a [dense feature](/wiki/dense_feature), which Google defines as "a feature in which most or all values are nonzero, typically a Tensor of floating-point values,"[14] a sparse feature contains only a small fraction of meaningful (non-zero) entries relative to the total number of possible entries. Such [sparsity](/wiki/sparsity) arises naturally in many domains, including [natural language processing](/wiki/natural_language_understanding), [recommendation systems](/wiki/recommender_system), click-through rate prediction, and any setting where [categorical data](/wiki/categorical_data) is encoded numerically. In text features in particular, scikit-learn notes that "the resulting matrix will have many feature values that are zeros (typically more than 99% of them)."[15]

Sparse features are typically stored using specialized [sparse matrix](/wiki/sparse_representation) formats that record only the non-zero values and their positions.[9] In Google's framing, "sparse representation means storing the position of the 1.0 in a sparse vector": a one-hot vector such as [0, 0, 1, 0, 0, 0, 0, 0] for the color "Blue" is stored simply as the index 2, and a multi-hot vector for an item that is both "Blue" and "Black" is stored as the indices 2, 5.[13] This approach saves memory and speeds up computation, since algorithms can skip the vast majority of zero-valued entries. Understanding how sparse features originate, how they are represented, and which algorithms handle them well is a practical requirement for anyone building machine learning systems at scale.

## Explain like I'm 5 (ELI5)

Imagine you have a giant box of crayons with 1,000 different colors. When you draw a picture, you only use about 5 or 10 of those crayons. If someone wrote down a list of all 1,000 colors and put a checkmark next to the ones you actually used, almost every spot on the list would be blank, and only a few would have checkmarks. That list is a "sparse feature." Most of the information is just "nope, didn't use this one," and only a tiny part says something interesting. Computers are smart about this: instead of writing out the whole long list of blanks, they just remember which crayons you did use. That way they don't waste space or time on all the empty spots.

## What is a sparse feature versus a dense feature?

The distinction between sparse and [dense features](/wiki/dense_feature) is one of the first concepts taught in feature representation, because it drives memory usage, algorithm selection, and model design. A sparse feature is mostly zeros; a dense feature is mostly non-zeros. Google's Machine Learning Crash Course gives car color as a canonical example of a sparse feature, since each example takes only one of many possible values and the one-hot encoding is therefore almost entirely zero.[13]

| Property | Sparse features | Dense features |
|---|---|---|
| Value distribution | Mostly zeros | Mostly non-zero |
| Typical origin | [One-hot encoding](/wiki/one-hot_encoding), [bag of words](/wiki/bag_of_words), user-item matrices | Sensor readings, pixel values, [embeddings](/wiki/embeddings) |
| Dimensionality | Often very high (thousands to millions) | Typically lower (tens to hundreds) |
| Storage | Compressed formats (CSR, CSC, COO) | Standard dense arrays |
| Memory efficiency | High (store only non-zero entries) | Lower (store every value) |
| Interpretability | Often interpretable (each dimension has a clear meaning) | Dimensions may lack direct meaning |
| Best algorithms | Linear models, SVMs, Naive Bayes | [Neural networks](/wiki/neural_network), tree-based methods |
| Conversion | Can be converted to dense via embeddings | Can be sparsified via thresholding |

In modern systems, sparse and dense features are often combined. For example, a recommendation model might use sparse features for user and item IDs (via [one-hot encoding](/wiki/one-hot_encoding)) alongside dense features for numerical attributes like price or rating. Frameworks such as [TensorFlow](/wiki/tensorflow) and [PyTorch](/wiki/pytorch) provide native support for sparse tensors and embedding lookups to handle this combination efficiently.

## How do sparse features arise?

Sparse features appear whenever the number of possible values or categories is large but each individual data point activates only a small subset. Several common processes produce sparse features.

### One-hot encoding

[One-hot encoding](/wiki/one-hot_encoding) converts a categorical variable with _k_ categories into a binary vector of length _k_, where exactly one element is set to 1 and all others are 0. For example, encoding the variable "country" across 195 possible countries produces a 195-dimensional vector with a single 1 and 194 zeros for each data point. As the number of categories grows, the resulting vectors become extremely sparse.

One-hot encoding is one of the most common sources of sparse features in practice. When a dataset contains multiple categorical columns, each with many categories, the combined one-hot encoded representation can have thousands or tens of thousands of dimensions, with well over 95% of all values equal to zero.

| Property | Value |
|---|---|
| Input | A single categorical variable with _k_ categories |
| Output | A binary vector of length _k_ |
| Number of non-zero entries per vector | Exactly 1 |
| Sparsity | (_k_ - 1) / _k_, approaching 100% for large _k_ |
| Common use cases | Encoding colors, countries, product IDs, user IDs |

### Bag of words

The [bag of words](/wiki/bag_of_words) (BoW) model represents a text document as a vector over a fixed vocabulary, where each element records how many times a particular word appears in that document.[7] Because any single document uses only a small fraction of the total vocabulary, the resulting vectors are highly sparse.

Consider a vocabulary of 50,000 words. A typical document might contain 200 unique words, leaving 49,800 entries at zero. This translates to 99.6% sparsity. With larger vocabularies (100,000 or more words), sparsity routinely exceeds 99.8%. As a concrete real-world figure, the scikit-learn documentation observes that a collection of 10,000 short text documents such as emails "will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually."[15] The document-term matrix, where each row is a document and each column is a vocabulary word, is therefore stored as a sparse matrix in practice.

The bag of words model ignores word order and grammar entirely, treating each document as an unordered collection of word counts. Despite this simplification, BoW remains widely used for tasks such as document classification, spam detection, and [sentiment analysis](/wiki/sentiment_analysis).[8]

### TF-IDF

TF-IDF (term frequency-inverse document frequency) extends the bag of words model by weighting each word count according to how common or rare the word is across the entire document collection.[7] The TF-IDF score for a term _t_ in a document _d_ is computed as:

**TF-IDF(t, d) = TF(t, d) x IDF(t)**

where:

- **TF(t, d)** is the term frequency, measuring how often term _t_ appears in document _d_, often normalized by the total number of terms in _d_.
- **IDF(t)** is the inverse document frequency, calculated as log(N / df_t), where _N_ is the total number of documents and df_t is the number of documents containing term _t_.

Words that appear in many documents (such as "the" or "is") receive low IDF values, while words that appear in only a few documents receive high IDF values. The resulting TF-IDF vectors have the same dimensionality as BoW vectors and are equally sparse, since most terms still do not appear in any given document.

TF-IDF is commonly used for search ranking, text classification, keyword extraction, and as input to traditional machine learning classifiers.

| Method | What it measures | Sparsity pattern | Typical use |
|---|---|---|---|
| [Bag of words](/wiki/bag_of_words) | Raw word counts | Same as vocabulary sparsity | Document classification, spam filtering |
| TF-IDF | Weighted word importance | Same dimensionality as BoW, equally sparse | Search ranking, text classification |
| Binary BoW | Word presence (0 or 1) | Same structure, slightly sparser | Simple text similarity |

### Multi-hot encoding

Multi-hot encoding generalizes [one-hot encoding](/wiki/one-hot_encoding) for cases where a data point can belong to multiple categories simultaneously. For example, a movie might be tagged with both "Action" and "Comedy" from a set of 20 genres. The resulting vector has length 20, with 1s in two positions and 0s in the remaining 18. Multi-hot vectors are sparse whenever the number of active categories per data point is small relative to the total number of categories.

### User-item interaction matrices

In [recommendation systems](/wiki/recommender_system), user-item interaction matrices record which users have interacted with which items (through ratings, purchases, or clicks). A platform with 1 million users and 100,000 items has a potential matrix of 100 billion entries, but each user typically interacts with only a tiny fraction of all items. On platforms like Netflix, fewer than 1% of possible user-item pairs have recorded interactions. The widely used MovieLens 100K benchmark is similarly sparse: its 100,000 ratings from roughly 1,000 users on roughly 1,700 movies fill only about 1.6% of the matrix, a sparsity near 98.4%.[16] This extreme sparsity is one of the central challenges in [collaborative filtering](/wiki/collaborative_filtering) and [matrix factorization](/wiki/matrix_factorization).[10]

## How are sparse features stored efficiently?

Storing a sparse feature vector or matrix in a standard dense array wastes memory on the zero entries. A 10,000 x 10,000 matrix with only 0.1% non-zero entries would require approximately 800 MB in a dense double-precision format but only about 0.8 MB when stored in a sparse format. Several storage formats exist, each optimized for different access patterns.

### Coordinate list (COO)

The COO format stores each non-zero entry as a triplet of (row index, column index, value). It is simple to construct and easy to understand, making it a natural format for building sparse matrices incrementally. However, COO does not support efficient arithmetic operations or slicing directly. It is commonly used as an intermediate format: data is assembled in COO and then converted to CSR or CSC for computation.

### Compressed sparse row (CSR)

The CSR format stores non-zero values in a contiguous array alongside two auxiliary arrays. One array holds the column indices of each non-zero value, and a second array (the row pointer array) records where each row begins in the values array. CSR supports efficient row slicing and fast matrix-vector multiplication, making it the default sparse format in many machine learning libraries, including [scikit-learn](/wiki/scikit-learn) and [SciPy](https://docs.scipy.org/doc/scipy/reference/sparse.html).[9] CSR is also known as the "Yale format" because it was proposed in the 1977 Yale Sparse Matrix Package report by Eisenstat, Gursky, Schultz, and Sherman, though the format had already been in use since at least the mid-1960s, with a complete description appearing in 1967.[17]

### Compressed sparse column (CSC)

CSC is the column-oriented counterpart to CSR. It stores non-zero values column by column, with a column pointer array and row index array. CSC is efficient for column slicing and is the preferred input format for certain linear algebra solvers. Conversions between CSR, CSC, and COO are all linear-time operations.

| Format | Structure | Best for | Limitations |
|---|---|---|---|
| COO | (row, col, value) triplets | Incremental construction, format conversion | No direct arithmetic or slicing |
| CSR | Row pointers, column indices, values | Row access, matrix-vector products, most ML tasks | Column slicing is slower |
| CSC | Column pointers, row indices, values | Column access, certain solvers | Row slicing is slower |
| Dictionary of keys (DOK) | Dictionary mapping (row, col) to value | Random element access and updates | Slow for arithmetic |
| Block sparse row (BSR) | CSR over dense sub-blocks | Data with dense sub-regions | Not general-purpose |

## Feature hashing

[Feature hashing](/wiki/feature_engineering), also called the hashing trick, maps high-dimensional sparse features into a fixed-size, lower-dimensional vector using a hash function. Instead of maintaining an explicit mapping from feature names to column indices (as in a vocabulary-based approach), feature hashing applies a hash function to each feature name and uses the result, modulo the chosen vector size, as the column index.

The technique was formalized by Weinberger et al. in 2009, though earlier descriptions by John Moody appeared as far back as 1989. Weinberger et al. applied feature hashing to multi-task learning and spam filtering for hundreds of thousands of users, demonstrating that a single parameter vector could capture both per-user and global spam filters.[3]

### How feature hashing works

1. Choose a fixed output dimension _m_ (for example, 2^18 = 262,144). In scikit-learn, the `FeatureHasher` and `HashingVectorizer` classes default to a much larger 2^20 = 1,048,576 columns to keep hash collisions rare.[18]
2. For each feature name (such as a word or categorical value), compute a hash: index = hash(feature_name) mod _m_.
3. Optionally, use a second hash function to determine the sign (+1 or -1) of the contribution, which helps reduce the bias introduced by collisions.
4. Add the feature value at the computed index.

Because multiple distinct features can hash to the same index (a collision), some information is lost. In practice, when the output dimension is reasonably large and the input is sparse, collisions have minimal impact on model accuracy.[3]

### Advantages and trade-offs of feature hashing

| Aspect | Detail |
|---|---|
| Memory | Bounded by the chosen output dimension; no vocabulary dictionary needed |
| Speed | Hash computation is fast; no dictionary lookups |
| Streaming data | New features can be hashed on the fly without retraining; scikit-learn's `HashingVectorizer` is stateless, so it needs no `fit` step[15] |
| Collisions | Multiple features may map to the same index, causing small accuracy loss |
| Interpretability | Individual feature names cannot be recovered from the hashed representation |
| Implementation | Available in [scikit-learn](/wiki/scikit-learn) as `HashingVectorizer` and `FeatureHasher` |

## Sparse coding

Sparse coding is an [unsupervised learning](/wiki/unsupervised_machine_learning) technique that represents each data point as a sparse linear combination of basis vectors (also called a dictionary) learned from the data. The goal is to find a set of basis vectors such that each input can be reconstructed using only a small number of them, with the remaining coefficients set to zero or near zero.

Olshausen and Field introduced a foundational sparse coding model in 1996, showing that when a learning algorithm is trained to find sparse representations of natural images, the learned basis vectors resemble the orientation-selective receptive fields of neurons in the primary visual cortex (V1).[4] Their Nature paper reported that "a learning algorithm that attempts to find sparse linear codes for natural scenes will develop a complete family of localized, oriented, bandpass receptive fields, similar to those found in the primary visual cortex."[4] This finding provided evidence that the brain may use sparse coding as a strategy for efficient sensory representation.

In machine learning, sparse coding is used for [feature extraction](/wiki/feature_extraction), [dimensionality reduction](/wiki/dimension_reduction), denoising, and as a building block in deep learning architectures. The sparsity constraint acts as a form of [regularization](/wiki/regularization), preventing the model from using all basis vectors simultaneously and encouraging it to learn more interpretable, localized features.

## Sparse principal component analysis

Standard [principal component analysis](/wiki/dimension_reduction) (PCA) finds the directions of maximum variance in the data, but the resulting principal components are typically dense linear combinations of all original features. This makes them difficult to interpret, especially in high-dimensional settings.

Sparse PCA adds an [L1 regularization](/wiki/l1_regularization) (lasso) penalty to the PCA objective, encouraging many of the loadings in each principal component to be exactly zero. The result is a set of principal components that are easier to interpret because each component depends on only a small subset of the original features.[5]

Sparse PCA is particularly useful in genomics, text analysis, and other domains where the number of features is very large and practitioners need to identify which specific features drive each component.

## Which algorithms work best with sparse data?

Different machine learning algorithms handle sparse features with varying degrees of effectiveness. The choice of algorithm can significantly affect both model performance and computational efficiency when working with high-dimensional sparse data.

### Linear models

[Logistic regression](/wiki/logistic_regression) and [linear regression](/wiki/linear_regression) with [L1 regularization](/wiki/l1_regularization) (lasso) are well-suited for sparse data. L1 regularization drives many model coefficients to exactly zero, effectively performing automatic [feature selection](/wiki/feature_engineering).[6] This property makes lasso particularly appropriate when the true signal involves only a subset of the available features. [L2 regularization](/wiki/l2_regularization) (ridge regression) is also commonly used but does not produce sparse coefficient vectors.[1]

### Support vector machines

Support vector machines (SVMs) with linear kernels handle sparse, high-dimensional data efficiently. Because SVMs rely on support vectors (the data points closest to the decision boundary), they can perform well even when the feature space is very large, as long as the data is linearly separable or nearly so.

### Tree-based models

[Decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), and [gradient boosting](/wiki/gradient_boosting) models can handle sparse features, but they sometimes underperform compared to linear models in extremely high-dimensional sparse settings.[12] Tree-based models may underestimate the importance of sparse features and favor denser features, even when the sparse features carry predictive information. Careful hyperparameter tuning and feature selection can mitigate this issue.

### Naive Bayes

[Naive Bayes](/wiki/naive_bayes) classifiers are a natural fit for sparse text data, especially the multinomial variant. They are fast to train and scale well to large vocabularies, making them popular for text classification and spam filtering.[8]

### Neural networks

[Neural networks](/wiki/neural_network) can work with sparse inputs through [embedding layers](/wiki/embedding_layer), which map sparse categorical features into low-dimensional dense vectors (see [word embeddings](/wiki/word_embedding)).[11] This is the standard approach in modern [deep learning](/wiki/deep_neural_network) for handling categorical features with high cardinality. The embedding approach converts sparse inputs into dense representations that the network can process efficiently.

| Algorithm | Suitability for sparse data | Notes |
|---|---|---|
| [Logistic regression](/wiki/logistic_regression) with L1 | High | Performs automatic feature selection |
| SVM (linear kernel) | High | Scales well to high dimensions |
| [Naive Bayes](/wiki/naive_bayes) | High | Fast, natural fit for text data |
| [Random forest](/wiki/random_forest) | Moderate | May undervalue sparse features |
| [Gradient boosting](/wiki/gradient_boosting) | Moderate | Libraries like XGBoost handle sparsity natively |
| [Neural networks](/wiki/neural_network) | High (with embeddings) | Embedding layers convert sparse to dense |
| k-nearest neighbors | Low | Distance metrics are unreliable in high-dimensional sparse spaces |

## How are sparse features converted to dense representations?

For high-cardinality categorical features, one-hot encoding becomes impractical because the vectors grow with the number of categories and carry no notion of similarity between categories. The standard modern remedy is to learn an [embedding](/wiki/embeddings): a lower-dimensional dense vector representation of the sparse input. Google's Machine Learning Crash Course describes embeddings as "lower-dimensional representations of sparse data" that "capture semantic relationships and reduce the dimensionality of data."[19]

The motivation is concrete. With M entries in a one-hot encoding and N nodes in the first layer of a network, that layer must train M x N weights, and Google notes that "large input vectors mean a huge number of weights for a neural network."[19] More weights require more data, more computation, and more accelerator memory. An [embedding layer](/wiki/embedding_layer) replaces the wide sparse vector with a compact dense vector (often tens to a few hundred dimensions) whose values are learned during training, so that similar categories end up nearby in the embedding space. This is the technique behind [Word2Vec](/wiki/word_embedding), GloVe, and the input layers of transformer-based models like [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers).[11] In production recommendation and ranking systems, sparse user and item IDs are routinely mapped through embedding tables before being combined with dense numerical features.

## What challenges do sparse features create?

Sparse features present several practical challenges that practitioners must address.

### The curse of dimensionality

As the number of sparse features grows, the feature space becomes increasingly high-dimensional. In high-dimensional spaces, data points tend to be roughly equidistant from each other, which undermines distance-based algorithms such as k-nearest neighbors and k-means clustering. [Dimensionality reduction](/wiki/dimension_reduction) techniques, including PCA, sparse PCA, and [feature extraction](/wiki/feature_extraction), can help alleviate this problem.

### Overfitting

When the number of features is large relative to the number of training examples, models are at risk of [overfitting](/wiki/overfitting). [Regularization](/wiki/regularization) (L1, L2, or elastic net) is a standard countermeasure.[1] L1 regularization is especially useful because it sets many coefficients to zero, effectively reducing the number of active features.[6]

### Computational cost

Although sparse storage formats reduce memory usage, some operations on sparse matrices are slower than their dense counterparts, particularly random element access and certain matrix decompositions. Choosing the right sparse format for the intended operations (for example, CSR for row-based operations and CSC for column-based operations) is important for performance.

### Feature importance estimation

Some models, particularly tree-based ensembles, may underestimate the importance of sparse features because each individual sparse feature appears in only a small fraction of training examples.[12] Techniques such as permutation importance and SHAP values can provide more reliable importance estimates for sparse features.

## Software and library support

Most major machine learning frameworks and numerical computing libraries provide built-in support for sparse data.

| Library | Sparse support |
|---|---|
| [SciPy](https://docs.scipy.org/doc/scipy/reference/sparse.html) | `csr_matrix`, `csc_matrix`, `coo_matrix`, `dok_matrix`, and other formats |
| [scikit-learn](/wiki/scikit-learn) | Accepts SciPy sparse matrices as input; provides `HashingVectorizer`, `CountVectorizer`, `TfidfVectorizer` |
| [TensorFlow](/wiki/tensorflow) | `tf.sparse.SparseTensor` for sparse inputs and embedding lookups |
| [PyTorch](/wiki/pytorch) | `torch.sparse` module with COO and CSR support |
| XGBoost | Native sparse data handling; missing values treated as sparse |
| LightGBM | Optimized for sparse features with histogram-based algorithms |
| [pandas](https://pandas.pydata.org/) | `pd.arrays.SparseArray` for memory-efficient storage of sparse columns |

## Practical example

To illustrate the concept concretely, consider a small text classification task with three documents and a vocabulary of eight words.

**Documents:**
1. "the cat sat on the mat"
2. "the dog chased the cat"
3. "the bird flew over the mat"

**Bag of words matrix (document-term matrix):**

| | the | cat | sat | on | mat | dog | chased | bird | flew | over |
|---|---|---|---|---|---|---|---|---|---|---|
| Doc 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| Doc 2 | 2 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| Doc 3 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |

Even in this tiny example, 14 out of 30 entries (47%) are zero. In real-world settings with vocabularies of tens of thousands of words, sparsity typically exceeds 99%.[7]

In CSR format, only the non-zero entries and their column indices are stored, along with row pointers indicating where each row starts:

- **Values:** [2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1]
- **Column indices:** [0, 1, 2, 3, 4, 0, 1, 5, 6, 0, 4, 7, 8, 9]
- **Row pointers:** [0, 5, 9, 14]

This representation avoids storing any of the zero values, and the savings grow proportionally as the matrix becomes larger and sparser.

## Historical context

Sparse representations have been studied in mathematics and computer science long before the rise of modern machine learning. Sparse matrix storage formats such as CSR and CSC were developed in the context of numerical linear algebra and finite element methods during the 1960s and 1970s. The Yale Sparse Matrix Package, one of the earliest sparse matrix libraries, was released by Yale University's Department of Computer Science in 1977 and lends its name to the "Yale format" still used for CSR today.[17]

In machine learning specifically, sparse features became prominent with the growth of text classification and information retrieval in the 1990s and 2000s. The [bag of words](/wiki/bag_of_words) model and TF-IDF weighting, both of which produce sparse representations, were foundational techniques in early NLP systems. Gerard Salton's vector space model, developed in the 1960s and 1970s, laid the groundwork for these approaches.[2]

The feature hashing trick, formalized by Weinberger et al. in 2009, provided a scalable way to handle extremely high-dimensional sparse features without maintaining explicit vocabularies. This technique saw widespread adoption in large-scale systems for spam filtering, online advertising, and click-through rate prediction.

More recently, the trend in deep learning has been to convert sparse features into dense [embeddings](/wiki/embeddings) using learned [embedding layers](/wiki/embedding_layer), as seen in [Word2Vec](/wiki/word_embedding), GloVe, and transformer-based architectures like [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers).[11] However, sparse features remain central to many production systems, particularly in recommendation engines and search ranking.

## See also

- [Dense feature](/wiki/dense_feature)
- [Feature engineering](/wiki/feature_engineering)
- [One-hot encoding](/wiki/one-hot_encoding)
- [Bag of words](/wiki/bag_of_words)
- [Sparse representation](/wiki/sparse_representation)
- [Sparsity](/wiki/sparsity)
- [Feature extraction](/wiki/feature_extraction)
- [Dimensionality reduction](/wiki/dimension_reduction)
- [Embeddings](/wiki/embeddings)
- [Regularization](/wiki/regularization)

## References

1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer.
2. Salton, G., Wong, A., & Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." *Communications of the ACM*, 18(11), 613-620.
3. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). "Feature Hashing for Large Scale Multitask Learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, 1113-1120. arXiv:0902.2206.
4. Olshausen, B. A., & Field, D. J. (1996). "Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images." *Nature*, 381(6583), 607-609. doi:10.1038/381607a0.
5. Zou, H., Hastie, T., & Tibshirani, R. (2006). "Sparse Principal Component Analysis." *Journal of Computational and Graphical Statistics*, 15(2), 265-286.
6. Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society: Series B*, 58(1), 267-288.
7. Manning, C. D., Raghavan, P., & Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press.
8. Jurafsky, D., & Martin, J. H. (2023). *Speech and Language Processing* (3rd ed. draft). Chapters on text classification and vector semantics.
9. Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
10. Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." *Computer*, 42(8), 30-37.
11. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv preprint arXiv:1301.3781*.
12. Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.
13. Google for Developers. "Categorical data: Vocabulary and one-hot encoding." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/categorical-data/one-hot-encoding
14. Google for Developers. "Machine Learning Glossary: dense feature." https://developers.google.com/machine-learning/glossary
15. scikit-learn developers. "6.2. Feature extraction" (Text feature extraction; Sparsity; Vectorizing a large text corpus with the hashing trick). scikit-learn documentation. https://scikit-learn.org/stable/modules/feature_extraction.html
16. Harper, F. M., & Konstan, J. A. (2015). "The MovieLens Datasets: History and Context." *ACM Transactions on Interactive Intelligent Systems*, 5(4), Article 19. https://grouplens.org/datasets/movielens/100k/
17. "Sparse matrix." Wikipedia. (CSR / Yale format; Eisenstat, Gursky, Schultz, & Sherman, Yale Sparse Matrix Package, 1977.) https://en.wikipedia.org/wiki/Sparse_matrix
18. scikit-learn developers. "FeatureHasher" (n_features default 2**20 = 1,048,576). scikit-learn documentation. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html
19. Google for Developers. "Embeddings." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/embeddings

