# Sparse Vector

> Source: https://aiwiki.ai/wiki/sparse_vector
> Updated: 2026-06-27
> Categories: Computer Science, Machine Learning, Mathematics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **sparse vector** is a vector in which most of the elements are zero, in contrast to a dense vector, in which most elements are non-zero. Google's Machine Learning Glossary defines it simply as "a vector whose values are mostly zero": for example, a 10,000-dimensional bag-of-words vector representing a 20-word phrase would have 9,980 zero values and only 20 non-zero values.[1] Because only the non-zero elements and their positions need to be stored, sparse vectors are far more memory-efficient than dense ones when working with high-dimensional data, and they appear throughout [machine learning](/wiki/machine_learning), [natural language processing](/wiki/natural_language_processing), [information retrieval](/wiki/information_retrieval), signal processing, and scientific computing. Understanding how to represent, store, and operate on sparse vectors is a foundational skill in applied mathematics and data science.

## ELI5 (Explain like I'm 5)

Imagine you have a very long row of 1,000 lockers at school. Only 5 of them have anything inside; the rest are completely empty. Instead of walking past every single locker and writing down "empty, empty, empty..." over and over, you could just make a short list: "Locker 12 has a ball, Locker 47 has a book, Locker 213 has a hat, Locker 650 has a pencil, Locker 899 has a toy." That short list is like a sparse vector. You skip all the zeros (the empty lockers) and only keep track of the ones that matter. Computers do the same thing when most of the numbers in a list are zero: they save time and space by remembering only the non-zero values and where they are.

## What is a sparse vector?

A vector **v** of dimension *n* is written as:

```
v = (v₁, v₂, v₃, ..., vₙ)
```

The vector is considered sparse when the number of non-zero entries, often denoted *nnz* (number of non-zeros), is much smaller than *n*. There is no universal threshold, but a common rule of thumb is that a vector (or matrix) is sparse when nnz is on the order of *n* rather than *n²* (for matrices) or when the fraction of non-zero elements is well below 50%. Google's glossary makes the same point about features: "a feature in which most values are zero" is a sparse feature, the opposite of a dense feature.[1]

The **sparsity** of a vector is defined as:

```
sparsity = (number of zero elements) / (total number of elements)
```

Conversely, **density** is the fraction of non-zero elements:

```
density = nnz / n = 1 - sparsity
```

For example, a 10,000-dimensional vector with only 15 non-zero entries has a sparsity of 0.9985 (99.85%) and a density of 0.0015 (0.15%).

## Why does sparsity matter?

[Sparsity](/wiki/sparsity) is not just a mathematical curiosity; it has direct practical consequences:

- **Memory savings.** Storing all *n* elements of a dense vector requires O(n) space. A sparse vector with *nnz* non-zeros requires only O(nnz) space (plus index overhead), which can be orders of magnitude smaller.
- **Faster computation.** Arithmetic operations such as dot products, addition, and [matrix factorization](/wiki/matrix_factorization) can skip zero entries entirely, reducing time complexity from O(n) to O(nnz).
- **Interpretability.** In many applications, each dimension corresponds to a human-readable feature (for example, a word in a vocabulary). Sparse vectors make it straightforward to inspect which features are active.
- **Scalability.** Many real-world datasets are naturally sparse. User-item interaction matrices in [recommendation systems](/wiki/recommender_system) may have densities below 1%. Text corpora represented as term-document matrices routinely have densities below 0.1%.

## What is the difference between sparse and dense vectors?

The core distinction is straightforward: a sparse vector is mostly zeros, while a dense vector is mostly non-zeros.[1] Sparse vectors are typically very high-dimensional (one dimension per vocabulary term or category) but store only a handful of values, whereas dense vectors, such as those produced by an [embedding](/wiki/embeddings) model, are lower-dimensional but pack meaningful values into every slot. The following table summarizes the main differences.

| Property | Sparse vector | Dense vector |
|---|---|---|
| Typical dimensionality | High (thousands to millions) | Low to moderate (tens to hundreds) |
| Fraction of non-zero values | Very low (often < 1%) | All or most elements are non-zero |
| Storage cost | Proportional to nnz | Proportional to n |
| Common generation methods | [Bag of words](/wiki/bag_of_words), [TF-IDF](/wiki/tf_idf), [one-hot encoding](/wiki/one-hot_encoding), feature hashing | [Word embedding](/wiki/word_embedding), neural encoder output |
| Semantic richness | Limited; relies on exact lexical match | High; captures synonymy and context |
| Interpretability | High; each dimension maps to a known feature | Low; dimensions lack obvious meaning |
| Similarity metrics | Cosine similarity, Jaccard index, dot product | [Cosine similarity](/wiki/cosine_similarity), Euclidean distance, dot product |
| Typical use cases | Text search, keyword matching, categorical features | Semantic search, image retrieval, language models |

## Where do sparse vectors come from in machine learning?

Sparse vectors arise naturally in many stages of a machine learning pipeline.

### One-hot encoding

[One-hot encoding](/wiki/one-hot_encoding) converts a categorical variable with *k* possible values into a binary vector of length *k*. Exactly one element is set to 1 and the remaining *k - 1* elements are 0. When *k* is large (for instance, the number of unique zip codes or product IDs), the resulting vector is highly sparse.

### Bag of words

The [bag-of-words](/wiki/bag_of_words) (BoW) model represents a document as a vector whose length equals the vocabulary size. Each element records how many times the corresponding word appears in the document. Because any single document uses only a small fraction of the full vocabulary, the resulting vector is extremely sparse. The [scikit-learn](/wiki/scikit-learn) library provides `CountVectorizer` for creating bag-of-words representations, and it stores the output as a SciPy sparse matrix.

### TF-IDF

[TF-IDF](/wiki/tf_idf) (Term Frequency-Inverse Document Frequency) extends the bag-of-words model by weighting each term according to how informative it is across the corpus. Common words like "the" and "is" receive low weights, while rare but meaningful terms receive higher weights. The resulting vector is still sparse and remains one of the most widely used representations in text classification and [information retrieval](/wiki/information_retrieval).

### Feature hashing

Also called the "hashing trick," feature hashing maps feature names to a fixed-size vector using a hash function. Scikit-learn's `FeatureHasher` uses the signed 32-bit Murmurhash3 function to determine column indices, producing a CSR sparse matrix. Feature hashing is useful when the feature space is too large to enumerate explicitly or when memory is constrained, such as in online learning scenarios.

### Learned sparse embeddings

Recent advances have produced neural models that output sparse vectors with learned weights. Unlike traditional TF-IDF, these models can perform term expansion, adding weights for semantically related terms that do not appear in the original text. Two notable examples are:

- **[SPLADE](/wiki/splade)** (Sparse Lexical and Expansion Model), introduced by Naver Labs in 2021, uses a pretrained [BERT](/wiki/bert) model and its masked language model head to produce sparse vectors with learned term importance weights. The authors set out to learn "sparse representations for documents and queries that could inherit from desirable properties of bag-of-words models such as exact term matching and efficiency of inverted indexes."[6] SPLADE achieves latency comparable to BM25 (within 4ms) while approaching the retrieval quality of dense neural rankers.
- **BGE-M3**, developed by the Beijing Academy of Artificial Intelligence (BAAI), supports dense, sparse, and multi-vector retrieval within a single model. It covers more than 100 languages and handles inputs up to 8,192 [tokens](/wiki/token). Its sparse output is generated by applying a linear layer followed by a [ReLU](/wiki/relu) activation to the hidden states.[7]

### Neural network activations with ReLU

[Neural networks](/wiki/neural_network) that use the ReLU [activation function](/wiki/activation_function) naturally produce sparse activations because ReLU sets all negative values to zero. In a typical hidden layer, 50% or more of the activations may be zero. This activation-level sparsity is exploited in [sparse attention](/wiki/sparse_attention) mechanisms and in structured pruning of network weights.

## How are sparse vectors stored?

The defining idea of sparse storage is to keep only the non-zero values together with the indices that locate them, rather than materializing every zero. Storing every element of a sparse vector or matrix wastes memory, so several compressed storage formats have been developed, each making a different tradeoff between construction speed, modification flexibility, and computational performance.[4]

### Formats optimized for construction

These formats support efficient element-by-element insertion and modification. They are typically used to build a sparse structure before converting it to a computation-optimized format. The SciPy documentation advises: "To construct an array efficiently, use any of coo_array, dok_array or lil_array."[4]

| Format | Full name | Structure | Best for |
|---|---|---|---|
| COO | Coordinate list | Three arrays: row indices, column indices, values | Incremental assembly; easy conversion to CSR/CSC |
| DOK | Dictionary of keys | Hash table mapping (row, col) pairs to values | Random single-element access and insertion |
| LIL | List of lists | One sorted list per row containing (column, value) pairs | Row-by-row construction |

**COO** (Coordinate) format stores each non-zero entry as a (row, column, value) triple. Sorting the entries first by row, then by column improves access speed. COO is the simplest format and converts quickly to CSR or CSC.

**DOK** (Dictionary of Keys) uses a hash map internally, giving O(1) lookup time for any individual element. It is well-suited for situations where the non-zero pattern is not known in advance.

**LIL** (List of Lists) keeps one list per row. Within each row, entries are stored sorted by column index. This format supports efficient row-wise insertion but is slower for column-wise access.

### Formats optimized for computation

Once a sparse matrix is assembled, it is usually converted to one of the following formats for fast arithmetic and linear algebra. SciPy puts the rule plainly: "To perform manipulations such as multiplication or inversion, first convert the array to either CSC or CSR format," and "the CSR format is especially suitable for fast matrix vector products."[4]

| Format | Full name | Structure | Best for |
|---|---|---|---|
| CSR | Compressed Sparse Row | Three arrays: values, column indices, row pointers | Row slicing, sparse matrix-vector multiply (SpMV) |
| CSC | Compressed Sparse Column | Three arrays: values, row indices, column pointers | Column slicing, solving triangular systems |
| BSR | Block Sparse Row | Like CSR but stores dense sub-blocks instead of scalars | Matrices with dense sub-blocks (finite element methods) |
| DIA | Diagonal | Stores diagonals as dense arrays plus offsets | Banded or diagonal matrices |

#### CSR (Compressed Sparse Row)

CSR is the most widely used format for sparse computation. It stores three one-dimensional arrays:

- **values**: the non-zero entries, listed row by row.
- **col_indices**: the column index of each entry in the values array.
- **row_ptr**: an array of length (number of rows + 1) where `row_ptr[i]` is the index into `values` where row *i* begins.

For example, consider the 4x5 matrix:

```
1  0  0  0  2
0  0  3  0  0
0  0  0  0  0
0  4  0  0  5
```

Its CSR representation is:

```
values     = [1, 2, 3, 4, 5]
col_indices = [0, 4, 2, 1, 4]
row_ptr    = [0, 2, 3, 3, 5]
```

Row 0 starts at index 0 in `values` and contains entries at indices 0 and 1 (values 1 and 2). Row 2 is empty because `row_ptr[2] == row_ptr[3] == 3`. CSR format traces its origins to at least the mid-1960s, with the first complete description appearing in 1967.

CSR saves memory compared to dense storage when nnz < (m(n - 1) - 1) / 2 for an m x n matrix. It supports fast row slicing and efficient sparse matrix-vector multiplication (SpMV), which is a core operation in iterative solvers, [gradient](/wiki/gradient) computation, and graph algorithms.

#### CSC (Compressed Sparse Column)

CSC is the transpose analogue of CSR. It stores values column by column, with row indices and column pointers. CSC is the default sparse format in MATLAB and is efficient for column slicing and solving sparse triangular systems. Many direct sparse solvers (LU, Cholesky) expect input in CSC format.

#### BSR (Block Sparse Row)

BSR extends CSR by replacing individual scalar entries with small dense blocks. It is well-suited for problems where non-zeros appear in regular block patterns, such as finite element discretizations with multiple degrees of freedom per node. For such problems, BSR is considerably more efficient than CSR or CSC for arithmetic operations because block operations can leverage BLAS routines.

#### DIA (Diagonal)

The DIA format stores the diagonals of a matrix as a 2D array of shape (number of diagonals, number of columns) plus an offset array indicating which diagonal each row represents. It is highly efficient for banded matrices but wastes memory when the non-zero pattern does not align with diagonals.

## How is sparse matrix-vector multiplication (SpMV) computed?

Sparse matrix-vector multiplication (SpMV) computes **y = Ax** where **A** is a sparse matrix and **x** is a dense vector. SpMV is one of the most performance-sensitive operations in scientific computing and machine learning because it lies at the heart of iterative solvers (conjugate gradient, GMRES), [PageRank](/wiki/information_retrieval), and [neural network](/wiki/neural_network) forward passes in sparse layers.

SpMV has very low arithmetic intensity (floating-point operations per byte of memory accessed), which makes it memory-bandwidth-bound rather than compute-bound. Optimizing SpMV therefore focuses on maximizing memory throughput through techniques such as:

- Choosing the right storage format for the matrix structure (CSR for general matrices, ELL or HYB for GPU execution).
- Reordering rows and columns to improve cache locality.
- Using vectorized (SIMD) instructions on CPUs or warp-level parallelism on GPUs.

On GPUs, the "vector kernel" approach uses multiple threads within a warp to collaboratively reduce a single sparse dot product, achieving higher utilization than scalar kernels. Research on multi-GPU SpMV shows that data transmission overhead can consume up to 53% of total SpMV time on two-GPU platforms and 63% on four-GPU platforms, making partitioning and communication-minimization strategies essential at scale.

## What software libraries support sparse vectors?

Several mature libraries provide sparse vector and matrix support across programming languages.

| Library | Language | Key sparse features |
|---|---|---|
| SciPy (`scipy.sparse`) | Python | Seven sparse formats (CSR, CSC, COO, BSR, DOK, LIL, DIA); sparse linear algebra; eigenvalue solvers |
| [NumPy](/wiki/numpy) | Python | Does not natively support sparse arrays but interoperates with SciPy sparse |
| [scikit-learn](/wiki/scikit-learn) | Python | Accepts SciPy sparse matrices as input; `CountVectorizer`, `TfidfVectorizer`, `FeatureHasher` all produce sparse output |
| [PyTorch](/wiki/pytorch) (`torch.sparse`) | Python/C++ | COO and CSR sparse [tensors](/wiki/tensor) on CPU and GPU; sparse-dense matrix multiply; autograd support |
| [TensorFlow](/wiki/tensorflow) (`tf.sparse`) | Python/C++ | `SparseTensor` class using COO-like representation; sparse-dense operations; TPU support |
| CuPy (`cupyx.scipy.sparse`) | Python | GPU-accelerated sparse matrices using cuSPARSE; mirrors SciPy's API |
| Eigen | C++ | High-performance sparse matrices in CSC format; direct and iterative solvers |
| PETSc | C/C++/Fortran | Distributed-memory parallel sparse matrices; used in large-scale simulations |
| MATLAB | MATLAB | Built-in sparse matrix type (CSC internally); extensive sparse solver support |

SciPy recommends building sparse structures with COO, DOK, or LIL formats, then converting to CSR or CSC for computation.[4] As of SciPy 1.8+, the library has been transitioning from `*_matrix` classes to `*_array` classes that follow NumPy's array interface.

CuPy provides a SciPy-compatible API on the GPU, but explicit conversion is needed when moving data between CuPy and SciPy because the data resides on different devices (GPU vs. CPU).

[PyTorch](/wiki/pytorch) supports sparse tensors primarily in COO and CSR layouts.[5] CSR tensors in PyTorch benefit from MKL (on CPU) and MAGMA (on GPU) backends for operations like SpMV. PyTorch's autograd engine can backpropagate through certain sparse operations, enabling sparse layers in trainable models.

## What are sparse vectors used for?

### Natural language processing

Sparse vectors are the backbone of classical [NLP](/wiki/natural_language_processing) pipelines. Document classification, spam filtering, [sentiment analysis](/wiki/sentiment_analysis), and topic modeling all rely on sparse term-document matrices. Even with the shift toward dense [embeddings](/wiki/embeddings) produced by [transformer](/wiki/transformer) models, sparse representations remain valuable for keyword matching, lexical overlap scoring, and as features in ensemble systems.

### Information retrieval and search

Search engines have historically relied on sparse vector models. [BM25](/wiki/bm25), the ranking function used by systems like Elasticsearch and Apache Lucene, scores documents based on sparse term frequency statistics. The inverted index, which maps each term to the list of documents containing it, is effectively a column-oriented sparse data structure. Modern search systems increasingly combine sparse retrieval with dense retrieval in a hybrid approach, using fusion algorithms such as Reciprocal Rank Fusion (RRF) to merge ranked lists from both methods.[11]

### Recommender systems

In collaborative filtering, the user-item interaction matrix records which users have interacted with (rated, purchased, clicked) which items. With millions of users and items, this matrix is extremely sparse because each user interacts with only a tiny fraction of available items. Sparse matrix factorization techniques decompose this matrix into lower-rank dense factors to predict missing entries. Platforms like Netflix, Spotify, and Amazon rely on such methods at scale.

### Compressed sensing and signal processing

Compressed sensing exploits the fact that many signals are sparse in some transform domain (for example, the Fourier or wavelet domain). By solving an [L1-regularized](/wiki/l1_regularization) optimization problem (basis pursuit or LASSO), it is possible to reconstruct a signal from far fewer measurements than the Nyquist-Shannon sampling theorem would normally require.[10] This principle is used in medical imaging (MRI acceleration), radar, and communications.

### Scientific computing and simulation

Finite element methods, circuit simulation, and computational fluid dynamics all produce large sparse linear systems. The stiffness and mass matrices arising from finite element discretizations are sparse because each element interacts only with its neighbors. Efficient sparse solvers (both direct, like Cholesky or LU decomposition, and iterative, like conjugate gradient) are essential for solving these systems.[2]

### Graph and network analysis

The adjacency matrix of a graph is sparse when the graph has few edges relative to the number of possible edges. Social networks, web link graphs, and biological interaction networks are all sparse. Graph algorithms such as PageRank, community detection, and shortest-path computation operate on sparse adjacency matrices.

## How does sparsity appear inside neural networks (pruning)?

Beyond sparse data representations, sparsity also appears in the structure of neural networks themselves. Weight pruning removes connections (sets [weights](/wiki/weight) to zero) in a trained network to reduce its size and speed up [inference](/wiki/inference).

The **Lottery Ticket Hypothesis**, proposed by Frankle and Carlin in 2018, states that dense, randomly initialized networks contain sparse subnetworks ("winning tickets") that, when trained in isolation from the same initial weights, match the accuracy of the full network.[8] Experiments have shown that [transformer](/wiki/transformer) models can be pruned to remove up to two-thirds of all weights while maintaining strong performance. For [BERT](/wiki/bert) models specifically, researchers have found matching subnetworks at 40% to 90% sparsity on GLUE and SQuAD benchmarks.[9]

Structured pruning removes entire neurons, attention heads, or layers rather than individual weights. This produces models that are not just theoretically sparse but practically faster on standard hardware, since dense matrix operations on smaller matrices are easier to accelerate than sparse operations on irregular patterns.

## How are sparse and dense vectors combined in hybrid search?

Modern search and retrieval systems increasingly combine sparse and dense vector representations to get the best of both approaches. Sparse vectors excel at exact keyword matching and are efficient to index using inverted indices. Dense vectors, produced by neural encoders, capture semantic meaning and handle synonyms. Hybrid search fuses results from both retrieval paths, typically using a weighted combination or Reciprocal Rank Fusion.[12]

Several [vector databases](/wiki/vector_database) support hybrid search natively:

| Vector database | Hybrid search support |
|---|---|
| Pinecone | Stores sparse-dense vectors as a single unified vector; supports dot-product scoring |
| Milvus | Separate sparse and dense fields with built-in hybrid search API |
| Qdrant | Native sparse vector support alongside dense vectors |
| Weaviate | Combines BM25 (sparse) with vector search using configurable fusion |
| Elasticsearch | Sparse vector field type plus dense k-NN; supports RRF fusion |

## Performance considerations

Working with sparse vectors involves several practical tradeoffs.

**Format selection.** Use COO, DOK, or LIL for assembly, then convert to CSR or CSC for computation.[4] Choosing the wrong format can result in orders-of-magnitude slowdowns. For example, element-wise insertion into a CSR matrix requires rebuilding its internal arrays, while the same operation in DOK takes O(1) time.

**Fill-in.** Some algorithms (such as LU factorization) introduce new non-zero entries ("fill-in") during computation, turning a sparse matrix into a less sparse one. Row and column reordering heuristics (approximate minimum degree, nested dissection) minimize fill-in and keep memory usage manageable.[14]

**GPU considerations.** GPUs achieve high throughput on dense, regular operations. Sparse operations with irregular access patterns can underperform because of thread divergence and poor memory coalescing. The NVIDIA cuSPARSE library provides GPU-optimized SpMV in multiple formats (CSR, COO, BSR, HYB, ELL). Choosing the right format depends on the sparsity pattern of the matrix.

**Sparse-dense boundary.** Very low sparsity (below roughly 90%) can make sparse formats slower than dense arrays due to index overhead and indirect memory access. Profiling is recommended to determine whether sparse storage is actually beneficial for a given workload.

## See also

- [Sparsity](/wiki/sparsity)
- [Sparse Representation](/wiki/sparse_representation)
- [Sparse Attention](/wiki/sparse_attention)
- [Embeddings](/wiki/embeddings)
- [Bag of Words](/wiki/bag_of_words)
- [Word Embedding](/wiki/word_embedding)
- [Feature Engineering](/wiki/feature_engineering)
- [Matrix Factorization](/wiki/matrix_factorization)
- [Vector Database](/wiki/vector_database)
- [L1 Regularization](/wiki/l1_regularization)

## References

1. "Machine Learning Glossary." Google for Developers. https://developers.google.com/machine-learning/glossary
2. Saad, Y. (2003). *Iterative Methods for Sparse Linear Systems*, 2nd edition. SIAM.
3. "Sparse matrix." Wikipedia. https://en.wikipedia.org/wiki/Sparse_matrix
4. "Sparse arrays (scipy.sparse)." SciPy v1.17.0 documentation. https://docs.scipy.org/doc/scipy/reference/sparse.html
5. "torch.sparse." PyTorch documentation. https://docs.pytorch.org/docs/stable/sparse.html
6. Formal, T., Piwowarski, B., and Clinchant, S. (2021). "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking." *Proceedings of SIGIR 2021*. https://arxiv.org/abs/2107.05720
7. Chen, J., Xiao, S., Zhang, P., et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." https://arxiv.org/abs/2402.03216
8. Frankle, J. and Carlin, M. (2018). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *ICLR 2019*. https://arxiv.org/abs/1803.03635
9. Chen, T., Frankle, J., Chang, S., et al. (2020). "The Lottery Ticket Hypothesis for Pre-trained BERT Networks." *NeurIPS 2020*. https://arxiv.org/abs/2007.12223
10. "Compressed sensing." Wikipedia. https://en.wikipedia.org/wiki/Compressed_sensing
11. "Sparse embeddings: Dense vs. sparse vector and usage with ML models." Elasticsearch Labs. https://www.elastic.co/search-labs/blog/sparse-vector-embedding
12. "SPLADE for Sparse Vector Search Explained." Pinecone. https://www.pinecone.io/learn/splade/
13. "Sparse and Dense Embeddings." Zilliz Learn. https://zilliz.com/learn/sparse-and-dense-embeddings
14. Davis, T. A. (2006). *Direct Methods for Sparse Linear Systems*. SIAM.

