Sparse Vector

A sparse vector is a vector in which the majority of elements are zero. Only the non-zero elements and their positions need to be stored, which makes sparse vectors far more memory-efficient than their dense counterparts when working with high-dimensional data. Sparse vectors appear throughout machine learning, natural language processing, information retrieval, signal processing, and scientific computing. Understanding how to represent, store, and operate on sparse vectors is a foundational skill in applied mathematics and data science.

ELI5 (Explain like I'm 5)

Imagine you have a very long row of 1,000 lockers at school. Only 5 of them have anything inside; the rest are completely empty. Instead of walking past every single locker and writing down "empty, empty, empty..." over and over, you could just make a short list: "Locker 12 has a ball, Locker 47 has a book, Locker 213 has a hat, Locker 650 has a pencil, Locker 899 has a toy." That short list is like a sparse vector. You skip all the zeros (the empty lockers) and only keep track of the ones that matter. Computers do the same thing when most of the numbers in a list are zero: they save time and space by remembering only the non-zero values and where they are.

Mathematical definition

A vector v of dimension n is written as:

v = (v₁, v₂, v₃, ..., vₙ)

The vector is considered sparse when the number of non-zero entries, often denoted nnz (number of non-zeros), is much smaller than n. There is no universal threshold, but a common rule of thumb is that a vector (or matrix) is sparse when nnz is on the order of n rather than n² (for matrices) or when the fraction of non-zero elements is well below 50%.

The sparsity of a vector is defined as:

sparsity = (number of zero elements) / (total number of elements)

Conversely, density is the fraction of non-zero elements:

density = nnz / n = 1 - sparsity

For example, a 10,000-dimensional vector with only 15 non-zero entries has a sparsity of 0.9985 (99.85%) and a density of 0.0015 (0.15%).

Why sparsity matters

Sparsity is not just a mathematical curiosity; it has direct practical consequences:

Memory savings. Storing all n elements of a dense vector requires O(n) space. A sparse vector with nnz non-zeros requires only O(nnz) space (plus index overhead), which can be orders of magnitude smaller.
Faster computation. Arithmetic operations such as dot products, addition, and matrix factorization can skip zero entries entirely, reducing time complexity from O(n) to O(nnz).
Interpretability. In many applications, each dimension corresponds to a human-readable feature (for example, a word in a vocabulary). Sparse vectors make it straightforward to inspect which features are active.
Scalability. Many real-world datasets are naturally sparse. User-item interaction matrices in recommendation systems may have densities below 1%. Text corpora represented as term-document matrices routinely have densities below 0.1%.

Sparse vs. dense vectors

The following table summarizes the main differences between sparse and dense vectors.

Property	Sparse vector	Dense vector
Typical dimensionality	High (thousands to millions)	Low to moderate (tens to hundreds)
Fraction of non-zero values	Very low (often < 1%)	All or most elements are non-zero
Storage cost	Proportional to nnz	Proportional to n
Common generation methods	Bag of words, TF-IDF, one-hot encoding, feature hashing	Word embedding, neural encoder output
Semantic richness	Limited; relies on exact lexical match	High; captures synonymy and context
Interpretability	High; each dimension maps to a known feature	Low; dimensions lack obvious meaning
Similarity metrics	Cosine similarity, Jaccard index, dot product	Cosine similarity, Euclidean distance, dot product
Typical use cases	Text search, keyword matching, categorical features	Semantic search, image retrieval, language models

Sources of sparse vectors in machine learning

Sparse vectors arise naturally in many stages of a machine learning pipeline.

One-hot encoding

One-hot encoding converts a categorical variable with k possible values into a binary vector of length k. Exactly one element is set to 1 and the remaining k - 1 elements are 0. When k is large (for instance, the number of unique zip codes or product IDs), the resulting vector is highly sparse.

Bag of words

The bag-of-words (BoW) model represents a document as a vector whose length equals the vocabulary size. Each element records how many times the corresponding word appears in the document. Because any single document uses only a small fraction of the full vocabulary, the resulting vector is extremely sparse. The scikit-learn library provides CountVectorizer for creating bag-of-words representations, and it stores the output as a SciPy sparse matrix.

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) extends the bag-of-words model by weighting each term according to how informative it is across the corpus. Common words like "the" and "is" receive low weights, while rare but meaningful terms receive higher weights. The resulting vector is still sparse and remains one of the most widely used representations in text classification and information retrieval.

Feature hashing

Also called the "hashing trick," feature hashing maps feature names to a fixed-size vector using a hash function. Scikit-learn's FeatureHasher uses the signed 32-bit Murmurhash3 function to determine column indices, producing a CSR sparse matrix. Feature hashing is useful when the feature space is too large to enumerate explicitly or when memory is constrained, such as in online learning scenarios.

Learned sparse embeddings

Recent advances have produced neural models that output sparse vectors with learned weights. Unlike traditional TF-IDF, these models can perform term expansion, adding weights for semantically related terms that do not appear in the original text. Two notable examples are:

SPLADE (Sparse Lexical and Expansion Model), introduced by Naver Labs in 2021, uses a pretrained BERT model and its masked language model head to produce sparse vectors with learned term importance weights. SPLADE achieves latency comparable to BM25 (within 4ms) while approaching the retrieval quality of dense neural rankers.
BGE-M3, developed by the Beijing Academy of Artificial Intelligence (BAAI), supports dense, sparse, and multi-vector retrieval within a single model. It covers more than 100 languages and handles inputs up to 8,192 tokens. Its sparse output is generated by applying a linear layer followed by a ReLU activation to the hidden states.

Neural network activations with ReLU

Neural networks that use the ReLU activation function naturally produce sparse activations because ReLU sets all negative values to zero. In a typical hidden layer, 50% or more of the activations may be zero. This activation-level sparsity is exploited in sparse attention mechanisms and in structured pruning of network weights.

Sparse storage formats

Storing every element of a sparse vector or matrix wastes memory. Several compressed storage formats have been developed to address this, each making a different tradeoff between construction speed, modification flexibility, and computational performance.

Formats optimized for construction

These formats support efficient element-by-element insertion and modification. They are typically used to build a sparse structure before converting it to a computation-optimized format.

Format	Full name	Structure	Best for
COO	Coordinate list	Three arrays: row indices, column indices, values	Incremental assembly; easy conversion to CSR/CSC
DOK	Dictionary of keys	Hash table mapping (row, col) pairs to values	Random single-element access and insertion
LIL	List of lists	One sorted list per row containing (column, value) pairs	Row-by-row construction

COO (Coordinate) format stores each non-zero entry as a (row, column, value) triple. Sorting the entries first by row, then by column improves access speed. COO is the simplest format and converts quickly to CSR or CSC.

DOK (Dictionary of Keys) uses a hash map internally, giving O(1) lookup time for any individual element. It is well-suited for situations where the non-zero pattern is not known in advance.

LIL (List of Lists) keeps one list per row. Within each row, entries are stored sorted by column index. This format supports efficient row-wise insertion but is slower for column-wise access.

Formats optimized for computation

Once a sparse matrix is assembled, it is usually converted to one of the following formats for fast arithmetic and linear algebra.

Format	Full name	Structure	Best for
CSR	Compressed Sparse Row	Three arrays: values, column indices, row pointers	Row slicing, sparse matrix-vector multiply (SpMV)
CSC	Compressed Sparse Column	Three arrays: values, row indices, column pointers	Column slicing, solving triangular systems
BSR	Block Sparse Row	Like CSR but stores dense sub-blocks instead of scalars	Matrices with dense sub-blocks (finite element methods)
DIA	Diagonal	Stores diagonals as dense arrays plus offsets	Banded or diagonal matrices

CSR (Compressed Sparse Row)

CSR is the most widely used format for sparse computation. It stores three one-dimensional arrays:

values: the non-zero entries, listed row by row.
col_indices: the column index of each entry in the values array.
row_ptr: an array of length (number of rows + 1) where row_ptr[i] is the index into values where row i begins.

For example, consider the 4x5 matrix:

1  0  0  0  2
0  0  3  0  0
0  0  0  0  0
0  4  0  0  5

Its CSR representation is:

values     = [1, 2, 3, 4, 5]
col_indices = [0, 4, 2, 1, 4]
row_ptr    = [0, 2, 3, 3, 5]

Row 0 starts at index 0 in values and contains entries at indices 0 and 1 (values 1 and 2). Row 2 is empty because row_ptr<sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup> == row_ptr<sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup> == 3. CSR format traces its origins to at least the mid-1960s, with the first complete description appearing in 1967.

CSR saves memory compared to dense storage when nnz < (m(n - 1) - 1) / 2 for an m x n matrix. It supports fast row slicing and efficient sparse matrix-vector multiplication (SpMV), which is a core operation in iterative solvers, gradient computation, and graph algorithms.

CSC (Compressed Sparse Column)

CSC is the transpose analogue of CSR. It stores values column by column, with row indices and column pointers. CSC is the default sparse format in MATLAB and is efficient for column slicing and solving sparse triangular systems. Many direct sparse solvers (LU, Cholesky) expect input in CSC format.

BSR (Block Sparse Row)

BSR extends CSR by replacing individual scalar entries with small dense blocks. It is well-suited for problems where non-zeros appear in regular block patterns, such as finite element discretizations with multiple degrees of freedom per node. For such problems, BSR is considerably more efficient than CSR or CSC for arithmetic operations because block operations can leverage BLAS routines.

DIA (Diagonal)

The DIA format stores the diagonals of a matrix as a 2D array of shape (number of diagonals, number of columns) plus an offset array indicating which diagonal each row represents. It is highly efficient for banded matrices but wastes memory when the non-zero pattern does not align with diagonals.

Sparse matrix-vector multiplication (SpMV)

Sparse matrix-vector multiplication (SpMV) computes y = Ax where A is a sparse matrix and x is a dense vector. SpMV is one of the most performance-sensitive operations in scientific computing and machine learning because it lies at the heart of iterative solvers (conjugate gradient, GMRES), PageRank, and neural network forward passes in sparse layers.

SpMV has very low arithmetic intensity (floating-point operations per byte of memory accessed), which makes it memory-bandwidth-bound rather than compute-bound. Optimizing SpMV therefore focuses on maximizing memory throughput through techniques such as:

Choosing the right storage format for the matrix structure (CSR for general matrices, ELL or HYB for GPU execution).
Reordering rows and columns to improve cache locality.
Using vectorized (SIMD) instructions on CPUs or warp-level parallelism on GPUs.

On GPUs, the "vector kernel" approach uses multiple threads within a warp to collaboratively reduce a single sparse dot product, achieving higher utilization than scalar kernels. Research on multi-GPU SpMV shows that data transmission overhead can consume up to 53% of total SpMV time on two-GPU platforms and 63% on four-GPU platforms, making partitioning and communication-minimization strategies essential at scale.

Software libraries for sparse computation

Several mature libraries provide sparse vector and matrix support across programming languages.

Library	Language	Key sparse features
SciPy (`scipy.sparse`)	Python	Seven sparse formats (CSR, CSC, COO, BSR, DOK, LIL, DIA); sparse linear algebra; eigenvalue solvers
NumPy	Python	Does not natively support sparse arrays but interoperates with SciPy sparse
scikit-learn	Python	Accepts SciPy sparse matrices as input; `CountVectorizer`, `TfidfVectorizer`, `FeatureHasher` all produce sparse output
PyTorch (`torch.sparse`)	Python/C++	COO and CSR sparse tensors on CPU and GPU; sparse-dense matrix multiply; autograd support
TensorFlow (`tf.sparse`)	Python/C++	`SparseTensor` class using COO-like representation; sparse-dense operations; TPU support
CuPy (`cupyx.scipy.sparse`)	Python	GPU-accelerated sparse matrices using cuSPARSE; mirrors SciPy's API
Eigen	C++	High-performance sparse matrices in CSC format; direct and iterative solvers
PETSc	C/C++/Fortran	Distributed-memory parallel sparse matrices; used in large-scale simulations
MATLAB	MATLAB	Built-in sparse matrix type (CSC internally); extensive sparse solver support

SciPy recommends building sparse structures with COO, DOK, or LIL formats, then converting to CSR or CSC for computation. As of SciPy 1.8+, the library has been transitioning from *_matrix classes to *_array classes that follow NumPy's array interface.

CuPy provides a SciPy-compatible API on the GPU, but explicit conversion is needed when moving data between CuPy and SciPy because the data resides on different devices (GPU vs. CPU).

PyTorch supports sparse tensors primarily in COO and CSR layouts. CSR tensors in PyTorch benefit from MKL (on CPU) and MAGMA (on GPU) backends for operations like SpMV. PyTorch's autograd engine can backpropagate through certain sparse operations, enabling sparse layers in trainable models.

Applications

Natural language processing

Sparse vectors are the backbone of classical NLP pipelines. Document classification, spam filtering, sentiment analysis, and topic modeling all rely on sparse term-document matrices. Even with the shift toward dense embeddings produced by transformer models, sparse representations remain valuable for keyword matching, lexical overlap scoring, and as features in ensemble systems.

Information retrieval and search

Search engines have historically relied on sparse vector models. BM25, the ranking function used by systems like Elasticsearch and Apache Lucene, scores documents based on sparse term frequency statistics. The inverted index, which maps each term to the list of documents containing it, is effectively a column-oriented sparse data structure. Modern search systems increasingly combine sparse retrieval with dense retrieval in a hybrid approach, using fusion algorithms such as Reciprocal Rank Fusion (RRF) to merge ranked lists from both methods.

Recommender systems

In collaborative filtering, the user-item interaction matrix records which users have interacted with (rated, purchased, clicked) which items. With millions of users and items, this matrix is extremely sparse because each user interacts with only a tiny fraction of available items. Sparse matrix factorization techniques decompose this matrix into lower-rank dense factors to predict missing entries. Platforms like Netflix, Spotify, and Amazon rely on such methods at scale.

Compressed sensing and signal processing

Compressed sensing exploits the fact that many signals are sparse in some transform domain (for example, the Fourier or wavelet domain). By solving an L1-regularized optimization problem (basis pursuit or LASSO), it is possible to reconstruct a signal from far fewer measurements than the Nyquist-Shannon sampling theorem would normally require. This principle is used in medical imaging (MRI acceleration), radar, and communications.

Scientific computing and simulation

Finite element methods, circuit simulation, and computational fluid dynamics all produce large sparse linear systems. The stiffness and mass matrices arising from finite element discretizations are sparse because each element interacts only with its neighbors. Efficient sparse solvers (both direct, like Cholesky or LU decomposition, and iterative, like conjugate gradient) are essential for solving these systems.

Graph and network analysis

The adjacency matrix of a graph is sparse when the graph has few edges relative to the number of possible edges. Social networks, web link graphs, and biological interaction networks are all sparse. Graph algorithms such as PageRank, community detection, and shortest-path computation operate on sparse adjacency matrices.

Sparse neural networks and pruning

Beyond sparse data representations, sparsity also appears in the structure of neural networks themselves. Weight pruning removes connections (sets weights to zero) in a trained network to reduce its size and speed up inference.

The Lottery Ticket Hypothesis, proposed by Frankle and Carlin in 2018, states that dense, randomly initialized networks contain sparse subnetworks ("winning tickets") that, when trained in isolation from the same initial weights, match the accuracy of the full network. Experiments have shown that transformer models can be pruned to remove up to two-thirds of all weights while maintaining strong performance. For BERT models specifically, researchers have found matching subnetworks at 40% to 90% sparsity on GLUE and SQuAD benchmarks.

Structured pruning removes entire neurons, attention heads, or layers rather than individual weights. This produces models that are not just theoretically sparse but practically faster on standard hardware, since dense matrix operations on smaller matrices are easier to accelerate than sparse operations on irregular patterns.

Hybrid search with sparse and dense vectors

Modern search and retrieval systems increasingly combine sparse and dense vector representations to get the best of both approaches. Sparse vectors excel at exact keyword matching and are efficient to index using inverted indices. Dense vectors, produced by neural encoders, capture semantic meaning and handle synonyms. Hybrid search fuses results from both retrieval paths, typically using a weighted combination or Reciprocal Rank Fusion.

Several vector databases support hybrid search natively:

Vector database	Hybrid search support
Pinecone	Stores sparse-dense vectors as a single unified vector; supports dot-product scoring
Milvus	Separate sparse and dense fields with built-in hybrid search API
Qdrant	Native sparse vector support alongside dense vectors
Weaviate	Combines BM25 (sparse) with vector search using configurable fusion
Elasticsearch	Sparse vector field type plus dense k-NN; supports RRF fusion

Performance considerations

Working with sparse vectors involves several practical tradeoffs.

Format selection. Use COO, DOK, or LIL for assembly, then convert to CSR or CSC for computation. Choosing the wrong format can result in orders-of-magnitude slowdowns. For example, element-wise insertion into a CSR matrix requires rebuilding its internal arrays, while the same operation in DOK takes O(1) time.

Fill-in. Some algorithms (such as LU factorization) introduce new non-zero entries ("fill-in") during computation, turning a sparse matrix into a less sparse one. Row and column reordering heuristics (approximate minimum degree, nested dissection) minimize fill-in and keep memory usage manageable.

GPU considerations. GPUs achieve high throughput on dense, regular operations. Sparse operations with irregular access patterns can underperform because of thread divergence and poor memory coalescing. The NVIDIA cuSPARSE library provides GPU-optimized SpMV in multiple formats (CSR, COO, BSR, HYB, ELL). Choosing the right format depends on the sparsity pattern of the matrix.

Sparse-dense boundary. Very low sparsity (below roughly 90%) can make sparse formats slower than dense arrays due to index overhead and indirect memory access. Profiling is recommended to determine whether sparse storage is actually beneficial for a given workload.

References

Golub, G. H. and Van Loan, C. F. (2013). *Matrix Computations*, 4th edition. Johns Hopkins University Press.
Saad, Y. (2003). *Iterative Methods for Sparse Linear Systems*, 2nd edition. SIAM.
"Sparse matrix." Wikipedia. https://en.wikipedia.org/wiki/Sparse_matrix
"Sparse arrays (scipy.sparse)." SciPy v1.17.0 documentation. https://docs.scipy.org/doc/scipy/reference/sparse.html
"torch.sparse." PyTorch documentation. https://docs.pytorch.org/docs/stable/sparse.html
Formal, T., Piwowarski, B., and Clinchant, S. (2021). "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking." *Proceedings of SIGIR 2021*. https://arxiv.org/abs/2107.05720
Chen, J., Xiao, S., Zhang, P., et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." https://arxiv.org/abs/2402.03216
Frankle, J. and Carlin, M. (2018). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *ICLR 2019*. https://arxiv.org/abs/1803.03635
Chen, T., Frankle, J., Chang, S., et al. (2020). "The Lottery Ticket Hypothesis for Pre-trained BERT Networks." *NeurIPS 2020*.
"Compressed sensing." Wikipedia. https://en.wikipedia.org/wiki/Compressed_sensing
"Sparse embeddings: Dense vs. sparse vector and usage with ML models." Elasticsearch Labs. https://www.elastic.co/search-labs/blog/sparse-vector-embedding
"SPLADE for Sparse Vector Search Explained." Pinecone. https://www.pinecone.io/learn/splade/
"Sparse and Dense Embeddings." Zilliz Learn. https://zilliz.com/learn/sparse-and-dense-embeddings
Davis, T. A. (2006). *Direct Methods for Sparse Linear Systems*. SIAM.

ELI5 (Explain like I'm 5)

Mathematical definition

Why sparsity matters

Sparse vs. dense vectors

Sources of sparse vectors in machine learning

One-hot encoding

Bag of words

TF-IDF

Feature hashing

Learned sparse embeddings

Neural network activations with ReLU

Sparse storage formats

Formats optimized for construction

Formats optimized for computation

CSR (Compressed Sparse Row)

CSC (Compressed Sparse Column)

BSR (Block Sparse Row)

DIA (Diagonal)

Sparse matrix-vector multiplication (SpMV)

Software libraries for sparse computation

Applications

Natural language processing

Information retrieval and search

Recommender systems

Compressed sensing and signal processing

Scientific computing and simulation

Graph and network analysis

Sparse neural networks and pruning

Hybrid search with sparse and dense vectors

Performance considerations

See also

References

Improve this article

Related Articles

ARC-AGI 2

Hyperplane

Scalar

Tensor Rank

Singular value decomposition

Matrix factorization

ELI5 (Explain like I'm 5)

Mathematical definition

Why sparsity matters

Sparse vs. dense vectors

Sources of sparse vectors in machine learning

One-hot encoding

Bag of words

TF-IDF

Feature hashing

Learned sparse embeddings

Neural network activations with ReLU

Sparse storage formats

Formats optimized for construction

Formats optimized for computation

CSR (Compressed Sparse Row)

CSC (Compressed Sparse Column)

BSR (Block Sparse Row)

DIA (Diagonal)

Sparse matrix-vector multiplication (SpMV)

Software libraries for sparse computation

Applications

Natural language processing

Information retrieval and search

Recommender systems

Compressed sensing and signal processing

Scientific computing and simulation

Graph and network analysis

Sparse neural networks and pruning

Hybrid search with sparse and dense vectors

Performance considerations

See also

References

Related Articles

ARC-AGI 2

Hyperplane

Scalar

Tensor Rank

Singular value decomposition

Matrix factorization