See also: Machine learning terms
Sparsity, in the context of machine learning and mathematics, refers to the condition where most values in a data structure, model, or representation are zero or near-zero. A sparse representation encodes information using only a small number of active (non-zero) elements, while the majority remain at zero. This property appears throughout modern AI systems: in training data, in model weights, in neural network activations, and in attention patterns.
Sparsity matters because it offers practical benefits in memory efficiency, computational speed, and model interpretability. When most values are zero, algorithms can skip unnecessary computations and store only the non-zero entries, saving both time and space. The concept has deep roots in signal processing and statistics, and it now plays a central role in scaling deep learning to billions of parameters.
Formally, a vector x of dimension n is called k-sparse if at most k of its entries are non-zero, where k is much smaller than n. The sparsity ratio of a vector or matrix is defined as the fraction of zero-valued elements divided by the total number of elements. For example, a vector with 95 zeros out of 100 entries has a sparsity ratio of 0.95 (or 95%).
For matrices, sparsity is measured the same way. A matrix is considered sparse when the number of non-zero entries is roughly on the order of the number of rows or columns, rather than the product of the two. There is no universal threshold that separates "sparse" from "dense," but in practice, matrices with sparsity above 90% are commonly treated as sparse.
Sparsity manifests in several distinct forms across machine learning systems. The following table summarizes the main categories.
| Type | Where It Occurs | How It Arises | Example |
|---|---|---|---|
| Data sparsity | Input features, datasets | High-dimensional data with many missing or zero values | User-item matrices in recommendation systems |
| Weight sparsity | Model parameters | Pruning, L1 regularization, or training constraints | A pruned neural network with 90% zero weights |
| Activation sparsity | Hidden layer outputs | ReLU or other activation functions that output zero for negative inputs | Only 15-30% of neurons firing for a given input |
| Gradient sparsity | Training updates | Many gradient components being zero or negligible | Sparse gradient updates in large-scale distributed training |
| Attention sparsity | Transformer attention matrices | Sparse attention patterns that attend to a subset of tokens | Longformer's sliding window attention |
Data sparsity occurs when input features contain mostly zero values. This is common in natural language processing, where text is represented as high-dimensional vectors (such as bag of words or TF-IDF) with most entries equal to zero because any single document uses only a small fraction of the total vocabulary. Recommendation systems also face extreme data sparsity: a user-item interaction matrix may have millions of entries but only a tiny percentage filled in, since each user interacts with only a handful of items.
Sparse data requires specialized storage formats and algorithms to handle efficiently. Treating these matrices as dense would waste enormous amounts of memory and compute.
Weight sparsity refers to the condition where most parameters in a trained model are zero. This can be achieved through pruning (removing weights after training), through regularization techniques that drive weights toward zero during training, or through specialized training procedures that maintain sparsity throughout the process.
A neural network with 90% weight sparsity contains 10x fewer active parameters than its dense counterpart. If the remaining weights preserve the model's accuracy, this translates directly to reduced storage requirements and potentially faster inference.
Activation sparsity arises when the outputs of hidden layers in a neural network are mostly zero. The ReLU (Rectified Linear Unit) activation function naturally produces sparse activations because it outputs zero for any negative input. In practice, ReLU-based networks typically have only 15-30% of neurons activated for any given input, meaning 70-85% of activation values are zero.
This natural sparsity has computational implications. If a neuron's activation is zero, the computations that depend on that activation can be skipped entirely. Researchers have developed methods to increase activation sparsity further, such as FATReLU, which implements a variable threshold below which all activations are set to zero, and the "ReLU Strikes Back" approach that exploits activation sparsity specifically in large language models.
L1 regularization (also known as Lasso regularization) is the most widely used technique for inducing sparsity in model parameters. It works by adding a penalty term proportional to the sum of the absolute values of the model's weights to the loss function:
Loss = Original Loss + lambda * sum(|w_i|)
where lambda controls the strength of the regularization and w_i are the model weights.
The reason L1 regularization produces exactly zero weights (while L2 regularization merely shrinks them toward zero) has a geometric explanation. The L1 constraint region forms a diamond shape (or hyperdiamond in higher dimensions), with sharp corners aligned with the coordinate axes. When the optimization algorithm finds the point where the loss function contour first touches this constraint region, contact is most likely at one of these corners, where one or more weight values are exactly zero. The L2 constraint region, by contrast, forms a smooth sphere with no corners, making exact-zero solutions far less probable.
The Least Absolute Shrinkage and Selection Operator (LASSO) regression applies L1 regularization to linear regression, performing both parameter estimation and feature selection simultaneously. Elastic Net regularization combines L1 and L2 penalties, offering a middle ground that encourages sparsity while maintaining some of L2's beneficial properties.
Pruning is the process of removing weights, neurons, or entire structures from a trained neural network to create a sparser, more efficient model. The core insight behind pruning is that large neural networks are often overparameterized: many of their weights contribute little to the model's predictions and can be removed without significant accuracy loss.
Unstructured pruning removes individual weights anywhere in the network, typically by zeroing out weights with the smallest magnitudes. This approach is simple to implement and can achieve very high sparsity ratios (90-99%) with minimal accuracy degradation. However, unstructured pruning produces irregular sparsity patterns that are difficult to accelerate on standard hardware. The resulting sparse weight matrices have scattered non-zero entries, leading to irregular memory access patterns that do not map well to the parallel execution model of modern GPUs.
Structured pruning removes entire groups of parameters at once: whole neurons, convolutional filters, attention heads, or even entire layers. Because the pruned model retains a regular, dense structure (just smaller), it can run efficiently on standard hardware without specialized sparse computation support. The trade-off is that structured pruning is coarser: removing an entire filter eliminates all its weights regardless of their individual importance, which can cause larger accuracy drops compared to unstructured pruning at the same compression ratio.
The following table compares these two pruning approaches.
| Property | Unstructured Pruning | Structured Pruning |
|---|---|---|
| Granularity | Individual weights | Neurons, filters, heads, layers |
| Achievable sparsity | Very high (90-99%) | Moderate (50-80%) |
| Accuracy impact | Lower at same compression | Higher at same compression |
| Hardware efficiency | Requires sparse hardware/software | Works on standard hardware |
| Implementation complexity | Simple (magnitude thresholding) | Requires importance scoring for groups |
| Memory savings | Requires sparse formats | Direct size reduction |
The Lottery Ticket Hypothesis, proposed by Jonathan Frankle and Michael Carbin in 2019, is one of the most influential ideas in neural network pruning. The hypothesis states that dense, randomly initialized neural networks contain sparse subnetworks (called "winning tickets") that, when trained in isolation from their original initialization, can match the full network's test accuracy in a comparable number of training iterations.
The key finding is that it is not just the architecture of the subnetwork that matters but also its specific initial weights. The winning tickets have "won the initialization lottery": their connections happen to have initial weight values that make training particularly effective.
Frankle and Carbin demonstrated this through iterative magnitude pruning: train the full network, prune the lowest-magnitude weights, reset the surviving weights to their original initialization values, and repeat. This iterative process finds winning tickets that are 10-20% the size of the original network while matching its performance. The hypothesis has profound implications: it suggests that the success of large networks may partly come from their ability to contain many potential sparse subnetworks, increasing the odds that at least one of them will train well.
Standard self-attention in transformer models computes attention scores between every pair of tokens in a sequence, resulting in quadratic computational complexity with respect to sequence length. For long sequences, this becomes prohibitively expensive. Sparse attention mechanisms address this by restricting each token to attend to only a subset of other tokens, reducing complexity from O(n^2) to O(n) or O(n log n).
Several sparse attention patterns have been developed:
These sparse attention designs have enabled transformers to process sequences of tens of thousands or even millions of tokens, which would be infeasible with full dense attention.
Mixture of Experts (MoE) represents a different form of sparsity: instead of making individual weights or activations sparse, MoE makes the computation path through the network sparse. An MoE layer contains multiple parallel sub-networks ("experts"), and a gating network (router) selects only a small subset of these experts to process each input token.
For example, a model might have 64 experts but activate only 2 of them for any given input. This means the model has the total capacity of all 64 experts but only uses a fraction of the compute for each forward pass. The total parameter count is large, but the active parameter count per token remains small.
Notable MoE implementations include Google's Switch Transformer (which routes each token to a single expert), Mixtral by Mistral AI, and DeepSeek's models. MoE has become a dominant strategy for scaling language models efficiently: it allows increasing model capacity without proportionally increasing compute costs.
| MoE Property | Description |
|---|---|
| Total parameters | Sum of all expert parameters (can be very large) |
| Active parameters per token | Parameters in selected experts only (small fraction of total) |
| Routing mechanism | Gating network that assigns tokens to experts |
| Common expert count | 8 to 64+ experts per layer |
| Common active experts | 1 to 4 experts selected per token |
| Key benefit | Scales model capacity without proportional compute increase |
| Key challenge | Load balancing across experts during training |
Efficient storage of sparse data is essential for realizing the benefits of sparsity. Dense storage of a matrix with 99% zeros wastes 99% of the memory. Several compressed sparse formats address this problem.
| Format | Full Name | Storage Method | Best For |
|---|---|---|---|
| COO | Coordinate | List of (row, column, value) tuples | Matrix construction, format conversion |
| CSR | Compressed Sparse Row | Row pointers, column indices, values | Row-wise access, matrix-vector multiplication |
| CSC | Compressed Sparse Column | Column pointers, row indices, values | Column-wise access, solving linear systems |
| BSR | Block Sparse Row | Block row pointers, block column indices, dense sub-blocks | Matrices with dense sub-blocks |
CSR is the most widely used format for sparse computation. It stores only the non-zero values along with their column indices and a pointer array indicating where each row starts. For a matrix with m rows, n columns, and nnz non-zero entries, CSR requires O(nnz + m) storage compared to O(m * n) for dense storage.
Exploiting sparsity on modern hardware presents significant challenges due to the irregular memory access patterns that sparse data introduces. Dense matrix operations benefit from predictable, sequential memory access that maximizes cache utilization and memory bandwidth. Sparse operations, by contrast, involve scattered non-zero entries, leading to poor cache performance and memory coalescing issues on GPUs.
NVIDIA has addressed this with hardware-level sparse computation support. Starting with the Ampere architecture (A100 GPU), NVIDIA introduced Sparse Tensor Cores that accelerate a specific 2:4 structured sparsity pattern: in every contiguous block of four values, exactly two must be zero. This yields 50% sparsity with a regular pattern that the hardware can exploit, delivering up to 2x throughput compared to dense Tensor Core operations. The Hopper architecture (H100) continues this support. NVIDIA's cuSPARSELt library provides optimized routines for this pattern.
PyTorch supports semi-structured (2:4) sparsity for both training and inference. The broader ecosystem includes several sparse computation libraries:
Sparsity offers several practical advantages across different stages of the machine learning pipeline.
Memory reduction. Sparse models and data require less storage. A neural network pruned to 10% density needs roughly 10x less memory to store its weights when using compressed formats. This is critical for deploying models on mobile devices and edge hardware with limited memory.
Computational efficiency. When most values are zero, multiplications involving those values can be skipped. Sparse matrix-vector multiplication only computes products for non-zero entries, reducing the number of floating-point operations proportionally to the sparsity ratio. NVIDIA's 2:4 sparse Tensor Cores can deliver up to 2x throughput for supported operations.
Faster training and inference. Sparse attention mechanisms reduce transformer complexity from quadratic to linear in sequence length. MoE models activate only a fraction of parameters per token. These approaches enable training and deploying models that would otherwise be computationally infeasible.
Improved interpretability. Sparse models are easier to interpret because fewer active features or weights contribute to each prediction. In linear models, L1 regularization performs automatic feature selection, identifying which inputs are most relevant. Sparse activations in neural networks indicate which neurons are important for particular inputs.
Better generalization. Sparsity-inducing regularization helps prevent overfitting by constraining model complexity. The Lottery Ticket Hypothesis suggests that the essential computation in a network can be captured by a small subset of weights, and finding this subset may lead to better-generalizing models.
Despite its benefits, sparsity introduces several practical difficulties.
Irregular memory access. Sparse data structures lead to non-contiguous memory access patterns. On GPUs, this reduces memory bandwidth utilization and causes workload imbalance across parallel execution units. Threads processing rows with many non-zeros may stall threads processing rows with few non-zeros, since GPU warps operate in SIMD (Single Instruction, Multiple Data) fashion.
Metadata overhead. Compressed sparse formats require additional storage for index arrays and pointers. For low-sparsity matrices, this metadata can actually make the compressed representation larger than the dense one. The break-even point depends on the format and the sparsity ratio.
Limited hardware support. Most current hardware (CPUs and GPUs) is optimized for dense computation. Only NVIDIA's recent GPU architectures provide dedicated sparse Tensor Cores, and those only support the specific 2:4 structured sparsity pattern. Arbitrary unstructured sparsity receives limited hardware acceleration.
Training instability. Maintaining sparsity during training can be challenging. Pruning too aggressively too early may remove weights that would have become important later. Techniques like gradual magnitude pruning address this by slowly increasing the sparsity ratio over the course of training.
Accuracy trade-offs. High sparsity levels inevitably degrade model accuracy. Finding the optimal sparsity ratio that balances efficiency and performance requires careful experimentation. The relationship between sparsity and accuracy is model-specific and task-dependent.
Compressed sensing (also called compressive sampling) is a signal processing technique that leverages sparsity to reconstruct high-dimensional signals from far fewer measurements than traditional sampling theory would require. The key assumption is that the signal has a sparse representation in some basis (for example, a natural image is sparse in the wavelet basis). Under this assumption, the signal can be recovered accurately from a small number of random linear measurements by solving an L1-minimization problem.
Compressed sensing has applications in medical imaging (reducing MRI scan times), radar systems, and sensor networks. Its mathematical foundations, developed by Emmanuel Candes, Terence Tao, and David Donoho in the mid-2000s, provided rigorous conditions under which sparse recovery is guaranteed.
Imagine you have a huge coloring book with 1,000 pages, but you only actually colored in 20 of them. Most pages are blank. That is sparsity: most of the space is empty, and only a small part has something in it.
In machine learning, computers work with big grids of numbers (called matrices). Many of those numbers are zero, meaning they are not doing anything useful. Instead of remembering all 1,000 pages (including the 980 blank ones), a smart computer can just remember which 20 pages have colors and what those colors look like. That saves a lot of memory and makes things faster.
When scientists build AI models, they sometimes find that the model has way more parts than it actually needs. It is like having a team of 100 people but only 10 of them are doing the real work. Sparsity helps find those 10 important workers and lets the rest sit out, so the team runs faster and uses less energy.