L0 regularization is a regularization technique in machine learning and statistics that penalizes the number of nonzero parameters in a model. Unlike L1 regularization (which penalizes the sum of absolute values) or L2 regularization (which penalizes the sum of squared values), L0 regularization directly counts how many parameters are active. This makes it the most natural formulation for enforcing sparsity, but it also makes the resulting optimization problem NP-hard in general. Because of this computational difficulty, researchers have developed a range of approximate and relaxation-based methods to make L0 regularization practical for real-world applications, including neural network pruning, feature selection, and compressed sensing.
Imagine you have a big box of crayons but you want to draw a picture using as few crayons as possible. L0 regularization is like a rule that says: "Every crayon you pick up costs you one point." So you try to draw the best picture you can while keeping the number of crayons small. The tricky part is that figuring out which exact set of crayons gives you the best picture with the fewest crayons is really, really hard, even for computers. So instead, people use clever tricks to get close to the best answer without checking every possible combination.
For a vector x = (x₁, x₂, ..., x_p), the L0 "norm" is defined as:
||x||₀ = |{j : x_j ≠ 0}|
In other words, it is simply the count of nonzero entries in the vector. Despite being commonly referred to as a "norm," the L0 function is technically a pseudonorm because it violates one of the required properties of a true norm: positive homogeneity. For a true norm, scaling a vector by a positive constant k must scale the norm by k as well (i.e., ||kx|| = k||x||). The L0 function does not satisfy this property because multiplying all entries by a nonzero scalar does not change the count of nonzero entries. The L0 pseudonorm is also non-convex, which distinguishes it from the L1 and L2 norms.
Given a dataset of n observations, a loss function L(θ), and a parameter vector θ, the L0-regularized optimization problem is:
minimize L(θ) + λ ||θ||₀
where λ > 0 is the regularization strength that controls the tradeoff between fitting the data and keeping the model sparse. Larger values of λ encourage sparser solutions by imposing a heavier penalty for each additional nonzero parameter.
In the context of linear regression, this formulation is equivalent to the best subset selection problem, which seeks to find the subset of at most k features that minimizes the residual sum of squares.
The L0-regularized optimization problem is NP-hard, meaning there is no known algorithm that can solve it in polynomial time for all instances. This was formally established by Natarajan (1995), who proved that finding the sparsest solution to a system of linear equations is NP-hard.
The core difficulty is combinatorial in nature. For a model with p parameters, there are 2^p possible subsets of active parameters. Evaluating every subset to find the optimal one becomes computationally infeasible even for moderate values of p. For example, a model with just 50 features has over 10^15 possible subsets.
This NP-hardness result has had a lasting impact on the field. For decades, the accepted view in statistics was that best subset selection was impractical beyond roughly 30 to 40 features. This motivated the development of computationally tractable alternatives, most notably L1 regularization (the Lasso).
The L1 norm is the tightest convex relaxation of the L0 pseudonorm on the unit hypercube [-1, 1]^d. This means that among all convex functions that lower-bound the L0 pseudonorm on that domain, the L1 norm is the largest. This mathematical relationship is why L1 regularization produces sparse solutions: it is the closest convex approximation to directly counting nonzero parameters.
In practice, replacing the L0 penalty with an L1 penalty transforms the NP-hard problem into a convex optimization problem that can be solved efficiently. In the linear regression setting, this substitution yields the Lasso (Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani in 1996. In signal processing, the same idea is known as basis pursuit.
However, the L1 relaxation is not without drawbacks. L1 regularization introduces shrinkage bias: it pushes all coefficient estimates toward zero, not just the irrelevant ones. This means that even the truly important features have their coefficients systematically underestimated. L0 regularization, by contrast, imposes no penalty on the magnitude of nonzero parameters. It only cares whether a parameter is zero or not, so the surviving coefficients are unbiased estimates.
| Property | L0 regularization | L1 regularization | L2 regularization |
|---|---|---|---|
| Penalty term | Number of nonzero parameters | Sum of absolute values | Sum of squared values |
| Convexity | Non-convex | Convex | Convex |
| Sparsity | Exact sparsity (hard zeros) | Approximate sparsity (soft thresholding) | No sparsity (shrinkage only) |
| Shrinkage bias on large coefficients | None | Yes | Yes |
| Computational complexity | NP-hard in general | Polynomial time | Polynomial time |
| Common name (linear regression) | Best subset selection | Lasso | Ridge regression |
| Differentiability | Non-differentiable, discontinuous | Non-differentiable at zero | Differentiable everywhere |
L0 regularization has a direct connection to classical model selection criteria from information theory. When the regularization parameter λ is set to specific values, the L0-penalized objective recovers well-known criteria.
Setting λ = 1 (in appropriate units) corresponds to the Akaike Information Criterion (AIC), which penalizes each additional parameter by a fixed cost of 1 (in units of twice the negative log-likelihood). Setting λ = (1/2) log(n), where n is the sample size, corresponds to the Bayesian Information Criterion (BIC), which applies a stronger penalty that grows with the sample size.
This connection shows that AIC and BIC can be viewed as special cases of L0 regularization with particular choices of the penalty strength. The Louizos et al. (2018) paper on L0 regularization for neural networks explicitly noted this relationship, observing that their method with appropriate λ settings is equivalent to performing AIC or BIC model selection over the network architecture.
Because exact L0 optimization is NP-hard, a variety of algorithms have been developed to find approximate solutions. These can be grouped into several families.
Greedy methods build sparse solutions incrementally by adding or removing one feature at a time.
Forward stepwise selection starts with an empty model and adds the feature that most improves the objective at each step. Backward stepwise elimination starts with all features and removes the least useful one at each step. Matching pursuit and its variants (orthogonal matching pursuit, or OMP) iteratively select the dictionary atom most correlated with the current residual, making them particularly popular in signal processing and compressed sensing.
These methods are fast and simple to implement, but they provide no guarantee of finding the globally optimal solution. A feature added early on may not be optimal given features added later.
Iterative hard thresholding, introduced by Blumensath and Davies (2008), is a projected gradient descent method for L0-constrained problems. At each iteration, it takes a gradient step and then applies a hard thresholding operator that keeps only the k largest-magnitude components and sets the rest to zero.
The update rule is:
θ^(t+1) = H_k(θ^(t) - η ∇L(θ^(t)))
where H_k denotes the hard thresholding operator that retains the top k entries and η is the step size. IHT has convergence guarantees under the restricted isometry property (RIP), a condition commonly studied in compressed sensing. Hard thresholding pursuit (HTP) combines IHT with debiasing steps, improving recovery performance.
The smoothed L0 algorithm, proposed by Mohimani, Babaie-Zadeh, and Jutten (2009), approximates the L0 norm with a smooth function. Instead of counting the number of nonzero entries directly, it replaces the discontinuous indicator function with a family of Gaussian-based approximations parameterized by σ:
F_σ(x) = Σ_j (1 - exp(-x_j² / 2σ²))
As σ approaches zero, F_σ converges to the L0 norm. The algorithm proceeds by gradually decreasing σ, solving a smooth optimization problem at each stage. This continuation strategy was reported to be two to three orders of magnitude faster than interior-point linear programming solvers used for L1 minimization, while achieving comparable or better accuracy.
Bertsimas, King, and Mazumder (2016) demonstrated that best subset selection can be formulated as a mixed integer optimization (MIO) problem and solved to provable global optimality using modern solvers. They observed that algorithmic advances in MIO, combined with hardware improvements, had produced roughly a 450-billion-fold speedup between 1991 and 2015, making exact L0 optimization feasible for problems with hundreds or even thousands of features.
Their approach provides certificates of optimality (or bounded suboptimality if terminated early) and can incorporate additional constraints on the coefficients. This work challenged the prevailing assumption that best subset selection was impractical, though it remains computationally expensive for very large-scale problems.
Alternating direction method of multipliers (ADMM) and penalty decomposition methods reformulate the L0-regularized problem by introducing auxiliary variables and decomposing it into subproblems that are easier to solve individually. These methods alternate between updating the model parameters (using standard gradient-based methods) and applying a hard thresholding step to the auxiliary variables. They are particularly useful for structured sparsity problems where groups of parameters should be zeroed out together.
The most influential approach to applying L0 regularization to deep learning was proposed by Louizos, Welling, and Kingma at ICLR 2018 in their paper "Learning Sparse Neural Networks through L0 Regularization." Their method addresses the fundamental challenge that the L0 norm is both non-differentiable and combinatorial, making it incompatible with standard stochastic gradient descent training.
The key idea is to introduce a learnable binary gate z_j for each weight θ_j in the network. The effective weight becomes:
θ_j^(eff) = θ_j * z_j
where z_j is either 0 (weight removed) or 1 (weight kept). If z_j follows a Bernoulli distribution with parameter π_j, then the expected L0 norm is simply the sum of the gate probabilities:
E[||θ^(eff)||₀] = Σ_j π_j
This expected L0 cost is differentiable with respect to the gate parameters π_j, allowing it to be optimized with gradient-based methods.
Sampling from Bernoulli distributions is not differentiable, so the authors use the reparameterization trick with a continuous relaxation called the hard concrete distribution. This distribution is constructed in three steps:
Sample from a binary concrete (Gumbel-softmax) distribution: s = σ((log u - log(1-u) + log α) / β), where u is drawn from Uniform(0, 1), log α is a learnable location parameter, and β is a temperature parameter.
Stretch the samples beyond [0, 1]: s' = s(ζ - γ) + γ, where γ < 0 and ζ > 1 are constants that extend the support of the distribution to (γ, ζ).
Apply a hard sigmoid: z = min(1, max(0, s')), which clips the stretched samples to [0, 1], creating exact zeros and ones at the boundaries.
The hard concrete distribution concentrates probability mass at exactly 0 and exactly 1, mimicking the discrete Bernoulli distribution while remaining differentiable. The expected L0 cost under this distribution has a closed-form expression involving a sigmoid function of the gate parameters.
Louizos et al. evaluated their method on MNIST (using LeNet-300-100 and LeNet-5 architectures) and CIFAR-10/CIFAR-100 (using Wide Residual Networks, WRN-28-10). Key findings included:
| Experiment | Architecture | Test error (L0) | Test error (baseline) | Outcome |
|---|---|---|---|---|
| MNIST | LeNet-5 | ~0.9% | ~0.8% (dense) | Near-baseline accuracy with significant compression |
| CIFAR-10 | WRN-28-10 | 3.83% | 3.89% (dropout) | Improved over dropout with fewer FLOPs |
| CIFAR-100 | WRN-28-10 | 18.75% | 18.85% (dropout) | Improved over dropout with fewer FLOPs |
On the CIFAR experiments, L0 regularization not only matched but slightly outperformed dropout-based regularization, while simultaneously reducing the number of floating-point operations during both training and inference. The method achieved substantial compression on the LeNet architectures with minimal loss in accuracy.
Savarese and Maire (NeurIPS 2020) proposed continuous sparsification, a deterministic alternative to the stochastic gates of Louizos et al. Their method constructs a smooth continuation path connecting a soft-gated L1-like objective to the intractable L0-regularized objective. A hyperparameter β controls the hardness of the objective: as β increases from 1 toward infinity, the objective transitions from L1 regularization to L0 regularization.
By gradually increasing β during training, the method continuously sparsifies the network without requiring heuristic decisions about which parameters to remove or when to remove them. This approach was shown to find sparse subnetworks ("winning tickets") competitive with those found by iterative magnitude pruning.
L0 regularization can be extended from individual weight pruning to structured pruning, where entire neurons, channels, or attention heads are removed together. This is more useful in practice because modern hardware (GPUs, TPUs) cannot easily exploit unstructured sparsity for speed gains. Structured L0 regularization applies gates at the level of groups rather than individual weights, enabling the removal of entire computational units and producing models that are genuinely faster on standard hardware.
The lottery ticket hypothesis, proposed by Frankle and Carlin (ICLR 2019), states that dense, randomly initialized neural networks contain sparse subnetworks ("winning tickets") that, when trained in isolation from their original initialization, can match the full network's test accuracy.
L0 regularization provides a principled, optimization-based approach to finding such subnetworks. Rather than relying on the heuristic process of training, pruning by magnitude, rewinding to initial weights, and repeating (iterative magnitude pruning), L0 regularization learns which weights to keep and which to discard during a single training run by optimizing the gate parameters alongside the network weights.
Continuous sparsification methods based on L0 approximations have been shown to find winning tickets that are competitive with those discovered by iterative magnitude pruning, suggesting that L0 regularization captures the same structural insights as the lottery ticket hypothesis but through a more direct optimization framework.
Compressed sensing is the field where L0 minimization has its deepest theoretical roots. The problem is to recover a sparse signal x from an underdetermined system of linear measurements y = Ax, where A is a measurement matrix with fewer rows than columns. The ideal recovery algorithm would solve:
minimize ||x||₀ subject to Ax = y
Donoho and Candes, Romberg, and Tao (2006) proved that under certain conditions on the measurement matrix (the restricted isometry property), the L1 minimization problem recovers the same sparse solution as the L0 problem. This theoretical result provides the foundation for compressed sensing and has enabled applications in magnetic resonance imaging (MRI), where the technique allows faster acquisition by measuring fewer Fourier coefficients, and in radar, seismology, and astronomical imaging.
In statistical modeling, L0 regularization is the most direct formulation of the feature selection problem: select the smallest subset of features that adequately explains the response variable. Applications include genomics (identifying relevant genes from thousands of candidates), finance (selecting predictive factors for asset returns), and clinical research (finding diagnostic biomarkers from high-dimensional patient data).
L0 regularization enables the training of sparse neural networks that require fewer parameters, less memory, and fewer computations at inference time. This is valuable for deploying models on resource-constrained devices such as mobile phones and embedded systems. By learning which weights (or entire neurons and channels) to remove during training, L0-based methods produce compressed models without a separate post-training pruning step.
L0 penalties are used to learn the structure of graphical models, such as Gaussian graphical models and Bayesian networks, where the goal is to identify which edges (conditional dependencies) exist in the graph. The L0 penalty directly controls the number of edges, producing interpretable models with known sparsity levels.
Despite its theoretical appeal, L0 regularization faces several practical challenges.
Computational cost. Even with approximate methods, L0 regularization is more expensive than L1 or L2 regularization. Stochastic gate methods introduce additional parameters (the gate distributions), and mixed integer programming approaches scale poorly to very large problems.
Non-convexity and local minima. The L0 objective is non-convex, meaning that gradient-based methods can converge to different solutions depending on initialization. There is no guarantee that the solution found is globally optimal, unlike L1 regularization, which has a unique global minimum (when the loss function is convex).
Hyperparameter sensitivity. The regularization strength λ, along with additional parameters introduced by relaxation methods (temperatures, stretch constants, etc.), must be carefully tuned. The performance of L0 methods can be sensitive to these choices.
Discrete-continuous gap. Continuous relaxations of the L0 norm (such as the hard concrete distribution) approximate the discrete counting function, but the gap between the relaxation and the true L0 objective can lead to solutions that are not perfectly sparse or that differ from the true L0 optimum.
Scalability to modern architectures. While L0 regularization has been demonstrated on moderately sized networks (LeNet, Wide Residual Networks), applying it to very large models with billions of parameters, such as modern large language models, remains an active research challenge. The additional memory and computation required for the gate parameters can be substantial at this scale.
| Algorithm | Type | Guarantees | Scalability | Key advantage |
|---|---|---|---|---|
| Exhaustive search | Exact | Global optimum | Very low (2^p subsets) | Provably optimal |
| Mixed integer programming | Exact | Global optimum (with certificate) | Moderate (up to ~1000 features) | Certificates of optimality |
| Forward/backward stepwise | Greedy | No global guarantee | High | Simple and fast |
| Matching pursuit / OMP | Greedy | Recovery guarantees under RIP | High | Well-suited for compressed sensing |
| Iterative hard thresholding | Projected gradient | Convergence to local minimum under RIP | High | Combines gradient descent with hard sparsity |
| Smoothed L0 (SL0) | Continuous approximation | Converges to L0 as σ approaches 0 | High | Very fast in practice |
| Hard concrete gates | Stochastic relaxation | Expected L0 is differentiable | Moderate to high | End-to-end training with SGD |
| Continuous sparsification | Deterministic relaxation | Continuation from L1 to L0 | Moderate to high | No stochastic sampling required |
| ADMM / penalty decomposition | Splitting method | Convergence to stationary point | High | Handles structured sparsity |