L0 Regularization

L0 regularization is a regularization technique in machine learning and statistics that penalizes the number of nonzero parameters in a model. Unlike L1 regularization (which penalizes the sum of absolute values) or L2 regularization (which penalizes the sum of squared values), L0 regularization directly counts how many parameters are active. This makes it the most natural formulation for enforcing sparsity, but it also makes the resulting optimization problem NP-hard in general. Because of this computational difficulty, researchers have developed a range of approximate and relaxation-based methods to make L0 regularization practical for real-world applications, including neural network pruning, feature selection, and compressed sensing.

Explain like I'm 5 (ELI5)

Imagine you have a big box of crayons but you want to draw a picture using as few crayons as possible. L0 regularization is like a rule that says: "Every crayon you pick up costs you one point." So you try to draw the best picture you can while keeping the number of crayons small. The tricky part is that figuring out which exact set of crayons gives you the best picture with the fewest crayons is really, really hard, even for computers. So instead, people use clever tricks to get close to the best answer without checking every possible combination.

Mathematical definition

The L0 "norm"

For a vector x = (x₁, x₂, ..., x_p), the L0 "norm" is defined as:

||x||₀ = |{j : x_j ≠ 0}|

In other words, it is simply the count of nonzero entries in the vector. Despite being commonly referred to as a "norm," the L0 function is technically a pseudonorm because it violates one of the required properties of a true norm: positive homogeneity. For a true norm, scaling a vector by a positive constant k must scale the norm by k as well (i.e., ||kx|| = k||x||). The L0 function does not satisfy this property because multiplying all entries by a nonzero scalar does not change the count of nonzero entries. The L0 pseudonorm is also non-convex, which distinguishes it from the L1 and L2 norms.

Regularized objective

Given a dataset of n observations, a loss function L(θ), and a parameter vector θ, the L0-regularized optimization problem is:

minimize L(θ) + λ ||θ||₀

where λ > 0 is the regularization strength that controls the tradeoff between fitting the data and keeping the model sparse. Larger values of λ encourage sparser solutions by imposing a heavier penalty for each additional nonzero parameter.

In the context of linear regression, this formulation is equivalent to the best subset selection problem, which seeks to find the subset of at most k features that minimizes the residual sum of squares.

Why L0 regularization is NP-hard

The L0-regularized optimization problem is NP-hard, meaning there is no known algorithm that can solve it in polynomial time for all instances. This was formally established by Natarajan (1995), who proved that finding the sparsest solution to a system of linear equations is NP-hard.

The core difficulty is combinatorial in nature. For a model with p parameters, there are 2^p possible subsets of active parameters. Evaluating every subset to find the optimal one becomes computationally infeasible even for moderate values of p. For example, a model with just 50 features has over 10^15 possible subsets.

This NP-hardness result has had a lasting impact on the field. For decades, the accepted view in statistics was that best subset selection was impractical beyond roughly 30 to 40 features. This motivated the development of computationally tractable alternatives, most notably L1 regularization (the Lasso).

Relationship to L1 regularization

The L1 norm is the tightest convex relaxation of the L0 pseudonorm on the unit hypercube [-1, 1]^d. This means that among all convex functions that lower-bound the L0 pseudonorm on that domain, the L1 norm is the largest. This mathematical relationship is why L1 regularization produces sparse solutions: it is the closest convex approximation to directly counting nonzero parameters.

In practice, replacing the L0 penalty with an L1 penalty transforms the NP-hard problem into a convex optimization problem that can be solved efficiently. In the linear regression setting, this substitution yields the Lasso (Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani in 1996. In signal processing, the same idea is known as basis pursuit.

However, the L1 relaxation is not without drawbacks. L1 regularization introduces shrinkage bias: it pushes all coefficient estimates toward zero, not just the irrelevant ones. This means that even the truly important features have their coefficients systematically underestimated. L0 regularization, by contrast, imposes no penalty on the magnitude of nonzero parameters. It only cares whether a parameter is zero or not, so the surviving coefficients are unbiased estimates.

Property	L0 regularization	L1 regularization	L2 regularization
Penalty term	Number of nonzero parameters	Sum of absolute values	Sum of squared values
Convexity	Non-convex	Convex	Convex
Sparsity	Exact sparsity (hard zeros)	Approximate sparsity (soft thresholding)	No sparsity (shrinkage only)
Shrinkage bias on large coefficients	None	Yes	Yes
Computational complexity	NP-hard in general	Polynomial time	Polynomial time
Common name (linear regression)	Best subset selection	Lasso	Ridge regression
Differentiability	Non-differentiable, discontinuous	Non-differentiable at zero	Differentiable everywhere

Connection to information criteria

L0 regularization has a direct connection to classical model selection criteria from information theory. When the regularization parameter λ is set to specific values, the L0-penalized objective recovers well-known criteria.

Setting λ = 1 (in appropriate units) corresponds to the Akaike Information Criterion (AIC), which penalizes each additional parameter by a fixed cost of 1 (in units of twice the negative log-likelihood). Setting λ = (1/2) log(n), where n is the sample size, corresponds to the Bayesian Information Criterion (BIC), which applies a stronger penalty that grows with the sample size.

This connection shows that AIC and BIC can be viewed as special cases of L0 regularization with particular choices of the penalty strength. The Louizos et al. (2018) paper on L0 regularization for neural networks explicitly noted this relationship, observing that their method with appropriate λ settings is equivalent to performing AIC or BIC model selection over the network architecture.

Algorithms and approximate methods

Because exact L0 optimization is NP-hard, a variety of algorithms have been developed to find approximate solutions. These can be grouped into several families.

Greedy algorithms

Greedy methods build sparse solutions incrementally by adding or removing one feature at a time.

Forward stepwise selection starts with an empty model and adds the feature that most improves the objective at each step. Backward stepwise elimination starts with all features and removes the least useful one at each step. Matching pursuit and its variants (orthogonal matching pursuit, or OMP) iteratively select the dictionary atom most correlated with the current residual, making them particularly popular in signal processing and compressed sensing.

These methods are fast and simple to implement, but they provide no guarantee of finding the globally optimal solution. A feature added early on may not be optimal given features added later.

Iterative hard thresholding (IHT)

Iterative hard thresholding, introduced by Blumensath and Davies (2008), is a projected gradient descent method for L0-constrained problems. At each iteration, it takes a gradient step and then applies a hard thresholding operator that keeps only the k largest-magnitude components and sets the rest to zero.

The update rule is:

θ^(t+1) = H_k(θ^(t) - η ∇L(θ^(t)))

where H_k denotes the hard thresholding operator that retains the top k entries and η is the step size. IHT has convergence guarantees under the restricted isometry property (RIP), a condition commonly studied in compressed sensing. Hard thresholding pursuit (HTP) combines IHT with debiasing steps, improving recovery performance.

Smoothed L0 (SL0)

The smoothed L0 algorithm, proposed by Mohimani, Babaie-Zadeh, and Jutten (2009), approximates the L0 norm with a smooth function. Instead of counting the number of nonzero entries directly, it replaces the discontinuous indicator function with a family of Gaussian-based approximations parameterized by σ:

F_σ(x) = Σ_j (1 - exp(-x_j² / 2σ²))

As σ approaches zero, F_σ converges to the L0 norm. The algorithm proceeds by gradually decreasing σ, solving a smooth optimization problem at each stage. This continuation strategy was reported to be two to three orders of magnitude faster than interior-point linear programming solvers used for L1 minimization, while achieving comparable or better accuracy.

Mixed integer programming (MIP)

Bertsimas, King, and Mazumder (2016) demonstrated that best subset selection can be formulated as a mixed integer optimization (MIO) problem and solved to provable global optimality using modern solvers. They observed that algorithmic advances in MIO, combined with hardware improvements, had produced roughly a 450-billion-fold speedup between 1991 and 2015, making exact L0 optimization feasible for problems with hundreds or even thousands of features.

Their approach provides certificates of optimality (or bounded suboptimality if terminated early) and can incorporate additional constraints on the coefficients. This work challenged the prevailing assumption that best subset selection was impractical, though it remains computationally expensive for very large-scale problems.

Penalty decomposition and ADMM

Alternating direction method of multipliers (ADMM) and penalty decomposition methods reformulate the L0-regularized problem by introducing auxiliary variables and decomposing it into subproblems that are easier to solve individually. These methods alternate between updating the model parameters (using standard gradient-based methods) and applying a hard thresholding step to the auxiliary variables. They are particularly useful for structured sparsity problems where groups of parameters should be zeroed out together.

L0 regularization for neural networks

The Louizos, Welling, and Kingma method (2018)

The most influential approach to applying L0 regularization to deep learning was proposed by Louizos, Welling, and Kingma at ICLR 2018 in their paper "Learning Sparse Neural Networks through L0 Regularization." Their method addresses the fundamental challenge that the L0 norm is both non-differentiable and combinatorial, making it incompatible with standard stochastic gradient descent training.

Stochastic gates

The key idea is to introduce a learnable binary gate z_j for each weight θ_j in the network. The effective weight becomes:

θ_j^(eff) = θ_j * z_j

where z_j is either 0 (weight removed) or 1 (weight kept). If z_j follows a Bernoulli distribution with parameter π_j, then the expected L0 norm is simply the sum of the gate probabilities:

E[||θ^(eff)||₀] = Σ_j π_j

This expected L0 cost is differentiable with respect to the gate parameters π_j, allowing it to be optimized with gradient-based methods.

The hard concrete distribution

Sampling from Bernoulli distributions is not differentiable, so the authors use the reparameterization trick with a continuous relaxation called the hard concrete distribution. This distribution is constructed in three steps:

Sample from a binary concrete (Gumbel-softmax) distribution: s = σ((log u - log(1-u) + log α) / β), where u is drawn from Uniform(0, 1), log α is a learnable location parameter, and β is a temperature parameter.
Stretch the samples beyond [0, 1]: s' = s(ζ - γ) + γ, where γ < 0 and ζ > 1 are constants that extend the support of the distribution to (γ, ζ).
Apply a hard sigmoid: z = min(1, max(0, s')), which clips the stretched samples to [0, 1], creating exact zeros and ones at the boundaries.

The hard concrete distribution concentrates probability mass at exactly 0 and exactly 1, mimicking the discrete Bernoulli distribution while remaining differentiable. The expected L0 cost under this distribution has a closed-form expression involving a sigmoid function of the gate parameters.

Experimental results

Louizos et al. evaluated their method on MNIST (using LeNet-300-100 and LeNet-5 architectures) and CIFAR-10/CIFAR-100 (using Wide Residual Networks, WRN-28-10). Key findings included:

Experiment	Architecture	Test error (L0)	Test error (baseline)	Outcome
MNIST	LeNet-5	~0.9%	~0.8% (dense)	Near-baseline accuracy with significant compression
CIFAR-10	WRN-28-10	3.83%	3.89% (dropout)	Improved over dropout with fewer FLOPs
CIFAR-100	WRN-28-10	18.75%	18.85% (dropout)	Improved over dropout with fewer FLOPs

On the CIFAR experiments, L0 regularization not only matched but slightly outperformed dropout-based regularization, while simultaneously reducing the number of floating-point operations during both training and inference. The method achieved substantial compression on the LeNet architectures with minimal loss in accuracy.

Continuous sparsification (2020)

Savarese and Maire (NeurIPS 2020) proposed continuous sparsification, a deterministic alternative to the stochastic gates of Louizos et al. Their method constructs a smooth continuation path connecting a soft-gated L1-like objective to the intractable L0-regularized objective. A hyperparameter β controls the hardness of the objective: as β increases from 1 toward infinity, the objective transitions from L1 regularization to L0 regularization.

By gradually increasing β during training, the method continuously sparsifies the network without requiring heuristic decisions about which parameters to remove or when to remove them. This approach was shown to find sparse subnetworks ("winning tickets") competitive with those found by iterative magnitude pruning.

Structured sparsity with L0

L0 regularization can be extended from individual weight pruning to structured pruning, where entire neurons, channels, or attention heads are removed together. This is more useful in practice because modern hardware (GPUs, TPUs) cannot easily exploit unstructured sparsity for speed gains. Structured L0 regularization applies gates at the level of groups rather than individual weights, enabling the removal of entire computational units and producing models that are genuinely faster on standard hardware.

Relationship to the lottery ticket hypothesis

The lottery ticket hypothesis, proposed by Frankle and Carlin (ICLR 2019), states that dense, randomly initialized neural networks contain sparse subnetworks ("winning tickets") that, when trained in isolation from their original initialization, can match the full network's test accuracy.

L0 regularization provides a principled, optimization-based approach to finding such subnetworks. Rather than relying on the heuristic process of training, pruning by magnitude, rewinding to initial weights, and repeating (iterative magnitude pruning), L0 regularization learns which weights to keep and which to discard during a single training run by optimizing the gate parameters alongside the network weights.

Continuous sparsification methods based on L0 approximations have been shown to find winning tickets that are competitive with those discovered by iterative magnitude pruning, suggesting that L0 regularization captures the same structural insights as the lottery ticket hypothesis but through a more direct optimization framework.

Applications

Compressed sensing

Compressed sensing is the field where L0 minimization has its deepest theoretical roots. The problem is to recover a sparse signal x from an underdetermined system of linear measurements y = Ax, where A is a measurement matrix with fewer rows than columns. The ideal recovery algorithm would solve:

minimize ||x||₀ subject to Ax = y

Donoho and Candes, Romberg, and Tao (2006) proved that under certain conditions on the measurement matrix (the restricted isometry property), the L1 minimization problem recovers the same sparse solution as the L0 problem. This theoretical result provides the foundation for compressed sensing and has enabled applications in magnetic resonance imaging (MRI), where the technique allows faster acquisition by measuring fewer Fourier coefficients, and in radar, seismology, and astronomical imaging.

Feature selection

In statistical modeling, L0 regularization is the most direct formulation of the feature selection problem: select the smallest subset of features that adequately explains the response variable. Applications include genomics (identifying relevant genes from thousands of candidates), finance (selecting predictive factors for asset returns), and clinical research (finding diagnostic biomarkers from high-dimensional patient data).

Neural network compression

L0 regularization enables the training of sparse neural networks that require fewer parameters, less memory, and fewer computations at inference time. This is valuable for deploying models on resource-constrained devices such as mobile phones and embedded systems. By learning which weights (or entire neurons and channels) to remove during training, L0-based methods produce compressed models without a separate post-training pruning step.

Variable selection in graphical models

L0 penalties are used to learn the structure of graphical models, such as Gaussian graphical models and Bayesian networks, where the goal is to identify which edges (conditional dependencies) exist in the graph. The L0 penalty directly controls the number of edges, producing interpretable models with known sparsity levels.

Challenges and limitations

Despite its theoretical appeal, L0 regularization faces several practical challenges.

Computational cost. Even with approximate methods, L0 regularization is more expensive than L1 or L2 regularization. Stochastic gate methods introduce additional parameters (the gate distributions), and mixed integer programming approaches scale poorly to very large problems.

Non-convexity and local minima. The L0 objective is non-convex, meaning that gradient-based methods can converge to different solutions depending on initialization. There is no guarantee that the solution found is globally optimal, unlike L1 regularization, which has a unique global minimum (when the loss function is convex).

Hyperparameter sensitivity. The regularization strength λ, along with additional parameters introduced by relaxation methods (temperatures, stretch constants, etc.), must be carefully tuned. The performance of L0 methods can be sensitive to these choices.

Discrete-continuous gap. Continuous relaxations of the L0 norm (such as the hard concrete distribution) approximate the discrete counting function, but the gap between the relaxation and the true L0 objective can lead to solutions that are not perfectly sparse or that differ from the true L0 optimum.

Scalability to modern architectures. While L0 regularization has been demonstrated on moderately sized networks (LeNet, Wide Residual Networks), applying it to very large models with billions of parameters, such as modern large language models, remains an active research challenge. The additional memory and computation required for the gate parameters can be substantial at this scale.

Comparison of L0 optimization algorithms

Algorithm	Type	Guarantees	Scalability	Key advantage
Exhaustive search	Exact	Global optimum	Very low (2^p subsets)	Provably optimal
Mixed integer programming	Exact	Global optimum (with certificate)	Moderate (up to ~1000 features)	Certificates of optimality
Forward/backward stepwise	Greedy	No global guarantee	High	Simple and fast
Matching pursuit / OMP	Greedy	Recovery guarantees under RIP	High	Well-suited for compressed sensing
Iterative hard thresholding	Projected gradient	Convergence to local minimum under RIP	High	Combines gradient descent with hard sparsity
Smoothed L0 (SL0)	Continuous approximation	Converges to L0 as σ approaches 0	High	Very fast in practice
Hard concrete gates	Stochastic relaxation	Expected L0 is differentiable	Moderate to high	End-to-end training with SGD
Continuous sparsification	Deterministic relaxation	Continuation from L1 to L0	Moderate to high	No stochastic sampling required
ADMM / penalty decomposition	Splitting method	Convergence to stationary point	High	Handles structured sparsity

References

Natarajan, B. K. (1995). "Sparse approximate solutions to linear systems." *SIAM Journal on Computing*, 24(2), 227-234.
Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso." *Journal of the Royal Statistical Society: Series B*, 58(1), 267-288.
Donoho, D. L. (2006). "Compressed sensing." *IEEE Transactions on Information Theory*, 52(4), 1289-1306.
Candes, E. J., Romberg, J., & Tao, T. (2006). "Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information." *IEEE Transactions on Information Theory*, 52(2), 489-509.
Blumensath, T., & Davies, M. E. (2008). "Iterative hard thresholding for compressed sensing." *Applied and Computational Harmonic Analysis*, 27(3), 265-274.
Mohimani, H., Babaie-Zadeh, M., & Jutten, C. (2009). "A fast approach for overcomplete sparse decomposition based on smoothed L0 norm." *IEEE Transactions on Signal Processing*, 57(1), 289-301.
Bertsimas, D., King, A., & Mazumder, R. (2016). "Best subset selection via a modern optimization lens." *The Annals of Statistics*, 44(2), 813-852.
Louizos, C., Welling, M., & Kingma, D. P. (2018). "Learning Sparse Neural Networks through L0 Regularization." *International Conference on Learning Representations (ICLR)*.
Frankle, J., & Carlin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *International Conference on Learning Representations (ICLR)*.
Savarese, P., & Maire, M. (2020). "Winning the Lottery with Continuous Sparsification." *Advances in Neural Information Processing Systems (NeurIPS)*.
Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., & Peste, A. (2021). "Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks." *Journal of Machine Learning Research*, 22(241), 1-124.
Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012). "Optimization with Sparsity-Inducing Penalties." *Foundations and Trends in Machine Learning*, 4(1), 1-106.

Explain like I'm 5 (ELI5)

Mathematical definition

The L0 "norm"

Regularized objective

Why L0 regularization is NP-hard

Relationship to L1 regularization

Connection to information criteria

Algorithms and approximate methods

Greedy algorithms

Iterative hard thresholding (IHT)

Smoothed L0 (SL0)

Mixed integer programming (MIP)

Penalty decomposition and ADMM

L0 regularization for neural networks

The Louizos, Welling, and Kingma method (2018)

Stochastic gates

The hard concrete distribution

Experimental results

Continuous sparsification (2020)

Structured sparsity with L0

Relationship to the lottery ticket hypothesis

Applications

Compressed sensing

Feature selection

Neural network compression

Variable selection in graphical models

Challenges and limitations

Comparison of L0 optimization algorithms

See also

References

Improve this article

Related Articles

ARC-AGI 2

L1 Regularization

L2 Regularization

Weight Decay

Dropout Regularization

Early Stopping

Explain like I'm 5 (ELI5)

Mathematical definition

The L0 "norm"

Regularized objective

Why L0 regularization is NP-hard

Relationship to L1 regularization

Connection to information criteria

Algorithms and approximate methods

Greedy algorithms

Iterative hard thresholding (IHT)

Smoothed L0 (SL0)

Mixed integer programming (MIP)

Penalty decomposition and ADMM

L0 regularization for neural networks

The Louizos, Welling, and Kingma method (2018)

Stochastic gates

The hard concrete distribution

Experimental results

Continuous sparsification (2020)

Structured sparsity with L0

Relationship to the lottery ticket hypothesis

Applications

Compressed sensing

Feature selection

Neural network compression

Variable selection in graphical models

Challenges and limitations

Comparison of L0 optimization algorithms

See also

References

Related Articles

ARC-AGI 2

L1 Regularization

L2 Regularization

Weight Decay

Dropout Regularization

Early Stopping