L1 Regularization

L1 regularization is a widely used regularization technique in machine learning and statistical modeling that prevents overfitting by adding the sum of the absolute values of a model's parameters as a penalty term to the loss function. Also known as the Lasso penalty (Least Absolute Shrinkage and Selection Operator), L1 regularization encourages sparsity in the learned parameters, effectively driving many coefficients to exactly zero. This property makes it a powerful tool for automatic feature selection, model interpretability, and working with high-dimensional datasets.

Mathematical Formulation

Given a standard supervised learning objective, L1 regularization modifies the loss function by appending a penalty term proportional to the L1 norm (the sum of absolute values) of the model's weight vector. For a model with parameters w = (w_1, w_2, ..., w_p), the regularized objective is:

L_total(w) = L_data(w) + λ ∑|w_i|

where:

L_data(w) is the original data-fitting loss (for example, mean squared error in linear regression or cross-entropy in classification).
λ (lambda) is the regularization hyperparameter that controls the strength of the penalty. Larger values of λ impose stronger regularization, shrinking more coefficients toward zero.
∑|w_i| is the L1 norm of the weight vector, computed as the sum of the absolute values of all model weights.

In the specific case of linear regression, this formulation yields the Lasso regression objective:

minimize (1/2n) ∑(y_i - x_i^T w)^2 + λ ∑|w_j|

An equivalent constrained formulation expresses the same idea: minimize the residual sum of squares subject to the constraint that ∑|w_j| ≤ t, where t is a budget parameter. Smaller values of t correspond to larger values of λ, producing sparser solutions.

Why L1 Regularization Produces Sparse Solutions

The defining characteristic of L1 regularization is its ability to drive model coefficients to exactly zero, producing sparse weight vectors. This behavior arises from several complementary perspectives.

Geometric Interpretation

The constrained form of L1 regularization restricts the weight vector to lie within a diamond-shaped region (or, in higher dimensions, a cross-polytope). In two dimensions, the set of all points satisfying |w_1| + |w_2| ≤ t forms a diamond (rotated square) with corners on the coordinate axes.

During optimization, the algorithm seeks the point where the loss function's contours first touch this constraint region. Because the L1 constraint region has sharp corners that lie directly on the coordinate axes, the point of contact frequently lands at one of these corners. At a corner, one or more of the weights equals exactly zero. By contrast, L2 regularization uses a circular (spherical) constraint region with no corners, so the intersection with the loss contours typically occurs at a point where all weights are nonzero.

Gradient-Based Intuition

The L1 penalty contributes a constant magnitude to the gradient of the loss function. For a weight w_i > 0, the penalty's gradient is +λ; for w_i < 0, it is -λ. This constant "push" toward zero operates independently of the weight's current magnitude. Even when a weight is very small, the push remains the same size, making it possible for the optimization process to drive the weight all the way to zero.

In contrast, the L2 penalty contributes a gradient of 2λw_i, which shrinks proportionally as the weight approaches zero. The diminishing push under L2 means weights get very small but rarely reach exactly zero.

Bayesian Interpretation

From a Bayesian perspective, L1 regularization corresponds to placing a Laplace (double-exponential) prior on the model parameters and performing maximum a posteriori (MAP) estimation. The Laplace distribution is sharply peaked at zero with heavy tails, encoding a prior belief that most parameters should be zero or near zero while a few may be large. This contrasts with the Gaussian prior implied by L2 regularization, which favors small but nonzero values distributed symmetrically around zero.

Lasso Regression

The most well-known application of L1 regularization is Lasso regression, introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso," published in the Journal of the Royal Statistical Society, Series B. The name "LASSO" stands for Least Absolute Shrinkage and Selection Operator.

Tibshirani's key insight was that the L1 penalty performs both shrinkage (reducing the magnitude of coefficients) and selection (setting some coefficients exactly to zero) simultaneously. This dual capability was a significant advance over existing methods:

Subset selection identifies important variables but is computationally expensive (combinatorial search) and can be unstable.
Ridge regression stabilizes estimates through shrinkage but retains all variables in the model.
Lasso combines the interpretability of subset selection with the stability of ridge regression.

The idea of using L1 penalties in regression had appeared earlier in geophysics literature around 1986, but Tibshirani's work formalized the statistical properties of the method and popularized it for broad use across the statistical and machine learning communities. His original paper has been cited over 50,000 times and is considered one of the most influential papers in modern statistics.

Comparison with L2 Regularization and Elastic Net

L1 and L2 regularization represent two fundamental approaches to constraining model complexity. Their differences are summarized in the table below.

Property	L1 Regularization (Lasso)	L2 Regularization (Ridge)
Penalty term	λ ∑\|w_i\|	λ ∑w_i^2
Constraint shape	Diamond (cross-polytope)	Sphere (Euclidean ball)
Sparsity	Produces exact zeros	Shrinks toward zero but rarely reaches it
Feature selection	Built-in (automatic)	Not performed
Correlated features	May arbitrarily select one from a group	Distributes weight among correlated features
Bayesian prior	Laplace distribution	Gaussian distribution
Solution uniqueness	May not be unique when p > n	Always unique
Typical use case	High-dimensional data with few relevant features	Data where most features contribute

Elastic Net regularization, proposed by Zou and Hastie in 2005, combines L1 and L2 penalties to address limitations of each method used alone:

L_elastic(w) = L_data(w) + λ_1 ∑|w_i| + λ_2 ∑w_i^2

Elastic Net retains the feature selection capability of L1 while gaining the grouping effect of L2, meaning it tends to select or exclude groups of correlated features together rather than arbitrarily picking one. This makes Elastic Net particularly useful when the number of predictors exceeds the number of observations or when predictors are highly correlated.

Optimization Methods for L1 Regularization

The L1 penalty introduces a challenge for optimization: the absolute value function |w| is not differentiable at w = 0. Several specialized algorithms have been developed to handle this.

Subgradient Methods

The subdifferential of |w| at w = 0 is the interval [-1, 1], meaning any value in that range is a valid subgradient. Subgradient descent generalizes gradient descent to work with non-differentiable convex functions by replacing the gradient with a subgradient at points of non-differentiability. While simple to implement, subgradient methods converge slowly and do not produce exactly sparse iterates during optimization.

Coordinate Descent

Coordinate descent optimizes one parameter at a time while holding all others fixed. For L1-regularized problems, each coordinate update has a closed-form solution involving a soft-thresholding operation:

w_j = S(z_j, λ) = sign(z_j) max(|z_j| - λ, 0)

where z_j is the partial residual. This operation naturally produces exact zeros whenever |z_j| ≤ λ. Coordinate descent is the algorithm underlying popular implementations like the glmnet package in R and scikit-learn's Lasso in Python.

Proximal Gradient Descent (ISTA and FISTA)

Proximal gradient methods split the objective into a smooth part (the data loss) and a non-smooth part (the L1 penalty). At each iteration, a standard gradient step is taken on the smooth part, followed by a proximal operator that handles the L1 term. For L1 regularization, the proximal operator is the soft-thresholding function.

ISTA (Iterative Shrinkage-Thresholding Algorithm) applies this approach directly and achieves a convergence rate of O(1/k), where k is the number of iterations.

FISTA (Fast ISTA), proposed by Beck and Teboulle in 2009, incorporates Nesterov-style momentum to accelerate convergence to O(1/k^2), often reducing computation time by orders of magnitude compared to ISTA.

LARS (Least Angle Regression)

The LARS algorithm, developed by Efron, Hastie, Johnstone, and Tibshirani in 2004, computes the entire Lasso solution path (solutions for all values of λ) with essentially the same computational cost as a single ordinary least squares fit. This makes it efficient for model selection across a range of regularization strengths.

L1 Regularization in Neural Networks

In deep learning, L1 regularization can be applied to the weights of neural networks to encourage sparse connectivity. The regularized loss becomes:

L_total = L_data + λ ∑ ∑ |W_ij^(l)|

where the double summation runs over all weights across all layers.

Weight Pruning

L1 regularization serves as a foundation for network pruning strategies. By training with an L1 penalty, many weights are driven to near-zero values and can then be removed (pruned) to create a smaller, faster network. However, the sparsity produced by standard L1 regularization is typically unstructured, meaning individual weights are zeroed out in scattered positions throughout the weight matrices. Unstructured sparsity does not always translate to computational speedups on standard hardware because the remaining nonzero weights do not form regular patterns.

Structured Sparsity and Group Lasso

To address this limitation, Group Lasso regularization (Yuan and Lin, 2006) applies the L1 norm at the group level rather than the individual weight level. Specifically, it penalizes the L2 norms of predefined groups of weights using an L1-style penalty:

λ ∑ ||w_g||_2

where w_g represents the weights belonging to group g. This drives entire groups of weights to zero simultaneously, enabling structured pruning of neurons, channels, or attention heads. Wen et al. (2016) applied group Lasso to convolutional neural networks for structured pruning of filters, achieving meaningful speedups on GPU hardware.

Practical Considerations

In practice, L2 regularization (weight decay) is more commonly used as the default regularizer in deep learning because it interacts more smoothly with gradient-based optimizers like Adam and SGD. L1 regularization is typically reserved for situations where explicit sparsity is desired, such as model compression, interpretability analysis, or embedded feature selection within neural architectures.

Extensions and Variants

Several important extensions of L1 regularization have been developed since Tibshirani's original work.

Variant	Authors (Year)	Key Idea
Adaptive Lasso	Zou (2006)	Uses weighted penalties (1/\|w_j^initial\|) to achieve oracle properties
Group Lasso	Yuan and Lin (2006)	Selects entire groups of variables together
Fused Lasso	Tibshirani et al. (2005)	Penalizes differences between successive coefficients for ordered data
Elastic Net	Zou and Hastie (2005)	Combines L1 and L2 penalties for correlated features
Sparse Group Lasso	Simon et al. (2013)	Combines Group Lasso with element-wise L1 for within-group sparsity
Prior Lasso	Jiang et al. (2016)	Incorporates external prior information through pseudo-responses

Applications

L1 regularization is used across a wide range of fields and problem types.

Genomics and Bioinformatics

In genome-wide association studies and gene expression analysis, datasets often contain thousands or tens of thousands of features (genes) but relatively few samples (patients). L1 regularization identifies the small subset of genes most relevant to a disease outcome, making results interpretable for biologists.

Compressed Sensing and Signal Processing

Compressed sensing theory, developed by Candes, Romberg, Tao, and Donoho around 2004 to 2006, showed that sparse signals can be reconstructed from far fewer measurements than traditionally required, provided the reconstruction uses L1 minimization. This has applications in medical imaging (faster MRI scans), radar, and communications.

Natural Language Processing

In text classification and sentiment analysis, the feature space can include millions of n-gram features. L1-regularized logistic regression effectively selects the most discriminative terms while ignoring the vast majority of irrelevant features.

Economics and Finance

Lasso regression helps economists identify which variables among many candidates genuinely drive outcomes in macroeconomic forecasting, asset pricing, and causal inference under high-dimensional confounding.

Computer Vision

L1 regularization is used for image denoising, inpainting, and reconstruction tasks where the underlying signal is sparse in some transform domain (such as wavelet or discrete cosine transform bases).

Implementation in Scikit-Learn

The Python library scikit-learn provides straightforward implementations of L1-regularized models. The Lasso class implements L1-regularized linear regression:

from sklearn.linear_model import Lasso, LassoCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# Generate sample data
X, y = make_regression(n_samples=200, n_features=50, n_informative=10, noise=0.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit Lasso with a specific alpha (lambda)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Check sparsity: number of zero coefficients
print(f"Non-zero coefficients: {(lasso.coef_ != 0).sum()} out of {len(lasso.coef_)}")
print(f"Test R^2 score: {lasso.score(X_test, y_test):.4f}")

# Use cross-validation to find the optimal alpha
lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X_train, y_train)
print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")

For classification, LogisticRegression with penalty='l1' provides L1-regularized logistic regression. The ElasticNet class supports combined L1 and L2 penalties.

Explain Like I'm 5 (ELI5)

Imagine you are packing a suitcase for a trip, but your suitcase is small and you cannot take everything. You have to pick only the most important items and leave the rest behind.

L1 regularization works the same way for a computer learning from data. When a computer builds a model to make predictions, it looks at many different pieces of information (called features). Some of those pieces are really useful, but many are not helpful at all. L1 regularization is like a rule that says: "For every piece of information you use, you have to pay a small cost." Because of this cost, the computer learns to completely ignore the useless pieces (setting their importance to exactly zero) and only keeps the truly helpful ones.

This makes the model simpler, easier to understand, and better at making predictions on new data it has never seen before.

References

Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society, Series B*, 58(1), 267-288.
Zou, H. and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." *Journal of the Royal Statistical Society, Series B*, 67(2), 301-320.
Yuan, M. and Lin, Y. (2006). "Model Selection and Estimation in Regression with Grouped Variables." *Journal of the Royal Statistical Society, Series B*, 68(1), 49-67.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). "Least Angle Regression." *The Annals of Statistics*, 32(2), 407-499.
Beck, A. and Teboulle, M. (2009). "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems." *SIAM Journal on Imaging Sciences*, 2(1), 183-202.
Candes, E. and Tao, T. (2005). "Decoding by Linear Programming." *IEEE Transactions on Information Theory*, 51(12), 4203-4215.
Tibshirani, R. (2011). "Regression Shrinkage and Selection via the Lasso: A Retrospective." *Journal of the Royal Statistical Society, Series B*, 73(3), 273-282.
Zou, H. (2006). "The Adaptive Lasso and Its Oracle Properties." *Journal of the American Statistical Association*, 101(476), 1418-1429.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*. 2nd edition. Springer.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.

Mathematical Formulation

Why L1 Regularization Produces Sparse Solutions

Geometric Interpretation

Gradient-Based Intuition

Bayesian Interpretation

Lasso Regression

Comparison with L2 Regularization and Elastic Net

Optimization Methods for L1 Regularization

Subgradient Methods

Coordinate Descent

Proximal Gradient Descent (ISTA and FISTA)

LARS (Least Angle Regression)

L1 Regularization in Neural Networks

Weight Pruning

Structured Sparsity and Group Lasso

Practical Considerations

Extensions and Variants

Applications

Genomics and Bioinformatics

Compressed Sensing and Signal Processing

Natural Language Processing

Economics and Finance

Computer Vision

Implementation in Scikit-Learn

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

L0 Regularization

L2 Regularization

Weight Decay

Dropout Regularization

Early Stopping

Mathematical Formulation

Why L1 Regularization Produces Sparse Solutions

Geometric Interpretation

Gradient-Based Intuition

Bayesian Interpretation

Lasso Regression

Comparison with L2 Regularization and Elastic Net

Optimization Methods for L1 Regularization

Subgradient Methods

Coordinate Descent

Proximal Gradient Descent (ISTA and FISTA)

LARS (Least Angle Regression)

L1 Regularization in Neural Networks

Weight Pruning

Structured Sparsity and Group Lasso

Practical Considerations

Extensions and Variants

Applications

Genomics and Bioinformatics

Compressed Sensing and Signal Processing

Natural Language Processing

Economics and Finance

Computer Vision

Implementation in Scikit-Learn

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

L0 Regularization

L2 Regularization

Weight Decay

Dropout Regularization

Early Stopping