L1 regularization is a widely used regularization technique in machine learning and statistical modeling that prevents overfitting by adding the sum of the absolute values of a model's parameters as a penalty term to the loss function. Also known as the Lasso penalty (Least Absolute Shrinkage and Selection Operator), L1 regularization encourages sparsity in the learned parameters, effectively driving many coefficients to exactly zero. This property makes it a powerful tool for automatic feature selection, model interpretability, and working with high-dimensional datasets.
Given a standard supervised learning objective, L1 regularization modifies the loss function by appending a penalty term proportional to the L1 norm (the sum of absolute values) of the model's weight vector. For a model with parameters w = (w_1, w_2, ..., w_p), the regularized objective is:
L_total(w) = L_data(w) + λ ∑|w_i|
where:
In the specific case of linear regression, this formulation yields the Lasso regression objective:
minimize (1/2n) ∑(y_i - x_i^T w)^2 + λ ∑|w_j|
An equivalent constrained formulation expresses the same idea: minimize the residual sum of squares subject to the constraint that ∑|w_j| ≤ t, where t is a budget parameter. Smaller values of t correspond to larger values of λ, producing sparser solutions.
The defining characteristic of L1 regularization is its ability to drive model coefficients to exactly zero, producing sparse weight vectors. This behavior arises from several complementary perspectives.
The constrained form of L1 regularization restricts the weight vector to lie within a diamond-shaped region (or, in higher dimensions, a cross-polytope). In two dimensions, the set of all points satisfying |w_1| + |w_2| ≤ t forms a diamond (rotated square) with corners on the coordinate axes.
During optimization, the algorithm seeks the point where the loss function's contours first touch this constraint region. Because the L1 constraint region has sharp corners that lie directly on the coordinate axes, the point of contact frequently lands at one of these corners. At a corner, one or more of the weights equals exactly zero. By contrast, L2 regularization uses a circular (spherical) constraint region with no corners, so the intersection with the loss contours typically occurs at a point where all weights are nonzero.
The L1 penalty contributes a constant magnitude to the gradient of the loss function. For a weight w_i > 0, the penalty's gradient is +λ; for w_i < 0, it is -λ. This constant "push" toward zero operates independently of the weight's current magnitude. Even when a weight is very small, the push remains the same size, making it possible for the optimization process to drive the weight all the way to zero.
In contrast, the L2 penalty contributes a gradient of 2λw_i, which shrinks proportionally as the weight approaches zero. The diminishing push under L2 means weights get very small but rarely reach exactly zero.
From a Bayesian perspective, L1 regularization corresponds to placing a Laplace (double-exponential) prior on the model parameters and performing maximum a posteriori (MAP) estimation. The Laplace distribution is sharply peaked at zero with heavy tails, encoding a prior belief that most parameters should be zero or near zero while a few may be large. This contrasts with the Gaussian prior implied by L2 regularization, which favors small but nonzero values distributed symmetrically around zero.
The most well-known application of L1 regularization is Lasso regression, introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso," published in the Journal of the Royal Statistical Society, Series B. The name "LASSO" stands for Least Absolute Shrinkage and Selection Operator.
Tibshirani's key insight was that the L1 penalty performs both shrinkage (reducing the magnitude of coefficients) and selection (setting some coefficients exactly to zero) simultaneously. This dual capability was a significant advance over existing methods:
The idea of using L1 penalties in regression had appeared earlier in geophysics literature around 1986, but Tibshirani's work formalized the statistical properties of the method and popularized it for broad use across the statistical and machine learning communities. His original paper has been cited over 50,000 times and is considered one of the most influential papers in modern statistics.
L1 and L2 regularization represent two fundamental approaches to constraining model complexity. Their differences are summarized in the table below.
| Property | L1 Regularization (Lasso) | L2 Regularization (Ridge) |
|---|---|---|
| Penalty term | λ ∑|w_i| | λ ∑w_i^2 |
| Constraint shape | Diamond (cross-polytope) | Sphere (Euclidean ball) |
| Sparsity | Produces exact zeros | Shrinks toward zero but rarely reaches it |
| Feature selection | Built-in (automatic) | Not performed |
| Correlated features | May arbitrarily select one from a group | Distributes weight among correlated features |
| Bayesian prior | Laplace distribution | Gaussian distribution |
| Solution uniqueness | May not be unique when p > n | Always unique |
| Typical use case | High-dimensional data with few relevant features | Data where most features contribute |
Elastic Net regularization, proposed by Zou and Hastie in 2005, combines L1 and L2 penalties to address limitations of each method used alone:
L_elastic(w) = L_data(w) + λ_1 ∑|w_i| + λ_2 ∑w_i^2
Elastic Net retains the feature selection capability of L1 while gaining the grouping effect of L2, meaning it tends to select or exclude groups of correlated features together rather than arbitrarily picking one. This makes Elastic Net particularly useful when the number of predictors exceeds the number of observations or when predictors are highly correlated.
The L1 penalty introduces a challenge for optimization: the absolute value function |w| is not differentiable at w = 0. Several specialized algorithms have been developed to handle this.
The subdifferential of |w| at w = 0 is the interval [-1, 1], meaning any value in that range is a valid subgradient. Subgradient descent generalizes gradient descent to work with non-differentiable convex functions by replacing the gradient with a subgradient at points of non-differentiability. While simple to implement, subgradient methods converge slowly and do not produce exactly sparse iterates during optimization.
Coordinate descent optimizes one parameter at a time while holding all others fixed. For L1-regularized problems, each coordinate update has a closed-form solution involving a soft-thresholding operation:
w_j = S(z_j, λ) = sign(z_j) max(|z_j| - λ, 0)
where z_j is the partial residual. This operation naturally produces exact zeros whenever |z_j| ≤ λ. Coordinate descent is the algorithm underlying popular implementations like the glmnet package in R and scikit-learn's Lasso in Python.
Proximal gradient methods split the objective into a smooth part (the data loss) and a non-smooth part (the L1 penalty). At each iteration, a standard gradient step is taken on the smooth part, followed by a proximal operator that handles the L1 term. For L1 regularization, the proximal operator is the soft-thresholding function.
ISTA (Iterative Shrinkage-Thresholding Algorithm) applies this approach directly and achieves a convergence rate of O(1/k), where k is the number of iterations.
FISTA (Fast ISTA), proposed by Beck and Teboulle in 2009, incorporates Nesterov-style momentum to accelerate convergence to O(1/k^2), often reducing computation time by orders of magnitude compared to ISTA.
The LARS algorithm, developed by Efron, Hastie, Johnstone, and Tibshirani in 2004, computes the entire Lasso solution path (solutions for all values of λ) with essentially the same computational cost as a single ordinary least squares fit. This makes it efficient for model selection across a range of regularization strengths.
In deep learning, L1 regularization can be applied to the weights of neural networks to encourage sparse connectivity. The regularized loss becomes:
L_total = L_data + λ ∑ ∑ |W_ij^(l)|
where the double summation runs over all weights across all layers.
L1 regularization serves as a foundation for network pruning strategies. By training with an L1 penalty, many weights are driven to near-zero values and can then be removed (pruned) to create a smaller, faster network. However, the sparsity produced by standard L1 regularization is typically unstructured, meaning individual weights are zeroed out in scattered positions throughout the weight matrices. Unstructured sparsity does not always translate to computational speedups on standard hardware because the remaining nonzero weights do not form regular patterns.
To address this limitation, Group Lasso regularization (Yuan and Lin, 2006) applies the L1 norm at the group level rather than the individual weight level. Specifically, it penalizes the L2 norms of predefined groups of weights using an L1-style penalty:
λ ∑ ||w_g||_2
where w_g represents the weights belonging to group g. This drives entire groups of weights to zero simultaneously, enabling structured pruning of neurons, channels, or attention heads. Wen et al. (2016) applied group Lasso to convolutional neural networks for structured pruning of filters, achieving meaningful speedups on GPU hardware.
In practice, L2 regularization (weight decay) is more commonly used as the default regularizer in deep learning because it interacts more smoothly with gradient-based optimizers like Adam and SGD. L1 regularization is typically reserved for situations where explicit sparsity is desired, such as model compression, interpretability analysis, or embedded feature selection within neural architectures.
Several important extensions of L1 regularization have been developed since Tibshirani's original work.
| Variant | Authors (Year) | Key Idea |
|---|---|---|
| Adaptive Lasso | Zou (2006) | Uses weighted penalties (1/|w_j^initial|) to achieve oracle properties |
| Group Lasso | Yuan and Lin (2006) | Selects entire groups of variables together |
| Fused Lasso | Tibshirani et al. (2005) | Penalizes differences between successive coefficients for ordered data |
| Elastic Net | Zou and Hastie (2005) | Combines L1 and L2 penalties for correlated features |
| Sparse Group Lasso | Simon et al. (2013) | Combines Group Lasso with element-wise L1 for within-group sparsity |
| Prior Lasso | Jiang et al. (2016) | Incorporates external prior information through pseudo-responses |
L1 regularization is used across a wide range of fields and problem types.
In genome-wide association studies and gene expression analysis, datasets often contain thousands or tens of thousands of features (genes) but relatively few samples (patients). L1 regularization identifies the small subset of genes most relevant to a disease outcome, making results interpretable for biologists.
Compressed sensing theory, developed by Candes, Romberg, Tao, and Donoho around 2004 to 2006, showed that sparse signals can be reconstructed from far fewer measurements than traditionally required, provided the reconstruction uses L1 minimization. This has applications in medical imaging (faster MRI scans), radar, and communications.
In text classification and sentiment analysis, the feature space can include millions of n-gram features. L1-regularized logistic regression effectively selects the most discriminative terms while ignoring the vast majority of irrelevant features.
Lasso regression helps economists identify which variables among many candidates genuinely drive outcomes in macroeconomic forecasting, asset pricing, and causal inference under high-dimensional confounding.
L1 regularization is used for image denoising, inpainting, and reconstruction tasks where the underlying signal is sparse in some transform domain (such as wavelet or discrete cosine transform bases).
The Python library scikit-learn provides straightforward implementations of L1-regularized models. The Lasso class implements L1-regularized linear regression:
from sklearn.linear_model import Lasso, LassoCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
# Generate sample data
X, y = make_regression(n_samples=200, n_features=50, n_informative=10, noise=0.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit Lasso with a specific alpha (lambda)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# Check sparsity: number of zero coefficients
print(f"Non-zero coefficients: {(lasso.coef_ != 0).sum()} out of {len(lasso.coef_)}")
print(f"Test R^2 score: {lasso.score(X_test, y_test):.4f}")
# Use cross-validation to find the optimal alpha
lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X_train, y_train)
print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
For classification, LogisticRegression with penalty='l1' provides L1-regularized logistic regression. The ElasticNet class supports combined L1 and L2 penalties.
Imagine you are packing a suitcase for a trip, but your suitcase is small and you cannot take everything. You have to pick only the most important items and leave the rest behind.
L1 regularization works the same way for a computer learning from data. When a computer builds a model to make predictions, it looks at many different pieces of information (called features). Some of those pieces are really useful, but many are not helpful at all. L1 regularization is like a rule that says: "For every piece of information you use, you have to pay a small cost." Because of this cost, the computer learns to completely ignore the useless pieces (setting their importance to exactly zero) and only keeps the truly helpful ones.
This makes the model simpler, easier to understand, and better at making predictions on new data it has never seen before.