# L1 Regularization

> Source: https://aiwiki.ai/wiki/l1_regularization
> Updated: 2026-07-13
> Categories: Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**L1 regularization** is a [regularization](/wiki/regularization) technique in [machine learning](/wiki/machine_learning) and statistics that prevents [overfitting](/wiki/overfitting) by adding the sum of the absolute values of a model's parameters as a penalty term to the [loss function](/wiki/loss_function). It is also called the **Lasso penalty** (Least Absolute Shrinkage and Selection Operator). Unlike [L2 regularization](/wiki/l2_regularization), L1 regularization drives many coefficients to exactly zero, so it performs automatic [feature](/wiki/feature) selection at the same time as it shrinks coefficients. The method was introduced by Robert Tibshirani in 1996, and his original paper has been cited more than 50,000 times, making it one of the most influential papers in modern statistics [1][7].

In his 1996 paper, Tibshirani described the method this way: "The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models." [1]

## Mathematical Formulation

Given a standard supervised learning objective, L1 regularization modifies the loss function by appending a penalty term proportional to the L1 norm (the sum of absolute values) of the model's weight vector. For a model with parameters $$\mathbf{w} = (w_1, w_2, \ldots, w_p)$$, the regularized objective is:

$$
L_{\text{total}}(w) = L_{\text{data}}(w) + \lambda \sum |w_i|
$$

where:

- $$L_{\text{data}}(w)$$ is the original data-fitting loss (for example, mean squared error in [linear regression](/wiki/linear_regression) or cross-entropy in classification).
- $$\lambda$$ (lambda) is the regularization hyperparameter that controls the strength of the penalty. Larger values of $$\lambda$$ impose stronger regularization, shrinking more coefficients toward zero.
- $$\sum |w_i|$$ is the L1 norm of the weight vector, computed as the sum of the absolute values of all model weights.

In the specific case of linear regression, this formulation yields the **Lasso regression** objective:

$$
\text{minimize} \quad \frac{1}{2n} \sum (y_i - x_i^\top w)^2 + \lambda \sum |w_j|
$$

An equivalent constrained formulation expresses the same idea: minimize the residual sum of squares subject to the constraint that $$\sum |w_j| \le t$$, where t is a budget parameter. Smaller values of t correspond to larger values of $$\lambda$$, producing sparser solutions.

## Why does L1 regularization produce sparse solutions?

The defining characteristic of L1 regularization is its ability to drive model coefficients to exactly zero, producing sparse weight vectors. This behavior arises from several complementary perspectives.

### Geometric Interpretation

The constrained form of L1 regularization restricts the weight vector to lie within a diamond-shaped region (or, in higher dimensions, a cross-polytope). In two dimensions, the set of all points satisfying $$|w_1| + |w_2| \le t$$ forms a diamond (rotated square) with corners on the coordinate axes.

During optimization, the algorithm seeks the point where the loss function's contours first touch this constraint region. Because the L1 constraint region has sharp corners that lie directly on the coordinate axes, the point of contact frequently lands at one of these corners. At a corner, one or more of the weights equals exactly zero. By contrast, [L2 regularization](/wiki/l2_regularization) uses a circular (spherical) constraint region with no corners, so the intersection with the loss contours typically occurs at a point where all weights are nonzero [9].

### Gradient-Based Intuition

The L1 penalty contributes a constant magnitude to the gradient of the loss function. For a weight $$w_i > 0$$, the penalty's gradient is $$+\lambda$$; for $$w_i < 0$$, it is $$-\lambda$$. This constant "push" toward zero operates independently of the weight's current magnitude. Even when a weight is very small, the push remains the same size, making it possible for the optimization process to drive the weight all the way to zero.

In contrast, the L2 penalty contributes a gradient of $$2\lambda w_i$$, which shrinks proportionally as the weight approaches zero. The diminishing push under L2 means weights get very small but rarely reach exactly zero.

### Bayesian Interpretation

From a Bayesian perspective, L1 regularization corresponds to placing a **Laplace (double-exponential) prior** on the model parameters and performing maximum a posteriori (MAP) estimation. The Laplace distribution is sharply peaked at zero with heavy tails, encoding a prior belief that most parameters should be zero or near zero while a few may be large. This contrasts with the Gaussian prior implied by L2 regularization, which favors small but nonzero values distributed symmetrically around zero.

## Lasso Regression

The most well-known application of L1 regularization is **Lasso regression**, introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso," published in the *Journal of the Royal Statistical Society, Series B*, volume 58, issue 1, pages 267 to 288 [1]. The name "LASSO" stands for Least Absolute Shrinkage and Selection Operator.

Tibshirani's key insight was that the L1 penalty performs both shrinkage (reducing the magnitude of coefficients) and selection (setting some coefficients exactly to zero) simultaneously [1]. As the abstract states, the lasso "enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression" [1]. This dual capability was a significant advance over existing methods:

- **Subset selection** identifies important variables but is computationally expensive (combinatorial search) and can be unstable.
- **Ridge regression** stabilizes estimates through shrinkage but retains all variables in the model.
- **Lasso** combines the interpretability of subset selection with the stability of ridge regression.

The idea of using L1 penalties in regression had appeared earlier in geophysics literature around 1986, but Tibshirani's work formalized the statistical properties of the method and popularized it for broad use across the statistical and machine learning communities. His original paper has been cited over 50,000 times and is considered one of the most influential papers in modern statistics [7].

## How does L1 regularization differ from L2 regularization?

L1 and L2 regularization represent two fundamental approaches to constraining model complexity. The core difference is that L1 produces exact zeros and therefore selects features, while L2 shrinks coefficients smoothly but keeps every feature in the model. Their differences are summarized in the table below.

| Property | L1 Regularization (Lasso) | [L2 Regularization](/wiki/l2_regularization) (Ridge) |
|---|---|---|
| Penalty term | $$\lambda \sum \lvert w_i \rvert$$ | $$\lambda \sum w_i^2$$ |
| Constraint shape | Diamond (cross-polytope) | Sphere (Euclidean ball) |
| Sparsity | Produces exact zeros | Shrinks toward zero but rarely reaches it |
| Feature selection | Built-in (automatic) | Not performed |
| Correlated features | May arbitrarily select one from a group | Distributes weight among correlated features |
| Bayesian prior | Laplace distribution | Gaussian distribution |
| Solution uniqueness | May not be unique when $$p > n$$ | Always unique |
| Typical use case | High-dimensional data with few relevant features | Data where most features contribute |

**Elastic Net** regularization, proposed by Hui Zou and Trevor Hastie in 2005 in the *Journal of the Royal Statistical Society, Series B*, volume 67, issue 2, pages 301 to 320, combines L1 and L2 penalties to address limitations of each method used alone [2]:

$$
L_{\text{elastic}}(w) = L_{\text{data}}(w) + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2
$$

Elastic Net retains the feature selection capability of L1 while gaining the grouping effect of L2, meaning it tends to select or exclude groups of correlated features together rather than arbitrarily picking one [2]. Zou and Hastie describe this grouping effect as the property that "strongly correlated predictors tend to be in or out of the model together" [2]. This makes Elastic Net particularly useful when the number of predictors exceeds the number of observations or when predictors are highly correlated.

## What algorithms solve L1-regularized problems?

The L1 penalty introduces a challenge for optimization: the absolute value function $$|w|$$ is not differentiable at $$w = 0$$. Several specialized algorithms have been developed to handle this.

### Subgradient Methods

The subdifferential of $$|w|$$ at $$w = 0$$ is the interval $$[-1, 1]$$, meaning any value in that range is a valid subgradient. Subgradient descent generalizes [gradient descent](/wiki/gradient_descent) to work with non-differentiable convex functions by replacing the gradient with a subgradient at points of non-differentiability. While simple to implement, subgradient methods converge slowly and do not produce exactly sparse iterates during optimization.

### Coordinate Descent

Coordinate descent optimizes one parameter at a time while holding all others fixed. For L1-regularized problems, each coordinate update has a closed-form solution involving a **soft-thresholding** operation:

$$
w_j = S(z_j, \lambda) = \operatorname{sign}(z_j) \max(|z_j| - \lambda, 0)
$$

where z_j is the partial residual. This operation naturally produces exact zeros whenever $$|z_j| \le \lambda$$. Coordinate descent is the algorithm underlying popular implementations like the *glmnet* package in R and scikit-learn's Lasso in Python.

### Proximal Gradient Descent (ISTA and FISTA)

Proximal gradient methods split the objective into a smooth part (the data loss) and a non-smooth part (the L1 penalty). At each iteration, a standard gradient step is taken on the smooth part, followed by a **proximal operator** that handles the L1 term. For L1 regularization, the proximal operator is the soft-thresholding function.

**ISTA** (Iterative Shrinkage-Thresholding Algorithm) applies this approach directly and achieves a convergence rate of $$O(1/k)$$, where k is the number of iterations.

**FISTA** (Fast ISTA), proposed by Amir Beck and Marc Teboulle in 2009 in the *SIAM Journal on Imaging Sciences*, incorporates Nesterov-style momentum to accelerate convergence from $$O(1/k)$$ to $$O(1/k^2)$$, often reducing computation time by orders of magnitude compared to ISTA [5].

### LARS (Least Angle Regression)

The LARS algorithm, developed by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani in 2004 in *The Annals of Statistics*, volume 32, issue 2, pages 407 to 499, computes the entire Lasso solution path (solutions for all values of $$\lambda$$) with essentially the same computational cost as a single ordinary least squares fit [4]. This makes it efficient for model selection across a range of regularization strengths.

## L1 Regularization in Neural Networks

In deep learning, L1 regularization can be applied to the weights of [neural networks](/wiki/neural_network) to encourage sparse connectivity. The regularized loss becomes:

$$
L_{\text{total}} = L_{\text{data}} + \lambda \sum \sum |W_{ij}^{(l)}|
$$

where the double summation runs over all weights across all layers.

### Weight Pruning

L1 regularization serves as a foundation for network pruning strategies. By training with an L1 penalty, many weights are driven to near-zero values and can then be removed (pruned) to create a smaller, faster network. However, the sparsity produced by standard L1 regularization is typically **unstructured**, meaning individual weights are zeroed out in scattered positions throughout the weight matrices. Unstructured sparsity does not always translate to computational speedups on standard hardware because the remaining nonzero weights do not form regular patterns.

### Structured Sparsity and Group Lasso

To address this limitation, **Group Lasso** regularization (Yuan and Lin, 2006) applies the L1 norm at the group level rather than the individual weight level [3]. Specifically, it penalizes the L2 norms of predefined groups of weights using an L1-style penalty:

$$
\lambda \sum \lVert w_g \rVert_2
$$

where $$w_g$$ represents the weights belonging to group g. This drives entire groups of weights to zero simultaneously, enabling structured pruning of neurons, channels, or attention heads. Wen et al. (2016), in "Learning Structured Sparsity in Deep Neural Networks," applied group Lasso to convolutional neural networks for structured pruning of filters and channels, reporting average speedups of 5.1x on CPU and 3.1x on GPU for the convolutional layers of AlexNet using off-the-shelf libraries [11].

### Practical Considerations

In practice, L2 regularization (weight decay) is more commonly used as the default regularizer in deep learning because it interacts more smoothly with gradient-based optimizers like Adam and SGD. L1 regularization is typically reserved for situations where explicit sparsity is desired, such as model compression, interpretability analysis, or embedded feature selection within neural architectures.

## Extensions and Variants

Several important extensions of L1 regularization have been developed since Tibshirani's original work.

| Variant | Authors (Year) | Key Idea |
|---|---|---|
| Adaptive Lasso | Zou (2006) | Uses weighted penalties $$(1/\lvert w_j^{\text{initial}} \rvert)$$ to achieve oracle properties |
| Group Lasso | Yuan and Lin (2006) | Selects entire groups of variables together |
| Fused Lasso | Tibshirani et al. (2005) | Penalizes differences between successive coefficients for ordered data |
| Elastic Net | Zou and Hastie (2005) | Combines L1 and L2 penalties for correlated features |
| Sparse Group Lasso | Simon et al. (2013) | Combines Group Lasso with element-wise L1 for within-group sparsity |
| Prior Lasso | Jiang et al. (2016) | Incorporates external prior information through pseudo-responses |

## What is L1 regularization used for?

L1 regularization is used across a wide range of fields and problem types, especially when datasets have many more features than samples and only a few features are expected to matter.

### Genomics and Bioinformatics

In genome-wide association studies and gene expression analysis, datasets often contain thousands or tens of thousands of features (genes) but relatively few samples (patients). L1 regularization identifies the small subset of genes most relevant to a disease outcome, making results interpretable for biologists.

### Compressed Sensing and Signal Processing

Compressed sensing theory, developed by Emmanuel Candes, Justin Romberg, Terence Tao, and David Donoho around 2004 to 2006, showed that sparse signals can be reconstructed from far fewer measurements than traditionally required, provided the reconstruction uses L1 minimization [6]. This has applications in medical imaging (faster MRI scans), radar, and communications.

### Natural Language Processing

In text classification and sentiment analysis, the [feature](/wiki/feature) space can include millions of n-gram features. L1-regularized logistic regression effectively selects the most discriminative terms while ignoring the vast majority of irrelevant features.

### Economics and Finance

Lasso regression helps economists identify which variables among many candidates genuinely drive outcomes in macroeconomic forecasting, asset pricing, and causal inference under high-dimensional confounding.

### Computer Vision

L1 regularization is used for image denoising, inpainting, and reconstruction tasks where the underlying signal is sparse in some transform domain (such as wavelet or discrete cosine transform bases).

## Implementation in Scikit-Learn

The Python library scikit-learn provides straightforward implementations of L1-regularized models [10]. The `Lasso` class implements L1-regularized linear regression:

```python
from sklearn.linear_model import Lasso, LassoCV
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# Generate sample data
X, y = make_regression(n_samples=200, n_features=50, n_informative=10, noise=0.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit Lasso with a specific alpha (lambda)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Check sparsity: number of zero coefficients
print(f"Non-zero coefficients: {(lasso.coef_ != 0).sum()} out of {len(lasso.coef_)}")
print(f"Test R^2 score: {lasso.score(X_test, y_test):.4f}")

# Use cross-validation to find the optimal alpha
lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X_train, y_train)
print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
```

For classification, `LogisticRegression` with `penalty='l1'` provides L1-regularized logistic regression. The `ElasticNet` class supports combined L1 and L2 penalties.

## Explain Like I'm 5 (ELI5)

Imagine you are packing a suitcase for a trip, but your suitcase is small and you cannot take everything. You have to pick only the most important items and leave the rest behind.

L1 regularization works the same way for a computer learning from data. When a computer builds a model to make predictions, it looks at many different pieces of information (called features). Some of those pieces are really useful, but many are not helpful at all. L1 regularization is like a rule that says: "For every piece of information you use, you have to pay a small cost." Because of this cost, the computer learns to completely ignore the useless pieces (setting their importance to exactly zero) and only keeps the truly helpful ones.

This makes the model simpler, easier to understand, and better at making predictions on new data it has never seen before.

## References

1. Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society, Series B*, 58(1), 267-288.
2. Zou, H. and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." *Journal of the Royal Statistical Society, Series B*, 67(2), 301-320.
3. Yuan, M. and Lin, Y. (2006). "Model Selection and Estimation in Regression with Grouped Variables." *Journal of the Royal Statistical Society, Series B*, 68(1), 49-67.
4. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). "Least Angle Regression." *The Annals of Statistics*, 32(2), 407-499.
5. Beck, A. and Teboulle, M. (2009). "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems." *SIAM Journal on Imaging Sciences*, 2(1), 183-202.
6. Candes, E. and Tao, T. (2005). "Decoding by Linear Programming." *IEEE Transactions on Information Theory*, 51(12), 4203-4215.
7. Tibshirani, R. (2011). "Regression Shrinkage and Selection via the Lasso: A Retrospective." *Journal of the Royal Statistical Society, Series B*, 73(3), 273-282.
8. Zou, H. (2006). "The Adaptive Lasso and Its Oracle Properties." *Journal of the American Statistical Association*, 101(476), 1418-1429.
9. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*. 2nd edition. Springer.
10. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
11. Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. (2016). "Learning Structured Sparsity in Deep Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 29, 2074-2082.