Lasso Regression

Lasso regression (an acronym for Least Absolute Shrinkage and Selection Operator) is a linear regression technique that adds an L1 penalty to the ordinary least squares objective. The penalty pulls coefficients toward zero and, because of the geometry of the L1 norm, drives many of them to exactly zero. The result is a sparse model that simultaneously estimates regression coefficients and performs automatic feature selection, which makes lasso one of the most widely used tools in modern statistics and machine learning.

The method was introduced by Robert Tibshirani in a 1996 paper in the Journal of the Royal Statistical Society, Series B. It has since become a cornerstone of regularization, inspiring a large family of variants and serving as the basis for routine analysis in genomics, signal processing, finance, and text mining, where the number of candidate predictors is large relative to the sample size.

ELI5: Explain like I'm five

Imagine you are baking a cake and someone hands you a tray of fifty different ingredients. Some, like flour and sugar, really matter. Others, like a sprinkle of paprika, would just make a mess. You want a recipe card that lists only the ingredients that actually help. Lasso regression does that with numbers: it starts with a long list of possible ingredients (the features), and instead of just shrinking each one a little, it sets the unhelpful ones to zero so they fall right off the recipe. What is left is a short, clean list that still bakes a pretty good cake.

Historical background

Before the lasso, statisticians had two main ways to control overly flexible regression models. Best subset selection tries different combinations of predictors and keeps the one with the best fit under a penalty for model size. It produces interpretable models but is computationally expensive and unstable, because tiny changes in the data can switch the chosen subset. Ridge regression, introduced by Arthur Hoerl and Robert Kennard in 1970, adds a squared penalty on the coefficients. Ridge keeps every variable in the model and shrinks all of them smoothly toward zero, but never produces a truly sparse model.

In 1996 Robert Tibshirani proposed a third option in "Regression Shrinkage and Selection via the Lasso" in the Journal of the Royal Statistical Society, Series B. He combined the convexity of ridge with the sparsity of subset selection by replacing the squared penalty with a sum of absolute values. The resulting estimator was convex (so it could be solved efficiently) and capable of producing exact zeros (so it acted as a feature selector). Tibshirani drew inspiration from Leo Breiman's earlier work on the non-negative garrote.

The years that followed saw an explosion of activity. Bradley Efron, Trevor Hastie, Iain Johnstone, and Tibshirani published Least Angle Regression in 2004, which computes the entire lasso solution path in roughly the time of a single least squares fit. Hui Zou and Hastie introduced the elastic net in 2005 to address lasso's instability with correlated predictors. Jerome Friedman, Hastie, and Tibshirani published glmnet in 2010, popularizing coordinate descent as the workhorse method for fitting lasso, ridge, and elastic net models on very large datasets. By the late 2000s the lasso had become standard in genomics, and by the 2010s it was a default tool in essentially every statistical software package.

Mathematical formulation

Basic objective

Given a data matrix X of size n by p (each row an observation, each column a feature) and a response vector y of length n, ordinary least squares chooses coefficients beta to minimize the sum of squared residuals. Lasso adds a penalty proportional to the L1 norm of beta:

minimize over beta: (1 / (2n)) * sum_i (y_i - x_i^T beta)^2 + lambda * sum_j |beta_j|

Here lambda is a non-negative tuning parameter that controls the strength of the penalty. When lambda is zero, the lasso reduces to ordinary least squares. As lambda grows, more coefficients are shrunk toward zero, and beyond a certain threshold individual coefficients become exactly zero. When lambda is large enough, every coefficient is zero and the model predicts only the intercept (which is typically not penalized).

An equivalent constrained formulation makes the geometry explicit:

minimize over beta: sum_i (y_i - x_i^T beta)^2 subject to sum_j |beta_j| <= t

For each lambda there is a corresponding budget t, and the two formulations trace the same path of solutions. The constrained form shows the lasso as ordinary least squares restricted to an L1 ball of radius t.

Standardization

Because the L1 penalty depends on feature scale, the columns of X are almost always standardized to unit variance and zero mean before fitting. Without standardization, features in large units (income in dollars) would receive much smaller coefficients than features in small units (proportions), and the penalty would shrink them unequally. The response is often centered as well so that the intercept can be handled separately.

Why some coefficients become exactly zero

The L1 penalty is non-differentiable at zero. Its gradient is the sign of the coefficient, which produces a constant pull toward zero whose magnitude does not depend on the size of the coefficient. As lambda increases, that pull eventually overcomes the residual gradient, and the coefficient snaps to zero. By contrast, the L2 penalty in ridge regression has a gradient proportional to the coefficient itself, so the pull weakens as the coefficient shrinks and the coefficient approaches zero only in the limit.

Soft thresholding

When the columns of X are orthonormal, the lasso solution has a closed form. For each coefficient j, the lasso estimate equals the soft thresholding of the corresponding least squares estimate:

beta_j_hat = sign(beta_j_OLS) * max(|beta_j_OLS| - lambda, 0)

This operator subtracts lambda from the magnitude of the OLS estimate and clamps the result at zero. Soft thresholding is the basic building block of nearly every lasso algorithm and appears as the proximal operator of the L1 norm in optimization theory.

Geometric intuition

The contrast between lasso and ridge becomes vivid in two dimensions. With just two coefficients, beta_1 and beta_2, the contours of the squared error loss form ellipses centered on the OLS estimate. Ridge confines the solution to a circular disk; lasso confines it to a square (or, in higher dimensions, an L1 ball with sharp corners aligned with the coordinate axes).

When the loss ellipse expands and first touches the constraint region, the touching point is the solution. For the ridge disk, the boundary is smooth, so the contact point is almost never on an axis and both coefficients remain non-zero. For the lasso square, the corners stick out toward the axes, so the contact point very often lands on a corner where one coefficient is exactly zero. As the dimension grows, the L1 ball has more corners and edges aligned with low-dimensional faces, which is why even moderate values of lambda can drive most coefficients to zero in high-dimensional problems. This sharp geometry is the source of every distinctive property of the lasso: variable selection, use in compressed sensing, and occasional instability when two columns of X point in nearly the same direction.

Lasso versus ridge versus elastic net

The three most common penalized linear regressions form a closely related family. They differ only in the shape of the penalty, but those small differences produce very different behavior.

Property	Ridge (Ridge Regression)	Lasso	Elastic Net
Penalty term	lambda * sum beta_j^2	lambda * sum	beta_j
Constraint shape	Hypersphere	L1 polytope (diamond in 2D)	Smoothed polytope
Produces exact zeros	No	Yes	Yes (for alpha > 0)
Correlated predictors	Spreads weight across the group	Picks one, drops the others	Spreads weight within the group
Closed-form solution	Yes	No (iterative)	No (iterative)
Handles p > n	Yes	Yes (at most n features)	Yes
Motivation	Stabilization, multicollinearity	Sparsity, interpretability	Both at once
Year introduced	1970 (Hoerl and Kennard)	1996 (Tibshirani)	2005 (Zou and Hastie)

When the true model is sparse and predictors are roughly independent, lasso is hard to beat. When predictors form correlated groups (genes in the same biological pathway, words that frequently co-occur), lasso tends to pick a single representative; elastic net is then the better choice because the L2 component encourages correlated predictors to enter or exit together.

Solving the lasso

Unlike ridge regression, the lasso does not have a closed-form solution in general, because the L1 penalty is non-differentiable at zero. Several algorithms have been developed to solve it efficiently, and the choice matters when n or p is large.

Coordinate descent and glmnet

The dominant solver in modern software is cyclical coordinate descent, popularized for the lasso by Jerome Friedman, Trevor Hastie, and Robert Tibshirani in their 2010 paper introducing glmnet. The algorithm cycles through the coefficients one at a time, each time solving a one-dimensional lasso problem (a single soft thresholding step) with the others held fixed. Glmnet computes the lasso along the entire regularization path, starting at the smallest lambda for which all coefficients are zero and decreasing step by step, warm-starting each new lambda from the previous solution. Combined with active-set strategies that ignore coefficients still at zero, this makes coordinate descent dramatically faster than fitting a single lambda from scratch. Glmnet handles linear, logistic, multinomial, Poisson, and Cox models in the same framework.

Least Angle Regression (LARS)

Least Angle Regression, introduced by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani in the 2004 Annals of Statistics paper, builds the lasso path one variable at a time. Starting from all-zero coefficients, LARS finds the predictor most correlated with the response and moves the corresponding coefficient in that direction. As soon as another predictor becomes equally correlated with the residual, both coefficients move together, and so on. With a small modification, LARS produces the exact piecewise-linear lasso solution path. Its cost is on the order of a single least squares fit on the active set. LARS underlies the LassoLars and LassoLarsCV implementations in scikit-learn.

Proximal gradient methods (ISTA and FISTA)

Proximal gradient methods treat the lasso as a smooth term (squared error) plus a non-smooth penalty (L1 norm). At each iteration the algorithm takes a gradient step on the smooth term and applies the proximal operator of the penalty, which for the L1 norm is exactly soft thresholding. The base algorithm is ISTA (Iterative Shrinkage Thresholding Algorithm), converging at rate O(1 / k). Amir Beck and Marc Teboulle introduced FISTA in 2009, which uses Nesterov acceleration to achieve rate O(1 / k^2). FISTA is widely used in image processing and signal recovery, where the design matrix is often implicit. The lasso can also be solved by ADMM, quadratic programming, or interior-point methods, which appear in distributed computing and theoretical research.

Solver comparison

Solver	Best when	Cost per iteration	Produces full path	Notes
Coordinate descent (glmnet)	n and p both large, especially with sparse X	Very low (one feature at a time)	Yes (with warm starts)	Default in R glmnet, scikit-learn Lasso, statsmodels
LARS	Want exact piecewise-linear path, p modest	Comparable to one OLS fit	Yes (knot by knot)	Sensitive to numerical precision when columns are nearly collinear
ISTA / FISTA	Implicit X (e.g., Fourier matrices), very large dense problems	One matrix-vector product	Often run at fixed lambda	Easy to parallelize, common in signal processing
ADMM	Distributed data, extra constraints	One matrix solve plus prox step	At fixed lambda	Used in spark.ml and other distributed pipelines
Quadratic programming	Small problems with extra linear constraints	Polynomial in size	At fixed lambda	Mostly of historical and theoretical interest

Choosing the regularization parameter

The value of lambda controls how many features survive and how much each is shrunk. Picking it well is one of the most important practical decisions when fitting a lasso model.

Cross-validation. The most common strategy is k-fold cross-validation. The data are split into k folds (typically 5 or 10), and for each candidate lambda the model is trained on k - 1 folds and evaluated on the held-out fold. The lambda that minimizes mean cross-validated error is called lambda_min. Many practitioners instead use lambda_1se, the largest lambda whose cross-validated error is within one standard error of the minimum, which produces a more parsimonious model with similar accuracy. In glmnet and scikit-learn's LassoCV, cross-validation runs along the entire regularization path simultaneously with warm starts, which makes it dramatically faster than naive cross-validation.

Information criteria. When noise variance can be reasonably estimated, AIC and BIC are valid criteria. They are computed from a single fit and require no resampling. BIC penalizes complexity more heavily than AIC and is consistent for variable selection in the classical setting where p is fixed; AIC is asymptotically loss-efficient for prediction. Both rely on the degrees of freedom of the lasso, which Tibshirani and Taylor showed equals the number of non-zero coefficients in expectation. Information criteria can break down when p > n because the noise variance is hard to estimate, so cross-validation is usually preferred there.

Stability selection. Nicolai Meinshausen and Peter Buhlmann introduced stability selection in 2010. The procedure subsamples the data many times, fits the lasso on each subsample, and records how often each variable enters the active set. Variables selected in a high fraction of subsamples are considered stable. Stability selection provides finite-sample error control on the number of false positives and is widely used in genomics.

Statistical properties

The lasso is unbiased only when lambda is zero. For positive lambda the estimator is biased toward zero, the price paid for variance reduction and sparsity. Asymptotic theory by Knight and Fu (2000), Greenshtein and Ritov (2004), and others has shown that the lasso is risk-consistent in high dimensions: its prediction error converges to the optimal rate even when p grows with n.

Variable selection consistency is subtler. The lasso recovers the true sparsity pattern only under what Peter Zhao and Bin Yu in 2006 called the Irrepresentable Condition, a constraint on the correlation between relevant and irrelevant predictors. When the condition fails, the lasso may include irrelevant variables or miss relevant ones, no matter how large n becomes. Zou's adaptive lasso modifies the penalty using initial coefficient estimates so that selection consistency holds under weaker conditions. In the Bayesian view, the lasso estimator is the posterior mode of a regression model with a Laplace (double exponential) prior on each coefficient, which places more mass at zero than a Gaussian prior and gives a probabilistic explanation for the sparsity-inducing behavior.

Variants and extensions

The lasso has spawned a large family of related estimators that adapt the basic idea to different data structures or fix specific weaknesses.

Variant	Reference	Idea	Typical use
Elastic Net	Zou and Hastie 2005	Mix of L1 and L2 penalties	Correlated predictors
Adaptive Lasso	Zou 2006	Reweight L1 penalty using preliminary estimates	Consistent variable selection
Group Lasso	Yuan and Lin 2006	Penalty acts on groups of coefficients jointly	Categorical variables, multitask learning
Sparse Group Lasso	Simon, Friedman, Hastie, Tibshirani 2013	Mix of group and single-coefficient L1 penalties	Sparsity within and across groups
Fused Lasso	Tibshirani, Saunders, Rosset, Zhu, Knight 2005	Penalize differences between consecutive coefficients	Ordered features such as genomic positions
Generalized Lasso	Tibshirani and Taylor 2011	Penalize a linear transform of the coefficients	Trend filtering, image denoising
Graphical Lasso	Friedman, Hastie, Tibshirani 2008	L1 penalty on the inverse covariance matrix	Sparse Gaussian graphical models
Square-root Lasso	Belloni, Chernozhukov, Wang 2011	Removes the need to estimate noise variance for tuning	Robust theoretical guarantees
Bayesian Lasso	Park and Casella 2008	Full Bayesian treatment with Laplace prior	Uncertainty quantification

The lasso also extends beyond linear regression. The same L1 penalty is added to logistic regression for sparse classification, to Cox proportional hazards models for survival analysis, to Poisson regression for count data, and to deep neural networks as a regularizer that encourages weight sparsity.

Use cases

High-dimensional problems with p greater than n. The defining application of the lasso is the regime where the number of candidate predictors is much larger than the number of observations. Classical least squares is undefined here because the system has more unknowns than equations. The lasso, by selecting at most n variables out of p, returns a unique sparse solution. This is the main reason it became the dominant tool for analyzing genome-wide association studies, RNA sequencing data, and proteomic measurements, where p can be tens of thousands while n is at most a few hundred.

Feature selection in machine learning pipelines. The lasso is often used as a feature selection step in front of more complex models. By fitting a lasso to a large feature pool and keeping only variables with non-zero coefficients, an analyst can reduce the dimensionality of the input to a downstream tree-based model or neural network. Scikit-learn's SelectFromModel utility wraps this idea directly.

Interpretable models in regulated industries. In credit scoring, insurance underwriting, and clinical risk prediction, models must be auditable. A sparse lasso model with a handful of non-zero coefficients is easier to explain to regulators than a dense model with hundreds of small weights. Many production scoring models in banking and insurance are essentially L1-regularized logistic regressions.

Text mining and signal recovery. Bag-of-words representations of text produce extremely high-dimensional, mostly empty feature vectors, and the L1 penalty prunes irrelevant words. Logistic lasso models on document-term matrices were a baseline for sentiment classification and spam filtering throughout the 2000s. In imaging, audio compression, and radar processing, the lasso (also known as basis pursuit denoising) recovers signals from far fewer measurements than a Nyquist analysis would suggest, an insight that underpinned compressed sensing.

Time series and economics. Macroeconomic forecasters often have hundreds of candidate indicators but only a few decades of monthly observations. Lasso variants such as the adaptive lasso and the fused lasso are used to pick leading indicators or to detect change points.

Software and tools

Tool	Language	Notes
glmnet	R, Python	Original coordinate descent implementation; supports linear, logistic, multinomial, Poisson, and Cox models
scikit-learn Lasso, LassoCV, LassoLarsCV, MultiTaskLasso	Python	Standard ML interface with cross-validation and warm starts built in
statsmodels	Python	Lasso for OLS and generalized linear models, with a focus on inference
Lasso.jl	Julia	High-performance lasso and elastic net for the Julia ecosystem
MATLAB lasso	MATLAB	Built into the Statistics and Machine Learning Toolbox
H2O.ai GLM	Java, Python, R	Distributed lasso and elastic net for large data
Apache Spark MLlib ElasticNet	Scala, Python, Java	Distributed L1-regularized regression for cluster-scale data
celer	Python	Fast lasso solver competitive with glmnet on dense and sparse data

Glmnet remains the de facto standard in the R community and is widely cited as a benchmark for new solvers. In Python, scikit-learn's Lasso and LassoCV are the most common starting points.

Limitations and pitfalls

Instability under correlated predictors. When two or more predictors are highly correlated, the lasso tends to pick one arbitrarily and shrink the rest to zero. The choice can flip with small changes to the data, which leads to unstable variable selection. This was the original motivation for the elastic net.

Bias toward zero. The lasso shrinks every non-zero coefficient toward zero, including ones it correctly identifies as important, which can hurt predictive performance when the true coefficients are large. A common remedy is to use lasso for selection only, then refit ordinary least squares on the selected subset (the relaxed lasso).

Cap on selected variables when p exceeds n. In the high-dimensional regime the lasso can select at most n non-zero coefficients. If the underlying model has more active predictors, the lasso will miss some by construction. The elastic net relaxes this limit because the L2 component breaks ties differently.

Sensitivity to scaling. Because the L1 penalty depends on feature units, the lasso requires careful standardization. Feeding raw, unscaled features produces a model dominated by whichever variables happen to be measured in small units.

Tuning is critical. A poor choice of lambda can produce a model that is either essentially OLS or empty. Cross-validation is reliable but adds a constant factor of computation. Analysts should plot the regularization path and the cross-validation error before reporting a final model.

Inference is hard. Classical confidence intervals and p-values do not apply to lasso estimates because the model is selected adaptively. Specialized tools such as the post-selection inference framework of Lockhart, Taylor, Tibshirani, and Tibshirani (2014) and the desparsified lasso of Zhang and Zhang (2014) provide valid inference, but are more complex than standard regression output.

Relationship to other ideas in machine learning

The lasso sits at the intersection of several broader ideas in modern statistics and machine learning. It is a canonical example of regularization, the practice of adding a penalty to an estimator to control its complexity, a theme that runs from ridge and lasso through support vector machines and deep learning (where weight decay, dropout, and early stopping all play similar roles). The lasso is also a paradigm case of sparsity-inducing methods, a design principle that recurs in compressed sensing, sparse coding, dictionary learning, and even attention-based neural network architectures. Through its variants the lasso connects to multitask learning (group lasso), graphical models (graphical lasso), change-point detection (fused lasso), and survival analysis (Cox lasso).

References

Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". *Journal of the Royal Statistical Society, Series B*, 58(1), 267-288. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1996.tb02080.x
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 3 (Linear Methods for Regression). https://hastie.su.domains/ElemStatLearn/
Friedman, J., Hastie, T., and Tibshirani, R. (2010). "Regularization Paths for Generalized Linear Models via Coordinate Descent". *Journal of Statistical Software*, 33(1), 1-22. https://www.jstatsoft.org/v33/i01/
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). "Least Angle Regression". *The Annals of Statistics*, 32(2), 407-451.
Zou, H., and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net". *Journal of the Royal Statistical Society, Series B*, 67(2), 301-320.
Zou, H. (2006). "The Adaptive Lasso and Its Oracle Properties". *Journal of the American Statistical Association*, 101(476), 1418-1429.
Yuan, M., and Lin, Y. (2006). "Model Selection and Estimation in Regression with Grouped Variables". *Journal of the Royal Statistical Society, Series B*, 68(1), 49-67.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). "Sparsity and Smoothness via the Fused Lasso". *Journal of the Royal Statistical Society, Series B*, 67(1), 91-108.
Beck, A., and Teboulle, M. (2009). "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems". *SIAM Journal on Imaging Sciences*, 2(1), 183-202.
Meinshausen, N., and Buhlmann, P. (2010). "Stability Selection". *Journal of the Royal Statistical Society, Series B*, 72(4), 417-473.
Tibshirani, R. (2011). "Regression Shrinkage and Selection via the Lasso: A Retrospective". *Journal of the Royal Statistical Society, Series B*, 73(3), 273-282.
scikit-learn documentation. "sklearn.linear_model.Lasso" and "sklearn.linear_model.LassoCV". https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
glmnet documentation, Trevor Hastie and Junyang Qian. https://glmnet.stanford.edu/
"Lasso (statistics)". Wikipedia. https://en.wikipedia.org/wiki/Lasso_(statistics)

ELI5: Explain like I'm five

Historical background

Mathematical formulation

Basic objective

Standardization

Why some coefficients become exactly zero

Soft thresholding

Geometric intuition

Lasso versus ridge versus elastic net

Solving the lasso

Coordinate descent and glmnet

Least Angle Regression (LARS)

Proximal gradient methods (ISTA and FISTA)

Solver comparison

Choosing the regularization parameter

Statistical properties

Variants and extensions

Use cases

Software and tools

Limitations and pitfalls

Relationship to other ideas in machine learning

See also

References

Improve this article

Related Articles

Ridge Regression

Elastic Net

ARC-AGI 2

Squared Loss

Least Squares Regression

Linear Regression

ELI5: Explain like I'm five

Historical background

Mathematical formulation

Basic objective

Standardization

Why some coefficients become exactly zero

Soft thresholding

Geometric intuition

Lasso versus ridge versus elastic net

Solving the lasso

Coordinate descent and glmnet

Least Angle Regression (LARS)

Proximal gradient methods (ISTA and FISTA)

Solver comparison

Choosing the regularization parameter

Statistical properties

Variants and extensions

Use cases

Software and tools

Limitations and pitfalls

Relationship to other ideas in machine learning

See also

References

Related Articles

Ridge Regression

Elastic Net

ARC-AGI 2

Squared Loss

Least Squares Regression

Linear Regression