Lasso regression (an acronym for Least Absolute Shrinkage and Selection Operator) is a linear regression technique that adds an L1 penalty to the ordinary least squares objective. The penalty pulls coefficients toward zero and, because of the geometry of the L1 norm, drives many of them to exactly zero. The result is a sparse model that simultaneously estimates regression coefficients and performs automatic feature selection, which makes lasso one of the most widely used tools in modern statistics and machine learning.
The method was introduced by Robert Tibshirani in a 1996 paper in the Journal of the Royal Statistical Society, Series B. It has since become a cornerstone of regularization, inspiring a large family of variants and serving as the basis for routine analysis in genomics, signal processing, finance, and text mining, where the number of candidate predictors is large relative to the sample size.
Imagine you are baking a cake and someone hands you a tray of fifty different ingredients. Some, like flour and sugar, really matter. Others, like a sprinkle of paprika, would just make a mess. You want a recipe card that lists only the ingredients that actually help. Lasso regression does that with numbers: it starts with a long list of possible ingredients (the features), and instead of just shrinking each one a little, it sets the unhelpful ones to zero so they fall right off the recipe. What is left is a short, clean list that still bakes a pretty good cake.
Before the lasso, statisticians had two main ways to control overly flexible regression models. Best subset selection tries different combinations of predictors and keeps the one with the best fit under a penalty for model size. It produces interpretable models but is computationally expensive and unstable, because tiny changes in the data can switch the chosen subset. Ridge regression, introduced by Arthur Hoerl and Robert Kennard in 1970, adds a squared penalty on the coefficients. Ridge keeps every variable in the model and shrinks all of them smoothly toward zero, but never produces a truly sparse model.
In 1996 Robert Tibshirani proposed a third option in "Regression Shrinkage and Selection via the Lasso" in the Journal of the Royal Statistical Society, Series B. He combined the convexity of ridge with the sparsity of subset selection by replacing the squared penalty with a sum of absolute values. The resulting estimator was convex (so it could be solved efficiently) and capable of producing exact zeros (so it acted as a feature selector). Tibshirani drew inspiration from Leo Breiman's earlier work on the non-negative garrote.
The years that followed saw an explosion of activity. Bradley Efron, Trevor Hastie, Iain Johnstone, and Tibshirani published Least Angle Regression in 2004, which computes the entire lasso solution path in roughly the time of a single least squares fit. Hui Zou and Hastie introduced the elastic net in 2005 to address lasso's instability with correlated predictors. Jerome Friedman, Hastie, and Tibshirani published glmnet in 2010, popularizing coordinate descent as the workhorse method for fitting lasso, ridge, and elastic net models on very large datasets. By the late 2000s the lasso had become standard in genomics, and by the 2010s it was a default tool in essentially every statistical software package.
Given a data matrix X of size n by p (each row an observation, each column a feature) and a response vector y of length n, ordinary least squares chooses coefficients beta to minimize the sum of squared residuals. Lasso adds a penalty proportional to the L1 norm of beta:
minimize over beta: (1 / (2n)) * sum_i (y_i - x_i^T beta)^2 + lambda * sum_j |beta_j|
Here lambda is a non-negative tuning parameter that controls the strength of the penalty. When lambda is zero, the lasso reduces to ordinary least squares. As lambda grows, more coefficients are shrunk toward zero, and beyond a certain threshold individual coefficients become exactly zero. When lambda is large enough, every coefficient is zero and the model predicts only the intercept (which is typically not penalized).
An equivalent constrained formulation makes the geometry explicit:
minimize over beta: sum_i (y_i - x_i^T beta)^2 subject to sum_j |beta_j| <= t
For each lambda there is a corresponding budget t, and the two formulations trace the same path of solutions. The constrained form shows the lasso as ordinary least squares restricted to an L1 ball of radius t.
Because the L1 penalty depends on feature scale, the columns of X are almost always standardized to unit variance and zero mean before fitting. Without standardization, features in large units (income in dollars) would receive much smaller coefficients than features in small units (proportions), and the penalty would shrink them unequally. The response is often centered as well so that the intercept can be handled separately.
The L1 penalty is non-differentiable at zero. Its gradient is the sign of the coefficient, which produces a constant pull toward zero whose magnitude does not depend on the size of the coefficient. As lambda increases, that pull eventually overcomes the residual gradient, and the coefficient snaps to zero. By contrast, the L2 penalty in ridge regression has a gradient proportional to the coefficient itself, so the pull weakens as the coefficient shrinks and the coefficient approaches zero only in the limit.
When the columns of X are orthonormal, the lasso solution has a closed form. For each coefficient j, the lasso estimate equals the soft thresholding of the corresponding least squares estimate:
beta_j_hat = sign(beta_j_OLS) * max(|beta_j_OLS| - lambda, 0)
This operator subtracts lambda from the magnitude of the OLS estimate and clamps the result at zero. Soft thresholding is the basic building block of nearly every lasso algorithm and appears as the proximal operator of the L1 norm in optimization theory.
The contrast between lasso and ridge becomes vivid in two dimensions. With just two coefficients, beta_1 and beta_2, the contours of the squared error loss form ellipses centered on the OLS estimate. Ridge confines the solution to a circular disk; lasso confines it to a square (or, in higher dimensions, an L1 ball with sharp corners aligned with the coordinate axes).
When the loss ellipse expands and first touches the constraint region, the touching point is the solution. For the ridge disk, the boundary is smooth, so the contact point is almost never on an axis and both coefficients remain non-zero. For the lasso square, the corners stick out toward the axes, so the contact point very often lands on a corner where one coefficient is exactly zero. As the dimension grows, the L1 ball has more corners and edges aligned with low-dimensional faces, which is why even moderate values of lambda can drive most coefficients to zero in high-dimensional problems. This sharp geometry is the source of every distinctive property of the lasso: variable selection, use in compressed sensing, and occasional instability when two columns of X point in nearly the same direction.
The three most common penalized linear regressions form a closely related family. They differ only in the shape of the penalty, but those small differences produce very different behavior.
| Property | Ridge (Ridge Regression) | Lasso | Elastic Net |
|---|---|---|---|
| Penalty term | lambda * sum beta_j^2 | lambda * sum | beta_j |
| Constraint shape | Hypersphere | L1 polytope (diamond in 2D) | Smoothed polytope |
| Produces exact zeros | No | Yes | Yes (for alpha > 0) |
| Correlated predictors | Spreads weight across the group | Picks one, drops the others | Spreads weight within the group |
| Closed-form solution | Yes | No (iterative) | No (iterative) |
| Handles p > n | Yes | Yes (at most n features) | Yes |
| Motivation | Stabilization, multicollinearity | Sparsity, interpretability | Both at once |
| Year introduced | 1970 (Hoerl and Kennard) | 1996 (Tibshirani) | 2005 (Zou and Hastie) |
When the true model is sparse and predictors are roughly independent, lasso is hard to beat. When predictors form correlated groups (genes in the same biological pathway, words that frequently co-occur), lasso tends to pick a single representative; elastic net is then the better choice because the L2 component encourages correlated predictors to enter or exit together.
Unlike ridge regression, the lasso does not have a closed-form solution in general, because the L1 penalty is non-differentiable at zero. Several algorithms have been developed to solve it efficiently, and the choice matters when n or p is large.
The dominant solver in modern software is cyclical coordinate descent, popularized for the lasso by Jerome Friedman, Trevor Hastie, and Robert Tibshirani in their 2010 paper introducing glmnet. The algorithm cycles through the coefficients one at a time, each time solving a one-dimensional lasso problem (a single soft thresholding step) with the others held fixed. Glmnet computes the lasso along the entire regularization path, starting at the smallest lambda for which all coefficients are zero and decreasing step by step, warm-starting each new lambda from the previous solution. Combined with active-set strategies that ignore coefficients still at zero, this makes coordinate descent dramatically faster than fitting a single lambda from scratch. Glmnet handles linear, logistic, multinomial, Poisson, and Cox models in the same framework.
Least Angle Regression, introduced by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani in the 2004 Annals of Statistics paper, builds the lasso path one variable at a time. Starting from all-zero coefficients, LARS finds the predictor most correlated with the response and moves the corresponding coefficient in that direction. As soon as another predictor becomes equally correlated with the residual, both coefficients move together, and so on. With a small modification, LARS produces the exact piecewise-linear lasso solution path. Its cost is on the order of a single least squares fit on the active set. LARS underlies the LassoLars and LassoLarsCV implementations in scikit-learn.
Proximal gradient methods treat the lasso as a smooth term (squared error) plus a non-smooth penalty (L1 norm). At each iteration the algorithm takes a gradient step on the smooth term and applies the proximal operator of the penalty, which for the L1 norm is exactly soft thresholding. The base algorithm is ISTA (Iterative Shrinkage Thresholding Algorithm), converging at rate O(1 / k). Amir Beck and Marc Teboulle introduced FISTA in 2009, which uses Nesterov acceleration to achieve rate O(1 / k^2). FISTA is widely used in image processing and signal recovery, where the design matrix is often implicit. The lasso can also be solved by ADMM, quadratic programming, or interior-point methods, which appear in distributed computing and theoretical research.
| Solver | Best when | Cost per iteration | Produces full path | Notes |
|---|---|---|---|---|
| Coordinate descent (glmnet) | n and p both large, especially with sparse X | Very low (one feature at a time) | Yes (with warm starts) | Default in R glmnet, scikit-learn Lasso, statsmodels |
| LARS | Want exact piecewise-linear path, p modest | Comparable to one OLS fit | Yes (knot by knot) | Sensitive to numerical precision when columns are nearly collinear |
| ISTA / FISTA | Implicit X (e.g., Fourier matrices), very large dense problems | One matrix-vector product | Often run at fixed lambda | Easy to parallelize, common in signal processing |
| ADMM | Distributed data, extra constraints | One matrix solve plus prox step | At fixed lambda | Used in spark.ml and other distributed pipelines |
| Quadratic programming | Small problems with extra linear constraints | Polynomial in size | At fixed lambda | Mostly of historical and theoretical interest |
The value of lambda controls how many features survive and how much each is shrunk. Picking it well is one of the most important practical decisions when fitting a lasso model.
Cross-validation. The most common strategy is k-fold cross-validation. The data are split into k folds (typically 5 or 10), and for each candidate lambda the model is trained on k - 1 folds and evaluated on the held-out fold. The lambda that minimizes mean cross-validated error is called lambda_min. Many practitioners instead use lambda_1se, the largest lambda whose cross-validated error is within one standard error of the minimum, which produces a more parsimonious model with similar accuracy. In glmnet and scikit-learn's LassoCV, cross-validation runs along the entire regularization path simultaneously with warm starts, which makes it dramatically faster than naive cross-validation.
Information criteria. When noise variance can be reasonably estimated, AIC and BIC are valid criteria. They are computed from a single fit and require no resampling. BIC penalizes complexity more heavily than AIC and is consistent for variable selection in the classical setting where p is fixed; AIC is asymptotically loss-efficient for prediction. Both rely on the degrees of freedom of the lasso, which Tibshirani and Taylor showed equals the number of non-zero coefficients in expectation. Information criteria can break down when p > n because the noise variance is hard to estimate, so cross-validation is usually preferred there.
Stability selection. Nicolai Meinshausen and Peter Buhlmann introduced stability selection in 2010. The procedure subsamples the data many times, fits the lasso on each subsample, and records how often each variable enters the active set. Variables selected in a high fraction of subsamples are considered stable. Stability selection provides finite-sample error control on the number of false positives and is widely used in genomics.
The lasso is unbiased only when lambda is zero. For positive lambda the estimator is biased toward zero, the price paid for variance reduction and sparsity. Asymptotic theory by Knight and Fu (2000), Greenshtein and Ritov (2004), and others has shown that the lasso is risk-consistent in high dimensions: its prediction error converges to the optimal rate even when p grows with n.
Variable selection consistency is subtler. The lasso recovers the true sparsity pattern only under what Peter Zhao and Bin Yu in 2006 called the Irrepresentable Condition, a constraint on the correlation between relevant and irrelevant predictors. When the condition fails, the lasso may include irrelevant variables or miss relevant ones, no matter how large n becomes. Zou's adaptive lasso modifies the penalty using initial coefficient estimates so that selection consistency holds under weaker conditions. In the Bayesian view, the lasso estimator is the posterior mode of a regression model with a Laplace (double exponential) prior on each coefficient, which places more mass at zero than a Gaussian prior and gives a probabilistic explanation for the sparsity-inducing behavior.
The lasso has spawned a large family of related estimators that adapt the basic idea to different data structures or fix specific weaknesses.
| Variant | Reference | Idea | Typical use |
|---|---|---|---|
| Elastic Net | Zou and Hastie 2005 | Mix of L1 and L2 penalties | Correlated predictors |
| Adaptive Lasso | Zou 2006 | Reweight L1 penalty using preliminary estimates | Consistent variable selection |
| Group Lasso | Yuan and Lin 2006 | Penalty acts on groups of coefficients jointly | Categorical variables, multitask learning |
| Sparse Group Lasso | Simon, Friedman, Hastie, Tibshirani 2013 | Mix of group and single-coefficient L1 penalties | Sparsity within and across groups |
| Fused Lasso | Tibshirani, Saunders, Rosset, Zhu, Knight 2005 | Penalize differences between consecutive coefficients | Ordered features such as genomic positions |
| Generalized Lasso | Tibshirani and Taylor 2011 | Penalize a linear transform of the coefficients | Trend filtering, image denoising |
| Graphical Lasso | Friedman, Hastie, Tibshirani 2008 | L1 penalty on the inverse covariance matrix | Sparse Gaussian graphical models |
| Square-root Lasso | Belloni, Chernozhukov, Wang 2011 | Removes the need to estimate noise variance for tuning | Robust theoretical guarantees |
| Bayesian Lasso | Park and Casella 2008 | Full Bayesian treatment with Laplace prior | Uncertainty quantification |
The lasso also extends beyond linear regression. The same L1 penalty is added to logistic regression for sparse classification, to Cox proportional hazards models for survival analysis, to Poisson regression for count data, and to deep neural networks as a regularizer that encourages weight sparsity.
High-dimensional problems with p greater than n. The defining application of the lasso is the regime where the number of candidate predictors is much larger than the number of observations. Classical least squares is undefined here because the system has more unknowns than equations. The lasso, by selecting at most n variables out of p, returns a unique sparse solution. This is the main reason it became the dominant tool for analyzing genome-wide association studies, RNA sequencing data, and proteomic measurements, where p can be tens of thousands while n is at most a few hundred.
Feature selection in machine learning pipelines. The lasso is often used as a feature selection step in front of more complex models. By fitting a lasso to a large feature pool and keeping only variables with non-zero coefficients, an analyst can reduce the dimensionality of the input to a downstream tree-based model or neural network. Scikit-learn's SelectFromModel utility wraps this idea directly.
Interpretable models in regulated industries. In credit scoring, insurance underwriting, and clinical risk prediction, models must be auditable. A sparse lasso model with a handful of non-zero coefficients is easier to explain to regulators than a dense model with hundreds of small weights. Many production scoring models in banking and insurance are essentially L1-regularized logistic regressions.
Text mining and signal recovery. Bag-of-words representations of text produce extremely high-dimensional, mostly empty feature vectors, and the L1 penalty prunes irrelevant words. Logistic lasso models on document-term matrices were a baseline for sentiment classification and spam filtering throughout the 2000s. In imaging, audio compression, and radar processing, the lasso (also known as basis pursuit denoising) recovers signals from far fewer measurements than a Nyquist analysis would suggest, an insight that underpinned compressed sensing.
Time series and economics. Macroeconomic forecasters often have hundreds of candidate indicators but only a few decades of monthly observations. Lasso variants such as the adaptive lasso and the fused lasso are used to pick leading indicators or to detect change points.
| Tool | Language | Notes |
|---|---|---|
| glmnet | R, Python | Original coordinate descent implementation; supports linear, logistic, multinomial, Poisson, and Cox models |
| scikit-learn Lasso, LassoCV, LassoLarsCV, MultiTaskLasso | Python | Standard ML interface with cross-validation and warm starts built in |
| statsmodels | Python | Lasso for OLS and generalized linear models, with a focus on inference |
| Lasso.jl | Julia | High-performance lasso and elastic net for the Julia ecosystem |
| MATLAB lasso | MATLAB | Built into the Statistics and Machine Learning Toolbox |
| H2O.ai GLM | Java, Python, R | Distributed lasso and elastic net for large data |
| Apache Spark MLlib ElasticNet | Scala, Python, Java | Distributed L1-regularized regression for cluster-scale data |
| celer | Python | Fast lasso solver competitive with glmnet on dense and sparse data |
Glmnet remains the de facto standard in the R community and is widely cited as a benchmark for new solvers. In Python, scikit-learn's Lasso and LassoCV are the most common starting points.
Instability under correlated predictors. When two or more predictors are highly correlated, the lasso tends to pick one arbitrarily and shrink the rest to zero. The choice can flip with small changes to the data, which leads to unstable variable selection. This was the original motivation for the elastic net.
Bias toward zero. The lasso shrinks every non-zero coefficient toward zero, including ones it correctly identifies as important, which can hurt predictive performance when the true coefficients are large. A common remedy is to use lasso for selection only, then refit ordinary least squares on the selected subset (the relaxed lasso).
Cap on selected variables when p exceeds n. In the high-dimensional regime the lasso can select at most n non-zero coefficients. If the underlying model has more active predictors, the lasso will miss some by construction. The elastic net relaxes this limit because the L2 component breaks ties differently.
Sensitivity to scaling. Because the L1 penalty depends on feature units, the lasso requires careful standardization. Feeding raw, unscaled features produces a model dominated by whichever variables happen to be measured in small units.
Tuning is critical. A poor choice of lambda can produce a model that is either essentially OLS or empty. Cross-validation is reliable but adds a constant factor of computation. Analysts should plot the regularization path and the cross-validation error before reporting a final model.
Inference is hard. Classical confidence intervals and p-values do not apply to lasso estimates because the model is selected adaptively. Specialized tools such as the post-selection inference framework of Lockhart, Taylor, Tibshirani, and Tibshirani (2014) and the desparsified lasso of Zhang and Zhang (2014) provide valid inference, but are more complex than standard regression output.
The lasso sits at the intersection of several broader ideas in modern statistics and machine learning. It is a canonical example of regularization, the practice of adding a penalty to an estimator to control its complexity, a theme that runs from ridge and lasso through support vector machines and deep learning (where weight decay, dropout, and early stopping all play similar roles). The lasso is also a paradigm case of sparsity-inducing methods, a design principle that recurs in compressed sensing, sparse coding, dictionary learning, and even attention-based neural network architectures. Through its variants the lasso connects to multitask learning (group lasso), graphical models (graphical lasso), change-point detection (fused lasso), and survival analysis (Cox lasso).