Elastic Net

See also: Regularization, Linear regression

Overview

Elastic Net is a regularization and variable selection method for linear regression and other generalized linear models that combines the L1 penalty of Lasso regression with the L2 penalty of ridge regression. The method was introduced by Hui Zou and Trevor Hastie in a 2005 paper published in the Journal of the Royal Statistical Society, Series B and was designed to overcome two well-known shortcomings of pure Lasso, namely its instability when input features are correlated and its inability to select more than the number of training observations when the number of features is large.[1]

The Elastic Net penalty is a convex combination of the absolute value penalty and the squared penalty applied to the model coefficients. The L1 component performs automatic feature selection by driving coefficients of unimportant variables exactly to zero, while the L2 component encourages a grouping effect in which strongly correlated predictors are kept together with similar coefficients. The resulting estimator behaves like Lasso when the data are uncorrelated and like Ridge when the predictors are highly collinear, and it interpolates smoothly between the two extremes through a single mixing hyperparameter.[1][2]

Elastic Net has become a standard tool in modern statistics and machine learning and is implemented in widely used libraries such as glmnet for R, scikit-learn for Python, H2O, Apache Spark MLlib, and statsmodels. It is particularly popular in problems with many more predictors than observations, including microarray gene expression analysis, genome-wide association studies, text classification, marketing analytics, and chemometrics.[3]

Mathematical background

Ordinary least squares

Given a response vector y of length n and a predictor matrix X of size n by p with standardized columns, ordinary least squares (OLS) estimates the coefficient vector beta by minimizing the residual sum of squares L(beta) = (1/2n) * ||y - X beta||^2. When n is much larger than p, OLS provides unbiased estimates with small variance. As p grows toward or beyond n, OLS estimates become unstable, the variance grows without bound, and predictive accuracy on new data deteriorates rapidly. Some form of regularization is then required to obtain a usable model.[2]

Ridge regression

Ridge regression, developed by Hoerl and Kennard in 1970, adds a squared L2 penalty to the OLS loss: beta_ridge = argmin (1/2n) * ||y - X beta||^2 + (lambda / 2) * sum_j beta_j^2. The L2 penalty shrinks all coefficients toward zero in a smooth manner and yields a closed-form solution beta_ridge = (X^T X + n lambda I)^(-1) X^T y. Ridge stabilizes estimates in the presence of multicollinearity, but it never sets a coefficient exactly to zero, so it cannot perform variable selection on its own and produces dense models that include every input feature.[2]

Lasso regression

The Least Absolute Shrinkage and Selection Operator, or Lasso, was proposed by Robert Tibshirani in 1996 and replaces the squared penalty with an absolute value penalty: beta_lasso = argmin (1/2n) * ||y - X beta||^2 + lambda * sum_j |beta_j|. The L1 penalty has the geometrically useful property that the constraint region has corners along the coordinate axes, so the optimum often falls on a corner where one or more coefficients are exactly zero, and Lasso simultaneously performs continuous shrinkage and discrete variable selection. This made Lasso enormously influential as researchers turned to high-dimensional problems with thousands or millions of features.[4]

Lasso is not without limitations, however. Three of them are particularly important and motivated the development of Elastic Net:

Saturation in the p > n setting. When the number of predictors p exceeds the number of observations n, the convex Lasso optimization can select at most n variables before saturating. In genomics, where p is often in the tens of thousands and n in the dozens, this is a serious bottleneck.
Arbitrary selection from correlated groups. When several predictors are highly correlated, Lasso tends to pick one of them and discard the rest, with the choice depending on small numerical perturbations of the data. The method gives no indication that the discarded variables are also relevant, which complicates interpretation.
Suboptimal prediction under high correlation. Empirical studies showed that Lasso often loses to Ridge in prediction error when the true predictors are strongly correlated with one another. The discrete nature of the L1 selection can throw away useful signal.

Elastic Net was constructed precisely to retain Lasso's sparsity while inheriting Ridge's stability under correlation.[1]

The Elastic Net penalty

Definition

Zou and Hastie defined the naive Elastic Net estimator as the minimizer of a loss function that adds both L1 and L2 penalties to the residual sum of squares:

beta_naive = argmin (1/2n) * ||y - X beta||^2 + lambda_1 * sum_j |beta_j| + lambda_2 * sum_j beta_j^2

The two non-negative tuning parameters lambda_1 and lambda_2 control the strength of the L1 and L2 components respectively. Setting lambda_1 = 0 recovers ridge regression, while setting lambda_2 = 0 recovers Lasso. When both parameters are positive, the penalty is strictly convex, which guarantees a unique solution even when X^T X is singular.

A second, equivalent parameterization is more common in software libraries. Define a single overall penalty strength lambda and a mixing parameter alpha in the interval [0, 1]:

P_alpha(beta) = alpha * sum_j |beta_j| + ((1 - alpha) / 2) * sum_j beta_j^2

beta_en = argmin (1/2n) * ||y - X beta||^2 + lambda * P_alpha(beta)

Under this parameterization, alpha = 1 corresponds to pure Lasso, alpha = 0 corresponds to pure Ridge, and intermediate values blend the two. The default in glmnet is alpha = 1 (Lasso), and most users tune both lambda and alpha by cross-validation.[3]

The naive Elastic Net problem

Zou and Hastie observed that the naive estimator described above suffers from a double shrinkage problem. Both the L1 and L2 components shrink the coefficients, and applying both simultaneously biases the estimates more than the data warrant, especially in regression settings. As a result, the naive estimator often had higher prediction error than either pure Lasso or pure Ridge in their experiments.

To correct this, the authors defined the final Elastic Net estimator by rescaling the naive solution:

beta_elastic_net = (1 + lambda_2) * beta_naive

This simple rescaling undoes the extra shrinkage introduced by the ridge term while preserving the variable-selection properties of the L1 component. The resulting estimator is what most software packages report when they refer to Elastic Net, although the distinction between naive and corrected versions is rarely emphasized in user-facing documentation.[1]

The grouping effect

One of the most important theoretical results in the original paper is the grouping effect. Zou and Hastie proved that when two predictors x_i and x_j are highly correlated, with sample correlation rho_ij close to 1, the Elastic Net estimates for their coefficients satisfy:

|beta_i_hat - beta_j_hat| <= (1 / lambda_2) * sqrt(2 * (1 - rho_ij))

In other words, perfectly correlated predictors receive identical coefficients, and nearly correlated predictors receive nearly identical coefficients. This is exactly the behavior practitioners want when, for example, two genes are co-regulated and either could plausibly explain the response.

The grouping effect does not hold for pure Lasso, where the difference in coefficients between two highly correlated predictors can be arbitrarily large because the algorithm essentially picks one and discards the other. The grouping property is what makes Elastic Net especially well suited to genomic data, image data, and other settings where features are organized in natural clusters.[1]

Selecting more than n variables

A second structural advantage of Elastic Net is that it can select up to all p variables when the L2 penalty is positive, even in the p > n regime. The proof works by showing that the Elastic Net optimization is equivalent to a Lasso problem on an augmented data set with n + p observations, which is no longer in the p > n regime.

Comparison with Lasso and Ridge

The table below summarizes how the three methods behave under common scenarios.

Scenario	Lasso (alpha = 1)	Elastic Net (0 < alpha < 1)	Ridge (alpha = 0)
Many irrelevant features	Sets them to zero, sparse model	Sets them to zero, sparse model	Keeps all, dense model
Highly correlated features	Picks one arbitrarily	Keeps the group with shrunken coefficients (grouping)	Keeps all, evenly shrunken
p much larger than n	Selects at most n variables	Can select up to p variables	All p variables retained
Prediction under correlation	Often loses to Ridge	Generally best of the three	Best when all features matter
Interpretability	High due to sparsity	High due to sparsity	Low, dense coefficients
Closed-form solution	No	No	Yes
Hyperparameters to tune	One (lambda)	Two (lambda and alpha)	One (lambda)

The practical takeaway is that Elastic Net is a safer default than Lasso whenever predictors might be correlated, while still delivering interpretable sparse models that Ridge cannot provide. The cost is one additional hyperparameter, which is usually a worthwhile trade.[2][3]

Computation

LARS-EN algorithm

In the original 2005 paper, Zou and Hastie proposed the LARS-EN algorithm, an extension of the Least Angle Regression (LARS) procedure of Efron, Hastie, Johnstone, and Tibshirani (2004). LARS computes the entire Lasso regularization path in roughly the same time as a single OLS fit. LARS-EN applies the same idea to Elastic Net by exploiting the augmented-data formulation. While elegant, LARS-EN scales poorly to very large problems because it must maintain a Cholesky factorization of an active-set Gram matrix.[5]

Coordinate descent and glmnet

The dominant modern algorithm for fitting Elastic Net models is coordinate descent, introduced for this purpose by Friedman, Hastie, and Tibshirani in their 2010 Journal of Statistical Software paper. Coordinate descent cycles through the coefficients one at a time, updating each by minimizing the penalized loss with the others held fixed. For Elastic Net, the per-coefficient update has a closed-form soft-thresholding solution:

beta_j_new = S(z_j, lambda * alpha) / (1 + lambda * (1 - alpha))

where z_j is the partial residual correlation with the j-th feature and S(z, gamma) = sign(z) * max(|z| - gamma, 0) is the soft-thresholding operator.

The glmnet package combined coordinate descent with several engineering tricks: computing the entire regularization path on a grid of lambda values, using warm starts so each new lambda is initialized from the previous solution, applying strong rules to skip predictors unlikely to enter the active set, and supporting sparse design matrices through compressed column storage. The result is a solver that fits Elastic Net problems with millions of features and millions of observations in seconds, and the package quickly became the de facto standard for L1 and L2 regularized generalized linear models. The approach was extended to logistic, multinomial, Cox proportional hazards, and Poisson regression in subsequent papers.[3][6]

Hyperparameter selection

The two knobs

Elastic Net has two main hyperparameters:

lambda (often written as alpha in scikit-learn's ElasticNet class, confusingly using the opposite letter from the literature): the overall strength of regularization. Larger lambda shrinks all coefficients more aggressively and produces sparser, simpler models.
alpha (often written as l1_ratio in scikit-learn): the mixing parameter between L1 and L2 penalties. alpha = 1 is pure Lasso, alpha = 0 is pure Ridge, and intermediate values blend the two.

Both parameters affect bias and variance and must be tuned to the specific problem.

Cross-validation

The standard approach is k-fold cross-validation. For each candidate combination of lambda and alpha, the data are split into k folds, the model is fit on k - 1 folds, the prediction error is computed on the held-out fold, and the combination with the lowest average error is selected. In practice, alpha is typically searched over a small grid such as {0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0} and lambda over a logarithmic grid of perhaps 100 values from a maximum where all coefficients are zero down to a small fraction of that maximum. The glmnet and scikit-learn implementations both build the lambda grid automatically based on the data.

A popular variant is the one standard error rule, which selects the largest lambda whose cross-validated error is within one standard error of the minimum. This rule trades a small amount of predictive accuracy for a noticeably sparser, more interpretable model and is widely used in genomics applications.[2]

Applications

Elastic Net's combination of sparsity and stability under correlation has made it popular across many scientific and industrial domains. The table below summarizes the most prominent application areas.

Domain	Typical setting	Why Elastic Net helps
Microarray gene expression	Thousands of probes, dozens of patients (p >> n), co-regulated genes	Selects whole pathways; can choose more than n features
Genome-wide association studies	Millions of SNPs in linkage disequilibrium	Groups correlated SNPs rather than picking one per haplotype block
Text classification (bag-of-words)	Many word and n-gram features, near-duplicates	Keeps related word features together; sparse output
Chemometrics and spectroscopy	Correlated wavelength channels	Stable coefficient estimates across spectral neighborhoods
Marketing mix modeling	Overlapping media channels with collinear spend	Distributes effect across correlated channels
Credit scoring	Many correlated borrower attributes	Sparse models for regulatory interpretability
Quantitative finance	Partially redundant candidate factors	Stable factor selection over time
Bioinformatics survival analysis	High-dimensional Cox regression on -omics data	Variable selection plus grouping for biomarker discovery
Neuroscience (fMRI decoding)	Thousands of voxel features per subject	Spatial smoothness across nearby voxels

The original Zou and Hastie paper applied Elastic Net to the leukemia microarray data of Golub et al. (1999) and to a prostate cancer regression problem, showing in both cases that Elastic Net selected more features than Lasso and obtained lower prediction error. Subsequent papers have used Elastic Net for thousands of biomarker-discovery and prediction tasks across cancer, cardiovascular disease, agriculture, ecology, and the social sciences.[1][7]

Software implementations

The table below lists widely used implementations of Elastic Net.

Software	Language	Key class or function	Notes
glmnet	R	`glmnet::glmnet`, `glmnet::cv.glmnet`	Reference implementation by the authors; supports binomial, multinomial, Cox, Poisson
scikit-learn	Python	`sklearn.linear_model.ElasticNet`, `ElasticNetCV`, `MultiTaskElasticNet`	Coordinate descent; cross-validated variant; sparse support
H2O	Java, Python, R	`H2OGeneralizedLinearEstimator` with `alpha` parameter	Distributed; scales to large data
Apache Spark MLlib	Scala, Python, Java	`LinearRegression` with `elasticNetParam` and `regParam`	Distributed across a Spark cluster
statsmodels	Python	`OLS.fit_regularized(method='elastic_net')`	Useful for econometric workflows
MATLAB	MATLAB	`lasso` with `Alpha` parameter	Exposes Elastic Net via the `Alpha` argument
Julia	Julia	`Lasso.jl`, `MLJLinearModels.jl`	Native Julia ports of coordinate descent
PyGLMNet	Python	`pyglmnet.GLM`	Python port of glmnet for GLMs

Most users in industry and academia interact with Elastic Net through glmnet or scikit-learn. The two libraries use compatible parameterizations once the lambda-versus-alpha naming conventions are reconciled, and they generally produce indistinguishable models on the same data.

Code example in scikit-learn

A typical workflow looks like this:

from sklearn.linear_model import ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('en', ElasticNetCV(
        l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0],
        n_alphas=100,
        cv=5,
        random_state=0,
    )),
])
pipe.fit(X_train, y_train)

The ElasticNetCV class searches over a grid of l1_ratio values and a path of regularization strengths and returns the configuration with the lowest mean squared cross-validation error.

Extensions and variants

Adaptive Elastic Net

Zou and Zhang proposed the adaptive Elastic Net in 2009, which uses data-driven weights on each L1 penalty term. This estimator inherits the oracle property of the adaptive Lasso, meaning that under regularity conditions it asymptotically selects the true non-zero coefficients with probability one. Adaptive weights are typically derived from an initial Ridge or OLS fit and frozen for the second-stage Elastic Net optimization.[8]

Multitask Elastic Net

When multiple related response variables share a common feature set, the multi-task Elastic Net jointly fits all responses while encouraging the same set of features to be selected across tasks. This is implemented in scikit-learn as MultiTaskElasticNet and uses an L1/L2 mixed norm on rows of the coefficient matrix.

Sparse group Elastic Net

Simon, Friedman, Hastie, and Tibshirani (2013) introduced the sparse group Lasso, which combines a group Lasso penalty with an L1 penalty to obtain both group-level and within-group sparsity. An analogous Elastic Net version replaces the group Lasso component with a group Elastic Net to obtain stability within groups.[9]

Generalized linear models

The Elastic Net penalty applies essentially unchanged to logistic regression, multinomial logistic regression, Poisson regression, and Cox proportional hazards regression. In each case the penalized log-likelihood replaces the penalized squared-error loss, and coordinate descent is still effective because the per-coefficient update remains soft-thresholded. These extensions are all implemented in glmnet and were among the principal contributions of the 2010 Friedman, Hastie, and Tibshirani paper.[3]

Practical considerations

A few practical points come up repeatedly when using Elastic Net. The predictors should always be standardized to have zero mean and unit variance before fitting. The L1 and L2 penalties are not scale-invariant, so unstandardized predictors will be penalized in proportion to their measurement units, which is rarely what the modeler intends. Most software libraries standardize internally by default.

The response should usually be centered as well so that the intercept can be left unpenalized. Penalizing the intercept biases predictions toward the origin and is almost never desirable. The cross-validation error curve as a function of lambda is often very flat near its minimum, so selecting lambda by the one-standard-error rule rather than the strict minimum yields a sparser model with negligible loss in predictive accuracy and is recommended when interpretability matters.

Elastic Net coefficients are biased estimators because of the shrinkage they apply. If the goal is statistical inference about the coefficients (confidence intervals, hypothesis tests), additional procedures such as the post-selection inference framework or the debiased Lasso must be applied on top of the Elastic Net fit.

Historical notes

The Elastic Net paper was published in 2005 in the Journal of the Royal Statistical Society, Series B (Statistical Methodology), volume 67, part 2, pages 301 to 320. It was Hui Zou's PhD dissertation work at Stanford under Trevor Hastie's supervision and was originally circulated as a Stanford technical report in 2003. The paper has since been cited tens of thousands of times and is one of the most influential statistics papers of the early twenty-first century.

The glmnet R package, which made Elastic Net practical for very large problems, was released in 2010 with the paper of the same year by Jerome Friedman, Trevor Hastie, and Robert Tibshirani in the Journal of Statistical Software. The package is still maintained by Trevor Hastie and continues to be the reference implementation against which other Elastic Net solvers are compared. Elastic Net is part of the broader sparsity movement that includes Lasso (Tibshirani 1996), LARS (Efron et al. 2004), the Dantzig selector (Candes and Tao 2007), and the development of compressed sensing (Donoho 2006). It bridges the geometric ideas of L1 sparsity with the numerical stability of L2 regularization in a way that has proven both theoretically pleasing and practically robust.[10]

References

↑ Zou, H. and Hastie, T. (2005). "Regularization and variable selection via the elastic net." *Journal of the Royal Statistical Society, Series B (Statistical Methodology)*, 67(2), 301-320. Retrieved from https://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20%26%20Hastie.pdf.
↑ Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction.* Second edition. Springer Series in Statistics. New York: Springer. ISBN 978-0-387-84857-0. Retrieved from https://hastie.su.domains/ElemStatLearn/.
↑ Friedman, J., Hastie, T., and Tibshirani, R. (2010). "Regularization Paths for Generalized Linear Models via Coordinate Descent." *Journal of Statistical Software*, 33(1), 1-22. Retrieved from https://www.jstatsoft.org/article/view/v033i01.
↑ Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society, Series B (Methodological)*, 58(1), 267-288.
↑ Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). "Least Angle Regression." *The Annals of Statistics*, 32(2), 407-499.
↑ Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011). "Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent." *Journal of Statistical Software*, 39(5), 1-13.
↑ Hastie, T. and Qian, J. (2014). "Glmnet Vignette." Stanford University. Retrieved from https://hastie.su.domains/Papers/Glmnet_Vignette.pdf.
↑ Zou, H. and Zhang, H. H. (2009). "On the adaptive elastic-net with a diverging number of parameters." *The Annals of Statistics*, 37(4), 1733-1751.
↑ Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013). "A Sparse-Group Lasso." *Journal of Computational and Graphical Statistics*, 22(2), 231-245.
↑ Hastie, T., Tibshirani, R., and Wainwright, M. (2015). *Statistical Learning with Sparsity: The Lasso and Generalizations.* Boca Raton: CRC Press. ISBN 978-1-4987-1216-3.

Overview

Mathematical background

Ordinary least squares

Ridge regression

Lasso regression

The Elastic Net penalty

Definition

The naive Elastic Net problem

The grouping effect

Selecting more than n variables

Comparison with Lasso and Ridge

Computation

LARS-EN algorithm

Coordinate descent and glmnet

Hyperparameter selection

The two knobs

Cross-validation

Applications

Software implementations

Code example in scikit-learn

Extensions and variants

Adaptive Elastic Net

Multitask Elastic Net

Sparse group Elastic Net

Generalized linear models

Practical considerations

Historical notes

See also

References

Improve this article

Related Articles

Lasso Regression

Ridge Regression

ARC-AGI 2

Squared Loss

Least Squares Regression

Linear Regression

Overview

Mathematical background

Ordinary least squares

Ridge regression

Lasso regression

The Elastic Net penalty

Definition

The naive Elastic Net problem

The grouping effect

Selecting more than n variables

Comparison with Lasso and Ridge

Computation

LARS-EN algorithm

Coordinate descent and glmnet

Hyperparameter selection

The two knobs

Cross-validation

Applications

Software implementations

Code example in scikit-learn

Extensions and variants

Adaptive Elastic Net

Multitask Elastic Net

Sparse group Elastic Net

Generalized linear models

Practical considerations

Historical notes

See also

References

Related Articles

Lasso Regression

Ridge Regression

ARC-AGI 2

Squared Loss

Least Squares Regression

Linear Regression