See also: Regularization, Linear regression
Elastic Net is a regularization and variable selection method for linear regression and other generalized linear models that combines the L1 penalty of Lasso regression with the L2 penalty of ridge regression. The method was introduced by Hui Zou and Trevor Hastie in a 2005 paper published in the Journal of the Royal Statistical Society, Series B and was designed to overcome two well-known shortcomings of pure Lasso, namely its instability when input features are correlated and its inability to select more than the number of training observations when the number of features is large.[1]
The Elastic Net penalty is a convex combination of the absolute value penalty and the squared penalty applied to the model coefficients. The L1 component performs automatic feature selection by driving coefficients of unimportant variables exactly to zero, while the L2 component encourages a grouping effect in which strongly correlated predictors are kept together with similar coefficients. The resulting estimator behaves like Lasso when the data are uncorrelated and like Ridge when the predictors are highly collinear, and it interpolates smoothly between the two extremes through a single mixing hyperparameter.[1][2]
Elastic Net has become a standard tool in modern statistics and machine learning and is implemented in widely used libraries such as glmnet for R, scikit-learn for Python, H2O, Apache Spark MLlib, and statsmodels. It is particularly popular in problems with many more predictors than observations, including microarray gene expression analysis, genome-wide association studies, text classification, marketing analytics, and chemometrics.[3]
Given a response vector y of length n and a predictor matrix X of size n by p with standardized columns, ordinary least squares (OLS) estimates the coefficient vector beta by minimizing the residual sum of squares L(beta) = (1/2n) * ||y - X beta||^2. When n is much larger than p, OLS provides unbiased estimates with small variance. As p grows toward or beyond n, OLS estimates become unstable, the variance grows without bound, and predictive accuracy on new data deteriorates rapidly. Some form of regularization is then required to obtain a usable model.[2]
Ridge regression, developed by Hoerl and Kennard in 1970, adds a squared L2 penalty to the OLS loss: beta_ridge = argmin (1/2n) * ||y - X beta||^2 + (lambda / 2) * sum_j beta_j^2. The L2 penalty shrinks all coefficients toward zero in a smooth manner and yields a closed-form solution beta_ridge = (X^T X + n lambda I)^(-1) X^T y. Ridge stabilizes estimates in the presence of multicollinearity, but it never sets a coefficient exactly to zero, so it cannot perform variable selection on its own and produces dense models that include every input feature.[2]
The Least Absolute Shrinkage and Selection Operator, or Lasso, was proposed by Robert Tibshirani in 1996 and replaces the squared penalty with an absolute value penalty: beta_lasso = argmin (1/2n) * ||y - X beta||^2 + lambda * sum_j |beta_j|. The L1 penalty has the geometrically useful property that the constraint region has corners along the coordinate axes, so the optimum often falls on a corner where one or more coefficients are exactly zero, and Lasso simultaneously performs continuous shrinkage and discrete variable selection. This made Lasso enormously influential as researchers turned to high-dimensional problems with thousands or millions of features.[4]
Lasso is not without limitations, however. Three of them are particularly important and motivated the development of Elastic Net:
Elastic Net was constructed precisely to retain Lasso's sparsity while inheriting Ridge's stability under correlation.[1]
Zou and Hastie defined the naive Elastic Net estimator as the minimizer of a loss function that adds both L1 and L2 penalties to the residual sum of squares:
beta_naive = argmin (1/2n) * ||y - X beta||^2 + lambda_1 * sum_j |beta_j| + lambda_2 * sum_j beta_j^2
The two non-negative tuning parameters lambda_1 and lambda_2 control the strength of the L1 and L2 components respectively. Setting lambda_1 = 0 recovers ridge regression, while setting lambda_2 = 0 recovers Lasso. When both parameters are positive, the penalty is strictly convex, which guarantees a unique solution even when X^T X is singular.
A second, equivalent parameterization is more common in software libraries. Define a single overall penalty strength lambda and a mixing parameter alpha in the interval [0, 1]:
P_alpha(beta) = alpha * sum_j |beta_j| + ((1 - alpha) / 2) * sum_j beta_j^2
beta_en = argmin (1/2n) * ||y - X beta||^2 + lambda * P_alpha(beta)
Under this parameterization, alpha = 1 corresponds to pure Lasso, alpha = 0 corresponds to pure Ridge, and intermediate values blend the two. The default in glmnet is alpha = 1 (Lasso), and most users tune both lambda and alpha by cross-validation.[3]
Zou and Hastie observed that the naive estimator described above suffers from a double shrinkage problem. Both the L1 and L2 components shrink the coefficients, and applying both simultaneously biases the estimates more than the data warrant, especially in regression settings. As a result, the naive estimator often had higher prediction error than either pure Lasso or pure Ridge in their experiments.
To correct this, the authors defined the final Elastic Net estimator by rescaling the naive solution:
beta_elastic_net = (1 + lambda_2) * beta_naive
This simple rescaling undoes the extra shrinkage introduced by the ridge term while preserving the variable-selection properties of the L1 component. The resulting estimator is what most software packages report when they refer to Elastic Net, although the distinction between naive and corrected versions is rarely emphasized in user-facing documentation.[1]
One of the most important theoretical results in the original paper is the grouping effect. Zou and Hastie proved that when two predictors x_i and x_j are highly correlated, with sample correlation rho_ij close to 1, the Elastic Net estimates for their coefficients satisfy:
|beta_i_hat - beta_j_hat| <= (1 / lambda_2) * sqrt(2 * (1 - rho_ij))
In other words, perfectly correlated predictors receive identical coefficients, and nearly correlated predictors receive nearly identical coefficients. This is exactly the behavior practitioners want when, for example, two genes are co-regulated and either could plausibly explain the response.
The grouping effect does not hold for pure Lasso, where the difference in coefficients between two highly correlated predictors can be arbitrarily large because the algorithm essentially picks one and discards the other. The grouping property is what makes Elastic Net especially well suited to genomic data, image data, and other settings where features are organized in natural clusters.[1]
A second structural advantage of Elastic Net is that it can select up to all p variables when the L2 penalty is positive, even in the p > n regime. The proof works by showing that the Elastic Net optimization is equivalent to a Lasso problem on an augmented data set with n + p observations, which is no longer in the p > n regime.
The table below summarizes how the three methods behave under common scenarios.
| Scenario | Lasso (alpha = 1) | Elastic Net (0 < alpha < 1) | Ridge (alpha = 0) |
|---|---|---|---|
| Many irrelevant features | Sets them to zero, sparse model | Sets them to zero, sparse model | Keeps all, dense model |
| Highly correlated features | Picks one arbitrarily | Keeps the group with shrunken coefficients (grouping) | Keeps all, evenly shrunken |
| p much larger than n | Selects at most n variables | Can select up to p variables | All p variables retained |
| Prediction under correlation | Often loses to Ridge | Generally best of the three | Best when all features matter |
| Interpretability | High due to sparsity | High due to sparsity | Low, dense coefficients |
| Closed-form solution | No | No | Yes |
| Hyperparameters to tune | One (lambda) | Two (lambda and alpha) | One (lambda) |
The practical takeaway is that Elastic Net is a safer default than Lasso whenever predictors might be correlated, while still delivering interpretable sparse models that Ridge cannot provide. The cost is one additional hyperparameter, which is usually a worthwhile trade.[2][3]
In the original 2005 paper, Zou and Hastie proposed the LARS-EN algorithm, an extension of the Least Angle Regression (LARS) procedure of Efron, Hastie, Johnstone, and Tibshirani (2004). LARS computes the entire Lasso regularization path in roughly the same time as a single OLS fit. LARS-EN applies the same idea to Elastic Net by exploiting the augmented-data formulation. While elegant, LARS-EN scales poorly to very large problems because it must maintain a Cholesky factorization of an active-set Gram matrix.[5]
The dominant modern algorithm for fitting Elastic Net models is coordinate descent, introduced for this purpose by Friedman, Hastie, and Tibshirani in their 2010 Journal of Statistical Software paper. Coordinate descent cycles through the coefficients one at a time, updating each by minimizing the penalized loss with the others held fixed. For Elastic Net, the per-coefficient update has a closed-form soft-thresholding solution:
beta_j_new = S(z_j, lambda * alpha) / (1 + lambda * (1 - alpha))
where z_j is the partial residual correlation with the j-th feature and S(z, gamma) = sign(z) * max(|z| - gamma, 0) is the soft-thresholding operator.
The glmnet package combined coordinate descent with several engineering tricks: computing the entire regularization path on a grid of lambda values, using warm starts so each new lambda is initialized from the previous solution, applying strong rules to skip predictors unlikely to enter the active set, and supporting sparse design matrices through compressed column storage. The result is a solver that fits Elastic Net problems with millions of features and millions of observations in seconds, and the package quickly became the de facto standard for L1 and L2 regularized generalized linear models. The approach was extended to logistic, multinomial, Cox proportional hazards, and Poisson regression in subsequent papers.[3][6]
Elastic Net has two main hyperparameters:
alpha in scikit-learn's ElasticNet class, confusingly using the opposite letter from the literature): the overall strength of regularization. Larger lambda shrinks all coefficients more aggressively and produces sparser, simpler models.l1_ratio in scikit-learn): the mixing parameter between L1 and L2 penalties. alpha = 1 is pure Lasso, alpha = 0 is pure Ridge, and intermediate values blend the two.Both parameters affect bias and variance and must be tuned to the specific problem.
The standard approach is k-fold cross-validation. For each candidate combination of lambda and alpha, the data are split into k folds, the model is fit on k - 1 folds, the prediction error is computed on the held-out fold, and the combination with the lowest average error is selected. In practice, alpha is typically searched over a small grid such as {0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0} and lambda over a logarithmic grid of perhaps 100 values from a maximum where all coefficients are zero down to a small fraction of that maximum. The glmnet and scikit-learn implementations both build the lambda grid automatically based on the data.
A popular variant is the one standard error rule, which selects the largest lambda whose cross-validated error is within one standard error of the minimum. This rule trades a small amount of predictive accuracy for a noticeably sparser, more interpretable model and is widely used in genomics applications.[2]
Elastic Net's combination of sparsity and stability under correlation has made it popular across many scientific and industrial domains. The table below summarizes the most prominent application areas.
| Domain | Typical setting | Why Elastic Net helps |
|---|---|---|
| Microarray gene expression | Thousands of probes, dozens of patients (p >> n), co-regulated genes | Selects whole pathways; can choose more than n features |
| Genome-wide association studies | Millions of SNPs in linkage disequilibrium | Groups correlated SNPs rather than picking one per haplotype block |
| Text classification (bag-of-words) | Many word and n-gram features, near-duplicates | Keeps related word features together; sparse output |
| Chemometrics and spectroscopy | Correlated wavelength channels | Stable coefficient estimates across spectral neighborhoods |
| Marketing mix modeling | Overlapping media channels with collinear spend | Distributes effect across correlated channels |
| Credit scoring | Many correlated borrower attributes | Sparse models for regulatory interpretability |
| Quantitative finance | Partially redundant candidate factors | Stable factor selection over time |
| Bioinformatics survival analysis | High-dimensional Cox regression on -omics data | Variable selection plus grouping for biomarker discovery |
| Neuroscience (fMRI decoding) | Thousands of voxel features per subject | Spatial smoothness across nearby voxels |
The original Zou and Hastie paper applied Elastic Net to the leukemia microarray data of Golub et al. (1999) and to a prostate cancer regression problem, showing in both cases that Elastic Net selected more features than Lasso and obtained lower prediction error. Subsequent papers have used Elastic Net for thousands of biomarker-discovery and prediction tasks across cancer, cardiovascular disease, agriculture, ecology, and the social sciences.[1][7]
The table below lists widely used implementations of Elastic Net.
| Software | Language | Key class or function | Notes |
|---|---|---|---|
| glmnet | R | glmnet::glmnet, glmnet::cv.glmnet | Reference implementation by the authors; supports binomial, multinomial, Cox, Poisson |
| scikit-learn | Python | sklearn.linear_model.ElasticNet, ElasticNetCV, MultiTaskElasticNet | Coordinate descent; cross-validated variant; sparse support |
| H2O | Java, Python, R | H2OGeneralizedLinearEstimator with alpha parameter | Distributed; scales to large data |
| Apache Spark MLlib | Scala, Python, Java | LinearRegression with elasticNetParam and regParam | Distributed across a Spark cluster |
| statsmodels | Python | OLS.fit_regularized(method='elastic_net') | Useful for econometric workflows |
| MATLAB | MATLAB | lasso with Alpha parameter | Exposes Elastic Net via the Alpha argument |
| Julia | Julia | Lasso.jl, MLJLinearModels.jl | Native Julia ports of coordinate descent |
| PyGLMNet | Python | pyglmnet.GLM | Python port of glmnet for GLMs |
Most users in industry and academia interact with Elastic Net through glmnet or scikit-learn. The two libraries use compatible parameterizations once the lambda-versus-alpha naming conventions are reconciled, and they generally produce indistinguishable models on the same data.
A typical workflow looks like this:
from sklearn.linear_model import ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scale', StandardScaler()),
('en', ElasticNetCV(
l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0],
n_alphas=100,
cv=5,
random_state=0,
)),
])
pipe.fit(X_train, y_train)
The ElasticNetCV class searches over a grid of l1_ratio values and a path of regularization strengths and returns the configuration with the lowest mean squared cross-validation error.
Zou and Zhang proposed the adaptive Elastic Net in 2009, which uses data-driven weights on each L1 penalty term. This estimator inherits the oracle property of the adaptive Lasso, meaning that under regularity conditions it asymptotically selects the true non-zero coefficients with probability one. Adaptive weights are typically derived from an initial Ridge or OLS fit and frozen for the second-stage Elastic Net optimization.[8]
When multiple related response variables share a common feature set, the multi-task Elastic Net jointly fits all responses while encouraging the same set of features to be selected across tasks. This is implemented in scikit-learn as MultiTaskElasticNet and uses an L1/L2 mixed norm on rows of the coefficient matrix.
Simon, Friedman, Hastie, and Tibshirani (2013) introduced the sparse group Lasso, which combines a group Lasso penalty with an L1 penalty to obtain both group-level and within-group sparsity. An analogous Elastic Net version replaces the group Lasso component with a group Elastic Net to obtain stability within groups.[9]
The Elastic Net penalty applies essentially unchanged to logistic regression, multinomial logistic regression, Poisson regression, and Cox proportional hazards regression. In each case the penalized log-likelihood replaces the penalized squared-error loss, and coordinate descent is still effective because the per-coefficient update remains soft-thresholded. These extensions are all implemented in glmnet and were among the principal contributions of the 2010 Friedman, Hastie, and Tibshirani paper.[3]
A few practical points come up repeatedly when using Elastic Net. The predictors should always be standardized to have zero mean and unit variance before fitting. The L1 and L2 penalties are not scale-invariant, so unstandardized predictors will be penalized in proportion to their measurement units, which is rarely what the modeler intends. Most software libraries standardize internally by default.
The response should usually be centered as well so that the intercept can be left unpenalized. Penalizing the intercept biases predictions toward the origin and is almost never desirable. The cross-validation error curve as a function of lambda is often very flat near its minimum, so selecting lambda by the one-standard-error rule rather than the strict minimum yields a sparser model with negligible loss in predictive accuracy and is recommended when interpretability matters.
Elastic Net coefficients are biased estimators because of the shrinkage they apply. If the goal is statistical inference about the coefficients (confidence intervals, hypothesis tests), additional procedures such as the post-selection inference framework or the debiased Lasso must be applied on top of the Elastic Net fit.
The Elastic Net paper was published in 2005 in the Journal of the Royal Statistical Society, Series B (Statistical Methodology), volume 67, part 2, pages 301 to 320. It was Hui Zou's PhD dissertation work at Stanford under Trevor Hastie's supervision and was originally circulated as a Stanford technical report in 2003. The paper has since been cited tens of thousands of times and is one of the most influential statistics papers of the early twenty-first century.
The glmnet R package, which made Elastic Net practical for very large problems, was released in 2010 with the paper of the same year by Jerome Friedman, Trevor Hastie, and Robert Tibshirani in the Journal of Statistical Software. The package is still maintained by Trevor Hastie and continues to be the reference implementation against which other Elastic Net solvers are compared. Elastic Net is part of the broader sparsity movement that includes Lasso (Tibshirani 1996), LARS (Efron et al. 2004), the Dantzig selector (Candes and Tao 2007), and the development of compressed sensing (Donoho 2006). It bridges the geometric ideas of L1 sparsity with the numerical stability of L2 regularization in a way that has proven both theoretically pleasing and practically robust.[10]