# Elastic Net

> Source: https://aiwiki.ai/wiki/elastic_net
> Updated: 2026-06-23
> Categories: Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Regularization](/wiki/regularization), [Linear regression](/wiki/linear_regression)*

## Overview

**Elastic Net** is a [regularization](/wiki/regularization) and variable selection method for [linear regression](/wiki/linear_regression) and other generalized linear models that minimizes the residual sum of squares plus a penalty that linearly combines the L1 penalty of [Lasso regression](/wiki/lasso_regression) with the L2 penalty of [ridge regression](/wiki/ridge_regression). Introduced by Hui Zou and Trevor Hastie in 2005, it simultaneously performs automatic [feature selection](/wiki/feature_selection) like Lasso and groups correlated predictors like Ridge, which makes it a robust default for high-dimensional problems where the number of features p greatly exceeds the number of observations n. Zou and Hastie described it as a method that "often outperforms the lasso, while enjoying a similar sparsity of representation," likening it to "a stretchable fishing net that retains all the big fish."[1]

The Elastic Net penalty is a convex combination of the absolute value penalty and the squared penalty applied to the model coefficients. The L1 component performs automatic [feature selection](/wiki/feature_selection) by driving coefficients of unimportant variables exactly to zero, while the L2 component encourages a *grouping effect* in which strongly correlated predictors are kept together with similar coefficients. The resulting estimator behaves like Lasso when the data are uncorrelated and like Ridge when the predictors are highly collinear, and it interpolates smoothly between the two extremes through a single mixing hyperparameter.[1][2]

Elastic Net has become a standard tool in modern statistics and [machine learning](/wiki/machine_learning) and is implemented in widely used libraries such as `glmnet` for R, `scikit-learn` for Python, H2O, Apache Spark MLlib, and statsmodels. It is particularly popular in problems with many more predictors than observations, including microarray gene expression analysis, genome-wide association studies, text classification, marketing analytics, and chemometrics.[3]

## Mathematical background

### Ordinary least squares

Given a response vector y of length n and a predictor matrix X of size n by p with standardized columns, ordinary least squares (OLS) estimates the coefficient vector beta by minimizing the residual sum of squares L(beta) = (1/2n) * ||y - X beta||^2. When n is much larger than p, OLS provides unbiased estimates with small variance. As p grows toward or beyond n, OLS estimates become unstable, the variance grows without bound, and predictive accuracy on new data deteriorates rapidly. Some form of [regularization](/wiki/regularization) is then required to obtain a usable model.[2]

### Ridge regression

[Ridge regression](/wiki/ridge_regression), developed by Hoerl and Kennard in 1970, adds a squared L2 penalty to the OLS loss: beta_ridge = argmin (1/2n) * ||y - X beta||^2 + (lambda / 2) * sum_j beta_j^2. The L2 penalty shrinks all coefficients toward zero in a smooth manner and yields a closed-form solution beta_ridge = (X^T X + n lambda I)^(-1) X^T y. Ridge stabilizes estimates in the presence of multicollinearity, but it never sets a coefficient exactly to zero, so it cannot perform variable selection on its own and produces dense models that include every input feature.[2]

### Lasso regression

The Least Absolute Shrinkage and Selection Operator, or [Lasso](/wiki/lasso_regression), was proposed by Robert Tibshirani in 1996 and replaces the squared penalty with an absolute value penalty: beta_lasso = argmin (1/2n) * ||y - X beta||^2 + lambda * sum_j |beta_j|. The L1 penalty has the geometrically useful property that the constraint region has corners along the coordinate axes, so the optimum often falls on a corner where one or more coefficients are exactly zero, and Lasso simultaneously performs continuous shrinkage and discrete variable selection. This made Lasso enormously influential as researchers turned to high-dimensional problems with thousands or millions of features.[4]

Lasso is not without limitations, however. Zou and Hastie identified three of them as the explicit motivation for Elastic Net:[1]

1. **Saturation in the p > n setting.** As the authors state, "in the p > n case, the lasso selects at most n variables before it saturates, because of the nature of the convex optimization problem."[1] In genomics, where p is often in the tens of thousands and n in the dozens, this is a serious bottleneck.
2. **Arbitrary selection from correlated groups.** When several predictors are highly correlated, "the lasso tends to select only one variable from the group and does not care which one is selected," with the choice depending on small numerical perturbations of the data.[1] The method gives no indication that the discarded variables are also relevant, which complicates interpretation.
3. **Suboptimal prediction under high correlation.** Empirical studies showed that when strong correlations exist among predictors, "the prediction performance of the lasso is dominated by ridge regression."[1] The discrete nature of the L1 selection can throw away useful signal.

Elastic Net was constructed precisely to retain Lasso's sparsity while inheriting Ridge's stability under correlation.[1]

## How is the Elastic Net penalty defined?

### Definition

Zou and Hastie defined the *naive Elastic Net* estimator as the minimizer of a loss function that adds both L1 and L2 penalties to the residual sum of squares:

beta_naive = argmin (1/2n) * ||y - X beta||^2 + lambda_1 * sum_j |beta_j| + lambda_2 * sum_j beta_j^2

The two non-negative tuning parameters lambda_1 and lambda_2 control the strength of the L1 and L2 components respectively. Setting lambda_1 = 0 recovers ridge regression, while setting lambda_2 = 0 recovers Lasso. When both parameters are positive, the penalty is strictly convex, which guarantees a unique solution even when X^T X is singular.[1]

A second, equivalent parameterization is more common in software libraries. Define a single overall penalty strength lambda and a mixing parameter alpha in the interval [0, 1]:

P_alpha(beta) = alpha * sum_j |beta_j| + ((1 - alpha) / 2) * sum_j beta_j^2

beta_en = argmin (1/2n) * ||y - X beta||^2 + lambda * P_alpha(beta)

Under this parameterization, alpha = 1 corresponds to pure Lasso, alpha = 0 corresponds to pure Ridge, and intermediate values blend the two. The `glmnet` documentation states that the penalty "bridges the gap between lasso regression (alpha = 1, the default) and ridge regression (alpha = 0)," and most users tune both lambda and alpha by [cross-validation](/wiki/cross-validation).[3] (Note that in the original 2005 paper, Zou and Hastie use the opposite convention, defining alpha = lambda_2 / (lambda_1 + lambda_2) so that alpha = 1 is Ridge; the glmnet and scikit-learn convention where alpha = 1 is Lasso is the one used throughout this article and in nearly all software.)[1]

### Why is a corrected (non-naive) estimator needed?

Zou and Hastie observed that the naive estimator described above suffers from a *double shrinkage* problem. Both the L1 and L2 components shrink the coefficients, and applying both simultaneously biases the estimates more than the data warrant, especially in regression settings. As a result, the naive estimator often had higher prediction error than either pure Lasso or pure Ridge in their experiments.[1]

To correct this, the authors defined the final Elastic Net estimator by rescaling the naive solution:

beta_elastic_net = (1 + lambda_2) * beta_naive

This simple rescaling undoes the extra shrinkage introduced by the ridge term while preserving the variable-selection properties of the L1 component. The resulting estimator is what most software packages report when they refer to Elastic Net, although the distinction between naive and corrected versions is rarely emphasized in user-facing documentation.[1]

### What is the grouping effect?

One of the most important theoretical results in the original paper is the *grouping effect*. Informally, "a regression method exhibits the grouping effect if the regression coefficients of a group of highly correlated variables tend to be equal (up to a sign change if negatively correlated)."[1] Zou and Hastie made this precise in Theorem 1: for the naive Elastic Net estimate with response y and standardized predictors, the standardized difference between the coefficients of two predictors x_i and x_j is bounded by

D(i, j) = (1 / ||y||_1) * |beta_i_hat - beta_j_hat| <= (1 / lambda_2) * sqrt(2 * (1 - rho_ij))

where rho_ij = x_i^T x_j is the sample correlation. As the authors note, "if x_i and x_j are highly correlated, i.e. rho approaches 1 ... the difference between the coefficient paths of predictor i and predictor j is almost 0."[1] In the extreme case of perfectly correlated predictors, Lemma 2 of the paper shows that a strictly convex penalty (which the Elastic Net penalty is whenever lambda_2 > 0) assigns them identical coefficients. This is exactly the behavior practitioners want when, for example, two genes are co-regulated and either could plausibly explain the response.

The grouping effect does not hold for pure Lasso, which "does not even have a unique solution" in the identical-predictor case, so the difference in coefficients between two highly correlated predictors can be arbitrarily large because the algorithm essentially picks one and discards the other.[1] The grouping property is what makes Elastic Net especially well suited to genomic data, image data, and other settings where features are organized in natural clusters.

### Selecting more than n variables

A second structural advantage of Elastic Net is that it can select up to all p variables when the L2 penalty is positive, even in the p > n regime. The proof works by showing that the Elastic Net optimization is equivalent to a Lasso problem on an augmented data set with n + p observations, which is no longer in the p > n regime. This directly removes Lasso's ceiling of selecting at most n variables.[1]

## How does Elastic Net compare with Lasso and Ridge?

The table below summarizes how the three methods behave under common scenarios.

| Scenario | [Lasso](/wiki/lasso_regression) (alpha = 1) | Elastic Net (0 < alpha < 1) | [Ridge](/wiki/ridge_regression) (alpha = 0) |
| --- | --- | --- | --- |
| Many irrelevant features | Sets them to zero, sparse model | Sets them to zero, sparse model | Keeps all, dense model |
| Highly correlated features | Picks one arbitrarily | Keeps the group with shrunken coefficients (grouping) | Keeps all, evenly shrunken |
| p much larger than n | Selects at most n variables | Can select up to p variables | All p variables retained |
| Prediction under correlation | Often loses to Ridge | Generally best of the three | Best when all features matter |
| Interpretability | High due to sparsity | High due to sparsity | Low, dense coefficients |
| Closed-form solution | No | No | Yes |
| Hyperparameters to tune | One (lambda) | Two (lambda and alpha) | One (lambda) |

The practical takeaway is that Elastic Net is a safer default than Lasso whenever predictors might be correlated, while still delivering interpretable sparse models that Ridge cannot provide. The cost is one additional hyperparameter, which is usually a worthwhile trade.[2][3]

## Computation

### LARS-EN algorithm

In the original 2005 paper, Zou and Hastie proposed the LARS-EN algorithm, an extension of the Least Angle Regression (LARS) procedure of Efron, Hastie, Johnstone, and Tibshirani (2004). The abstract notes that LARS-EN computes "elastic net regularization paths efficiently, much like the LARS algorithm does for the lasso," with roughly the computational effort of a single OLS fit.[1] While elegant, LARS-EN scales poorly to very large problems because it must maintain a Cholesky factorization of an active-set Gram matrix.[5]

### Coordinate descent and glmnet

The dominant modern algorithm for fitting Elastic Net models is coordinate descent, introduced for this purpose by Friedman, Hastie, and Tibshirani in their 2010 *Journal of Statistical Software* paper. Coordinate descent cycles through the coefficients one at a time, updating each by minimizing the penalized loss with the others held fixed. For Elastic Net, the per-coefficient update has a closed-form soft-thresholding solution:

beta_j_new = S(z_j, lambda * alpha) / (1 + lambda * (1 - alpha))

where z_j is the partial residual correlation with the j-th feature and S(z, gamma) = sign(z) * max(|z| - gamma, 0) is the soft-thresholding operator.

The `glmnet` package combined coordinate descent with several engineering tricks: computing the entire regularization path on a grid of lambda values, using warm starts so each new lambda is initialized from the previous solution, applying *strong rules* to skip predictors unlikely to enter the active set, and supporting sparse design matrices through compressed column storage. The result is a solver that fits Elastic Net problems with millions of features and millions of observations in seconds, and the package quickly became the de facto standard for L1 and L2 regularized generalized linear models. The approach was extended to logistic, multinomial, Cox proportional hazards, and Poisson regression in subsequent papers.[3][6]

## How are the hyperparameters chosen?

### The two knobs

Elastic Net has two main hyperparameters:

- **lambda** (often written as `alpha` in `scikit-learn`'s ElasticNet class, confusingly using the opposite letter from the literature): the overall strength of regularization. Larger lambda shrinks all coefficients more aggressively and produces sparser, simpler models.
- **alpha** (often written as `l1_ratio` in `scikit-learn`): the mixing parameter between L1 and L2 penalties. alpha = 1 is pure Lasso, alpha = 0 is pure Ridge, and intermediate values blend the two.

The scikit-learn `ElasticNet` class minimizes the objective `1 / (2 * n_samples) * ||y - Xw||^2 + alpha * l1_ratio * ||w||_1 + 0.5 * alpha * (1 - l1_ratio) * ||w||^2`, where its `alpha` is the overall penalty strength and `l1_ratio` is the L1/L2 mixing parameter.[7] Both parameters affect bias and variance and must be tuned to the specific problem.

### Cross-validation

The standard approach is k-fold cross-validation. For each candidate combination of lambda and alpha, the data are split into k folds, the model is fit on k - 1 folds, the prediction error is computed on the held-out fold, and the combination with the lowest average error is selected. In practice, alpha is typically searched over a small grid such as {0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0} and lambda over a logarithmic grid of perhaps 100 values from a maximum where all coefficients are zero down to a small fraction of that maximum. The `glmnet` and `scikit-learn` implementations both build the lambda grid automatically based on the data.

A popular variant is the *one standard error rule*, which selects the largest lambda whose cross-validated error is within one standard error of the minimum. This rule trades a small amount of predictive accuracy for a noticeably sparser, more interpretable model and is widely used in genomics applications.[2]

## What is Elastic Net used for?

Elastic Net's combination of sparsity and stability under correlation has made it popular across many scientific and industrial domains. The table below summarizes the most prominent application areas.

| Domain | Typical setting | Why Elastic Net helps |
| --- | --- | --- |
| Microarray gene expression | Thousands of probes, dozens of patients (p >> n), co-regulated genes | Selects whole pathways; can choose more than n features |
| Genome-wide association studies | Millions of SNPs in linkage disequilibrium | Groups correlated SNPs rather than picking one per haplotype block |
| Text classification (bag-of-words) | Many word and n-gram features, near-duplicates | Keeps related word features together; sparse output |
| Chemometrics and spectroscopy | Correlated wavelength channels | Stable coefficient estimates across spectral neighborhoods |
| Marketing mix modeling | Overlapping media channels with collinear spend | Distributes effect across correlated channels |
| Credit scoring | Many correlated borrower attributes | Sparse models for regulatory interpretability |
| Quantitative finance | Partially redundant candidate factors | Stable factor selection over time |
| Bioinformatics survival analysis | High-dimensional Cox regression on -omics data | Variable selection plus grouping for biomarker discovery |
| Neuroscience (fMRI decoding) | Thousands of voxel features per subject | Spatial smoothness across nearby voxels |

The original Zou and Hastie paper demonstrated Elastic Net on a prostate cancer regression problem and on the leukemia microarray classification data of Golub et al. (1999), which comprises 7,129 genes across 72 samples, split into 38 training samples (27 acute lymphoblastic leukemia and 11 acute myeloid leukemia) and 34 test samples.[1] In that benchmark the Elastic Net selected more genes than Lasso and gave the best classification results among the methods compared, including Golub's procedure, the support vector machine, penalized logistic regression, and the nearest shrunken centroid.[1] Subsequent papers have used Elastic Net for thousands of biomarker-discovery and prediction tasks across cancer, cardiovascular disease, agriculture, ecology, and the social sciences.[1][8]

## Software implementations

The table below lists widely used implementations of Elastic Net.

| Software | Language | Key class or function | Notes |
| --- | --- | --- | --- |
| glmnet | R | `glmnet::glmnet`, `glmnet::cv.glmnet` | Reference implementation by the authors; supports binomial, multinomial, Cox, Poisson |
| scikit-learn | Python | `sklearn.linear_model.ElasticNet`, `ElasticNetCV`, `MultiTaskElasticNet` | Coordinate descent; cross-validated variant; sparse support |
| H2O | Java, Python, R | `H2OGeneralizedLinearEstimator` with `alpha` parameter | Distributed; scales to large data |
| Apache Spark MLlib | Scala, Python, Java | `LinearRegression` with `elasticNetParam` and `regParam` | Distributed across a Spark cluster |
| statsmodels | Python | `OLS.fit_regularized(method='elastic_net')` | Useful for econometric workflows |
| MATLAB | MATLAB | `lasso` with `Alpha` parameter | Exposes Elastic Net via the `Alpha` argument |
| Julia | Julia | `Lasso.jl`, `MLJLinearModels.jl` | Native Julia ports of coordinate descent |
| PyGLMNet | Python | `pyglmnet.GLM` | Python port of glmnet for GLMs |

Most users in industry and academia interact with Elastic Net through `glmnet` or `scikit-learn`. The two libraries use compatible parameterizations once the lambda-versus-alpha naming conventions are reconciled, and they generally produce indistinguishable models on the same data.

### Code example in scikit-learn

A typical workflow looks like this:

    from sklearn.linear_model import ElasticNetCV
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline

    pipe = Pipeline([
        ('scale', StandardScaler()),
        ('en', ElasticNetCV(
            l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1.0],
            n_alphas=100,
            cv=5,
            random_state=0,
        )),
    ])
    pipe.fit(X_train, y_train)

The `ElasticNetCV` class searches over a grid of `l1_ratio` values and a path of regularization strengths and returns the configuration with the lowest mean squared cross-validation error.

## Extensions and variants

### Adaptive Elastic Net

Zou and Zhang proposed the adaptive Elastic Net in 2009, which uses data-driven weights on each L1 penalty term. This estimator inherits the *oracle property* of the adaptive Lasso, meaning that under regularity conditions it asymptotically selects the true non-zero coefficients with probability one. Adaptive weights are typically derived from an initial Ridge or OLS fit and frozen for the second-stage Elastic Net optimization.[9]

### Multitask Elastic Net

When multiple related response variables share a common feature set, the multi-task Elastic Net jointly fits all responses while encouraging the same set of features to be selected across tasks. This is implemented in `scikit-learn` as `MultiTaskElasticNet` and uses an L1/L2 mixed norm on rows of the coefficient matrix.

### Sparse group Elastic Net

Simon, Friedman, Hastie, and Tibshirani (2013) introduced the sparse group Lasso, which combines a group Lasso penalty with an L1 penalty to obtain both group-level and within-group sparsity. An analogous Elastic Net version replaces the group Lasso component with a group Elastic Net to obtain stability within groups.[10]

### Generalized linear models

The Elastic Net penalty applies essentially unchanged to logistic regression, multinomial logistic regression, Poisson regression, and Cox proportional hazards regression. In each case the penalized log-likelihood replaces the penalized squared-error loss, and coordinate descent is still effective because the per-coefficient update remains soft-thresholded. These extensions are all implemented in `glmnet` and were among the principal contributions of the 2010 Friedman, Hastie, and Tibshirani paper.[3]

## Practical considerations

A few practical points come up repeatedly when using Elastic Net. The predictors should always be standardized to have zero mean and unit variance before fitting. The L1 and L2 penalties are not scale-invariant, so unstandardized predictors will be penalized in proportion to their measurement units, which is rarely what the modeler intends. Most software libraries standardize internally by default.

The response should usually be centered as well so that the intercept can be left unpenalized. Penalizing the intercept biases predictions toward the origin and is almost never desirable. The cross-validation error curve as a function of lambda is often very flat near its minimum, so selecting lambda by the one-standard-error rule rather than the strict minimum yields a sparser model with negligible loss in predictive accuracy and is recommended when interpretability matters.

Elastic Net coefficients are biased estimators because of the shrinkage they apply. If the goal is statistical inference about the coefficients (confidence intervals, hypothesis tests), additional procedures such as the post-selection inference framework or the debiased Lasso must be applied on top of the Elastic Net fit.

## When was Elastic Net introduced?

The Elastic Net paper was published in 2005 in the *Journal of the Royal Statistical Society, Series B (Statistical Methodology)*, volume 67, part 2, pages 301 to 320.[1] It was Hui Zou's PhD dissertation work at Stanford under Trevor Hastie's supervision and was originally circulated as a Stanford technical report dated December 5, 2003 (revised August 2004).[1] The paper has since been cited tens of thousands of times and is one of the most influential statistics papers of the early twenty-first century.

The `glmnet` R package, which made Elastic Net practical for very large problems, was released in 2010 with the paper of the same year by Jerome Friedman, Trevor Hastie, and Robert Tibshirani in the *Journal of Statistical Software*. The package is still maintained by Trevor Hastie and continues to be the reference implementation against which other Elastic Net solvers are compared. Elastic Net is part of the broader sparsity movement that includes Lasso (Tibshirani 1996), LARS (Efron et al. 2004), the Dantzig selector (Candes and Tao 2007), and the development of compressed sensing (Donoho 2006). It bridges the geometric ideas of L1 sparsity with the numerical stability of L2 regularization in a way that has proven both theoretically pleasing and practically robust.[11]

## See also

- [Linear regression](/wiki/linear_regression)
- [Lasso regression](/wiki/lasso_regression)
- [Ridge regression](/wiki/ridge_regression)
- [Regularization](/wiki/regularization)
- [L1 regularization](/wiki/l1_regularization)
- [L2 regularization](/wiki/l2_regularization)
- [Feature selection](/wiki/feature_selection)
- [Machine learning](/wiki/machine_learning)
- [Cross-validation](/wiki/cross-validation)
- [Generalized linear model](/wiki/generalized_linear_model)
- [Logistic regression](/wiki/logistic_regression)
- [Cox proportional hazards model](/index.php?title=Cox_proportional_hazards_model&action=edit&redlink=1)

## References

1. Zou, H. and Hastie, T. (2005). "Regularization and variable selection via the elastic net." *Journal of the Royal Statistical Society, Series B (Statistical Methodology)*, 67(2), 301-320. Retrieved from [https://hastie.su.domains/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf](https://hastie.su.domains/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf).

2. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction.* Second edition. Springer Series in Statistics. New York: Springer. ISBN 978-0-387-84857-0. Retrieved from [https://hastie.su.domains/ElemStatLearn/](https://hastie.su.domains/ElemStatLearn/).

3. Friedman, J., Hastie, T., and Tibshirani, R. (2010). "Regularization Paths for Generalized Linear Models via Coordinate Descent." *Journal of Statistical Software*, 33(1), 1-22. Retrieved from [https://www.jstatsoft.org/article/view/v033i01](https://www.jstatsoft.org/article/view/v033i01).

4. Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society, Series B (Methodological)*, 58(1), 267-288.

5. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). "Least Angle Regression." *The Annals of Statistics*, 32(2), 407-499.

6. Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011). "Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent." *Journal of Statistical Software*, 39(5), 1-13.

7. scikit-learn developers. "ElasticNet." scikit-learn API Reference. Retrieved from [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html).

8. Hastie, T. and Qian, J. (2014). "Glmnet Vignette." Stanford University. Retrieved from [https://hastie.su.domains/Papers/Glmnet_Vignette.pdf](https://hastie.su.domains/Papers/Glmnet_Vignette.pdf).

9. Zou, H. and Zhang, H. H. (2009). "On the adaptive elastic-net with a diverging number of parameters." *The Annals of Statistics*, 37(4), 1733-1751.

10. Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013). "A Sparse-Group Lasso." *Journal of Computational and Graphical Statistics*, 22(2), 231-245.

11. Hastie, T., Tibshirani, R., and Wainwright, M. (2015). *Statistical Learning with Sparsity: The Lasso and Generalizations.* Boca Raton: CRC Press. ISBN 978-1-4987-1216-3.
