# Lambda

> Source: https://aiwiki.ai/wiki/lambda
> Updated: 2026-06-25
> Categories: Machine Learning, Mathematics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Lambda (the Greek letter λ) is a symbol used across machine learning, statistics, and computer science to denote several distinct quantities, the most common in machine learning being the strength of a [regularization](/wiki/regularization) penalty added to a loss function. The same letter also marks eigenvalues in linear algebra (where A v = λ v), the trace-decay parameter in TD(λ) reinforcement learning, the context-summarizing linear function in lambda layers, and [Lagrange multipliers](/wiki/lagrange_multiplier) in constrained optimization. The name additionally appears in unrelated computing contexts such as lambda calculus and AWS Lambda.

This article focuses on the regularization meaning of lambda, since that is the most frequent usage in machine learning, and then disambiguates the other common uses.

## What does lambda mean at a glance?

| Context | Symbol | Role |
| --- | --- | --- |
| Regularized regression | λ | Strength of the [L1 regularization](/wiki/l1_regularization) or [L2 regularization](/wiki/l2_regularization) penalty |
| [Elastic net](/wiki/elastic_net) | λ, α | Overall penalty strength and L1/L2 mix |
| Linear algebra | λ | [Eigenvalue](/wiki/eigenvalue) of a matrix |
| [Principal component analysis](/wiki/principal_component_analysis) | λ_i | Variance captured by the i-th component |
| Reinforcement learning | λ | Trace decay parameter in TD(λ) and eligibility traces |
| Constrained optimization | λ | [Lagrange multiplier](/wiki/lagrange_multiplier) on a constraint |
| LambdaNetworks | λ | A small linear function summarizing context for an input |
| Lambda calculus | λ | Binder for an anonymous function |
| AWS | Lambda | Serverless compute service from [AWS](/wiki/aws_lambda) |

## What is lambda as a regularization hyperparameter?

Lambda is most commonly used in machine learning as the multiplier on a penalty term added to a loss function. The penalty discourages large model coefficients, which tends to reduce overfitting and improve generalization. It is particularly relevant in [linear regression](/wiki/linear_regression) and [logistic regression](/wiki/logistic_regression) models, but the same idea applies to neural networks (where it is often called weight decay) and to many other estimators.

The regularized objective has the general form:

minimize  L(β) + λ · R(β)

where L(β) is the data fit term, R(β) is the penalty function, and λ ≥ 0 controls how much the penalty matters relative to the fit. When λ is zero the model reduces to the unregularized estimator. As λ grows the coefficients are pushed toward zero, which makes the model simpler and less sensitive to noise in the training data. In scikit-learn this exact behavior is documented: setting the penalty constant to 0 makes the objective equivalent to ordinary least squares, while larger values specify stronger regularization [7].

### What are the regularization formulas that use lambda?

The table below shows how lambda enters the three most common regularized least-squares estimators. Here β are the coefficients, y the targets, and X the design matrix.

| Method | Penalty R(β) | Objective minimized | Effect on coefficients |
| --- | --- | --- | --- |
| [Ridge regression](/wiki/ridge_regression) (L2) | Σ β_i² | ‖y − Xβ‖² + λ Σ β_i² | Shrinks all coefficients smoothly toward zero |
| [Lasso regression](/wiki/lasso_regression) (L1) | Σ \|β_i\| | ‖y − Xβ‖² + λ Σ \|β_i\| | Shrinks some coefficients exactly to zero, performing variable selection |
| [Elastic net](/wiki/elastic_net) | α Σ \|β_i\| + (1 − α) Σ β_i² | ‖y − Xβ‖² + λ (α Σ \|β_i\| + (1 − α) Σ β_i²) | Mixes L1 and L2; useful when predictors are correlated |

Lasso was introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso" in the *Journal of the Royal Statistical Society, Series B*, volume 58, pages 267-288 [1]. The lasso minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, and because of the nature of this constraint it tends to produce some coefficients that are exactly 0, giving interpretable models [1]. Elastic net was proposed by Hui Zou and Trevor Hastie in 2005 in the same journal, volume 67, pages 301-320, as a way to keep groups of correlated features together rather than letting lasso pick one and drop the rest [2]. Zou and Hastie showed that the elastic net "encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together" [2].

### Why does lambda matter?

Fitting a model that minimizes only the data loss can give large coefficients that fit the training data well but generalize poorly. Adding a penalty multiplied by lambda biases the solution toward simpler, smaller coefficients. Small λ leaves the coefficients close to the unregularized fit, so the model has high variance and may overfit. Large λ pushes the coefficients close to zero, so the model has higher bias and may underfit. The right value depends on the data, the noise level, and how many features there are relative to the number of training examples.

For lasso, increasing λ also drives more coefficients to exactly zero, which gives a sparse model [1]. For ridge, no coefficient is ever exactly zero, but all of them shrink. Elastic net interpolates between the two [2].

### How do you choose lambda?

Because the optimal value depends on the data, lambda is treated as a [hyperparameter](/wiki/hyperparameter) and tuned outside of the main fit. Common strategies:

| Strategy | How it works |
| --- | --- |
| Grid search with [cross-validation](/wiki/cross-validation) | Try a logarithmic grid of λ values; pick the one with the lowest mean validation error |
| Information criteria | Choose λ that minimizes AIC or BIC on the training set |
| LARS path | Compute the entire lasso solution path in one pass, then pick a λ along the path |
| [Bayesian optimization](/wiki/bayesian_optimization) | Treat λ as a black-box hyperparameter and search efficiently |

The LARS algorithm by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani, published in *The Annals of Statistics* in 2004, volume 32, is a useful tool here [3]. A simple modification of LARS computes the entire lasso solution path as a piecewise linear function of λ, which lets you see how each coefficient enters or leaves the model as the penalty changes [3]. This is also why the function `lars` and lasso path plots are common in regression workflows.

When the noise level σ is known approximately, theoretical work has suggested λ on the order of σ √(2 log p / n) for the lasso, where p is the number of features and n is the sample size. In practice, [cross-validation](/wiki/cross-validation) almost always wins out as the chosen method.

### What do scikit-learn and other libraries call lambda?

Different toolkits use different names for the same parameter. The math is the same; only the label changes.

| Library | Parameter name | Notes |
| --- | --- | --- |
| scikit-learn (`Ridge`, `Lasso`, `ElasticNet`) | `alpha` | Plays the role of textbook λ; default 1.0, must be ≥ 0, and `alpha = 0` recovers ordinary least squares [7] |
| scikit-learn (`ElasticNet`) | `l1_ratio` | The L1/L2 mix |
| scikit-learn (`LogisticRegression`, `LinearSVC`) | `C` | Inverse strength; `alpha` corresponds to 1 / (2C) [7] |
| glmnet (R) | `lambda` | Penalty strength, returned as a sequence along the path |
| glmnet (R) | `alpha` | L1/L2 mix |
| Keras / PyTorch | `weight_decay` | Plays the role of λ for L2 penalties on weights |

The scikit-learn convention of calling it `alpha` is a frequent source of confusion for people coming from a statistics background, where lambda is standard [7]. The two also disagree about scaling: as discussed in scikit-learn issue 21891, `Lasso(alpha)` and `Ridge(alpha)` use slightly different conventions for the relationship between the penalty constant and the per-sample loss.

### How does lambda relate to a Bayesian prior?

The regularization parameter has a clean Bayesian reading. If you put an independent zero-mean Gaussian prior on each coefficient, the posterior mode is the ridge estimator and λ is proportional to the ratio of noise variance to prior variance [9]. A larger λ corresponds to a tighter prior, which is why it pulls the estimates more strongly toward zero.

For lasso the analogue is a Laplace (double exponential) prior on each coefficient. The mode of the posterior is then the lasso estimate, and 1/λ plays the role of the scale parameter of the prior [9]. The Laplace prior puts more probability mass exactly at zero than the Gaussian, which is the underlying reason lasso produces sparse solutions while ridge does not.

### How does lambda connect to constrained optimization?

The penalized form `min L(β) + λ R(β)` is the Lagrangian of a constrained problem of the form `min L(β) subject to R(β) ≤ t`. The Lagrange multiplier on the inequality constraint is exactly λ, and the Karush-Kuhn-Tucker (KKT) conditions tie the two formulations together [10]. Increasing λ corresponds to tightening the constraint t, which is why a larger penalty leads to smaller coefficients.

This duality is more than a notational coincidence. It is the reason solvers can switch between thinking about "how much penalty to add" and "how big a coefficient budget to allow" [10]. It also generalizes to many other ML settings, from constrained policy optimization in reinforcement learning to Lagrangian formulations of fairness constraints.

## What is lambda as an eigenvalue?

In linear algebra, lambda is the standard symbol for an eigenvalue. Given a square matrix A, an eigenvalue λ and eigenvector v satisfy Av = λv. Eigenvalues show up throughout machine learning whenever a method involves a covariance matrix, a kernel matrix, or a graph Laplacian.

| Application | Where eigenvalues appear |
| --- | --- |
| [Principal component analysis](/wiki/principal_component_analysis) | The eigenvalues of the sample covariance matrix give the variance captured by each principal component; ranking them gives the scree plot |
| [Singular value decomposition](/wiki/singular_value_decomposition) | The squared singular values of X are the eigenvalues of X^T X |
| Kernel methods | Eigenvalues of the kernel (Gram) matrix appear in kernel PCA and spectral clustering |
| Spectral clustering | Smallest eigenvalues of the graph Laplacian reveal cluster structure |
| Stability of training | Eigenvalues of the loss Hessian relate to step-size choice and curvature |

In PCA the eigenvalue structure has a particularly direct interpretation: each eigenvalue equals the variance of the projection of the data onto the corresponding eigenvector, so the proportion of variance explained by component i is λ_i divided by the sum of all eigenvalues [9]. The same matrix and the same lambdas show up under the spectral decomposition Σ = V Λ V^T, with Λ the diagonal matrix of eigenvalues.

## What is TD(λ) in reinforcement learning?

In reinforcement learning, lambda is the trace-decay parameter in TD(λ), introduced by Richard Sutton in his 1988 paper "Learning to predict by the methods of temporal differences" in *Machine Learning*, volume 3, pages 9-44 [4]. TD(λ) uses an eligibility trace, a temporary memory of recently visited states, that decays geometrically by a factor of λ at each step. When a reward arrives, all states in the trace get a credit update weighted by their current eligibility. Sutton's central idea was that the new methods "assign credit by means of the difference between temporally successive predictions" rather than by the difference between predicted and actual outcomes [4].

The parameter λ controls a continuous interpolation between two extreme algorithms:

| Value of λ | Equivalent algorithm | Behaviour |
| --- | --- | --- |
| λ = 0 | One-step TD (also called TD(0)) | Updates only the current state from the next state's estimate |
| 0 < λ < 1 | Mixed TD(λ) | Blends one-step bootstrapping with longer multi-step backups |
| λ = 1 | Monte Carlo (offline equivalence) | Updates from the full return, like Monte Carlo methods |

Intermediate values of λ often outperform either extreme [8]. Eligibility traces are now a standard tool in RL and appear in many algorithms, including SARSA(λ), Q(λ), and the True Online TD(λ) variant by Harm van Seijen and Sutton, published at ICML 2014 (pages 692-700), which makes the forward-view and backward-view equivalence exact for the online setting rather than only the offline one [5].

## What are lambda layers and LambdaNetworks?

A different use of the symbol comes from the LambdaNetworks paper by Irwan Bello, presented at ICLR 2021 under the title "LambdaNetworks: Modeling Long-Range Interactions Without Attention" [6]. As the paper describes it, "lambda layers capture such interactions by transforming available contexts into linear functions, termed lambdas, and applying these linear functions to each input separately" [6]. A lambda layer therefore summarizes the surrounding context of an input position into a small linear function, called a lambda, and then applies that linear function to the query at that position.

The goal is to capture long-range interactions like self-attention does, but without the quadratic memory cost of attention maps. Lambda layers model both content-based and position-based interactions, which makes them practical for large structured inputs like images. The resulting LambdaResNets reach competitive accuracy on ImageNet and COCO benchmarks while running 3.2 to 4.4 times faster than the popular EfficientNets on modern machine learning accelerators, and up to 9.5 times faster when trained with 130 million extra pseudo-labeled images [6]. In this setting the symbol λ refers to the linear function itself, not to a scalar hyperparameter. The naming is a deliberate nod to lambda calculus, where λ binds an anonymous function. LambdaResNets are sometimes compared with [vision transformers](/wiki/vision_transformer_vit), since both aim to model long-range structure beyond convolution.

## What are the other uses of lambda in ML and CS?

A few other places where lambda or Lambda shows up around AI and ML, listed briefly because they are not statistical hyperparameters:

| Use | What it is |
| --- | --- |
| Stein-type shrinkage | Estimators of the form λ · target + (1 − λ) · sample, where λ pulls the sample mean toward a fixed target |
| Mixing weights | Convex combinations like λ · model_A + (1 − λ) · model_B for ensembling, interpolation, or distillation |
| [Lambda calculus](/wiki/lambda_calculus) | Formal model of computation introduced by Alonzo Church in 1936 in "An Unsolvable Problem of Elementary Number Theory"; inspiration for anonymous functions [11] |
| Python `lambda` keyword | Defines an anonymous inline function, common in pandas and PyTorch pipelines (for example `df.apply(lambda x: x * 2)`) |
| [AWS Lambda](/wiki/aws_lambda) | Serverless compute service from Amazon Web Services, often used to deploy lightweight ML inference endpoints |

None of these are the same concept as the regularization hyperparameter. They share the symbol or the name and sometimes appear in the same code or paper, which is a frequent source of confusion for newcomers.

## Explain like i'm 5 (ELI5)

Lambda is like a helper in machine learning that makes sure a model doesn't get too focused on learning every tiny detail from the data. It helps the model to learn the most important patterns so it can make good predictions on new, unseen data. There are two popular ways to use lambda, called Lasso and Ridge, which have different ways of helping the model. Finding the right amount of help from lambda is important for the model to do its job well.

The same letter shows up in other parts of math and computer science too. In linear algebra it is a number that describes how a matrix stretches a special direction. In reinforcement learning it is a knob that controls how far back in time a robot remembers when it gets a reward. In Python the word lambda is used to write tiny one-line functions. They are all different ideas; they just borrowed the same Greek letter.

## References

1. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. *Journal of the Royal Statistical Society, Series B*, 58(1), 267-288. https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1996.tb02080.x
2. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. *Journal of the Royal Statistical Society, Series B*, 67(2), 301-320. https://academic.oup.com/jrsssb/article/67/2/301/7109482
3. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least Angle Regression. *The Annals of Statistics*, 32(2), 407-499. https://tibshirani.su.domains/ftp/lars.pdf
4. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. *Machine Learning*, 3, 9-44. https://link.springer.com/article/10.1007/BF00115009
5. van Seijen, H., & Sutton, R. S. (2014). True Online TD(λ). *Proceedings of the 31st International Conference on Machine Learning*, 692-700. https://proceedings.mlr.press/v32/seijen14.html
6. Bello, I. (2021). LambdaNetworks: Modeling Long-Range Interactions Without Attention. *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2102.08602
7. scikit-learn developers. Lasso, Ridge, and ElasticNet documentation. https://scikit-learn.org/stable/modules/linear_model.html
8. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. Chapter 12: Eligibility Traces. http://www.incompleteideas.net/book/the-book-2nd.html
9. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. https://hastie.su.domains/ElemStatLearn/
10. Boyd, S., & Vandenberghe, L. (2004). *Convex Optimization*. Cambridge University Press. Chapter 5: Duality. https://web.stanford.edu/~boyd/cvxbook/
11. Church, A. (1936). An Unsolvable Problem of Elementary Number Theory. *American Journal of Mathematics*, 58(2), 345-363. https://www.jstor.org/stable/2371045

