Lambda

See also: Machine learning terms

In machine learning, statistics, and computer science, the Greek letter lambda (λ) shows up in several different roles. The most common meaning in supervised learning is the strength of a regularization penalty, but the same symbol also denotes eigenvalues in linear algebra, the mixing parameter in TD(λ) reinforcement learning, the linear functions in lambda layers, and Lagrange multipliers in constrained optimization. The name also appears in unrelated places such as AWS Lambda and lambda calculus.

This article focuses on the regularization meaning of lambda, since that is the most frequent usage in machine learning, and then disambiguates the other common uses.

meanings of lambda at a glance

Context	Symbol	Role
Regularized regression	λ	Strength of the L1 regularization or L2 regularization penalty
Elastic net	λ, α	Overall penalty strength and L1/L2 mix
Linear algebra	λ	Eigenvalue of a matrix
Principal component analysis	λ_i	Variance captured by the i-th component
Reinforcement learning	λ	Trace decay parameter in TD(λ) and eligibility traces
Constrained optimization	λ	Lagrange multiplier on a constraint
LambdaNetworks	λ	A small linear function summarizing context for an input
Lambda calculus	λ	Binder for an anonymous function
AWS	Lambda	Serverless compute service from AWS

lambda as a regularization hyperparameter

Lambda is most commonly used in machine learning as the multiplier on a penalty term added to a loss function. The penalty discourages large model coefficients, which tends to reduce overfitting and improve generalization. It is particularly relevant in linear regression and logistic regression models, but the same idea applies to neural networks (where it is often called weight decay) and to many other estimators.

The regularized objective has the general form:

minimize L(β) + λ · R(β)

where L(β) is the data fit term, R(β) is the penalty function, and λ ≥ 0 controls how much the penalty matters relative to the fit. When λ is zero the model reduces to the unregularized estimator. As λ grows the coefficients are pushed toward zero, which makes the model simpler and less sensitive to noise in the training data.

regularization formulas using lambda

The table below shows how lambda enters the three most common regularized least-squares estimators. Here β are the coefficients, y the targets, and X the design matrix.

Method	Penalty R(β)	Objective minimized	Effect on coefficients
Ridge regression (L2)	Σ β_i²	‖y − Xβ‖² + λ Σ β_i²	Shrinks all coefficients smoothly toward zero
Lasso regression (L1)	Σ \|β_i\|	‖y − Xβ‖² + λ Σ \|β_i\|	Shrinks some coefficients exactly to zero, performing variable selection
Elastic net	α Σ \|β_i\| + (1 − α) Σ β_i²	‖y − Xβ‖² + λ (α Σ \|β_i\| + (1 − α) Σ β_i²)	Mixes L1 and L2; useful when predictors are correlated

Lasso was introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso" in the Journal of the Royal Statistical Society, Series B. Elastic net was proposed by Hui Zou and Trevor Hastie in 2005 as a way to keep groups of correlated features together rather than letting lasso pick one and drop the rest.

why lambda matters

Fitting a model that minimizes only the data loss can give large coefficients that fit the training data well but generalize poorly. Adding a penalty multiplied by lambda biases the solution toward simpler, smaller coefficients. Small λ leaves the coefficients close to the unregularized fit, so the model has high variance and may overfit. Large λ pushes the coefficients close to zero, so the model has higher bias and may underfit. The right value depends on the data, the noise level, and how many features there are relative to the number of training examples.

For lasso, increasing λ also drives more coefficients to exactly zero, which gives a sparse model. For ridge, no coefficient is ever exactly zero, but all of them shrink. Elastic net interpolates between the two.

choosing lambda

Because the optimal value depends on the data, lambda is treated as a hyperparameter and tuned outside of the main fit. Common strategies:

Strategy	How it works
Grid search with cross-validation	Try a logarithmic grid of λ values; pick the one with the lowest mean validation error
Information criteria	Choose λ that minimizes AIC or BIC on the training set
LARS path	Compute the entire lasso solution path in one pass, then pick a λ along the path
Bayesian optimization	Treat λ as a black-box hyperparameter and search efficiently

The LARS algorithm by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani, published in The Annals of Statistics in 2004, is a useful tool here. A simple modification of LARS computes the entire lasso solution path as a piecewise linear function of λ, which lets you see how each coefficient enters or leaves the model as the penalty changes. This is also why the function lars and lasso path plots are common in regression workflows.

When the noise level σ is known approximately, theoretical work has suggested λ on the order of σ √(2 log p / n) for the lasso, where p is the number of features and n is the sample size. In practice, cross-validation almost always wins out as the chosen method.

lambda in scikit-learn and other libraries

Different toolkits use different names for the same parameter. The math is the same; only the label changes.

Library	Parameter name	Notes
scikit-learn (`Ridge`, `Lasso`, `ElasticNet`)	`alpha`	Plays the role of textbook λ
scikit-learn (`ElasticNet`)	`l1_ratio`	The L1/L2 mix
glmnet (R)	`lambda`	Penalty strength, returned as a sequence along the path
glmnet (R)	`alpha`	L1/L2 mix
Keras / PyTorch	`weight_decay`	Plays the role of λ for L2 penalties on weights

The scikit-learn convention of calling it alpha is a frequent source of confusion for people coming from a statistics background, where lambda is standard. The two also disagree about scaling: as discussed in scikit-learn issue 21891, Lasso(alpha) and Ridge(alpha) use slightly different conventions for the relationship between the penalty constant and the per-sample loss.

bayesian interpretation

The regularization parameter has a clean Bayesian reading. If you put an independent zero-mean Gaussian prior on each coefficient, the posterior mode is the ridge estimator and λ is proportional to the ratio of noise variance to prior variance. A larger λ corresponds to a tighter prior, which is why it pulls the estimates more strongly toward zero.

For lasso the analogue is a Laplace (double exponential) prior on each coefficient. The mode of the posterior is then the lasso estimate, and 1/λ plays the role of the scale parameter of the prior. The Laplace prior puts more probability mass exactly at zero than the Gaussian, which is the underlying reason lasso produces sparse solutions while ridge does not.

connection to constrained optimization

The penalized form min L(β) + λ R(β) is the Lagrangian of a constrained problem of the form min L(β) subject to R(β) ≤ t. The Lagrange multiplier on the inequality constraint is exactly λ, and the Karush-Kuhn-Tucker (KKT) conditions tie the two formulations together. Increasing λ corresponds to tightening the constraint t, which is why a larger penalty leads to smaller coefficients.

This duality is more than a notational coincidence. It is the reason solvers can switch between thinking about "how much penalty to add" and "how big a coefficient budget to allow". It also generalizes to many other ML settings, from constrained policy optimization in reinforcement learning to Lagrangian formulations of fairness constraints.

lambda as an eigenvalue

In linear algebra, lambda is the standard symbol for an eigenvalue. Given a square matrix A, an eigenvalue λ and eigenvector v satisfy Av = λv. Eigenvalues show up throughout machine learning whenever a method involves a covariance matrix, a kernel matrix, or a graph Laplacian.

Application	Where eigenvalues appear
Principal component analysis	The eigenvalues of the sample covariance matrix give the variance captured by each principal component; ranking them gives the scree plot
Singular value decomposition	The squared singular values of X are the eigenvalues of X^T X
Kernel methods	Eigenvalues of the kernel (Gram) matrix appear in kernel PCA and spectral clustering
Spectral clustering	Smallest eigenvalues of the graph Laplacian reveal cluster structure
Stability of training	Eigenvalues of the loss Hessian relate to step-size choice and curvature

In PCA the eigenvalue structure has a particularly direct interpretation: each eigenvalue equals the variance of the projection of the data onto the corresponding eigenvector, so the proportion of variance explained by component i is λ_i divided by the sum of all eigenvalues. The same matrix and the same lambdas show up under the spectral decomposition Σ = V Λ V^T, with Λ the diagonal matrix of eigenvalues.

TD(λ) and eligibility traces in reinforcement learning

In reinforcement learning, lambda is the trace-decay parameter in TD(λ), introduced by Richard Sutton in his 1988 paper "Learning to predict by the methods of temporal differences" in Machine Learning. TD(λ) uses an eligibility trace, a temporary memory of recently visited states, that decays geometrically by a factor of λ at each step. When a reward arrives, all states in the trace get a credit update weighted by their current eligibility.

The parameter λ controls a continuous interpolation between two extreme algorithms:

Value of λ	Equivalent algorithm	Behaviour
λ = 0	One-step TD (also called TD(0))	Updates only the current state from the next state's estimate
0 < λ < 1	Mixed TD(λ)	Blends one-step bootstrapping with longer multi-step backups
λ = 1	Monte Carlo (offline equivalence)	Updates from the full return, like Monte Carlo methods

Intermediate values of λ often outperform either extreme. Eligibility traces are now a standard tool in RL and appear in many algorithms, including SARSA(λ), Q(λ), and the True Online TD(λ) variant by Harm van Seijen and Sutton from 2014.

lambda layers and LambdaNetworks

A different use of the symbol comes from the LambdaNetworks paper by Irwan Bello, presented at ICLR 2021 under the title "LambdaNetworks: Modeling Long-Range Interactions Without Attention". A lambda layer summarizes the surrounding context of an input position into a small linear function, called a lambda, and then applies that linear function to the query at that position.

The goal is to capture long-range interactions like self-attention does, but without the quadratic memory cost of attention maps. Lambda layers model both content-based and position-based interactions, which makes them practical for large structured inputs like images. The resulting LambdaResNets reach competitive accuracy on ImageNet and COCO benchmarks while running several times faster than comparable vision transformers. In this setting the symbol λ refers to the linear function itself, not to a scalar hyperparameter. The naming is a deliberate nod to lambda calculus, where λ binds an anonymous function.

A few other places where lambda or Lambda shows up around AI and ML, listed briefly because they are not statistical hyperparameters:

Use	What it is
Stein-type shrinkage	Estimators of the form λ · target + (1 − λ) · sample, where λ pulls the sample mean toward a fixed target
Mixing weights	Convex combinations like λ · model_A + (1 − λ) · model_B for ensembling, interpolation, or distillation
Lambda calculus	Formal model of computation introduced by Alonzo Church in 1936; inspiration for anonymous functions
Python `lambda` keyword	Defines an anonymous inline function, common in pandas and PyTorch pipelines (for example `df.apply(lambda x: x * 2)`)
AWS Lambda	Serverless compute service from Amazon Web Services, often used to deploy lightweight ML inference endpoints

None of these are the same concept as the regularization hyperparameter. They share the symbol or the name and sometimes appear in the same code or paper, which is a frequent source of confusion for newcomers.

explain like i'm 5 (ELI5)

Lambda is like a helper in machine learning that makes sure a model doesn't get too focused on learning every tiny detail from the data. It helps the model to learn the most important patterns so it can make good predictions on new, unseen data. There are two popular ways to use lambda, called Lasso and Ridge, which have different ways of helping the model. Finding the right amount of help from lambda is important for the model to do its job well.

The same letter shows up in other parts of math and computer science too. In linear algebra it is a number that describes how a matrix stretches a special direction. In reinforcement learning it is a knob that controls how far back in time a robot remembers when it gets a reward. In Python the word lambda is used to write tiny one-line functions. They are all different ideas; they just borrowed the same Greek letter.

references

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267-288. https://webdoc.agsci.colostate.edu/koontz/arec-econ535/papers/Tibshirani%20(JRSS-B%201996).pdf
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2), 301-320. https://hastie.su.domains/TALKS/enet_talk.pdf
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least Angle Regression. The Annals of Statistics, 32(2), 407-451. https://tibshirani.su.domains/ftp/lars.pdf
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.
van Seijen, H., & Sutton, R. S. (2014). True Online TD(λ). Proceedings of the 31st International Conference on Machine Learning. http://proceedings.mlr.press/v32/seijen14.pdf
Bello, I. (2021). LambdaNetworks: Modeling Long-Range Interactions Without Attention. International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=xTJEN-ggl1b
scikit-learn developers. Lasso, Ridge, and ElasticNet documentation. https://scikit-learn.org/stable/modules/linear_model.html
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Chapter 12: Eligibility Traces. http://www.incompleteideas.net/book/the-book-2nd.html
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. https://hastie.su.domains/ElemStatLearn/
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Chapter 5: Duality. https://web.stanford.edu/~boyd/cvxbook/

meanings of lambda at a glance

lambda as a regularization hyperparameter

regularization formulas using lambda

why lambda matters

choosing lambda

lambda in scikit-learn and other libraries

bayesian interpretation

connection to constrained optimization

lambda as an eigenvalue

TD(λ) and eligibility traces in reinforcement learning

lambda layers and LambdaNetworks

related uses of lambda in ML and CS

explain like i'm 5 (ELI5)

references

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Machine learning terms/Fundamentals

meanings of lambda at a glance

lambda as a regularization hyperparameter

regularization formulas using lambda

why lambda matters

choosing lambda

lambda in scikit-learn and other libraries

bayesian interpretation

connection to constrained optimization

lambda as an eigenvalue

TD(λ) and eligibility traces in reinforcement learning

lambda layers and LambdaNetworks

related uses of lambda in ML and CS

explain like i'm 5 (ELI5)

references

Related Articles

ARC-AGI 2

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Machine learning terms/Fundamentals