See also: Machine learning terms
In machine learning, statistics, and computer science, the Greek letter lambda (λ) shows up in several different roles. The most common meaning in supervised learning is the strength of a regularization penalty, but the same symbol also denotes eigenvalues in linear algebra, the mixing parameter in TD(λ) reinforcement learning, the linear functions in lambda layers, and Lagrange multipliers in constrained optimization. The name also appears in unrelated places such as AWS Lambda and lambda calculus.
This article focuses on the regularization meaning of lambda, since that is the most frequent usage in machine learning, and then disambiguates the other common uses.
| Context | Symbol | Role |
|---|---|---|
| Regularized regression | λ | Strength of the L1 regularization or L2 regularization penalty |
| Elastic net | λ, α | Overall penalty strength and L1/L2 mix |
| Linear algebra | λ | Eigenvalue of a matrix |
| Principal component analysis | λ_i | Variance captured by the i-th component |
| Reinforcement learning | λ | Trace decay parameter in TD(λ) and eligibility traces |
| Constrained optimization | λ | Lagrange multiplier on a constraint |
| LambdaNetworks | λ | A small linear function summarizing context for an input |
| Lambda calculus | λ | Binder for an anonymous function |
| AWS | Lambda | Serverless compute service from AWS |
Lambda is most commonly used in machine learning as the multiplier on a penalty term added to a loss function. The penalty discourages large model coefficients, which tends to reduce overfitting and improve generalization. It is particularly relevant in linear regression and logistic regression models, but the same idea applies to neural networks (where it is often called weight decay) and to many other estimators.
The regularized objective has the general form:
minimize L(β) + λ · R(β)
where L(β) is the data fit term, R(β) is the penalty function, and λ ≥ 0 controls how much the penalty matters relative to the fit. When λ is zero the model reduces to the unregularized estimator. As λ grows the coefficients are pushed toward zero, which makes the model simpler and less sensitive to noise in the training data.
The table below shows how lambda enters the three most common regularized least-squares estimators. Here β are the coefficients, y the targets, and X the design matrix.
| Method | Penalty R(β) | Objective minimized | Effect on coefficients |
|---|---|---|---|
| Ridge regression (L2) | Σ β_i² | ‖y − Xβ‖² + λ Σ β_i² | Shrinks all coefficients smoothly toward zero |
| Lasso regression (L1) | Σ |β_i| | ‖y − Xβ‖² + λ Σ |β_i| | Shrinks some coefficients exactly to zero, performing variable selection |
| Elastic net | α Σ |β_i| + (1 − α) Σ β_i² | ‖y − Xβ‖² + λ (α Σ |β_i| + (1 − α) Σ β_i²) | Mixes L1 and L2; useful when predictors are correlated |
Lasso was introduced by Robert Tibshirani in his 1996 paper "Regression Shrinkage and Selection via the Lasso" in the Journal of the Royal Statistical Society, Series B. Elastic net was proposed by Hui Zou and Trevor Hastie in 2005 as a way to keep groups of correlated features together rather than letting lasso pick one and drop the rest.
Fitting a model that minimizes only the data loss can give large coefficients that fit the training data well but generalize poorly. Adding a penalty multiplied by lambda biases the solution toward simpler, smaller coefficients. Small λ leaves the coefficients close to the unregularized fit, so the model has high variance and may overfit. Large λ pushes the coefficients close to zero, so the model has higher bias and may underfit. The right value depends on the data, the noise level, and how many features there are relative to the number of training examples.
For lasso, increasing λ also drives more coefficients to exactly zero, which gives a sparse model. For ridge, no coefficient is ever exactly zero, but all of them shrink. Elastic net interpolates between the two.
Because the optimal value depends on the data, lambda is treated as a hyperparameter and tuned outside of the main fit. Common strategies:
| Strategy | How it works |
|---|---|
| Grid search with cross-validation | Try a logarithmic grid of λ values; pick the one with the lowest mean validation error |
| Information criteria | Choose λ that minimizes AIC or BIC on the training set |
| LARS path | Compute the entire lasso solution path in one pass, then pick a λ along the path |
| Bayesian optimization | Treat λ as a black-box hyperparameter and search efficiently |
The LARS algorithm by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani, published in The Annals of Statistics in 2004, is a useful tool here. A simple modification of LARS computes the entire lasso solution path as a piecewise linear function of λ, which lets you see how each coefficient enters or leaves the model as the penalty changes. This is also why the function lars and lasso path plots are common in regression workflows.
When the noise level σ is known approximately, theoretical work has suggested λ on the order of σ √(2 log p / n) for the lasso, where p is the number of features and n is the sample size. In practice, cross-validation almost always wins out as the chosen method.
Different toolkits use different names for the same parameter. The math is the same; only the label changes.
| Library | Parameter name | Notes |
|---|---|---|
scikit-learn (Ridge, Lasso, ElasticNet) | alpha | Plays the role of textbook λ |
scikit-learn (ElasticNet) | l1_ratio | The L1/L2 mix |
| glmnet (R) | lambda | Penalty strength, returned as a sequence along the path |
| glmnet (R) | alpha | L1/L2 mix |
| Keras / PyTorch | weight_decay | Plays the role of λ for L2 penalties on weights |
The scikit-learn convention of calling it alpha is a frequent source of confusion for people coming from a statistics background, where lambda is standard. The two also disagree about scaling: as discussed in scikit-learn issue 21891, Lasso(alpha) and Ridge(alpha) use slightly different conventions for the relationship between the penalty constant and the per-sample loss.
The regularization parameter has a clean Bayesian reading. If you put an independent zero-mean Gaussian prior on each coefficient, the posterior mode is the ridge estimator and λ is proportional to the ratio of noise variance to prior variance. A larger λ corresponds to a tighter prior, which is why it pulls the estimates more strongly toward zero.
For lasso the analogue is a Laplace (double exponential) prior on each coefficient. The mode of the posterior is then the lasso estimate, and 1/λ plays the role of the scale parameter of the prior. The Laplace prior puts more probability mass exactly at zero than the Gaussian, which is the underlying reason lasso produces sparse solutions while ridge does not.
The penalized form min L(β) + λ R(β) is the Lagrangian of a constrained problem of the form min L(β) subject to R(β) ≤ t. The Lagrange multiplier on the inequality constraint is exactly λ, and the Karush-Kuhn-Tucker (KKT) conditions tie the two formulations together. Increasing λ corresponds to tightening the constraint t, which is why a larger penalty leads to smaller coefficients.
This duality is more than a notational coincidence. It is the reason solvers can switch between thinking about "how much penalty to add" and "how big a coefficient budget to allow". It also generalizes to many other ML settings, from constrained policy optimization in reinforcement learning to Lagrangian formulations of fairness constraints.
In linear algebra, lambda is the standard symbol for an eigenvalue. Given a square matrix A, an eigenvalue λ and eigenvector v satisfy Av = λv. Eigenvalues show up throughout machine learning whenever a method involves a covariance matrix, a kernel matrix, or a graph Laplacian.
| Application | Where eigenvalues appear |
|---|---|
| Principal component analysis | The eigenvalues of the sample covariance matrix give the variance captured by each principal component; ranking them gives the scree plot |
| Singular value decomposition | The squared singular values of X are the eigenvalues of X^T X |
| Kernel methods | Eigenvalues of the kernel (Gram) matrix appear in kernel PCA and spectral clustering |
| Spectral clustering | Smallest eigenvalues of the graph Laplacian reveal cluster structure |
| Stability of training | Eigenvalues of the loss Hessian relate to step-size choice and curvature |
In PCA the eigenvalue structure has a particularly direct interpretation: each eigenvalue equals the variance of the projection of the data onto the corresponding eigenvector, so the proportion of variance explained by component i is λ_i divided by the sum of all eigenvalues. The same matrix and the same lambdas show up under the spectral decomposition Σ = V Λ V^T, with Λ the diagonal matrix of eigenvalues.
In reinforcement learning, lambda is the trace-decay parameter in TD(λ), introduced by Richard Sutton in his 1988 paper "Learning to predict by the methods of temporal differences" in Machine Learning. TD(λ) uses an eligibility trace, a temporary memory of recently visited states, that decays geometrically by a factor of λ at each step. When a reward arrives, all states in the trace get a credit update weighted by their current eligibility.
The parameter λ controls a continuous interpolation between two extreme algorithms:
| Value of λ | Equivalent algorithm | Behaviour |
|---|---|---|
| λ = 0 | One-step TD (also called TD(0)) | Updates only the current state from the next state's estimate |
| 0 < λ < 1 | Mixed TD(λ) | Blends one-step bootstrapping with longer multi-step backups |
| λ = 1 | Monte Carlo (offline equivalence) | Updates from the full return, like Monte Carlo methods |
Intermediate values of λ often outperform either extreme. Eligibility traces are now a standard tool in RL and appear in many algorithms, including SARSA(λ), Q(λ), and the True Online TD(λ) variant by Harm van Seijen and Sutton from 2014.
A different use of the symbol comes from the LambdaNetworks paper by Irwan Bello, presented at ICLR 2021 under the title "LambdaNetworks: Modeling Long-Range Interactions Without Attention". A lambda layer summarizes the surrounding context of an input position into a small linear function, called a lambda, and then applies that linear function to the query at that position.
The goal is to capture long-range interactions like self-attention does, but without the quadratic memory cost of attention maps. Lambda layers model both content-based and position-based interactions, which makes them practical for large structured inputs like images. The resulting LambdaResNets reach competitive accuracy on ImageNet and COCO benchmarks while running several times faster than comparable vision transformers. In this setting the symbol λ refers to the linear function itself, not to a scalar hyperparameter. The naming is a deliberate nod to lambda calculus, where λ binds an anonymous function.
A few other places where lambda or Lambda shows up around AI and ML, listed briefly because they are not statistical hyperparameters:
| Use | What it is |
|---|---|
| Stein-type shrinkage | Estimators of the form λ · target + (1 − λ) · sample, where λ pulls the sample mean toward a fixed target |
| Mixing weights | Convex combinations like λ · model_A + (1 − λ) · model_B for ensembling, interpolation, or distillation |
| Lambda calculus | Formal model of computation introduced by Alonzo Church in 1936; inspiration for anonymous functions |
Python lambda keyword | Defines an anonymous inline function, common in pandas and PyTorch pipelines (for example df.apply(lambda x: x * 2)) |
| AWS Lambda | Serverless compute service from Amazon Web Services, often used to deploy lightweight ML inference endpoints |
None of these are the same concept as the regularization hyperparameter. They share the symbol or the name and sometimes appear in the same code or paper, which is a frequent source of confusion for newcomers.
Lambda is like a helper in machine learning that makes sure a model doesn't get too focused on learning every tiny detail from the data. It helps the model to learn the most important patterns so it can make good predictions on new, unseen data. There are two popular ways to use lambda, called Lasso and Ridge, which have different ways of helping the model. Finding the right amount of help from lambda is important for the model to do its job well.
The same letter shows up in other parts of math and computer science too. In linear algebra it is a number that describes how a matrix stretches a special direction. In reinforcement learning it is a knob that controls how far back in time a robot remembers when it gets a reward. In Python the word lambda is used to write tiny one-line functions. They are all different ideas; they just borrowed the same Greek letter.