# L2 Loss

> Source: https://aiwiki.ai/wiki/l2_loss
> Updated: 2026-04-06
> Categories: Machine Learning, Statistics, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**L2 loss**, also known as **squared error loss**, **quadratic loss**, or **mean squared error (MSE) loss**, is one of the most widely used [loss functions](/wiki/loss_function) in [machine learning](/wiki/machine_learning) and statistics. It measures the average of the squared differences between predicted values and actual target values. Because squaring penalizes large errors more heavily than small ones, L2 loss is particularly sensitive to outliers but provides smooth, differentiable gradients that are well suited for [gradient descent](/wiki/gradient_descent) optimization. It serves as the default loss function for most [regression](/wiki/regression) tasks and plays a central role in [linear regression](/wiki/linear_regression), [neural networks](/wiki/neural_network), and many other predictive models.

## Explain like I'm 5 (ELI5)

Imagine you are throwing darts at a target on the wall. Every time you throw a dart, you measure how far it landed from the bullseye. L2 loss is like taking each of those distances, multiplying each one by itself (squaring it), and then finding the average. If most of your darts land close to the bullseye, the number is small. If you throw one dart way off into the corner, squaring that big distance makes the number jump up a lot. So L2 loss tells you, on average, how badly you are missing, and it really punishes the throws that are far off.

## Mathematical definition

### Squared error for a single prediction

For a single data point with true value *y* and predicted value *\u0177* (y-hat), the squared error is:

**SE = (y - \u0177)\u00b2**

### Mean squared error (MSE)

When computed over a dataset of *n* observations, the mean squared error averages the individual squared errors:

**MSE = (1/n) \u2211\u1d62\u208c\u2081\u207f (y\u1d62 - \u0177\u1d62)\u00b2**

Here, *y\u1d62* is the true value of the *i*-th observation and *\u0177\u1d62* is the model's prediction for that observation.

### Sum of squared errors (SSE)

Some formulations use the total (non-averaged) form:

**SSE = \u2211\u1d62\u208c\u2081\u207f (y\u1d62 - \u0177\u1d62)\u00b2**

SSE and MSE differ only by the constant factor *1/n*, so minimizing one is equivalent to minimizing the other. The MSE form is more common in machine learning because it keeps the loss magnitude independent of dataset size.

### Matrix notation

In matrix form, with error vector **e** = **y** - **\u0177**:

**MSE = (1/n) e\u1d40e**

This compact notation is useful when deriving closed-form solutions in linear regression.

## Gradient and optimization

One of the main reasons L2 loss is popular is that its gradient has a simple, closed-form expression. The partial derivative of MSE with respect to a predicted value *\u0177\u1d62* is:

**\u2202MSE / \u2202\u0177\u1d62 = -(2/n)(y\u1d62 - \u0177\u1d62)**

This gradient is linear in the residual *(y\u1d62 - \u0177\u1d62)*. When the prediction is far from the target, the gradient is large, driving a strong update. When the prediction is close to the target, the gradient is small, allowing fine-grained convergence. During [backpropagation](/wiki/backpropagation), this gradient is propagated through the network using the chain rule to update all trainable parameters.

In the special case of [linear regression](/wiki/linear_regression) with MSE loss, the loss surface is a convex paraboloid. Setting the gradient to zero yields the **normal equation**, which provides a closed-form solution:

**w* = (X\u1d40X)\u207b\u00b9 X\u1d40y**

For [deep learning](/wiki/deep_learning) models, the loss surface is generally non-convex due to the network's nonlinear [activation functions](/wiki/activation_function). However, the L2 loss component itself is always convex with respect to the network's output layer, which contributes to stable training dynamics.

## Key mathematical properties

| Property | Description |
|---|---|
| Non-negativity | L2 loss is always greater than or equal to zero. It equals zero only when every prediction exactly matches its target. |
| Convexity | The function is convex with respect to the predictions, guaranteeing that any local minimum is also the global minimum (for linear models). |
| Differentiability | L2 loss is smooth and continuously differentiable everywhere, unlike [L1 loss](/wiki/l1_loss) which has a non-differentiable point at zero. |
| Symmetry | The loss is symmetric around zero: overestimating by *k* units incurs the same penalty as underestimating by *k* units. |
| Sensitivity to scale | Squaring amplifies large errors and diminishes small ones. An error of 10 contributes 100 to the loss, while an error of 1 contributes only 1. |
| Decomposability | The total MSE over a dataset is the average of independent per-sample terms, which makes it straightforward to compute in mini-batch settings. |
| Quadratic growth | The loss grows quadratically with the magnitude of the error, meaning doubling the error quadruples the loss. |

## Bias-variance decomposition

In statistical estimation theory, MSE admits a well-known decomposition into bias and variance components. For an estimator *\u0177* of a parameter *\u03b8*:

**MSE(\u0177) = Bias(\u0177)\u00b2 + Var(\u0177)**

The **bias** term measures the systematic deviation of the estimator's expected value from the true parameter, while the **variance** term measures how much the estimator fluctuates across different samples drawn from the same distribution. This decomposition is central to the [bias-variance tradeoff](/wiki/bias_variance_tradeoff): a model with high bias tends to underfit, while a model with high variance tends to overfit.

When irreducible noise (also called Bayes error) is present in the data, the full decomposition becomes:

**Expected MSE = Bias\u00b2 + Variance + Irreducible Error**

The irreducible error represents noise inherent in the data that no model can eliminate. Understanding this decomposition helps practitioners diagnose whether a model's poor MSE stems from systematic errors (high bias), instability (high variance), or noisy data.

## Connection to maximum likelihood estimation

L2 loss has a deep connection to probability theory through maximum likelihood estimation (MLE). If we assume the target variable follows a Gaussian (normal) distribution centered on the model's prediction, with constant variance *\u03c3\u00b2*:

**y\u1d62 = f(x\u1d62) + \u03b5\u1d62, where \u03b5\u1d62 ~ N(0, \u03c3\u00b2)**

Then the negative log-likelihood of the observed data is:

**-log L = (n/2) log(2\u03c0\u03c3\u00b2) + (1/2\u03c3\u00b2) \u2211(y\u1d62 - f(x\u1d62))\u00b2**

Since the first term is a constant, minimizing the negative log-likelihood is equivalent to minimizing the sum of squared errors. This means that using L2 loss implicitly assumes that prediction errors are normally distributed. When this assumption holds, the L2 loss estimator is the most efficient unbiased estimator (it achieves the Cramer-Rao lower bound). When errors are not normally distributed (for example, when data contains heavy-tailed outliers), other loss functions such as [L1 loss](/wiki/l1_loss) or Huber loss may be more appropriate.

## Sensitivity to outliers

The quadratic nature of L2 loss makes it highly sensitive to outliers. Consider a dataset where most residuals are around 1, but one outlier has a residual of 50. The outlier contributes 2,500 to the sum of squared errors, while a typical point contributes only 1. This single outlier can dominate the total loss and pull the model's predictions toward it, degrading performance on the majority of the data.

This behavior is a double-edged sword. In settings where large errors are genuinely costly (for example, predicting structural loads in engineering, where even one large miscalculation can cause failure), L2 loss appropriately assigns heavy penalties to big mistakes. In settings where outliers are merely noise or data entry errors, L2 loss can produce misleading models.

Strategies for dealing with outlier sensitivity include:

- **Data preprocessing:** Remove or cap extreme values before training.
- **Robust loss functions:** Use Huber loss or log-cosh loss, which behave quadratically near zero but linearly for large errors.
- **[Regularization](/wiki/regularization):** Apply [L2 regularization](/wiki/l2_regularization) (weight decay) to constrain model capacity and reduce sensitivity to individual data points.
- **Trimmed or Winsorized estimators:** Exclude or downweight a fixed percentage of extreme residuals.

## Comparison with other loss functions

### L2 loss vs. L1 loss

| Aspect | L2 loss (squared error) | L1 loss (absolute error) |
|---|---|---|
| Formula | (1/n) \u2211(y\u1d62 - \u0177\u1d62)\u00b2 | (1/n) \u2211\|y\u1d62 - \u0177\u1d62\| |
| Gradient behavior | Gradient proportional to residual; large errors produce large gradients | Constant gradient magnitude (\u00b11); does not scale with error size |
| Outlier sensitivity | High; squaring amplifies large errors | Low; linear penalty on large errors |
| Differentiability | Smooth everywhere | Not differentiable at zero |
| Optimal prediction | Predicts the conditional mean | Predicts the conditional median |
| Noise assumption | Assumes Gaussian noise | Assumes Laplacian noise |
| Closed-form solution | Available for [linear regression](/wiki/linear_regression) (normal equation) | Not available; requires iterative methods |
| Convergence speed | Generally faster due to smooth gradient | Can be slower near the optimum due to constant gradient |
| Sparsity | Does not encourage sparse solutions | Can produce sparse coefficients |

In practice, L2 loss is preferred when the data is clean, errors are roughly Gaussian, and the model should avoid any large individual errors. [L1 loss](/wiki/l1_loss) is preferred when robustness to outliers is needed or when the conditional median is a more meaningful prediction than the conditional mean.

### L2 loss vs. Huber loss

Huber loss is a hybrid that combines the best properties of L2 and L1 loss. It is defined by a threshold parameter \u03b4:

- For residuals smaller than \u03b4, Huber loss behaves like L2 loss (quadratic).
- For residuals larger than \u03b4, Huber loss behaves like L1 loss (linear).

This design gives Huber loss the smooth gradients of L2 loss near zero (enabling efficient convergence) while limiting the influence of outliers. The parameter \u03b4 is typically set via cross-validation.

### L2 loss vs. log-cosh loss

Log-cosh loss uses the logarithm of the hyperbolic cosine of the error: *log(cosh(y - \u0177))*. For small errors, it approximates *(y - \u0177)\u00b2 / 2* (like L2 loss). For large errors, it approximates *|y - \u0177| - log(2)* (like L1 loss). Unlike Huber loss, log-cosh is twice differentiable everywhere, which can be advantageous for second-order optimization methods.

### Summary of regression loss functions

| Loss function | Outlier robustness | Differentiability | Gradient at zero | Typical use case |
|---|---|---|---|---|
| [L2 loss](/wiki/l2_loss) | Low | Smooth everywhere | Zero | Clean data, Gaussian noise |
| [L1 loss](/wiki/l1_loss) | High | Not differentiable at 0 | Undefined | Heavy-tailed noise, sparse models |
| Huber loss | Medium-high | Continuous first derivative | Zero | Mixed noise, tunable threshold |
| Log-cosh loss | Medium-high | Twice differentiable | Zero | When second-order smoothness is needed |

## Relationship to related metrics

### Root mean squared error (RMSE)

RMSE is the square root of MSE:

**RMSE = \u221aMSE = \u221a[(1/n) \u2211(y\u1d62 - \u0177\u1d62)\u00b2]**

RMSE has the advantage of being expressed in the same units as the target variable, which makes it easier to interpret. For example, if the target is measured in dollars, MSE is in "squared dollars" (a unit with no intuitive meaning), while RMSE is directly in dollars. Minimizing RMSE is equivalent to minimizing MSE, since the square root is a monotonically increasing function.

### Coefficient of determination (R\u00b2)

R\u00b2 measures the proportion of variance in the target variable explained by the model:

**R\u00b2 = 1 - (MSE / Var(y)) = 1 - [\u2211(y\u1d62 - \u0177\u1d62)\u00b2 / \u2211(y\u1d62 - \u0233)\u00b2]**

where *\u0233* is the mean of the observed values. R\u00b2 ranges from negative infinity to 1, with 1 indicating a perfect fit. Unlike MSE, R\u00b2 is dimensionless and scale-invariant, which makes it useful for comparing models across different datasets.

### Squared loss vs. L2 norm

The terms can be confusing because "L2" appears in multiple contexts. The **L2 norm** (Euclidean norm) of a vector is the square root of the sum of squared elements: ||v||\u2082 = \u221a(\u2211 v\u1d62\u00b2). The **L2 loss** (squared error loss) is the square of the L2 norm of the residual vector (divided by n for the mean version). [L2 regularization](/wiki/l2_regularization) adds the squared L2 norm of the weight vector as a penalty term to the loss. These are related but distinct concepts.

## Applications

### Linear regression

[Linear regression](/wiki/linear_regression) with L2 loss is the classical "ordinary least squares" (OLS) method, first developed by Adrien-Marie Legendre in 1805 and independently by Carl Friedrich Gauss, who claimed to have used the method since 1795. The normal equation provides a closed-form solution, and the Gauss-Markov theorem guarantees that OLS produces the best linear unbiased estimator (BLUE) under certain conditions (linearity, exogeneity, homoscedasticity, no perfect multicollinearity).

### Neural network training

In [deep learning](/wiki/deep_learning), L2 loss is the standard choice for regression output layers. A [neural network](/wiki/neural_network) with a linear output neuron trained using MSE loss learns to predict the conditional mean of the target distribution. For multi-output regression (for example, predicting the x and y coordinates of an object), MSE is applied element-wise across all outputs and averaged.

### Object detection and localization

In [object detection](/wiki/object_detection) models like YOLO and Faster R-CNN, L2 loss (or its variant, Smooth L1 loss) is used to train the bounding box regression head. The model predicts four coordinates (x, y, width, height) for each detected object, and the squared error between predicted and ground-truth coordinates forms the localization loss.

### Image reconstruction and generation

Pixel-wise MSE is commonly used to train [autoencoders](/wiki/autoencoder) and [variational autoencoders](/wiki/variational_autoencoder) for image reconstruction. The loss measures the average squared difference between each pixel in the reconstructed image and the original. While effective for training, pixel-wise MSE tends to produce blurry outputs because it penalizes all pixel deviations equally, regardless of perceptual importance. For this reason, perceptual loss functions based on feature-space distances are often used alongside MSE in generative models.

### Time series forecasting

L2 loss is widely used in time series prediction tasks, where models forecast future values of a sequence. The squared error penalizes large forecast deviations, which is desirable in applications such as energy demand prediction and financial risk assessment where worst-case accuracy matters.

### Reinforcement learning

In [reinforcement learning](/wiki/reinforcement_learning), MSE is commonly used to train value function approximators. The temporal difference (TD) error, which measures the discrepancy between the current value estimate and the bootstrapped target, is often squared to form the loss for updating the value network.

### Signal processing and control

Outside of machine learning, L2 loss appears in signal processing (for example, Wiener filter design), control theory (linear-quadratic regulator), and communication systems (minimum mean squared error estimation). Its mathematical tractability and connection to Gaussian models make it a natural choice in these fields.

## Implementation in popular frameworks

### PyTorch

PyTorch provides MSE loss through `torch.nn.MSELoss` and the functional API `torch.nn.functional.mse_loss`. The class supports three reduction modes:

```python
import torch
import torch.nn as nn

# Create sample predictions and targets
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 7.5])

# Default: mean reduction
criterion = nn.MSELoss(reduction='mean')
loss = criterion(predictions, targets)

# Sum reduction
criterion_sum = nn.MSELoss(reduction='sum')
loss_sum = criterion_sum(predictions, targets)

# No reduction (returns per-element loss)
criterion_none = nn.MSELoss(reduction='none')
loss_none = criterion_none(predictions, targets)
```

### TensorFlow / Keras

In TensorFlow, MSE loss is available as both a standalone function and a Keras loss class:

```python
import tensorflow as tf

# As a Keras loss
loss_fn = tf.keras.losses.MeanSquaredError()
loss = loss_fn(y_true, y_pred)

# As a function
loss = tf.keras.losses.mean_squared_error(y_true, y_pred)

# In model compilation
model.compile(optimizer='adam', loss='mse')
```

### NumPy (manual implementation)

A simple MSE implementation from scratch:

```python
import numpy as np

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def mse_gradient(y_true, y_pred):
    n = len(y_true)
    return -2 / n * (y_true - y_pred)
```

## L2 loss with regularization

L2 loss is often combined with [regularization](/wiki/regularization) terms to prevent [overfitting](/wiki/overfitting). The two most common combinations are:

### Ridge regression (L2 regularization)

Ridge regression adds the squared L2 norm of the weight vector to the MSE loss:

**Loss = MSE + \u03bb ||w||\u00b2\u2082 = (1/n) \u2211(y\u1d62 - \u0177\u1d62)\u00b2 + \u03bb \u2211 w\u2c7c\u00b2**

The [regularization](/wiki/regularization) parameter \u03bb controls the strength of the penalty. [L2 regularization](/wiki/l2_regularization) shrinks all weights toward zero but does not set any to exactly zero, resulting in dense models. This is also called **weight decay** in the [deep learning](/wiki/deep_learning) literature.

### Elastic net (L1 + L2 regularization)

Elastic net combines [L1 regularization](/wiki/l1_regularization) and [L2 regularization](/wiki/l2_regularization) with MSE loss:

**Loss = MSE + \u03bb\u2081 ||w||\u2081 + \u03bb\u2082 ||w||\u00b2\u2082**

This combination provides both the sparsity-inducing property of L1 and the grouping effect of L2, making it useful when features are correlated.

## Historical background

The method of least squares, which directly minimizes L2 loss, is one of the oldest techniques in statistical estimation. Adrien-Marie Legendre published the first clear exposition of the method in 1805, in an appendix to his work on determining cometary orbits. Carl Friedrich Gauss claimed in 1809 that he had been using the method since 1795, sparking a priority dispute that was never fully resolved.

Gauss made a contribution that went beyond Legendre's: he connected the method of least squares to the theory of probability by showing that if measurement errors follow a normal distribution, the least squares estimator is the maximum likelihood estimator. This connection gave the method a solid theoretical foundation and helped explain why it worked so well in practice.

The method gained rapid acceptance in the scientific community for two reasons. First, it was computationally tractable: minimizing squared error led to systems of linear equations that could be solved with pen and paper. Second, the resulting estimators had desirable statistical properties, including unbiasedness and minimum variance among linear estimators (as later formalized by the Gauss-Markov theorem in the early 20th century).

With the rise of [machine learning](/wiki/machine_learning) and [neural networks](/wiki/neural_network), MSE became the default regression loss function, and it remains one of the most commonly used loss functions in both research and production systems.

## Practical tips for using L2 loss

- **Check for outliers** before training. If the dataset contains extreme values, consider using Huber loss or preprocessing the data to cap outliers.
- **Normalize or standardize** input features so that all features contribute roughly equally to the MSE. Without normalization, features with large scales can dominate the loss.
- **Monitor both MSE and RMSE.** RMSE is easier to interpret because it is in the same units as the target. MSE is easier to differentiate and optimize.
- **Use MSE for model training but consider other metrics for evaluation.** In some domains, mean absolute error (MAE) or domain-specific metrics may better reflect real-world performance.
- **Combine with [regularization](/wiki/regularization)** when training models with many parameters. [L2 regularization](/wiki/l2_regularization) is especially natural alongside L2 loss.
- **Be aware of the scale.** MSE values depend on the scale of the target variable. An MSE of 100 might be excellent for targets in the range of 10,000 but terrible for targets in the range of 1 to 10. Always interpret MSE relative to the data.
- **Consider the [learning rate](/wiki/learning_rate).** Because L2 loss gradients scale linearly with the residual, very large initial errors can produce very large gradients. If training is unstable, try reducing the learning rate or using gradient clipping.

## Limitations

- **Outlier sensitivity:** As discussed, L2 loss disproportionately weights large errors, which can be problematic with noisy or contaminated data.
- **Blurry outputs in generation tasks:** When used as a pixel-wise loss for image generation, MSE tends to average over multiple modes in the data distribution, producing blurry results rather than sharp, realistic images.
- **Assumption of Gaussian noise:** The implicit assumption that errors are normally distributed may not hold in all settings. For heavy-tailed or skewed error distributions, L2 loss produces suboptimal estimates.
- **Scale dependence:** MSE is not dimensionless. Its value depends on the scale of the target variable, which can make it difficult to compare across different tasks or datasets without normalization.
- **Insensitivity to direction:** L2 loss treats overestimation and underestimation equally. In applications where one type of error is more costly than the other (for example, predicting medication dosages), asymmetric loss functions may be more appropriate.

## See also

- [Loss function](/wiki/loss_function)
- [L1 loss](/wiki/l1_loss)
- [Squared loss](/wiki/squared_loss)
- [L2 regularization](/wiki/l2_regularization)
- [Linear regression](/wiki/linear_regression)
- [Gradient descent](/wiki/gradient_descent)
- [Bias-variance tradeoff](/wiki/bias_variance_tradeoff)
- [Overfitting](/wiki/overfitting)
- [Cross-entropy loss](/wiki/cross_entropy_loss)

## References

1. Legendre, A. M. (1805). *Nouvelles methodes pour la determination des orbites des cometes*. Appendix: "Sur la methode des moindres quarres."
2. Gauss, C. F. (1809). *Theoria motus corporum coelestium*. Hamburg: Perthes et Besser.
3. Stigler, S. M. (1981). "Gauss and the Invention of Least Squares." *Annals of Statistics*, 9(3), 465-474.
4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 2: Overview of Supervised Learning.
5. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 1.2.5: Loss Functions for Regression.
6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Section 6.2.2: Learning Conditional Statistics.
7. Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.
8. Lehmann, E. L. & Casella, G. (1998). *Theory of Point Estimation* (2nd ed.). Springer. Chapter 2: Unbiasedness.
9. Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press. Section 5.1: Empirical Risk Minimization.
10. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear Regression.
11. PyTorch Documentation. "MSELoss." https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html
12. TensorFlow Documentation. "tf.keras.losses.MeanSquaredError." https://www.tensorflow.org/api_docs/python/tf/keras/losses/MeanSquaredError
13. Google Developers. "Linear Regression: Loss." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/linear-regression/loss
14. Raschka, S. (2022). "What is the Derivative of the Mean Squared Error?" https://sebastianraschka.com/faq/docs/mse-derivative.html
