# L2 Loss

> Source: https://aiwiki.ai/wiki/l2_loss
> Updated: 2026-07-11
> Categories: Machine Learning, Statistics, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**L2 loss** is the squared-error [loss function](/wiki/loss_function): for a true value $$y$$ and a predicted value $$\hat{y}$$, it is the squared difference $$(y - \hat{y})^2$$, and averaging it across a dataset gives the [mean squared error](/wiki/mean_squared_error_mse) (MSE).[1][2] It is the default loss for most [regression](/wiki/regression) problems because squaring is smooth, differentiable, and yields a convex objective for [linear regression](/wiki/linear_regression), but the same squaring makes L2 loss highly sensitive to outliers.[2][3] Google's Machine Learning Crash Course defines it plainly: L2 loss is "the sum of the squared difference between the predicted values and the actual values," and "MSE is the average of L2 losses across a set of N examples."[2] L2 loss is also called **squared error loss**, **quadratic loss**, or **MSE loss**, and it is mathematically the same loss covered by the AI Wiki page on [squared loss](/wiki/squared_loss).[1][2]

L2 loss is a *loss function* applied to prediction errors, which is a separate idea from [L2 regularization](/wiki/l2_regularization) (also called weight decay): L2 regularization penalizes the size of a model's *weights* to control [overfitting](/wiki/overfitting), whereas L2 loss penalizes the size of the *prediction error*.[4] The two share the name "L2" because both are built from the squared [L2 norm](/wiki/regularization), but this page is about the loss function, not the regularizer.

## Explain like I'm 5 (ELI5)

Imagine you are throwing darts at a target on the wall. Every time you throw a dart, you measure how far it landed from the bullseye. L2 loss is like taking each of those distances, multiplying each one by itself (squaring it), and then finding the average. If most of your darts land close to the bullseye, the number is small. If you throw one dart way off into the corner, squaring that big distance makes the number jump up a lot. So L2 loss tells you, on average, how badly you are missing, and it really punishes the throws that are far off.

## What is L2 loss?

L2 loss measures error by squaring the gap between what a model predicts and what actually happened. For one prediction the value is $$(y - \hat{y})^2$$; over a dataset it is summed (sum of squared errors) or averaged (mean squared error).[1][2] Because the penalty grows with the square of the error, a prediction that is twice as far off contributes four times as much loss, so L2 loss focuses a model's attention disproportionately on its largest mistakes.[2][3] Google states the consequence directly: "When the difference between the prediction and label is large, squaring makes the loss even larger."[2]

The single most important statistical property is that the value which minimizes expected L2 loss is the **conditional mean** $$\mathbb{E}[y \mid x]$$.[5][6] A model trained with a linear output and L2 loss therefore learns to predict the average of the target distribution, which is why L2 loss is the natural choice when the goal is an unbiased estimate of the typical value.[5][6]

### Mathematical definition

#### Squared error for a single prediction

For a single data point with true value $$y$$ and predicted value $$\hat{y}$$ (y-hat), the squared error is:

$$
\text{SE} = (y - \hat{y})^2
$$

#### Mean squared error (MSE)

When computed over a dataset of *n* observations, the mean squared error averages the individual squared errors:[1][2]

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Here, $$y_i$$ is the true value of the $$i$$-th observation and $$\hat{y}_i$$ is the model's prediction for that observation. The [mean squared error](/wiki/mean_squared_error_mse) is exactly the average of the per-example L2 losses.[2]

#### Sum of squared errors (SSE)

Some formulations use the total (non-averaged) form:

$$
\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

SSE and MSE differ only by the constant factor $$1/n$$, so minimizing one is equivalent to minimizing the other. The MSE form is more common in machine learning because it keeps the loss magnitude independent of dataset size.

#### Matrix notation

In matrix form, with error vector $$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$$:

$$
\text{MSE} = \frac{1}{n} \mathbf{e}^\top \mathbf{e}
$$

This compact notation is useful when deriving closed-form solutions in linear regression.

## Why is it called L2 loss? The L2-norm connection

The name comes from the **L2 norm**, also called the Euclidean norm, which measures the length of a vector as the square root of the sum of its squared elements: $$\lVert v \rVert_2 = \sqrt{\sum v_i^2}$$.[7] L2 loss is the **squared** L2 norm of the residual vector (the vector of errors), optionally divided by *n* for the mean version.[7] Squaring the residual is the same operation, applied componentwise, that defines Euclidean distance, which is why "L2 loss," "squared error," and "Euclidean loss" all refer to the same quantity.[7]

This is also the source of the most common naming confusion. Three distinct objects share the "L2" label: the **L2 norm** (a way to measure a vector's length), the **L2 loss** (the squared L2 norm of the *error* vector, the subject of this page), and **L2 regularization** (a penalty equal to the squared L2 norm of the *weight* vector, added to a loss to discourage large weights).[4][7] They are related but distinct, and the section below contrasts the loss with the regularizer directly.

## Gradient and optimization

One of the main reasons L2 loss is popular is that its gradient has a simple, closed-form expression. The partial derivative of MSE with respect to a predicted value *ŷᵢ* is:[14]

$$
\frac{\partial \text{MSE}}{\partial \hat{y}_i} = -\frac{2}{n}(y_i - \hat{y}_i)
$$

This gradient is linear in the residual $$(y_i - \hat{y}_i)$$. When the prediction is far from the target, the gradient is large, driving a strong update. When the prediction is close to the target, the gradient is small, allowing fine-grained convergence.[14] During [backpropagation](/wiki/backpropagation), this gradient is propagated through the network using the chain rule to update all trainable parameters.

In the special case of [linear regression](/wiki/linear_regression) with MSE loss, the loss surface is a convex paraboloid. Setting the gradient to zero yields the **normal equation**, which provides a closed-form solution:

$$
w^* = (X^\top X)^{-1} X^\top y
$$

For [deep learning](/wiki/deep_learning) models, the loss surface is generally non-convex due to the network's nonlinear [activation functions](/wiki/activation_function). However, the L2 loss component itself is always convex with respect to the network's output layer, which contributes to stable training dynamics.[6]

## Key mathematical properties

| Property | Description |
|---|---|
| Non-negativity | L2 loss is always greater than or equal to zero. It equals zero only when every prediction exactly matches its target. |
| Convexity | The function is convex with respect to the predictions, guaranteeing that any local minimum is also the global minimum (for linear models).[6] |
| Differentiability | L2 loss is smooth and continuously differentiable everywhere, unlike [L1 loss](/wiki/l1_loss) which has a non-differentiable point at zero. |
| Symmetry | The loss is symmetric around zero: overestimating by *k* units incurs the same penalty as underestimating by *k* units. |
| Sensitivity to scale | Squaring amplifies large errors and diminishes small ones. An error of 10 contributes 100 to the loss, while an error of 1 contributes only 1.[2] |
| Decomposability | The total MSE over a dataset is the average of independent per-sample terms, which makes it straightforward to compute in mini-batch settings. |
| Quadratic growth | The loss grows quadratically with the magnitude of the error, meaning doubling the error quadruples the loss.[2] |
| Optimal prediction | The constant prediction that minimizes expected L2 loss is the (conditional) mean of the targets.[5][6] |

## Bias-variance decomposition

In statistical estimation theory, MSE admits a well-known decomposition into bias and variance components. For an estimator $$\hat{y}$$ of a parameter $$\theta$$:[8]

$$
\text{MSE}(\hat{y}) = \text{Bias}(\hat{y})^2 + \text{Var}(\hat{y})
$$

The **bias** term measures the systematic deviation of the estimator's expected value from the true parameter, while the **variance** term measures how much the estimator fluctuates across different samples drawn from the same distribution. This decomposition is central to the [bias-variance tradeoff](/wiki/bias_variance_tradeoff): a model with high bias tends to underfit, while a model with high variance tends to overfit.

When irreducible noise (also called Bayes error) is present in the data, the full decomposition becomes:

$$
\text{Expected MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}
$$

The irreducible error represents noise inherent in the data that no model can eliminate. Understanding this decomposition helps practitioners diagnose whether a model's poor MSE stems from systematic errors (high bias), instability (high variance), or noisy data.

## How does L2 loss connect to maximum likelihood and the Gaussian?

L2 loss has a deep connection to probability theory through [maximum likelihood estimation](/wiki/maximum_likelihood_estimation) (MLE). If we assume the target variable follows a Gaussian (normal) distribution centered on the model's prediction, with constant variance $$\sigma^2$$:[5][6]

$$
y_i = f(x_i) + \epsilon_i, \text{ where } \epsilon_i \sim \mathcal{N}(0, \sigma^2)
$$

Then the negative log-likelihood of the observed data is:

$$
-\log L = \frac{n}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2} \sum (y_i - f(x_i))^2
$$

Since the first term is a constant, minimizing the negative log-likelihood is equivalent to minimizing the sum of squared errors.[5][6] This means that using L2 loss implicitly assumes that prediction errors are normally distributed. When this assumption holds, the L2 loss estimator is the most efficient unbiased estimator (it achieves the Cramer-Rao lower bound).[8] When errors are not normally distributed (for example, when data contains heavy-tailed outliers), other loss functions such as [L1 loss](/wiki/l1_loss) or Huber loss may be more appropriate.[7]

## Why is L2 loss so sensitive to outliers?

The quadratic nature of L2 loss makes it highly sensitive to outliers. Consider a dataset where most residuals are around 1, but one outlier has a residual of 50. The outlier contributes 2,500 to the sum of squared errors, while a typical point contributes only 1. This single outlier can dominate the total loss and pull the model's predictions toward it, degrading performance on the majority of the data.[2][3] Google's Machine Learning Crash Course makes the practical effect explicit: "MSE moves the model more toward the outliers, while MAE doesn't," because "L2 loss incurs a much higher penalty for an outlier than L1 loss."[2]

This behavior is a double-edged sword. In settings where large errors are genuinely costly (for example, predicting structural loads in engineering, where even one large miscalculation can cause failure), L2 loss appropriately assigns heavy penalties to big mistakes. In settings where outliers are merely noise or data entry errors, L2 loss can produce misleading models.

Strategies for dealing with outlier sensitivity include:

- **Data preprocessing:** Remove or cap extreme values before training.
- **Robust loss functions:** Use Huber loss or log-cosh loss, which behave quadratically near zero but linearly for large errors.[7]
- **[Regularization](/wiki/regularization):** Apply [L2 regularization](/wiki/l2_regularization) (weight decay) to constrain model capacity and reduce sensitivity to individual data points.[4]
- **Trimmed or Winsorized estimators:** Exclude or downweight a fixed percentage of extreme residuals.

## L2 loss vs L1 loss: what is the difference?

The practical contrast between L2 loss and [L1 loss](/wiki/l1_loss) (absolute error) comes down to how each treats large errors and what statistic each one targets. Google defines L1 loss as "the sum of the absolute values of the difference between the predicted values and the actual values," against L2 loss as the sum of *squared* differences.[2] L1 loss applies a constant penalty per unit of error and so is robust to outliers and predicts the conditional **median**; L2 loss applies a growing penalty and predicts the conditional **mean**.[2][5]

| Aspect | L2 loss (squared error) | L1 loss (absolute error) |
|---|---|---|
| Formula | $$\frac{1}{n} \sum (y_i - \hat{y}_i)^2$$ | $$\frac{1}{n} \sum \lvert y_i - \hat{y}_i \rvert$$ |
| Gradient behavior | Gradient proportional to residual; large errors produce large gradients | Constant gradient magnitude (±1); does not scale with error size |
| Outlier sensitivity | High; squaring amplifies large errors | Low; linear penalty on large errors |
| Differentiability | Smooth everywhere | Not differentiable at zero |
| Optimal prediction | Predicts the conditional mean | Predicts the conditional median |
| Noise assumption | Assumes Gaussian noise | Assumes Laplacian noise |
| Closed-form solution | Available for [linear regression](/wiki/linear_regression) (normal equation) | Not available; requires iterative methods |
| Convergence speed | Generally faster due to smooth gradient | Can be slower near the optimum due to constant gradient |
| Sparsity | Does not encourage sparse solutions | Can produce sparse coefficients |

In practice, L2 loss is preferred when the data is clean, errors are roughly Gaussian, and the model should avoid any large individual errors. [L1 loss](/wiki/l1_loss) is preferred when robustness to outliers is needed or when the conditional median is a more meaningful prediction than the conditional mean.[2][5]

## Is L2 loss the same as L2 regularization?

No. Despite the shared name, L2 loss and [L2 regularization](/wiki/l2_regularization) do different jobs, and a model can use either, both, or neither.[4][7] L2 loss is a *loss function*: it measures how wrong predictions are by squaring the prediction error $$(y - \hat{y})^2$$. L2 regularization (weight decay) is a *penalty term added to a loss*: it adds the squared L2 norm of the model's *weights*, $$\lambda \lVert w \rVert_2^2$$, to discourage large weights and reduce [overfitting](/wiki/overfitting).[4] The error vector is the input to L2 loss; the weight vector is the input to L2 regularization. They can appear together, for example in [ridge regression](/wiki/least_squares_regression), where the training objective is L2 loss **plus** an L2 regularization term.

| | L2 loss (this page) | L2 regularization (weight decay) |
|---|---|---|
| What it acts on | The prediction error $$(y - \hat{y})$$ | The model weights $$w$$ |
| Role | A loss function (the objective to minimize) | A penalty added to the objective |
| Typical form | $$\frac{1}{n} \sum (y_i - \hat{y}_i)^2$$ | $$\lambda \sum w_j^2$$ |
| Purpose | Measure prediction accuracy | Control model complexity / [overfitting](/wiki/overfitting) |
| Built from | Squared [L2 norm](/wiki/regularization) of the residual vector | Squared L2 norm of the weight vector |

## L2 loss vs Huber loss

Huber loss is a hybrid that combines the best properties of L2 and L1 loss. It is defined by a threshold parameter $$\delta$$:[7]

- For residuals smaller than $$\delta$$, Huber loss behaves like L2 loss (quadratic).
- For residuals larger than $$\delta$$, Huber loss behaves like L1 loss (linear).

This design gives Huber loss the smooth gradients of L2 loss near zero (enabling efficient convergence) while limiting the influence of outliers. The parameter $$\delta$$ is typically set via cross-validation.

## L2 loss vs log-cosh loss

Log-cosh loss uses the logarithm of the hyperbolic cosine of the error: $$\log(\cosh(y - \hat{y}))$$. For small errors, it approximates $$(y - \hat{y})^2 / 2$$ (like L2 loss). For large errors, it approximates $$\lvert y - \hat{y} \rvert - \log(2)$$ (like L1 loss). Unlike Huber loss, log-cosh is twice differentiable everywhere, which can be advantageous for second-order optimization methods.

### Summary of regression loss functions

| Loss function | Outlier robustness | Differentiability | Gradient at zero | Typical use case |
|---|---|---|---|---|
| [L2 loss](/wiki/l2_loss) | Low | Smooth everywhere | Zero | Clean data, Gaussian noise |
| [L1 loss](/wiki/l1_loss) | High | Not differentiable at 0 | Undefined | Heavy-tailed noise, sparse models |
| Huber loss | Medium-high | Continuous first derivative | Zero | Mixed noise, tunable threshold |
| Log-cosh loss | Medium-high | Twice differentiable | Zero | When second-order smoothness is needed |

## Relationship to related metrics

### Root mean squared error (RMSE)

[RMSE](/wiki/root_mean_squared_error_rmse) is the square root of MSE:

$$
\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum (y_i - \hat{y}_i)^2}
$$

RMSE has the advantage of being expressed in the same units as the target variable, which makes it easier to interpret. For example, if the target is measured in dollars, MSE is in "squared dollars" (a unit with no intuitive meaning), while RMSE is directly in dollars. Minimizing RMSE is equivalent to minimizing MSE, since the square root is a monotonically increasing function.

### Coefficient of determination (R²)

R² measures the proportion of variance in the target variable explained by the model:

$$
R^2 = 1 - \frac{\text{MSE}}{\text{Var}(y)} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
$$

where *ȳ* is the mean of the observed values. R² ranges from $$-\infty$$ to 1, with 1 indicating a perfect fit. Unlike MSE, R² is dimensionless and scale-invariant, which makes it useful for comparing models across different datasets.

### Squared loss vs L2 norm

The terms can be confusing because "L2" appears in multiple contexts. The **L2 norm** (Euclidean norm) of a vector is the square root of the sum of squared elements: $$\lVert v \rVert_2 = \sqrt{\sum v_i^2}$$.[7] The **L2 loss** (squared error loss) is the square of the L2 norm of the residual vector (divided by n for the mean version).[7] [L2 regularization](/wiki/l2_regularization) adds the squared L2 norm of the weight vector as a penalty term to the loss.[4] These are related but distinct concepts.

## What is L2 loss used for? Applications

### Linear regression

[Linear regression](/wiki/linear_regression) with L2 loss is the classical "ordinary least squares" (OLS) method, first developed by Adrien-Marie Legendre in 1805 and independently by Carl Friedrich Gauss, who claimed to have used the method since 1795.[1][3] The normal equation provides a closed-form solution, and the Gauss-Markov theorem guarantees that OLS produces the best linear unbiased estimator (BLUE) under certain conditions (linearity, exogeneity, homoscedasticity, no perfect multicollinearity).

### Neural network training

In [deep learning](/wiki/deep_learning), L2 loss is the standard choice for regression output layers. A [neural network](/wiki/neural_network) with a linear output neuron trained using MSE loss learns to predict the conditional mean of the target distribution.[6] For multi-output regression (for example, predicting the x and y coordinates of an object), MSE is applied element-wise across all outputs and averaged.

### Object detection and localization

In [object detection](/wiki/object_detection) models like YOLO and Faster R-CNN, L2 loss (or its variant, Smooth L1 loss) is used to train the bounding box regression head. The model predicts four coordinates (x, y, width, height) for each detected object, and the squared error between predicted and ground-truth coordinates forms the localization loss.

### Image reconstruction and generation

Pixel-wise MSE is commonly used to train [autoencoders](/wiki/autoencoder) and [variational autoencoders](/wiki/variational_autoencoder) for image reconstruction. The loss measures the average squared difference between each pixel in the reconstructed image and the original. While effective for training, pixel-wise MSE tends to produce blurry outputs because it penalizes all pixel deviations equally, regardless of perceptual importance. For this reason, perceptual loss functions based on feature-space distances are often used alongside MSE in generative models.

### Time series forecasting

L2 loss is widely used in time series prediction tasks, where models forecast future values of a sequence. The squared error penalizes large forecast deviations, which is desirable in applications such as energy demand prediction and financial risk assessment where worst-case accuracy matters.

### Reinforcement learning

In [reinforcement learning](/wiki/reinforcement_learning), MSE is commonly used to train value function approximators. The temporal difference (TD) error, which measures the discrepancy between the current value estimate and the bootstrapped target, is often squared to form the loss for updating the value network.

### Signal processing and control

Outside of machine learning, L2 loss appears in signal processing (for example, Wiener filter design), control theory (linear-quadratic regulator), and communication systems (minimum mean squared error estimation). Its mathematical tractability and connection to Gaussian models make it a natural choice in these fields.

## Implementation in popular frameworks

### PyTorch

PyTorch provides MSE loss through `torch.nn.MSELoss` and the functional API `torch.nn.functional.mse_loss`. The class supports three reduction modes:[11]

```python
import torch
import torch.nn as nn

# Create sample predictions and targets
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 7.5])

# Default: mean reduction
criterion = nn.MSELoss(reduction='mean')
loss = criterion(predictions, targets)

# Sum reduction
criterion_sum = nn.MSELoss(reduction='sum')
loss_sum = criterion_sum(predictions, targets)

# No reduction (returns per-element loss)
criterion_none = nn.MSELoss(reduction='none')
loss_none = criterion_none(predictions, targets)
```

### TensorFlow / Keras

In TensorFlow, MSE loss is available as both a standalone function and a Keras loss class:[12]

```python
import tensorflow as tf

# As a Keras loss
loss_fn = tf.keras.losses.MeanSquaredError()
loss = loss_fn(y_true, y_pred)

# As a function
loss = tf.keras.losses.mean_squared_error(y_true, y_pred)

# In model compilation
model.compile(optimizer='adam', loss='mse')
```

### NumPy (manual implementation)

A simple MSE implementation from scratch:

```python
import numpy as np

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def mse_gradient(y_true, y_pred):
    n = len(y_true)
    return -2 / n * (y_true - y_pred)
```

## L2 loss with regularization

L2 loss is often combined with [regularization](/wiki/regularization) terms to prevent [overfitting](/wiki/overfitting). The two most common combinations are:[4]

### Ridge regression (L2 regularization)

Ridge regression adds the squared L2 norm of the weight vector to the MSE loss:[4]

$$
\text{Loss} = \text{MSE} + \lambda \lVert w \rVert_2^2 = \frac{1}{n} \sum (y_i - \hat{y}_i)^2 + \lambda \sum w_j^2
$$

The [regularization](/wiki/regularization) parameter $$\lambda$$ controls the strength of the penalty. [L2 regularization](/wiki/l2_regularization) shrinks all weights toward zero but does not set any to exactly zero, resulting in dense models. This is also called **weight decay** in the [deep learning](/wiki/deep_learning) literature.[4]

### Elastic net (L1 + L2 regularization)

Elastic net combines [L1 regularization](/wiki/l1_regularization) and [L2 regularization](/wiki/l2_regularization) with MSE loss:

$$
\text{Loss} = \text{MSE} + \lambda_1 \lVert w \rVert_1 + \lambda_2 \lVert w \rVert_2^2
$$

This combination provides both the sparsity-inducing property of L1 and the grouping effect of L2, making it useful when features are correlated.

## Historical background

The method of least squares, which directly minimizes L2 loss, is one of the oldest techniques in statistical estimation. Adrien-Marie Legendre published the first clear exposition of the method in 1805, in an appendix to his work on determining cometary orbits.[1] Carl Friedrich Gauss claimed in 1809 that he had been using the method since 1795, sparking a priority dispute that was never fully resolved.[3]

Gauss made a contribution that went beyond Legendre's: he connected the method of least squares to the theory of probability by showing that if measurement errors follow a normal distribution, the least squares estimator is the maximum likelihood estimator.[3][5] This connection gave the method a solid theoretical foundation and helped explain why it worked so well in practice.

The method gained rapid acceptance in the scientific community for two reasons. First, it was computationally tractable: minimizing squared error led to systems of linear equations that could be solved with pen and paper. Second, the resulting estimators had desirable statistical properties, including unbiasedness and minimum variance among linear estimators (as later formalized by the Gauss-Markov theorem in the early 20th century).

With the rise of [machine learning](/wiki/machine_learning) and [neural networks](/wiki/neural_network), MSE became the default regression loss function, and it remains one of the most commonly used loss functions in both research and production systems.

## Practical tips for using L2 loss

- **Check for outliers** before training. If the dataset contains extreme values, consider using Huber loss or preprocessing the data to cap outliers.[2]
- **Normalize or standardize** input features so that all features contribute roughly equally to the MSE. Without normalization, features with large scales can dominate the loss.
- **Monitor both MSE and RMSE.** RMSE is easier to interpret because it is in the same units as the target. MSE is easier to differentiate and optimize.
- **Use MSE for model training but consider other metrics for evaluation.** In some domains, mean absolute error (MAE) or domain-specific metrics may better reflect real-world performance.
- **Combine with [regularization](/wiki/regularization)** when training models with many parameters. [L2 regularization](/wiki/l2_regularization) is especially natural alongside L2 loss.[4]
- **Be aware of the scale.** MSE values depend on the scale of the target variable. An MSE of 100 might be excellent for targets in the range of 10,000 but terrible for targets in the range of 1 to 10. Always interpret MSE relative to the data.
- **Consider the [learning rate](/wiki/learning_rate).** Because L2 loss gradients scale linearly with the residual, very large initial errors can produce very large gradients. If training is unstable, try reducing the learning rate or using gradient clipping.[14]

## Limitations

- **Outlier sensitivity:** As discussed, L2 loss disproportionately weights large errors, which can be problematic with noisy or contaminated data.[2][3]
- **Blurry outputs in generation tasks:** When used as a pixel-wise loss for image generation, MSE tends to average over multiple modes in the data distribution, producing blurry results rather than sharp, realistic images.
- **Assumption of Gaussian noise:** The implicit assumption that errors are normally distributed may not hold in all settings. For heavy-tailed or skewed error distributions, L2 loss produces suboptimal estimates.[5]
- **Scale dependence:** MSE is not dimensionless. Its value depends on the scale of the target variable, which can make it difficult to compare across different tasks or datasets without normalization.
- **Insensitivity to direction:** L2 loss treats overestimation and underestimation equally. In applications where one type of error is more costly than the other (for example, predicting medication dosages), asymmetric loss functions may be more appropriate.

## See also

- [Loss function](/wiki/loss_function)
- [Squared loss](/wiki/squared_loss)
- [L1 loss](/wiki/l1_loss)
- [Mean Squared Error (MSE)](/wiki/mean_squared_error_mse)
- [Root Mean Squared Error (RMSE)](/wiki/root_mean_squared_error_rmse)
- [L2 regularization](/wiki/l2_regularization)
- [Linear regression](/wiki/linear_regression)
- [Gradient descent](/wiki/gradient_descent)
- [Bias-variance tradeoff](/wiki/bias_variance_tradeoff)
- [Overfitting](/wiki/overfitting)
- [Cross-entropy loss](/wiki/cross_entropy_loss)
- [Maximum likelihood estimation](/wiki/maximum_likelihood_estimation)

## References

1. Legendre, A. M. (1805). *Nouvelles methodes pour la determination des orbites des cometes*. Appendix: "Sur la methode des moindres quarres."
2. Google Developers. "Linear Regression: Loss." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/linear-regression/loss
3. Stigler, S. M. (1981). "Gauss and the Invention of Least Squares." *Annals of Statistics*, 9(3), 465-474.
4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear Methods for Regression (ridge / L2 regularization).
5. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 1.2.5: Loss Functions for Regression; Section 3.1.1.
6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Section 6.2.2: Learning Conditional Statistics. https://www.deeplearningbook.org/contents/mlp.html
7. Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press. Sections on norms, robust losses (Huber), and empirical risk minimization. https://probml.github.io/pml-book/book1.html
8. Lehmann, E. L. & Casella, G. (1998). *Theory of Point Estimation* (2nd ed.). Springer. Chapter 2: Unbiasedness; bias-variance and Cramer-Rao bound.
9. Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.
10. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning* (2nd ed.). Springer. Chapter 3: Linear Regression.
11. PyTorch Documentation. "MSELoss." https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html
12. TensorFlow Documentation. "tf.keras.losses.MeanSquaredError." https://www.tensorflow.org/api_docs/python/tf/keras/losses/MeanSquaredError
13. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 2: Overview of Supervised Learning.
14. Raschka, S. (2022). "What is the Derivative of the Mean Squared Error?" https://sebastianraschka.com/faq/docs/mse-derivative.html