L2 loss, also known as squared error loss, quadratic loss, or mean squared error (MSE) loss, is one of the most widely used loss functions in machine learning and statistics. It measures the average of the squared differences between predicted values and actual target values. Because squaring penalizes large errors more heavily than small ones, L2 loss is particularly sensitive to outliers but provides smooth, differentiable gradients that are well suited for gradient descent optimization. It serves as the default loss function for most regression tasks and plays a central role in linear regression, neural networks, and many other predictive models.
Imagine you are throwing darts at a target on the wall. Every time you throw a dart, you measure how far it landed from the bullseye. L2 loss is like taking each of those distances, multiplying each one by itself (squaring it), and then finding the average. If most of your darts land close to the bullseye, the number is small. If you throw one dart way off into the corner, squaring that big distance makes the number jump up a lot. So L2 loss tells you, on average, how badly you are missing, and it really punishes the throws that are far off.
For a single data point with true value y and predicted value \u0177 (y-hat), the squared error is:
SE = (y - \u0177)\u00b2
When computed over a dataset of n observations, the mean squared error averages the individual squared errors:
MSE = (1/n) \u2211\u1d62\u208c\u2081\u207f (y\u1d62 - \u0177\u1d62)\u00b2
Here, y\u1d62 is the true value of the i-th observation and \u0177\u1d62 is the model's prediction for that observation.
Some formulations use the total (non-averaged) form:
SSE = \u2211\u1d62\u208c\u2081\u207f (y\u1d62 - \u0177\u1d62)\u00b2
SSE and MSE differ only by the constant factor 1/n, so minimizing one is equivalent to minimizing the other. The MSE form is more common in machine learning because it keeps the loss magnitude independent of dataset size.
In matrix form, with error vector e = y - \u0177:
MSE = (1/n) e\u1d40e
This compact notation is useful when deriving closed-form solutions in linear regression.
One of the main reasons L2 loss is popular is that its gradient has a simple, closed-form expression. The partial derivative of MSE with respect to a predicted value \u0177\u1d62 is:
\u2202MSE / \u2202\u0177\u1d62 = -(2/n)(y\u1d62 - \u0177\u1d62)
This gradient is linear in the residual (y\u1d62 - \u0177\u1d62). When the prediction is far from the target, the gradient is large, driving a strong update. When the prediction is close to the target, the gradient is small, allowing fine-grained convergence. During backpropagation, this gradient is propagated through the network using the chain rule to update all trainable parameters.
In the special case of linear regression with MSE loss, the loss surface is a convex paraboloid. Setting the gradient to zero yields the normal equation, which provides a closed-form solution:
w = (X\u1d40X)\u207b\u00b9 X\u1d40y*
For deep learning models, the loss surface is generally non-convex due to the network's nonlinear activation functions. However, the L2 loss component itself is always convex with respect to the network's output layer, which contributes to stable training dynamics.
| Property | Description |
|---|---|
| Non-negativity | L2 loss is always greater than or equal to zero. It equals zero only when every prediction exactly matches its target. |
| Convexity | The function is convex with respect to the predictions, guaranteeing that any local minimum is also the global minimum (for linear models). |
| Differentiability | L2 loss is smooth and continuously differentiable everywhere, unlike L1 loss which has a non-differentiable point at zero. |
| Symmetry | The loss is symmetric around zero: overestimating by k units incurs the same penalty as underestimating by k units. |
| Sensitivity to scale | Squaring amplifies large errors and diminishes small ones. An error of 10 contributes 100 to the loss, while an error of 1 contributes only 1. |
| Decomposability | The total MSE over a dataset is the average of independent per-sample terms, which makes it straightforward to compute in mini-batch settings. |
| Quadratic growth | The loss grows quadratically with the magnitude of the error, meaning doubling the error quadruples the loss. |
In statistical estimation theory, MSE admits a well-known decomposition into bias and variance components. For an estimator \u0177 of a parameter \u03b8:
MSE(\u0177) = Bias(\u0177)\u00b2 + Var(\u0177)
The bias term measures the systematic deviation of the estimator's expected value from the true parameter, while the variance term measures how much the estimator fluctuates across different samples drawn from the same distribution. This decomposition is central to the bias-variance tradeoff: a model with high bias tends to underfit, while a model with high variance tends to overfit.
When irreducible noise (also called Bayes error) is present in the data, the full decomposition becomes:
Expected MSE = Bias\u00b2 + Variance + Irreducible Error
The irreducible error represents noise inherent in the data that no model can eliminate. Understanding this decomposition helps practitioners diagnose whether a model's poor MSE stems from systematic errors (high bias), instability (high variance), or noisy data.
L2 loss has a deep connection to probability theory through maximum likelihood estimation (MLE). If we assume the target variable follows a Gaussian (normal) distribution centered on the model's prediction, with constant variance \u03c3\u00b2:
y\u1d62 = f(x\u1d62) + \u03b5\u1d62, where \u03b5\u1d62 ~ N(0, \u03c3\u00b2)
Then the negative log-likelihood of the observed data is:
-log L = (n/2) log(2\u03c0\u03c3\u00b2) + (1/2\u03c3\u00b2) \u2211(y\u1d62 - f(x\u1d62))\u00b2
Since the first term is a constant, minimizing the negative log-likelihood is equivalent to minimizing the sum of squared errors. This means that using L2 loss implicitly assumes that prediction errors are normally distributed. When this assumption holds, the L2 loss estimator is the most efficient unbiased estimator (it achieves the Cramer-Rao lower bound). When errors are not normally distributed (for example, when data contains heavy-tailed outliers), other loss functions such as L1 loss or Huber loss may be more appropriate.
The quadratic nature of L2 loss makes it highly sensitive to outliers. Consider a dataset where most residuals are around 1, but one outlier has a residual of 50. The outlier contributes 2,500 to the sum of squared errors, while a typical point contributes only 1. This single outlier can dominate the total loss and pull the model's predictions toward it, degrading performance on the majority of the data.
This behavior is a double-edged sword. In settings where large errors are genuinely costly (for example, predicting structural loads in engineering, where even one large miscalculation can cause failure), L2 loss appropriately assigns heavy penalties to big mistakes. In settings where outliers are merely noise or data entry errors, L2 loss can produce misleading models.
Strategies for dealing with outlier sensitivity include:
| Aspect | L2 loss (squared error) | L1 loss (absolute error) |
|---|---|---|
| Formula | (1/n) \u2211(y\u1d62 - \u0177\u1d62)\u00b2 | (1/n) \u2211|y\u1d62 - \u0177\u1d62| |
| Gradient behavior | Gradient proportional to residual; large errors produce large gradients | Constant gradient magnitude (\u00b11); does not scale with error size |
| Outlier sensitivity | High; squaring amplifies large errors | Low; linear penalty on large errors |
| Differentiability | Smooth everywhere | Not differentiable at zero |
| Optimal prediction | Predicts the conditional mean | Predicts the conditional median |
| Noise assumption | Assumes Gaussian noise | Assumes Laplacian noise |
| Closed-form solution | Available for linear regression (normal equation) | Not available; requires iterative methods |
| Convergence speed | Generally faster due to smooth gradient | Can be slower near the optimum due to constant gradient |
| Sparsity | Does not encourage sparse solutions | Can produce sparse coefficients |
In practice, L2 loss is preferred when the data is clean, errors are roughly Gaussian, and the model should avoid any large individual errors. L1 loss is preferred when robustness to outliers is needed or when the conditional median is a more meaningful prediction than the conditional mean.
Huber loss is a hybrid that combines the best properties of L2 and L1 loss. It is defined by a threshold parameter \u03b4:
This design gives Huber loss the smooth gradients of L2 loss near zero (enabling efficient convergence) while limiting the influence of outliers. The parameter \u03b4 is typically set via cross-validation.
Log-cosh loss uses the logarithm of the hyperbolic cosine of the error: log(cosh(y - \u0177)). For small errors, it approximates (y - \u0177)\u00b2 / 2 (like L2 loss). For large errors, it approximates |y - \u0177| - log(2) (like L1 loss). Unlike Huber loss, log-cosh is twice differentiable everywhere, which can be advantageous for second-order optimization methods.
| Loss function | Outlier robustness | Differentiability | Gradient at zero | Typical use case |
|---|---|---|---|---|
| L2 loss | Low | Smooth everywhere | Zero | Clean data, Gaussian noise |
| L1 loss | High | Not differentiable at 0 | Undefined | Heavy-tailed noise, sparse models |
| Huber loss | Medium-high | Continuous first derivative | Zero | Mixed noise, tunable threshold |
| Log-cosh loss | Medium-high | Twice differentiable | Zero | When second-order smoothness is needed |
RMSE is the square root of MSE:
RMSE = \u221aMSE = \u221a[(1/n) \u2211(y\u1d62 - \u0177\u1d62)\u00b2]
RMSE has the advantage of being expressed in the same units as the target variable, which makes it easier to interpret. For example, if the target is measured in dollars, MSE is in "squared dollars" (a unit with no intuitive meaning), while RMSE is directly in dollars. Minimizing RMSE is equivalent to minimizing MSE, since the square root is a monotonically increasing function.
R\u00b2 measures the proportion of variance in the target variable explained by the model:
R\u00b2 = 1 - (MSE / Var(y)) = 1 - [\u2211(y\u1d62 - \u0177\u1d62)\u00b2 / \u2211(y\u1d62 - \u0233)\u00b2]
where \u0233 is the mean of the observed values. R\u00b2 ranges from negative infinity to 1, with 1 indicating a perfect fit. Unlike MSE, R\u00b2 is dimensionless and scale-invariant, which makes it useful for comparing models across different datasets.
The terms can be confusing because "L2" appears in multiple contexts. The L2 norm (Euclidean norm) of a vector is the square root of the sum of squared elements: ||v||\u2082 = \u221a(\u2211 v\u1d62\u00b2). The L2 loss (squared error loss) is the square of the L2 norm of the residual vector (divided by n for the mean version). L2 regularization adds the squared L2 norm of the weight vector as a penalty term to the loss. These are related but distinct concepts.
Linear regression with L2 loss is the classical "ordinary least squares" (OLS) method, first developed by Adrien-Marie Legendre in 1805 and independently by Carl Friedrich Gauss, who claimed to have used the method since 1795. The normal equation provides a closed-form solution, and the Gauss-Markov theorem guarantees that OLS produces the best linear unbiased estimator (BLUE) under certain conditions (linearity, exogeneity, homoscedasticity, no perfect multicollinearity).
In deep learning, L2 loss is the standard choice for regression output layers. A neural network with a linear output neuron trained using MSE loss learns to predict the conditional mean of the target distribution. For multi-output regression (for example, predicting the x and y coordinates of an object), MSE is applied element-wise across all outputs and averaged.
In object detection models like YOLO and Faster R-CNN, L2 loss (or its variant, Smooth L1 loss) is used to train the bounding box regression head. The model predicts four coordinates (x, y, width, height) for each detected object, and the squared error between predicted and ground-truth coordinates forms the localization loss.
Pixel-wise MSE is commonly used to train autoencoders and variational autoencoders for image reconstruction. The loss measures the average squared difference between each pixel in the reconstructed image and the original. While effective for training, pixel-wise MSE tends to produce blurry outputs because it penalizes all pixel deviations equally, regardless of perceptual importance. For this reason, perceptual loss functions based on feature-space distances are often used alongside MSE in generative models.
L2 loss is widely used in time series prediction tasks, where models forecast future values of a sequence. The squared error penalizes large forecast deviations, which is desirable in applications such as energy demand prediction and financial risk assessment where worst-case accuracy matters.
In reinforcement learning, MSE is commonly used to train value function approximators. The temporal difference (TD) error, which measures the discrepancy between the current value estimate and the bootstrapped target, is often squared to form the loss for updating the value network.
Outside of machine learning, L2 loss appears in signal processing (for example, Wiener filter design), control theory (linear-quadratic regulator), and communication systems (minimum mean squared error estimation). Its mathematical tractability and connection to Gaussian models make it a natural choice in these fields.
PyTorch provides MSE loss through torch.nn.MSELoss and the functional API torch.nn.functional.mse_loss. The class supports three reduction modes:
import torch
import torch.nn as nn
# Create sample predictions and targets
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 7.5])
# Default: mean reduction
criterion = nn.MSELoss(reduction='mean')
loss = criterion(predictions, targets)
# Sum reduction
criterion_sum = nn.MSELoss(reduction='sum')
loss_sum = criterion_sum(predictions, targets)
# No reduction (returns per-element loss)
criterion_none = nn.MSELoss(reduction='none')
loss_none = criterion_none(predictions, targets)
In TensorFlow, MSE loss is available as both a standalone function and a Keras loss class:
import tensorflow as tf
# As a Keras loss
loss_fn = tf.keras.losses.MeanSquaredError()
loss = loss_fn(y_true, y_pred)
# As a function
loss = tf.keras.losses.mean_squared_error(y_true, y_pred)
# In model compilation
model.compile(optimizer='adam', loss='mse')
A simple MSE implementation from scratch:
import numpy as np
def mse_loss(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
def mse_gradient(y_true, y_pred):
n = len(y_true)
return -2 / n * (y_true - y_pred)
L2 loss is often combined with regularization terms to prevent overfitting. The two most common combinations are:
Ridge regression adds the squared L2 norm of the weight vector to the MSE loss:
Loss = MSE + \u03bb ||w||\u00b2\u2082 = (1/n) \u2211(y\u1d62 - \u0177\u1d62)\u00b2 + \u03bb \u2211 w\u2c7c\u00b2
The regularization parameter \u03bb controls the strength of the penalty. L2 regularization shrinks all weights toward zero but does not set any to exactly zero, resulting in dense models. This is also called weight decay in the deep learning literature.
Elastic net combines L1 regularization and L2 regularization with MSE loss:
Loss = MSE + \u03bb\u2081 ||w||\u2081 + \u03bb\u2082 ||w||\u00b2\u2082
This combination provides both the sparsity-inducing property of L1 and the grouping effect of L2, making it useful when features are correlated.
The method of least squares, which directly minimizes L2 loss, is one of the oldest techniques in statistical estimation. Adrien-Marie Legendre published the first clear exposition of the method in 1805, in an appendix to his work on determining cometary orbits. Carl Friedrich Gauss claimed in 1809 that he had been using the method since 1795, sparking a priority dispute that was never fully resolved.
Gauss made a contribution that went beyond Legendre's: he connected the method of least squares to the theory of probability by showing that if measurement errors follow a normal distribution, the least squares estimator is the maximum likelihood estimator. This connection gave the method a solid theoretical foundation and helped explain why it worked so well in practice.
The method gained rapid acceptance in the scientific community for two reasons. First, it was computationally tractable: minimizing squared error led to systems of linear equations that could be solved with pen and paper. Second, the resulting estimators had desirable statistical properties, including unbiasedness and minimum variance among linear estimators (as later formalized by the Gauss-Markov theorem in the early 20th century).
With the rise of machine learning and neural networks, MSE became the default regression loss function, and it remains one of the most commonly used loss functions in both research and production systems.