Squared loss, also called quadratic loss, L2 loss, or squared error loss, is a loss function that measures the discrepancy between a predicted value and a true value by squaring their difference. It is one of the oldest and most widely used error measures in statistics and machine learning, serving as the foundation for least squares regression, mean squared error (MSE), and many optimization algorithms. Minimizing squared loss is equivalent to performing maximum likelihood estimation under the assumption of additive Gaussian noise, a connection first established by Carl Friedrich Gauss in the early 19th century.
Imagine you are throwing beanbags at a target on the ground. After each throw, you measure how far your beanbag landed from the bullseye. Now, instead of just adding up all those distances, you multiply each distance by itself (that is what "squaring" means). A beanbag that lands 2 feet away counts as 4, and one that lands 3 feet away counts as 9. Then you take the average of all those squared numbers. This final number tells you how accurate your throws are. Because you squared the distances, throws that are really far off get penalized much more than throws that are only a little off. The goal is to make that final number as small as possible.
For a single observation with true value $y$ and predicted value $\hat{y}$, the squared loss is defined as:
$$L(y, \hat{y}) = (y - \hat{y})^2$$
This function returns zero when the prediction exactly matches the true value and increases quadratically as the prediction deviates from the target.
When averaged over a dataset of $n$ observations, the squared loss produces the mean squared error:
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Some formulations divide by $2n$ instead of $n$ to simplify the derivative during gradient descent optimization. This scaling does not change the location of the minimum.
Without averaging, the total squared loss over all data points is called the residual sum of squares:
$$\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Minimizing the RSS is the objective in ordinary least squares (OLS) regression.
The square root of MSE is known as the root mean squared error:
$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$
RMSE is often preferred for reporting because it is expressed in the same units as the original target variable, making it easier to interpret.
Squared loss has several mathematical properties that make it convenient for optimization and statistical analysis.
Because the error term is squared, the loss is always greater than or equal to zero. It equals zero only when the prediction exactly matches the true value.
An overestimate of a given magnitude produces the same loss as an underestimate of the same magnitude. For example, predicting 12 when the true value is 10 gives the same loss as predicting 8.
The squared loss function is continuous and differentiable everywhere. This property makes it well suited for gradient-based optimization methods. The gradient of the per-sample squared loss with respect to the prediction is:
$$\frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y})$$
This simple linear gradient means that the magnitude of the update signal is proportional to the size of the error, which helps optimization converge smoothly.
The squared loss function is strictly convex. The second derivative (Hessian) with respect to the prediction is a constant:
$$\frac{\partial^2 L}{\partial \hat{y}^2} = 2$$
Since this is always positive, the loss surface has a single global minimum with no local minima. For linear regression with squared loss, this strict convexity guarantees that gradient descent will converge to the unique optimal solution.
Because errors are squared, large deviations from the true value are penalized disproportionately. An outlier with twice the error magnitude of a typical observation contributes four times as much to the total loss. This property makes squared loss sensitive to outliers and heavy-tailed distributions, which can distort model parameter estimates.
One of the most important theoretical results involving squared loss is the bias-variance decomposition. For an estimator $\hat{\theta}$ of a true parameter $\theta$, the expected squared loss (MSE) decomposes into three additive components:
$$\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta}) + \sigma^2$$
where:
This decomposition reveals a fundamental tradeoff in model selection. Simple models tend to have high bias but low variance, while complex models tend to have low bias but high variance. The squared loss framework provides a natural way to quantify this tradeoff because MSE cleanly separates into these interpretable components. This decomposition does not hold as neatly for other loss functions such as mean absolute error.
Squared loss has a deep probabilistic justification. Suppose the true data-generating process follows:
$$y = f(x) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)$$
where $f(x)$ is the true function and $\varepsilon$ is additive Gaussian noise with mean zero and variance $\sigma^2$. Under this model, the conditional distribution of $y$ given $x$ is:
$$p(y \mid x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x))^2}{2\sigma^2}\right)$$
For a dataset of $n$ independent observations, the log-likelihood is:
$$\log \mathcal{L} = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - f(x_i))^2$$
Maximizing this log-likelihood with respect to the function $f$ is equivalent to minimizing the sum of squared errors $\sum(y_i - f(x_i))^2$, because the first term is a constant and the factor $\frac{1}{2\sigma^2}$ does not affect the location of the optimum.
This means that when you train a model by minimizing squared loss, you are implicitly assuming that the errors follow a normal distribution. If the noise distribution is actually Laplacian (heavy-tailed), then minimizing absolute loss (L1 loss) would be the correct maximum likelihood approach instead.
Squared loss is the simplest example of a Bregman divergence. A Bregman divergence is defined for a strictly convex, differentiable function $F$ as:
$$D_F(p, q) = F(p) - F(q) - \langle \nabla F(q), p - q \rangle$$
When $F(x) = x^2$ (or more generally $F(x) = |x|^2$), the Bregman divergence reduces to the squared Euclidean distance:
$$D_F(p, q) = (p - q)^2$$
This connection is significant for several reasons. First, it places squared loss within a broader family of divergence measures that includes the KL divergence and the Itakura-Saito distance. Second, a key theorem states that the mean of a distribution minimizes the expected Bregman divergence, which generalizes the classical result that the arithmetic mean minimizes total squared error. Third, optimization techniques such as mirror descent can be understood as generalizations of gradient descent by replacing the squared Euclidean distance with other Bregman divergences.
Squared loss is particularly convenient for gradient-based optimization because its derivatives have simple closed-form expressions.
For a linear model $\hat{y} = wx + b$, the partial derivatives of the per-sample squared loss are:
$$\frac{\partial L}{\partial w} = -2(y - \hat{y})x$$
$$\frac{\partial L}{\partial b} = -2(y - \hat{y})$$
For a multivariate linear model $\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta}$, the gradient of the MSE with respect to the parameter vector is:
$$\nabla_{\boldsymbol{\theta}} \text{MSE} = \frac{2}{n} \mathbf{X}^T (\mathbf{X}\boldsymbol{\theta} - \mathbf{y})$$
Setting this gradient to zero yields the normal equations, which provide the closed-form OLS solution:
$$\boldsymbol{\theta}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$
This closed-form solution is one of the major computational advantages of squared loss. For problems where $\mathbf{X}^T\mathbf{X}$ is too large to invert, iterative methods such as stochastic gradient descent are used instead.
The following table compares squared loss with other common loss functions used in regression and classification.
| Loss function | Formula | Optimal estimator | Sensitivity to outliers | Differentiable everywhere | Typical use case |
|---|---|---|---|---|---|
| Squared loss (L2) | $(y - \hat{y})^2$ | Conditional mean | High | Yes | Regression |
| Absolute loss (L1) | $|y - \hat{y}|$ | Conditional median | Low | No (at zero) | Robust regression |
| Huber loss | Quadratic for small errors, linear for large | Conditional mean (approx.) | Medium | Yes | Robust regression |
| Hinge loss | $\max(0, 1 - y\hat{y})$ | Decision boundary | Low | No (at 1) | Classification (SVM) |
| Cross-entropy loss | $-\sum y \log \hat{y}$ | Conditional probability | Low | Yes | Classification |
| Log loss | $-y\log(\hat{y}) - (1-y)\log(1-\hat{y})$ | Conditional probability | Low | Yes | Binary classification |
The primary tradeoff between squared loss and absolute loss involves sensitivity to outliers versus mathematical convenience. Squared loss penalizes large errors quadratically, which means a single extreme outlier can dominate the total loss. Absolute loss penalizes all errors linearly, making it more robust to outliers but introducing a non-differentiable point at zero error. In practice, squared loss converges faster during optimization because its gradient varies smoothly with the error magnitude, whereas absolute loss has a constant gradient magnitude (positive or negative 1), which can cause instability near the optimum.
Huber loss combines the best properties of both squared and absolute loss. For residuals smaller than a threshold $\delta$, it behaves like squared loss, providing smooth gradients and efficient convergence. For residuals larger than $\delta$, it switches to linear growth, reducing the influence of outliers. The parameter $\delta$ controls where this transition occurs. Huber loss is differentiable everywhere and is commonly used in robust regression when the data contains occasional outliers but the bulk of observations are well-behaved.
Cross-entropy loss is the standard choice for classification tasks, while squared loss is used primarily for regression. When applied to classification, squared loss produces smaller gradients for confident but wrong predictions compared to cross-entropy, which leads to slower learning. Cross-entropy also has a natural probabilistic interpretation as the negative log-likelihood under a Bernoulli or categorical distribution, just as squared loss corresponds to Gaussian noise. Using squared loss for classification can work but generally leads to suboptimal convergence.
Squared loss is the defining loss function for ordinary least squares regression. In linear regression, the model parameters are chosen to minimize the sum of squared residuals, which yields the best linear unbiased estimator (BLUE) under the Gauss-Markov conditions.
In ridge regression, the squared loss is combined with an L2 regularization penalty on the model weights: $\text{Loss} = \text{MSE} + \lambda |\boldsymbol{\theta}|^2$. This prevents overfitting by shrinking large coefficients. Similarly, Lasso regression combines squared loss with an L1 regularization penalty, which encourages sparsity in the learned parameters.
In deep learning, squared loss (MSE) is the standard loss function for regression tasks. When training a neural network to predict continuous values (such as house prices, temperature, or stock returns), the network's output is compared to the true value using MSE, and backpropagation computes the gradient of this loss with respect to all network weights. Most deep learning frameworks (PyTorch, TensorFlow, Keras) provide built-in MSE loss functions.
In Bayesian inference, the squared loss has a special status: the Bayes-optimal estimator under squared loss is the posterior mean. That is, among all possible estimators, the one that minimizes expected squared loss is the conditional expectation $E[\theta \mid \text{data}]$. This result follows from the strict convexity of the squared loss function.
Squared error is widely used in signal processing for filter design, system identification, and signal reconstruction. The Wiener filter, for example, is derived by minimizing the mean squared error between the desired signal and the filter output. Similarly, the minimum mean squared error (MMSE) estimator is a standard tool in communications engineering.
MSE and RMSE are standard evaluation metrics for time series forecasting models. Methods such as ARIMA, exponential smoothing, and recurrent neural networks are often trained and evaluated using squared loss. RMSE is particularly popular in forecasting because it penalizes large errors more heavily, which is desirable in applications where big misses are costly (for example, energy demand prediction or financial risk estimation).
Matrix factorization methods used in recommendation systems typically minimize the squared error between predicted and actual user ratings. The Netflix Prize competition (2006 to 2009), which advanced the state of the art in collaborative filtering, used RMSE as its primary evaluation metric.
The method of minimizing squared errors has a history spanning more than two centuries.
Adrien-Marie Legendre published the first formal description of the method of least squares in 1805, in an appendix to his book on determining the orbits of comets (Nouvelles methodes pour la determination des orbites des cometes). The nine-page appendix, titled "Sur la methode des moindres quarres," presented the idea of minimizing the sum of squared residuals as a principled way to fit a model to observational data.
Carl Friedrich Gauss later claimed in 1809 that he had been using the method since 1795, leading to one of the most well-known priority disputes in the history of statistics. Regardless of who used it first, Gauss made a contribution that went beyond Legendre: he connected the method of least squares to the theory of probability by showing that it arises naturally from the assumption of normally distributed errors. This connection between squared loss and the Gaussian distribution remains one of the cornerstones of modern statistical theory.
Within a decade of Legendre's publication, the method of least squares had become a standard tool in astronomy and geodesy across Europe. Pierre-Simon Laplace further developed the probabilistic foundations of the method and connected it to the central limit theorem. In the 20th century, Abraham Wald formalized the squared error loss within the framework of statistical decision theory, and the bias-variance decomposition emerged as a central concept in both frequentist statistics and machine learning.
While squared loss is mathematically convenient, it is not always the best choice.
Squared loss is straightforward to implement in all major programming languages and frameworks.
import numpy as np
def mse_loss(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
y_true = np.array([3.0, 5.0, 2.5, 7.0])
y_pred = np.array([2.5, 5.2, 2.0, 6.8])
print(mse_loss(y_true, y_pred)) # 0.0825
import torch
import torch.nn as nn
loss_fn = nn.MSELoss()
y_true = torch.tensor([3.0, 5.0, 2.5, 7.0])
y_pred = torch.tensor([2.5, 5.2, 2.0, 6.8])
loss = loss_fn(y_pred, y_true)
print(loss.item()) # 0.0825
from sklearn.metrics import mean_squared_error
y_true = [3.0, 5.0, 2.5, 7.0]
y_pred = [2.5, 5.2, 2.0, 6.8]
mse = mean_squared_error(y_true, y_pred)
print(mse) # 0.0825
| Quantity | Formula | Description |
|---|---|---|
| Per-sample squared loss | $(y - \hat{y})^2$ | Loss for a single observation |
| Mean squared error (MSE) | $\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ | Average loss over the dataset |
| Residual sum of squares (RSS) | $\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ | Total loss (unnormalized) |
| Root mean squared error (RMSE) | $\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$ | MSE in original units |
| Gradient of MSE | $\frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta} - \mathbf{y})$ | Direction of steepest ascent |
| OLS solution | $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ | Closed-form optimal parameters |
| Bias-variance decomposition | $\text{Bias}^2 + \text{Var} + \sigma^2$ | MSE decomposition |