# Squared Loss

> Source: https://aiwiki.ai/wiki/squared_loss
> Updated: 2026-06-28
> Categories: Machine Learning, Statistics, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Squared loss**, also called **quadratic loss**, **L2 loss**, or **squared error loss**, is a [loss function](/wiki/loss_function) that penalizes a prediction by the square of its error: for a true value $y$ and a prediction $\hat{y}$, the squared loss is $(y - \hat{y})^2$. Averaging the squared loss over a dataset gives the [mean squared error](/wiki/mean_squared_error_mse) (MSE), and summing it (without averaging) gives the sum of squared errors, also called the residual sum of squares (RSS). It is one of the oldest and most widely used error measures in statistics and [machine learning](/wiki/machine_learning), serving as the foundation for [least squares regression](/wiki/least_squares_regression), [mean squared error](/wiki/mean_squared_error_mse) (MSE), and many optimization algorithms. Because the error is squared, large deviations are penalized quadratically, which makes squared loss especially sensitive to outliers.[7][13] Minimizing squared loss is equivalent to performing [maximum likelihood estimation](/wiki/maximum_likelihood_estimation) under the assumption of additive Gaussian noise, and the prediction that minimizes expected squared loss is the conditional mean $E[y \mid x]$, a connection first established by Carl Friedrich Gauss in the early 19th century.[2][5]

## Explain like I'm 5 (ELI5)

Imagine you are throwing beanbags at a target on the ground. After each throw, you measure how far your beanbag landed from the bullseye. Now, instead of just adding up all those distances, you multiply each distance by itself (that is what "squaring" means). A beanbag that lands 2 feet away counts as 4, and one that lands 3 feet away counts as 9. Then you take the average of all those squared numbers. This final number tells you how accurate your throws are. Because you squared the distances, throws that are really far off get penalized much more than throws that are only a little off. The goal is to make that final number as small as possible.

## What is squared loss?

Squared loss is the function that maps a prediction error to the square of that error. For a single observation, it is $(y - \hat{y})^2$, and it answers a simple question: how wrong was this prediction, weighted so that being very wrong is much worse than being a little wrong. Google's Machine Learning Crash Course defines the L2 loss of an example as "the squared difference between the predicted values and the actual values," and defines mean squared error as "the average of L2 losses across a set of N examples."[13]

Three closely related quantities are built directly from the per-sample squared loss:

- **MSE** is the squared loss *averaged* over all observations. It is the most common form used as a training objective and evaluation metric.
- **Sum of squared errors (SSE) / residual sum of squares (RSS)** is the squared loss *summed* (not averaged) over all observations. Minimizing RSS is the objective of ordinary least squares.
- **RMSE** is the square root of MSE, which restores the original units of the target.

## Mathematical definition

### Per-sample squared loss

For a single observation with true value $y$ and predicted value $\hat{y}$, the squared loss is defined as:

$$L(y, \hat{y}) = (y - \hat{y})^2$$

This function returns zero when the prediction exactly matches the true value and increases quadratically as the prediction deviates from the target.[4]

### Mean squared error (MSE)

When averaged over a dataset of $n$ observations, the squared loss produces the [mean squared error](/wiki/mean_squared_error_mse):

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Some formulations divide by $2n$ instead of $n$ to simplify the derivative during [gradient descent](/wiki/stochastic_gradient_descent_sgd) optimization.[6] This scaling does not change the location of the minimum.

### Residual sum of squares (RSS)

Without averaging, the total squared loss over all data points is called the residual sum of squares (also called the sum of squared errors, SSE):

$$\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Minimizing the RSS is the objective in ordinary least squares (OLS) regression.[10]

### Root mean squared error (RMSE)

The square root of MSE is known as the root mean squared error:

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

RMSE is often preferred for reporting because it is expressed in the same units as the original target variable, making it easier to interpret.

## What are the properties of squared loss?

Squared loss has several mathematical properties that make it convenient for optimization and statistical analysis.

### Non-negativity

Because the error term is squared, the loss is always greater than or equal to zero. It equals zero only when the prediction exactly matches the true value.

### Symmetry

An overestimate of a given magnitude produces the same loss as an underestimate of the same magnitude. For example, predicting 12 when the true value is 10 gives the same loss as predicting 8.

### Smoothness and differentiability

The squared loss function is continuous and differentiable everywhere. This property makes it well suited for gradient-based optimization methods. The gradient of the per-sample squared loss with respect to the prediction is:

$$\frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y})$$

This simple linear gradient means that the magnitude of the update signal is proportional to the size of the error, which helps optimization converge smoothly.

### Strict convexity

The squared loss function is strictly [convex](/wiki/convex_function). The second derivative (Hessian) with respect to the prediction is a constant:

$$\frac{\partial^2 L}{\partial \hat{y}^2} = 2$$

Since this is always positive, the loss surface has a single global minimum with no local minima. For [linear regression](/wiki/linear_regression) with squared loss, this strict convexity guarantees that gradient descent will converge to the unique optimal solution.[6]

### Sensitivity to outliers

Because errors are squared, large deviations from the true value are penalized disproportionately. An outlier with twice the error magnitude of a typical observation contributes four times as much to the total loss. This property makes squared loss sensitive to outliers and heavy-tailed distributions, which can distort model parameter estimates.[7]

## Why is squared loss sensitive to outliers?

The quadratic penalty is the source of squared loss's defining weakness: a single far-off point can dominate the total loss. If a typical residual is 1 and one outlier has a residual of 10, that outlier alone contributes $10^2 = 100$ to the sum, which is one hundred times the contribution of a typical point. To reduce that one term, the optimizer shifts the fitted model toward the outlier, at the expense of fitting the bulk of the data.

Google's Machine Learning Crash Course states the practical consequence directly: "L2 loss incurs a much higher penalty for an outlier than L1 loss," and "MSE moves the model more toward the outliers, while MAE doesn't."[13] In other words, a model trained with squared loss (MSE) ends up closer to the outliers but further away from most of the other data points, whereas a model trained with [absolute loss](/wiki/l1_loss) (MAE) is further from the outliers but closer to the majority.[13] This is why squared loss is a poor choice when the data contains heavy tails or contamination, and why robust alternatives such as absolute loss and [Huber loss](/wiki/huber_loss) are preferred in those settings.[7]

## Bias-variance decomposition

One of the most important theoretical results involving squared loss is the bias-variance decomposition.[4] For an estimator $\hat{\theta}$ of a true parameter $\theta$, the expected squared loss (MSE) decomposes into three additive components:

$$\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta}) + \sigma^2$$

where:

- **Bias** is the systematic error, measuring how far the expected prediction is from the true value
- **Variance** is the variability of the estimator across different training sets
- **$\sigma^2$ (irreducible error)** is the inherent noise in the data that no model can eliminate

This decomposition reveals a fundamental tradeoff in model selection. Simple models tend to have high bias but low variance, while complex models tend to have low bias but high variance.[10] The squared loss framework provides a natural way to quantify this tradeoff because MSE cleanly separates into these interpretable components. This decomposition does not hold as neatly for other loss functions such as [mean absolute error](/wiki/mean_absolute_error_mae).

## Why does minimizing squared loss assume Gaussian noise?

Squared loss has a deep probabilistic justification. Suppose the true data-generating process follows:

$$y = f(x) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)$$

where $f(x)$ is the true function and $\varepsilon$ is additive Gaussian noise with mean zero and variance $\sigma^2$. Under this model, the conditional distribution of $y$ given $x$ is:

$$p(y \mid x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x))^2}{2\sigma^2}\right)$$

For a dataset of $n$ independent observations, the log-likelihood is:

$$\log \mathcal{L} = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - f(x_i))^2$$

Maximizing this log-likelihood with respect to the function $f$ is equivalent to minimizing the sum of squared errors $\sum(y_i - f(x_i))^2$, because the first term is a constant and the factor $\frac{1}{2\sigma^2}$ does not affect the location of the optimum.[5]

This means that when you train a model by minimizing squared loss, you are implicitly assuming that the errors follow a [normal distribution](/wiki/normal_distribution).[9] A second, equally important consequence is that the predictor minimizing expected squared loss is the **conditional mean** $E[y \mid x]$: among all functions of $x$, the conditional expectation is the unique minimizer of $E[(y - f(x))^2]$.[5] By contrast, if the noise distribution is actually Laplacian (heavy-tailed), then minimizing [absolute loss](/wiki/l1_loss) (L1 loss) would be the correct maximum likelihood approach, and its optimal predictor is the conditional median rather than the mean.[6]

## Connection to Bregman divergence

Squared loss is the simplest example of a Bregman divergence.[8] A Bregman divergence is defined for a strictly convex, differentiable function $F$ as:

$$D_F(p, q) = F(p) - F(q) - \langle \nabla F(q), p - q \rangle$$

When $F(x) = x^2$ (or more generally $F(x) = \|x\|^2$), the Bregman divergence reduces to the squared Euclidean distance:

$$D_F(p, q) = (p - q)^2$$

This connection is significant for several reasons. First, it places squared loss within a broader family of divergence measures that includes the [KL divergence](/wiki/kl_divergence) and the Itakura-Saito distance. Second, a key theorem states that the mean of a distribution minimizes the expected Bregman divergence, which generalizes the classical result that the arithmetic mean minimizes total squared error.[8] Third, optimization techniques such as mirror descent can be understood as generalizations of gradient descent by replacing the squared Euclidean distance with other Bregman divergences.

## How is the gradient of squared loss computed?

Squared loss is particularly convenient for gradient-based optimization because its derivatives have simple closed-form expressions.

### Gradient for a single parameter

For a linear model $\hat{y} = wx + b$, the partial derivatives of the per-sample squared loss are:

$$\frac{\partial L}{\partial w} = -2(y - \hat{y})x$$

$$\frac{\partial L}{\partial b} = -2(y - \hat{y})$$

### Gradient for multiple parameters (vector form)

For a multivariate linear model $\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta}$, the gradient of the MSE with respect to the parameter vector is:

$$\nabla_{\boldsymbol{\theta}} \text{MSE} = \frac{2}{n} \mathbf{X}^T (\mathbf{X}\boldsymbol{\theta} - \mathbf{y})$$

Setting this gradient to zero yields the normal equations, which provide the closed-form OLS solution:

$$\boldsymbol{\theta}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

This closed-form solution is one of the major computational advantages of squared loss.[5] For problems where $\mathbf{X}^T\mathbf{X}$ is too large to invert, iterative methods such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) are used instead.

## How does squared loss compare with other loss functions?

The following table compares squared loss with other common loss functions used in regression and classification.

| Loss function | Formula | Optimal estimator | Sensitivity to outliers | Differentiable everywhere | Typical use case |
|---|---|---|---|---|---|
| [Squared loss](/wiki/squared_loss) (L2) | $(y - \hat{y})^2$ | Conditional mean | High | Yes | [Regression](/wiki/regression_model) |
| [Absolute loss](/wiki/l1_loss) (L1) | $\|y - \hat{y}\|$ | Conditional median | Low | No (at zero) | Robust regression |
| [Huber loss](/wiki/huber_loss) | Quadratic for small errors, linear for large | Conditional mean (approx.) | Medium | Yes | Robust [regression](/wiki/regression_model) |
| [Hinge loss](/wiki/hinge_loss) | $\max(0, 1 - y\hat{y})$ | Decision boundary | Low | No (at 1) | [Classification](/wiki/classification_model) (SVM) |
| [Cross-entropy loss](/wiki/cross-entropy) | $-\sum y \log \hat{y}$ | Conditional probability | Low | Yes | [Classification](/wiki/classification_model) |
| [Log loss](/wiki/log_loss) | $-y\log(\hat{y}) - (1-y)\log(1-\hat{y})$ | Conditional probability | Low | Yes | Binary [classification](/wiki/classification_model) |

### Squared loss vs absolute loss: what is the difference?

The primary tradeoff between squared loss and [absolute loss](/wiki/l1_loss) involves sensitivity to outliers versus mathematical convenience. Squared loss penalizes large errors quadratically, so its optimal predictor is the conditional mean, and a single extreme outlier can dominate the total loss. Absolute loss (also called L1 loss or, when averaged, mean absolute error / MAE) penalizes all errors linearly, so its optimal predictor is the conditional median, which is far more robust to outliers. The cost of that robustness is a non-differentiable point at zero error.[7] In practice, squared loss converges faster during optimization because its gradient varies smoothly with the error magnitude, whereas absolute loss has a constant gradient magnitude (positive or negative 1), which can cause instability near the optimum. In short: squared loss predicts the mean and is sensitive to outliers; absolute loss predicts the median and is robust to them.

### Squared loss vs Huber loss: which is more robust?

Huber loss combines the best properties of both squared and absolute loss. For residuals smaller than a threshold $\delta$, it behaves like squared loss, providing smooth gradients and efficient convergence. For residuals larger than $\delta$, it switches to linear growth, reducing the influence of outliers. The parameter $\delta$ controls where this transition occurs: as $\delta \to 0$ Huber loss approaches absolute loss (MAE), and as $\delta \to \infty$ it approaches squared loss (MSE). Huber loss is differentiable everywhere and is commonly used in robust regression when the data contains occasional outliers but the bulk of observations are well-behaved.[7]

### Squared loss vs cross-entropy loss

[Cross-entropy loss](/wiki/cross-entropy) is the standard choice for classification tasks, while squared loss is used primarily for regression. When applied to classification, squared loss produces smaller gradients for confident but wrong predictions compared to cross-entropy, which leads to slower learning. Cross-entropy also has a natural probabilistic interpretation as the negative log-likelihood under a Bernoulli or categorical distribution, just as squared loss corresponds to Gaussian noise.[9] Using squared loss for classification can work but generally leads to suboptimal convergence.

## What is squared loss used for?

### Linear regression

Squared loss is the defining loss function for ordinary [least squares regression](/wiki/least_squares_regression). In [linear regression](/wiki/linear_regression), the model parameters are chosen to minimize the sum of squared residuals, which yields the best linear unbiased estimator (BLUE) under the Gauss-Markov conditions.[10]

### Ridge and Lasso regression

In [ridge regression](/wiki/l2_regularization), the squared loss is combined with an L2 [regularization](/wiki/regularization) penalty on the model weights: $\text{Loss} = \text{MSE} + \lambda \|\boldsymbol{\theta}\|^2$. This prevents [overfitting](/wiki/overfitting) by shrinking large coefficients. Similarly, [Lasso regression](/wiki/l1_regularization) combines squared loss with an L1 regularization penalty, which encourages sparsity in the learned parameters.

### Neural network training

In [deep learning](/wiki/deep_neural_network), squared loss (MSE) is the standard loss function for regression tasks. When training a [neural network](/wiki/neural_network) to predict continuous values (such as house prices, temperature, or stock returns), the network's output is compared to the true value using MSE, and [backpropagation](/wiki/backpropagation) computes the gradient of this loss with respect to all network weights.[9] Most deep learning frameworks (PyTorch, TensorFlow, Keras) provide built-in MSE loss functions.

### Bayesian estimation

In [Bayesian inference](/wiki/bayesian_neural_network), the squared loss has a special status: the Bayes-optimal estimator under squared loss is the posterior mean. That is, among all possible estimators, the one that minimizes expected squared loss is the conditional expectation $E[\theta \mid \text{data}]$.[12] This result follows from the strict convexity of the squared loss function.

### Signal processing

Squared error is widely used in signal processing for filter design, system identification, and signal reconstruction. The Wiener filter, for example, is derived by minimizing the mean squared error between the desired signal and the filter output. Similarly, the minimum mean squared error (MMSE) estimator is a standard tool in communications engineering.

### Time series forecasting

MSE and RMSE are standard evaluation metrics for time series forecasting models. Methods such as ARIMA, exponential smoothing, and recurrent [neural networks](/wiki/neural_network) are often trained and evaluated using squared loss. RMSE is particularly popular in forecasting because it penalizes large errors more heavily, which is desirable in applications where big misses are costly (for example, energy demand prediction or financial risk estimation).

### Recommender systems

Matrix factorization methods used in [recommendation systems](/wiki/recommender_system) typically minimize the squared error between predicted and actual user ratings. The Netflix Prize competition (2006 to 2009), which advanced the state of the art in collaborative filtering, used RMSE as its primary evaluation metric.

## When was the method of squared errors invented?

The method of minimizing squared errors has a history spanning more than two centuries.

Adrien-Marie Legendre published the first formal description of the method of least squares in 1805, in an appendix to his book on determining the orbits of comets (*Nouvelles methodes pour la determination des orbites des cometes*). The nine-page appendix, titled "Sur la methode des moindres quarres," presented the idea of minimizing the sum of squared residuals as a principled way to fit a model to observational data.[1]

Carl Friedrich Gauss later claimed in 1809 that he had been using the method since 1795, leading to one of the most well-known priority disputes in the history of statistics.[3] Regardless of who used it first, Gauss made a contribution that went beyond Legendre: he connected the method of least squares to the theory of probability by showing that it arises naturally from the assumption of normally distributed errors.[2] This connection between squared loss and the Gaussian distribution remains one of the cornerstones of modern statistical theory.

Within a decade of Legendre's publication, the method of least squares had become a standard tool in astronomy and geodesy across Europe. Pierre-Simon Laplace further developed the probabilistic foundations of the method and connected it to the central limit theorem. In the 20th century, Abraham Wald formalized the squared error loss within the framework of statistical decision theory,[11] and the bias-variance decomposition emerged as a central concept in both frequentist statistics and machine learning.

## When should you avoid squared loss?

While squared loss is mathematically convenient, it is not always the best choice.

- **Outlier-prone data.** When the dataset contains significant outliers or the error distribution has heavy tails, squared loss can produce misleading estimates. In these cases, [absolute loss](/wiki/l1_loss), Huber loss, or other robust loss functions are preferred.[7]
- **Classification tasks.** Using squared loss for classification problems (predicting discrete class labels) generally performs worse than [cross-entropy](/wiki/cross-entropy) because the gradients become very small for confident but incorrect predictions, slowing down learning.
- **Heteroscedastic data.** When the variance of the noise changes across the input space (heteroscedasticity), unweighted squared loss gives equal importance to all regions. Weighted least squares or heteroscedasticity-robust methods should be used instead.
- **Non-Gaussian noise.** If the error distribution is known to be non-Gaussian (for example, Laplacian or Cauchy), the maximum likelihood justification for squared loss breaks down, and a loss function matched to the actual noise distribution will yield better estimates.
- **Scale sensitivity.** MSE values depend on the scale of the target variable, making it difficult to compare across different datasets or problems without normalization.

## Implementation examples

Squared loss is straightforward to implement in all major programming languages and frameworks.

### Python (NumPy)

```python
import numpy as np

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

y_true = np.array([3.0, 5.0, 2.5, 7.0])
y_pred = np.array([2.5, 5.2, 2.0, 6.8])
print(mse_loss(y_true, y_pred))  # 0.0825
```

### PyTorch

```python
import torch
import torch.nn as nn

loss_fn = nn.MSELoss()
y_true = torch.tensor([3.0, 5.0, 2.5, 7.0])
y_pred = torch.tensor([2.5, 5.2, 2.0, 6.8])
loss = loss_fn(y_pred, y_true)
print(loss.item())  # 0.0825
```

### scikit-learn

```python
from sklearn.metrics import mean_squared_error

y_true = [3.0, 5.0, 2.5, 7.0]
y_pred = [2.5, 5.2, 2.0, 6.8]
mse = mean_squared_error(y_true, y_pred)
print(mse)  # 0.0825
```

## Summary of key formulas

| Quantity | Formula | Description |
|---|---|---|
| Per-sample [squared loss](/wiki/squared_loss) | $(y - \hat{y})^2$ | Loss for a single observation |
| [Mean squared error](/wiki/mean_squared_error_mse) (MSE) | $\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ | Average loss over the dataset |
| Residual sum of squares (RSS) | $\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ | Total loss (unnormalized) |
| Root mean squared error (RMSE) | $\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$ | MSE in original units |
| [Gradient](/wiki/gradient) of MSE | $\frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta} - \mathbf{y})$ | Direction of steepest ascent |
| OLS solution | $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ | Closed-form optimal parameters |
| [Bias-variance decomposition](/wiki/bias_variance_tradeoff) | $\text{Bias}^2 + \text{Var} + \sigma^2$ | MSE decomposition |

## See also

- [Loss function](/wiki/loss_function)
- [Mean squared error](/wiki/mean_squared_error_mse)
- [Absolute loss (L1 loss)](/wiki/l1_loss)
- [Huber loss](/wiki/huber_loss)
- [Least squares regression](/wiki/least_squares_regression)
- [Linear regression](/wiki/linear_regression)
- [Gradient descent](/wiki/gradient_descent)

## References

1. Legendre, A.-M. (1805). *Nouvelles methodes pour la determination des orbites des cometes*. Firmin Didot, Paris.
2. Gauss, C. F. (1809). *Theoria motus corporum coelestium*. Perthes et Besser, Hamburg.
3. Stigler, S. M. (1981). "Gauss and the Invention of Least Squares." *Annals of Statistics*, 9(3), 465-474.
4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. Chapter 2 and Chapter 7.
5. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 1.2.5 and Section 3.1.
6. Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press. Chapter 7.
7. Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.
8. Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). "Clustering with Bregman Divergences." *Journal of Machine Learning Research*, 6, 1705-1749.
9. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5.
10. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). *An Introduction to Statistical Learning*. Springer. Chapter 2 and Chapter 3.
11. Wald, A. (1950). *Statistical Decision Functions*. John Wiley & Sons.
12. Berger, J. O. (1985). *Statistical Decision Theory and Bayesian Analysis*, 2nd ed. Springer. Chapter 1.
13. Google. "Linear regression: Loss." *Machine Learning Crash Course*, Google for Developers. https://developers.google.com/machine-learning/crash-course/linear-regression/loss