Squared Loss

Squared loss, also called quadratic loss, L2 loss, or squared error loss, is a loss function that measures the discrepancy between a predicted value and a true value by squaring their difference. It is one of the oldest and most widely used error measures in statistics and machine learning, serving as the foundation for least squares regression, mean squared error (MSE), and many optimization algorithms. Minimizing squared loss is equivalent to performing maximum likelihood estimation under the assumption of additive Gaussian noise, a connection first established by Carl Friedrich Gauss in the early 19th century.

Explain like I'm 5 (ELI5)

Imagine you are throwing beanbags at a target on the ground. After each throw, you measure how far your beanbag landed from the bullseye. Now, instead of just adding up all those distances, you multiply each distance by itself (that is what "squaring" means). A beanbag that lands 2 feet away counts as 4, and one that lands 3 feet away counts as 9. Then you take the average of all those squared numbers. This final number tells you how accurate your throws are. Because you squared the distances, throws that are really far off get penalized much more than throws that are only a little off. The goal is to make that final number as small as possible.

Mathematical definition

Per-sample squared loss

For a single observation with true value $y$ and predicted value $\hat{y}$, the squared loss is defined as:

$$L(y, \hat{y}) = (y - \hat{y})^2$$

This function returns zero when the prediction exactly matches the true value and increases quadratically as the prediction deviates from the target.

Mean squared error (MSE)

When averaged over a dataset of $n$ observations, the squared loss produces the mean squared error:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Some formulations divide by $2n$ instead of $n$ to simplify the derivative during gradient descent optimization. This scaling does not change the location of the minimum.

Residual sum of squares (RSS)

Without averaging, the total squared loss over all data points is called the residual sum of squares:

$$\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Minimizing the RSS is the objective in ordinary least squares (OLS) regression.

Root mean squared error (RMSE)

The square root of MSE is known as the root mean squared error:

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

RMSE is often preferred for reporting because it is expressed in the same units as the original target variable, making it easier to interpret.

Properties

Squared loss has several mathematical properties that make it convenient for optimization and statistical analysis.

Non-negativity

Because the error term is squared, the loss is always greater than or equal to zero. It equals zero only when the prediction exactly matches the true value.

Symmetry

An overestimate of a given magnitude produces the same loss as an underestimate of the same magnitude. For example, predicting 12 when the true value is 10 gives the same loss as predicting 8.

Smoothness and differentiability

The squared loss function is continuous and differentiable everywhere. This property makes it well suited for gradient-based optimization methods. The gradient of the per-sample squared loss with respect to the prediction is:

$$\frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y})$$

This simple linear gradient means that the magnitude of the update signal is proportional to the size of the error, which helps optimization converge smoothly.

Strict convexity

The squared loss function is strictly convex. The second derivative (Hessian) with respect to the prediction is a constant:

$$\frac{\partial^2 L}{\partial \hat{y}^2} = 2$$

Since this is always positive, the loss surface has a single global minimum with no local minima. For linear regression with squared loss, this strict convexity guarantees that gradient descent will converge to the unique optimal solution.

Sensitivity to outliers

Because errors are squared, large deviations from the true value are penalized disproportionately. An outlier with twice the error magnitude of a typical observation contributes four times as much to the total loss. This property makes squared loss sensitive to outliers and heavy-tailed distributions, which can distort model parameter estimates.

Bias-variance decomposition

One of the most important theoretical results involving squared loss is the bias-variance decomposition. For an estimator $\hat{\theta}$ of a true parameter $\theta$, the expected squared loss (MSE) decomposes into three additive components:

$$\text{MSE}(\hat{\theta}) = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta}) + \sigma^2$$

where:

Bias is the systematic error, measuring how far the expected prediction is from the true value
Variance is the variability of the estimator across different training sets
$\sigma^2$ (irreducible error) is the inherent noise in the data that no model can eliminate

This decomposition reveals a fundamental tradeoff in model selection. Simple models tend to have high bias but low variance, while complex models tend to have low bias but high variance. The squared loss framework provides a natural way to quantify this tradeoff because MSE cleanly separates into these interpretable components. This decomposition does not hold as neatly for other loss functions such as mean absolute error.

Connection to Gaussian noise and maximum likelihood estimation

Squared loss has a deep probabilistic justification. Suppose the true data-generating process follows:

$$y = f(x) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)$$

where $f(x)$ is the true function and $\varepsilon$ is additive Gaussian noise with mean zero and variance $\sigma^2$. Under this model, the conditional distribution of $y$ given $x$ is:

$$p(y \mid x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x))^2}{2\sigma^2}\right)$$

For a dataset of $n$ independent observations, the log-likelihood is:

$$\log \mathcal{L} = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - f(x_i))^2$$

Maximizing this log-likelihood with respect to the function $f$ is equivalent to minimizing the sum of squared errors $\sum(y_i - f(x_i))^2$, because the first term is a constant and the factor $\frac{1}{2\sigma^2}$ does not affect the location of the optimum.

This means that when you train a model by minimizing squared loss, you are implicitly assuming that the errors follow a normal distribution. If the noise distribution is actually Laplacian (heavy-tailed), then minimizing absolute loss (L1 loss) would be the correct maximum likelihood approach instead.

Connection to Bregman divergence

Squared loss is the simplest example of a Bregman divergence. A Bregman divergence is defined for a strictly convex, differentiable function $F$ as:

$$D_F(p, q) = F(p) - F(q) - \langle \nabla F(q), p - q \rangle$$

When $F(x) = x^2$ (or more generally $F(x) = |x|^2$), the Bregman divergence reduces to the squared Euclidean distance:

$$D_F(p, q) = (p - q)^2$$

This connection is significant for several reasons. First, it places squared loss within a broader family of divergence measures that includes the KL divergence and the Itakura-Saito distance. Second, a key theorem states that the mean of a distribution minimizes the expected Bregman divergence, which generalizes the classical result that the arithmetic mean minimizes total squared error. Third, optimization techniques such as mirror descent can be understood as generalizations of gradient descent by replacing the squared Euclidean distance with other Bregman divergences.

Gradient computation and optimization

Squared loss is particularly convenient for gradient-based optimization because its derivatives have simple closed-form expressions.

Gradient for a single parameter

For a linear model $\hat{y} = wx + b$, the partial derivatives of the per-sample squared loss are:

$$\frac{\partial L}{\partial w} = -2(y - \hat{y})x$$

$$\frac{\partial L}{\partial b} = -2(y - \hat{y})$$

Gradient for multiple parameters (vector form)

For a multivariate linear model $\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta}$, the gradient of the MSE with respect to the parameter vector is:

$$\nabla_{\boldsymbol{\theta}} \text{MSE} = \frac{2}{n} \mathbf{X}^T (\mathbf{X}\boldsymbol{\theta} - \mathbf{y})$$

Setting this gradient to zero yields the normal equations, which provide the closed-form OLS solution:

$$\boldsymbol{\theta}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

This closed-form solution is one of the major computational advantages of squared loss. For problems where $\mathbf{X}^T\mathbf{X}$ is too large to invert, iterative methods such as stochastic gradient descent are used instead.

Comparison with other loss functions

The following table compares squared loss with other common loss functions used in regression and classification.

Loss function	Formula	Optimal estimator	Sensitivity to outliers	Differentiable everywhere	Typical use case
Squared loss (L2)	$(y - \hat{y})^2$	Conditional mean	High	Yes	Regression
Absolute loss (L1)	$\|y - \hat{y}\|$	Conditional median	Low	No (at zero)	Robust regression
Huber loss	Quadratic for small errors, linear for large	Conditional mean (approx.)	Medium	Yes	Robust regression
Hinge loss	$\max(0, 1 - y\hat{y})$	Decision boundary	Low	No (at 1)	Classification (SVM)
Cross-entropy loss	$-\sum y \log \hat{y}$	Conditional probability	Low	Yes	Classification
Log loss	$-y\log(\hat{y}) - (1-y)\log(1-\hat{y})$	Conditional probability	Low	Yes	Binary classification

Squared loss vs. absolute loss

The primary tradeoff between squared loss and absolute loss involves sensitivity to outliers versus mathematical convenience. Squared loss penalizes large errors quadratically, which means a single extreme outlier can dominate the total loss. Absolute loss penalizes all errors linearly, making it more robust to outliers but introducing a non-differentiable point at zero error. In practice, squared loss converges faster during optimization because its gradient varies smoothly with the error magnitude, whereas absolute loss has a constant gradient magnitude (positive or negative 1), which can cause instability near the optimum.

Squared loss vs. Huber loss

Huber loss combines the best properties of both squared and absolute loss. For residuals smaller than a threshold $\delta$, it behaves like squared loss, providing smooth gradients and efficient convergence. For residuals larger than $\delta$, it switches to linear growth, reducing the influence of outliers. The parameter $\delta$ controls where this transition occurs. Huber loss is differentiable everywhere and is commonly used in robust regression when the data contains occasional outliers but the bulk of observations are well-behaved.

Squared loss vs. cross-entropy loss

Cross-entropy loss is the standard choice for classification tasks, while squared loss is used primarily for regression. When applied to classification, squared loss produces smaller gradients for confident but wrong predictions compared to cross-entropy, which leads to slower learning. Cross-entropy also has a natural probabilistic interpretation as the negative log-likelihood under a Bernoulli or categorical distribution, just as squared loss corresponds to Gaussian noise. Using squared loss for classification can work but generally leads to suboptimal convergence.

Applications

Linear regression

Squared loss is the defining loss function for ordinary least squares regression. In linear regression, the model parameters are chosen to minimize the sum of squared residuals, which yields the best linear unbiased estimator (BLUE) under the Gauss-Markov conditions.

Ridge and Lasso regression

In ridge regression, the squared loss is combined with an L2 regularization penalty on the model weights: $\text{Loss} = \text{MSE} + \lambda |\boldsymbol{\theta}|^2$. This prevents overfitting by shrinking large coefficients. Similarly, Lasso regression combines squared loss with an L1 regularization penalty, which encourages sparsity in the learned parameters.

Neural network training

In deep learning, squared loss (MSE) is the standard loss function for regression tasks. When training a neural network to predict continuous values (such as house prices, temperature, or stock returns), the network's output is compared to the true value using MSE, and backpropagation computes the gradient of this loss with respect to all network weights. Most deep learning frameworks (PyTorch, TensorFlow, Keras) provide built-in MSE loss functions.

Bayesian estimation

In Bayesian inference, the squared loss has a special status: the Bayes-optimal estimator under squared loss is the posterior mean. That is, among all possible estimators, the one that minimizes expected squared loss is the conditional expectation $E[\theta \mid \text{data}]$. This result follows from the strict convexity of the squared loss function.

Signal processing

Squared error is widely used in signal processing for filter design, system identification, and signal reconstruction. The Wiener filter, for example, is derived by minimizing the mean squared error between the desired signal and the filter output. Similarly, the minimum mean squared error (MMSE) estimator is a standard tool in communications engineering.

Time series forecasting

MSE and RMSE are standard evaluation metrics for time series forecasting models. Methods such as ARIMA, exponential smoothing, and recurrent neural networks are often trained and evaluated using squared loss. RMSE is particularly popular in forecasting because it penalizes large errors more heavily, which is desirable in applications where big misses are costly (for example, energy demand prediction or financial risk estimation).

Recommender systems

Matrix factorization methods used in recommendation systems typically minimize the squared error between predicted and actual user ratings. The Netflix Prize competition (2006 to 2009), which advanced the state of the art in collaborative filtering, used RMSE as its primary evaluation metric.

Historical background

The method of minimizing squared errors has a history spanning more than two centuries.

Adrien-Marie Legendre published the first formal description of the method of least squares in 1805, in an appendix to his book on determining the orbits of comets (Nouvelles methodes pour la determination des orbites des cometes). The nine-page appendix, titled "Sur la methode des moindres quarres," presented the idea of minimizing the sum of squared residuals as a principled way to fit a model to observational data.

Carl Friedrich Gauss later claimed in 1809 that he had been using the method since 1795, leading to one of the most well-known priority disputes in the history of statistics. Regardless of who used it first, Gauss made a contribution that went beyond Legendre: he connected the method of least squares to the theory of probability by showing that it arises naturally from the assumption of normally distributed errors. This connection between squared loss and the Gaussian distribution remains one of the cornerstones of modern statistical theory.

Within a decade of Legendre's publication, the method of least squares had become a standard tool in astronomy and geodesy across Europe. Pierre-Simon Laplace further developed the probabilistic foundations of the method and connected it to the central limit theorem. In the 20th century, Abraham Wald formalized the squared error loss within the framework of statistical decision theory, and the bias-variance decomposition emerged as a central concept in both frequentist statistics and machine learning.

Limitations and when to avoid squared loss

While squared loss is mathematically convenient, it is not always the best choice.

Outlier-prone data. When the dataset contains significant outliers or the error distribution has heavy tails, squared loss can produce misleading estimates. In these cases, absolute loss, Huber loss, or other robust loss functions are preferred.
Classification tasks. Using squared loss for classification problems (predicting discrete class labels) generally performs worse than cross-entropy because the gradients become very small for confident but incorrect predictions, slowing down learning.
Heteroscedastic data. When the variance of the noise changes across the input space (heteroscedasticity), unweighted squared loss gives equal importance to all regions. Weighted least squares or heteroscedasticity-robust methods should be used instead.
Non-Gaussian noise. If the error distribution is known to be non-Gaussian (for example, Laplacian or Cauchy), the maximum likelihood justification for squared loss breaks down, and a loss function matched to the actual noise distribution will yield better estimates.
Scale sensitivity. MSE values depend on the scale of the target variable, making it difficult to compare across different datasets or problems without normalization.

Implementation examples

Squared loss is straightforward to implement in all major programming languages and frameworks.

Python (NumPy)

import numpy as np

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

y_true = np.array([3.0, 5.0, 2.5, 7.0])
y_pred = np.array([2.5, 5.2, 2.0, 6.8])
print(mse_loss(y_true, y_pred))  # 0.0825

PyTorch

import torch
import torch.nn as nn

loss_fn = nn.MSELoss()
y_true = torch.tensor([3.0, 5.0, 2.5, 7.0])
y_pred = torch.tensor([2.5, 5.2, 2.0, 6.8])
loss = loss_fn(y_pred, y_true)
print(loss.item())  # 0.0825

scikit-learn

from sklearn.metrics import mean_squared_error

y_true = [3.0, 5.0, 2.5, 7.0]
y_pred = [2.5, 5.2, 2.0, 6.8]
mse = mean_squared_error(y_true, y_pred)
print(mse)  # 0.0825

Summary of key formulas

Quantity	Formula	Description
Per-sample squared loss	$(y - \hat{y})^2$	Loss for a single observation
Mean squared error (MSE)	$\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$	Average loss over the dataset
Residual sum of squares (RSS)	$\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$	Total loss (unnormalized)
Root mean squared error (RMSE)	$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$	MSE in original units
Gradient of MSE	$\frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\theta} - \mathbf{y})$	Direction of steepest ascent
OLS solution	$(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$	Closed-form optimal parameters
Bias-variance decomposition	$\text{Bias}^2 + \text{Var} + \sigma^2$	MSE decomposition

References

Legendre, A.-M. (1805). *Nouvelles methodes pour la determination des orbites des cometes*. Firmin Didot, Paris.
Gauss, C. F. (1809). *Theoria motus corporum coelestium*. Perthes et Besser, Hamburg.
Stigler, S. M. (1981). "Gauss and the Invention of Least Squares." *Annals of Statistics*, 9(3), 465-474.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. Chapter 2 and Chapter 7.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 1.2.5 and Section 3.1.
Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press. Chapter 7.
Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.
Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). "Clustering with Bregman Divergences." *Journal of Machine Learning Research*, 6, 1705-1749.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). *An Introduction to Statistical Learning*. Springer. Chapter 2 and Chapter 3.
Wald, A. (1950). *Statistical Decision Functions*. John Wiley & Sons.
Berger, J. O. (1985). *Statistical Decision Theory and Bayesian Analysis*, 2nd ed. Springer. Chapter 1.

Explain like I'm 5 (ELI5)

Mathematical definition

Per-sample squared loss

Mean squared error (MSE)

Residual sum of squares (RSS)

Root mean squared error (RMSE)

Properties

Non-negativity

Symmetry

Smoothness and differentiability

Strict convexity

Sensitivity to outliers

Bias-variance decomposition

Connection to Gaussian noise and maximum likelihood estimation

Connection to Bregman divergence

Gradient computation and optimization

Gradient for a single parameter

Gradient for multiple parameters (vector form)

Comparison with other loss functions

Squared loss vs. absolute loss

Squared loss vs. Huber loss

Squared loss vs. cross-entropy loss

Applications

Linear regression

Ridge and Lasso regression

Neural network training

Bayesian estimation

Signal processing

Time series forecasting

Recommender systems

Historical background

Limitations and when to avoid squared loss

Implementation examples

Python (NumPy)

PyTorch

scikit-learn

Summary of key formulas

References

Improve this article

Related Articles

ARC-AGI 2

Least Squares Regression

Linear Regression

Probabilistic Regression Model

Root Mean Squared Error (RMSE)

L1 Loss

Explain like I'm 5 (ELI5)

Mathematical definition

Per-sample squared loss

Mean squared error (MSE)

Residual sum of squares (RSS)

Root mean squared error (RMSE)

Properties

Non-negativity

Symmetry

Smoothness and differentiability

Strict convexity

Sensitivity to outliers

Bias-variance decomposition

Connection to Gaussian noise and maximum likelihood estimation

Connection to Bregman divergence

Gradient computation and optimization

Gradient for a single parameter

Gradient for multiple parameters (vector form)

Comparison with other loss functions

Squared loss vs. absolute loss

Squared loss vs. Huber loss

Squared loss vs. cross-entropy loss

Applications

Linear regression

Ridge and Lasso regression

Neural network training

Bayesian estimation

Signal processing

Time series forecasting

Recommender systems

Historical background

Limitations and when to avoid squared loss

Implementation examples

Python (NumPy)