See also: Machine learning terms
Mean Squared Error (MSE), also called mean squared deviation (MSD), is one of the most widely used metrics for evaluating the performance of regression models in machine learning and statistics. It measures the average of the squared differences between predicted values and actual (observed) values. Because it squares each error before averaging, MSE penalizes large errors more heavily than small ones, making it particularly sensitive to outliers. A lower MSE indicates better predictive accuracy, with a perfect model achieving an MSE of zero.
MSE serves a dual role in machine learning. It functions both as an evaluation metric (measuring how well a trained model performs on held-out data) and as a loss function (the objective that a model minimizes during training via gradient descent or similar optimization algorithms).
Given a dataset of n observations, where y_i is the actual value and \u0177_i (y-hat) is the predicted value for the i-th observation, the Mean Squared Error is defined as:
MSE = (1/n) \u03a3\u1d62\u208c\u2081\u207f (y\u1d62 \u2212 \u0177\u1d62)\u00b2
In expanded form:
MSE = (1/n) [(y\u2081 \u2212 \u0177\u2081)\u00b2 + (y\u2082 \u2212 \u0177\u2082)\u00b2 + ... + (y\u2099 \u2212 \u0177\u2099)\u00b2]
Where:
| Symbol | Meaning |
|---|---|
| n | Number of data points (observations) |
| y\u1d62 | Actual (observed) value for the i-th data point |
| \u0177\u1d62 | Predicted value for the i-th data point |
| (y\u1d62 \u2212 \u0177\u1d62) | Residual (error) for the i-th data point |
| \u03a3 | Summation over all n data points |
The formula first computes the residual for each observation, squares it to eliminate negative signs and emphasize larger errors, then averages all squared residuals.
In the context of estimation theory, for an estimator \u03b8\u0302 of a true parameter \u03b8, MSE is defined as the expected value of the squared error:
MSE(\u03b8\u0302) = E[(\u03b8\u0302 \u2212 \u03b8)\u00b2]
MSE has several mathematical properties that make it useful for model training and evaluation:
| Property | Description |
|---|---|
| Non-negative | MSE is always greater than or equal to zero, since every squared term is non-negative. An MSE of zero indicates perfect predictions. |
| Differentiable | MSE is smooth and continuously differentiable everywhere, which makes it well-suited for gradient descent optimization. The gradient of MSE with respect to predictions is \u2212(2/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62). |
| Convex (for linear models) | When used with linear regression or other linear models, MSE forms a convex loss surface with a single global minimum. This guarantees that gradient-based optimization converges to the optimal solution. |
| Non-convex (for neural networks) | For neural networks with nonlinear activation functions, the MSE loss surface becomes non-convex and may contain local minima and saddle points. |
| Scale-dependent | MSE is expressed in the squared units of the target variable, making direct interpretation less intuitive. For example, if the target is measured in dollars, MSE is in dollars squared. |
| Sensitive to outliers | Because errors are squared, a single large error has a disproportionate effect on the overall MSE. An error of 10 contributes 100 to the loss, while an error of 1 contributes only 1. |
One of the most important theoretical results involving MSE is the bias-variance decomposition. For an estimator \u03b8\u0302 of a true parameter \u03b8, the MSE can be decomposed as:
MSE(\u03b8\u0302) = Var(\u03b8\u0302) + [Bias(\u03b8\u0302)]\u00b2
Where:
In a supervised learning context with irreducible noise \u03c3\u00b2, this extends to:
Expected MSE = Bias\u00b2 + Variance + \u03c3\u00b2 (irreducible noise)
This decomposition is central to understanding the bias-variance tradeoff. A simple model (e.g., linear regression on complex data) may have high bias but low variance, leading to underfitting. A highly complex model (e.g., a deep neural network with many parameters) may have low bias but high variance, leading to overfitting. The goal is to find a model that balances both components to minimize overall MSE.
For an unbiased estimator (where Bias = 0), the MSE equals the variance. This is why unbiased estimators are sometimes favored in statistics, though biased estimators with lower variance can achieve a lower overall MSE.
Minimizing MSE is mathematically equivalent to maximum likelihood estimation (MLE) under the assumption that the errors follow a Gaussian (normal) distribution. This connection provides a probabilistic justification for using MSE.
Consider a model y\u1d62 = f(x\u1d62; \u03b8) + \u03b5\u1d62, where \u03b5\u1d62 ~ N(0, \u03c3\u00b2). Under this assumption, each observation y\u1d62 follows a normal distribution:
p(y\u1d62 | x\u1d62, \u03b8) = (1 / \u221a(2\u03c0\u03c3\u00b2)) \u00b7 exp(\u2212(y\u1d62 \u2212 f(x\u1d62; \u03b8))\u00b2 / (2\u03c3\u00b2))
The log-likelihood for the entire dataset is:
log L(\u03b8) = \u2212(n/2) log(2\u03c0\u03c3\u00b2) \u2212 (1/(2\u03c3\u00b2)) \u03a3(y\u1d62 \u2212 f(x\u1d62; \u03b8))\u00b2
Since \u03c3\u00b2 is a constant with respect to \u03b8, maximizing the log-likelihood is equivalent to minimizing \u03a3(y\u1d62 \u2212 f(x\u1d62; \u03b8))\u00b2, which is proportional to MSE. This means that when the noise in the data is truly Gaussian, MSE is the theoretically optimal loss function. When the noise distribution has heavier tails (more outliers than a Gaussian), alternatives like mean absolute error or Huber loss may be more appropriate.
In supervised learning, MSE is the default loss function for regression tasks. During training, the model parameters are adjusted to minimize the MSE over the training data. The gradient of MSE with respect to the model's predicted output \u0177\u1d62 is:
\u2202MSE/\u2202\u0177\u1d62 = \u2212(2/n)(y\u1d62 \u2212 \u0177\u1d62)
This gradient has a useful property: it scales linearly with the error. When the prediction is far from the true value, the gradient is large, pushing the model to make a bigger correction. As the prediction approaches the target, the gradient shrinks toward zero, allowing the model to fine-tune near the optimum. This adaptive behavior makes MSE effective for training with gradient descent.
For linear regression, minimizing MSE yields the Ordinary Least Squares (OLS) solution. The closed-form solution is \u03b8 = (X\u1d40X)\u207b\u00b9X\u1d40y, where X is the feature matrix and y is the target vector. This solution is guaranteed to be the global minimum because the MSE loss surface is convex for linear models.
Although MSE can technically be used for classification tasks, cross-entropy loss is strongly preferred for several reasons:
| Issue | MSE for Classification | Cross-Entropy for Classification |
|---|---|---|
| Gradient behavior | When a sigmoid or softmax output is confidently wrong, the gradient of MSE nearly vanishes due to the flat regions of the sigmoid curve. This leads to extremely slow learning. | The logarithm in cross-entropy cancels the exponential in the sigmoid/softmax, producing a clean gradient proportional to the error (p\u1d62 \u2212 y\u1d62). Learning remains fast even when the model is confidently wrong. |
| Probabilistic foundation | MSE assumes Gaussian-distributed errors, which does not match the Bernoulli or categorical distribution underlying classification. | Cross-entropy arises naturally from maximum likelihood estimation under Bernoulli (binary) or categorical (multiclass) distributions. |
| Loss surface | MSE creates a non-convex loss surface when combined with sigmoid or softmax outputs, making optimization harder. | Cross-entropy combined with sigmoid or softmax produces a convex loss surface (for logistic regression), ensuring reliable convergence. |
| Penalty for confidence | MSE penalizes confident wrong predictions only slightly more than uncertain ones. | Cross-entropy imposes a sharply increasing penalty as the model becomes more confidently wrong, driving faster corrections. |
For these reasons, cross-entropy is the standard loss function for both binary and multiclass classification in modern deep learning.
MSE's quadratic penalty means it is highly sensitive to outliers. A single data point with a large error can dominate the total loss and disproportionately influence the model's parameters. For example, if most errors are around 1 but one outlier has an error of 100, the squared error for the outlier (10,000) is 10,000 times larger than a typical error (1). The model may distort its predictions to reduce this one large error at the expense of fitting the majority of the data well.
This sensitivity is both a strength and a weakness. In applications where large errors are genuinely costly (e.g., financial forecasting, safety-critical systems), penalizing them heavily is desirable. In applications where outliers represent noise or data entry errors, MSE can mislead the model. Practitioners should consider the data distribution and application requirements before choosing MSE over more robust alternatives.
Several alternative loss functions and metrics are commonly compared with MSE:
| Metric / Loss | Formula | Outlier Sensitivity | Differentiable at 0 | Units | When to Use |
|---|---|---|---|---|---|
| MSE | (1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2 | High (quadratic penalty) | Yes | Squared units of target | Clean data, large errors are costly, need smooth optimization |
| MAE | (1/n) \u03a3|y\u1d62 \u2212 \u0177\u1d62| | Low (linear penalty) | No (kink at 0) | Same units as target | Noisy data with outliers, median prediction desired |
| RMSE | \u221a[(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2] | High (same as MSE) | Yes | Same units as target | Reporting and interpretation (units match target), comparing models |
| Huber Loss | Quadratic for |error| \u2264 \u03b4, linear for |error| > \u03b4 | Medium (configurable via \u03b4) | Yes | Same units as target | Data with some outliers but you still want to penalize moderate errors quadratically |
Key differences:
Normalized Mean Squared Error addresses MSE's scale-dependence by dividing by the variance of the observed data:
NMSE = MSE / Var(y) = [\u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2] / [\u03a3(y\u1d62 \u2212 \u0233)\u00b2]
Where \u0233 is the mean of the observed values. NMSE is dimensionless and scale-independent, making it useful for comparing model performance across different datasets or target variables. An NMSE of 1.0 means the model performs no better than simply predicting the mean, while values below 1.0 indicate useful predictive power.
NMSE is closely related to the coefficient of determination (R\u00b2), with R\u00b2 = 1 \u2212 NMSE. NMSE is used in fields such as wireless communications, signal processing, and air quality modeling where comparing predictions across different scales is important.
Choosing the right loss function depends on the data characteristics and application requirements:
| Scenario | Recommended Metric | Reason |
|---|---|---|
| Clean data, Gaussian-distributed errors | MSE | Theoretically optimal under Gaussian noise assumption; smooth gradients enable fast convergence |
| Data with frequent outliers or heavy-tailed errors | MAE or Huber Loss | MSE would give disproportionate weight to outliers, distorting the model |
| Need interpretable error in original units | RMSE | Same units as target variable, easier to communicate to stakeholders |
| Comparing models across different scales or datasets | NMSE or R\u00b2 | Scale-independent, allows fair comparison |
| Object detection bounding box regression | Huber Loss (Smooth L1) | Balances outlier robustness with smooth optimization during early training |
| Financial applications with asymmetric costs | Custom asymmetric loss | MSE treats over-predictions and under-predictions equally, which may not reflect real-world costs |
MSE is available in all major machine learning frameworks.
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mse = mean_squared_error(y_true, y_pred)
print(mse) # Output: 0.375
scikit-learn's mean_squared_error accepts sample_weight for weighted MSE and multioutput for multi-target regression. It can return per-output errors via multioutput='raw_values'.
import torch
import torch.nn as nn
loss_fn = nn.MSELoss()
y_true = torch.tensor([3.0, -0.5, 2.0, 7.0])
y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])
loss = loss_fn(y_pred, y_true)
print(loss.item()) # Output: 0.375
PyTorch's nn.MSELoss supports three reduction modes: 'mean' (default, computes MSE), 'sum' (returns total squared error), and 'none' (returns per-element squared errors).
import tensorflow as tf
loss_fn = tf.keras.losses.MeanSquaredError()
y_true = [[3.0, -0.5], [2.0, 7.0]]
y_pred = [[2.5, 0.0], [2.0, 8.0]]
loss = loss_fn(y_true, y_pred)
print(loss.numpy()) # Output: 0.375
Keras offers MeanSquaredError as both a loss class and a metric class (tf.keras.metrics.MeanSquaredError), and supports configurable reduction options including 'sum_over_batch_size', 'sum', and 'none'.
import numpy as np
def mean_squared_error(y_true, y_pred):
return np.mean((np.array(y_true) - np.array(y_pred)) ** 2)
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_squared_error(y_true, y_pred)) # Output: 0.375
Imagine you and your friends are trying to guess how many candies are in a jar. Each time someone guesses, they might be a little bit wrong. Mean Squared Error is a way to measure how wrong your guesses are, on average.
Here is what you do. Take the difference between each guess and the actual number of candies. Then multiply each difference by itself (that is called squaring it). Finally, add up all those squared numbers and divide by how many guesses there were.
Why do we square the differences? Two reasons. First, it gets rid of the minus sign, so it does not matter if you guessed too high or too low. Second, it makes bigger mistakes count a lot more. If you were off by 2, the squared error is 4. If you were off by 10, the squared error is 100. That is 25 times worse, not just 5 times worse.
If everyone's guesses are close to the real number, the MSE will be small. If the guesses are way off, the MSE will be large. So MSE tells us how good our guessing method is, and smaller is better.