# Mean Squared Error (MSE)

> Source: https://aiwiki.ai/wiki/mean_squared_error_mse
> Updated: 2026-06-21
> Categories: Machine Learning, Model Evaluation, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Mean Squared Error (MSE), also called mean squared deviation (MSD), is the average of the squared differences between predicted values and actual (observed) values, expressed as MSE = (1/n) Σ(yᵢ − ŷᵢ)². It is one of the most widely used metrics for evaluating the performance of [regression models](/wiki/regression_model) in [machine learning](/wiki/machine_learning) and statistics, and it is also the default training objective for continuous-output models.[14][4] Because it squares each error before averaging, MSE penalizes large errors more heavily than small ones, making it particularly sensitive to outliers: an error of 10 contributes 100 to the loss, while an error of 1 contributes only 1.[1] A lower MSE indicates better predictive accuracy, with a perfect model achieving an MSE of exactly zero.

MSE serves a dual role in machine learning. It functions both as an evaluation metric (measuring how well a trained model performs on held-out data) and as a [loss function](/wiki/loss_function) (the objective that a model minimizes during training via [gradient descent](/wiki/gradient_descent) or similar optimization algorithms).[2] Minimizing MSE drives a model's output toward the conditional expectation E[y | x] of the target given the input, which is the prediction that minimizes squared loss.[2] MSE also underpins many other quantities used in modern AI, including the peak signal-to-noise ratio (PSNR) for image reconstruction, the simplified training objective for [diffusion models](/wiki/diffusion_model), and the reconstruction loss for [autoencoders](/wiki/autoencoder).[3]

## How is MSE different from RMSE and MAE?

MSE, RMSE, and [mean absolute error (MAE)](/wiki/mean_absolute_error_mae) are three closely related ways of summarizing prediction error, and they answer different questions. MSE averages the squared residuals and is reported in the squared units of the target. RMSE is the square root of MSE, which returns the figure to the same units as the target and is therefore the value most practitioners actually report. MAE averages the absolute residuals, penalizes errors linearly rather than quadratically, and is more robust to outliers. A key statistical distinction is what each metric makes a model predict: minimizing MSE yields the conditional mean of the target, while minimizing MAE yields the conditional median.[2] The detailed comparison table appears in the section below on MSE vs. MAE vs. RMSE vs. Huber Loss.

## Historical Background: When was the method of least squares invented?

The idea of squaring residuals and minimizing their sum predates modern statistics by more than two centuries. The method of least squares, which is the optimization principle that gives rise to MSE, was first published by Adrien-Marie Legendre in 1805 in his work *Nouvelles methodes pour la determination des orbites des cometes* (New Methods for the Determination of Comet Orbits), where he introduced the method and gave it its name.[7] Carl Friedrich Gauss published a probabilistic justification for it in 1809 in *Theoria motus corporum coelestium*, where he showed that least squares yields the most probable estimate when observation errors follow a normal distribution, and where he claimed to have been using the method since 1795.[7] That claim triggered a long-running priority dispute with Legendre, who held that priority is established only by publication.[7] Pierre-Simon Laplace independently extended the theory in 1810 by connecting it to the central limit theorem.[7]

The Gauss-Markov theorem, formalized in the 19th century, established that under standard assumptions (linear model, zero-mean errors, equal variance, uncorrelated errors), the ordinary least squares estimator has the lowest variance among all linear unbiased estimators. This result, often abbreviated as BLUE (Best Linear Unbiased Estimator), gave MSE-based estimation a strong theoretical foundation that carried into the 20th century.

In the 20th century, MSE became central to estimation theory through the work of statisticians such as Jerzy Neyman, Egon Pearson, Ronald Fisher, and Erich Lehmann.[6] Fisher's development of maximum likelihood estimation in the 1920s reinforced the connection between MSE and probabilistic modeling under Gaussian assumptions. The rise of digital computing in the second half of the century made least-squares regression practical at scale, and the emergence of [neural networks](/wiki/neural_network) in the 1980s adopted MSE as the natural training objective for continuous-output models, a role it continues to play in modern deep learning.

## Mathematical Definition

Given a dataset of *n* observations, where *y_i* is the actual value and *ŷ_i* (y-hat) is the predicted value for the *i*-th observation, the Mean Squared Error is defined as:

**MSE = (1/n) Σᵢ₌₁ⁿ (yᵢ − ŷᵢ)²**

In expanded form:

MSE = (1/n) [(y₁ − ŷ₁)² + (y₂ − ŷ₂)² + ... + (yₙ − ŷₙ)²]

Where:

| Symbol | Meaning |
|---|---|
| n | Number of data points (observations) |
| yᵢ | Actual (observed) value for the i-th data point |
| ŷᵢ | Predicted value for the i-th data point |
| (yᵢ − ŷᵢ) | Residual (error) for the i-th data point |
| Σ | Summation over all n data points |

The formula first computes the residual for each observation, squares it to eliminate negative signs and emphasize larger errors, then averages all squared residuals.

In the context of estimation theory, for an estimator θ̂ of a true parameter θ, MSE is defined as the expected value of the squared error:[6]

**MSE(θ̂) = E[(θ̂ − θ)²]**

This population-level definition treats MSE as an expectation over the distribution of the estimator, while the sample-level definition treats it as a finite arithmetic mean over data points. The two definitions converge as the sample size grows, by the law of large numbers.

### Worked Numerical Example

Suppose a model predicts house prices in thousands of dollars. The actual prices for five houses are 250, 300, 180, 420, and 350. The model predicts 240, 310, 200, 400, and 360. The residuals are 10, -10, -20, 20, and -10. Squaring gives 100, 100, 400, 400, and 100. The sum is 1100. Dividing by n equal to 5 yields an MSE of 220 (in squared thousands of dollars). The corresponding RMSE is the square root of 220, approximately 14.83 thousand dollars, which is the figure most practitioners would report.[4]

## Key Properties

MSE has several mathematical properties that make it useful for model training and evaluation:

| Property | Description |
|---|---|
| Non-negative | MSE is always greater than or equal to zero, since every squared term is non-negative. An MSE of zero indicates perfect predictions. |
| Differentiable | MSE is smooth and continuously differentiable everywhere, which makes it well-suited for [gradient descent](/wiki/gradient_descent) optimization. The gradient of MSE with respect to predictions is −(2/n) Σ(yᵢ − ŷᵢ). |
| Convex (for linear models) | When used with [linear regression](/wiki/linear_regression) or other linear models, MSE forms a convex loss surface with a single global minimum. This guarantees that gradient-based optimization converges to the optimal solution. |
| Non-convex (for neural networks) | For [neural networks](/wiki/neural_network) with nonlinear activation functions, the MSE loss surface becomes non-convex and may contain local minima and saddle points. |
| Scale-dependent | MSE is expressed in the squared units of the target variable, making direct interpretation less intuitive. For example, if the target is measured in dollars, MSE is in dollars squared. |
| Sensitive to outliers | Because errors are squared, a single large error has a disproportionate effect on the overall MSE. An error of 10 contributes 100 to the loss, while an error of 1 contributes only 1. |
| Lipschitz-bounded gradient (on bounded domains) | When predictions and targets lie in a bounded set, the MSE gradient is Lipschitz continuous, which is a useful condition for the convergence guarantees of many optimization algorithms. |
| Twice differentiable | The Hessian of MSE with respect to predictions is constant, which simplifies second-order optimization methods such as Newton's method and natural gradient descent. |

## Bias-Variance Decomposition

One of the most important theoretical results involving MSE is the bias-variance decomposition. For an estimator θ̂ of a true parameter θ, the MSE can be decomposed as:[1]

**MSE(θ̂) = Var(θ̂) + [Bias(θ̂)]²**

Where:

- **Var(θ̂)** is the variance of the estimator, measuring how much the predictions fluctuate across different training sets.
- **Bias(θ̂)** is the bias, measuring the systematic difference between the average prediction and the true value.

In a supervised learning context with irreducible noise σ², this extends to:

**Expected MSE = Bias² + Variance + σ² (irreducible noise)**

This decomposition is central to understanding the [bias-variance tradeoff](/wiki/bias_variance_tradeoff).[4] A simple model (for example, linear regression on complex data) may have high bias but low variance, leading to [underfitting](/wiki/underfitting). A highly complex model (for example, a deep neural network with many parameters) may have low bias but high variance, leading to [overfitting](/wiki/overfitting). The goal is to find a model that balances both components to minimize overall MSE.

For an unbiased estimator (where Bias = 0), the MSE equals the variance. This is why unbiased estimators are sometimes favored in statistics, though biased estimators with lower variance can achieve a lower overall MSE.[6] A classic illustration is shrinkage, where slightly biased estimators such as [ridge regression](/wiki/ridge_regression) or the James-Stein estimator can produce lower MSE than the unbiased OLS estimator, particularly in high-dimensional settings.[5]

## Why does minimizing MSE equal maximum likelihood estimation?

Minimizing MSE is mathematically equivalent to maximum likelihood estimation (MLE) under the assumption that the errors follow a Gaussian (normal) distribution.[5] This connection provides a probabilistic justification for using MSE.[3]

Consider a model yᵢ = f(xᵢ; θ) + εᵢ, where εᵢ ~ N(0, σ²). Under this assumption, each observation yᵢ follows a normal distribution:

p(yᵢ | xᵢ, θ) = (1 / √(2πσ²)) · exp(−(yᵢ − f(xᵢ; θ))² / (2σ²))

The log-likelihood for the entire dataset is:

log L(θ) = −(n/2) log(2πσ²) − (1/(2σ²)) Σ(yᵢ − f(xᵢ; θ))²

Since σ² is a constant with respect to θ, maximizing the log-likelihood is equivalent to minimizing Σ(yᵢ − f(xᵢ; θ))², which is proportional to MSE.[5] This means that when the noise in the data is truly Gaussian, MSE is the theoretically optimal loss function. When the noise distribution has heavier tails (more outliers than a Gaussian), alternatives like [mean absolute error](/wiki/mean_absolute_error_mae) or [Huber loss](/wiki/huber_loss) may be more appropriate.[10]

The Gaussian-MSE connection also clarifies the meaning of the predictions a model produces. Minimizing MSE drives the model output toward the conditional expectation E[y | x] of the target given the input.[2] Minimizing MAE, by contrast, drives the model output toward the conditional median. Choosing between MSE and MAE is therefore not just a robustness choice but a statement about which summary of the conditional distribution the model is predicting.

## Gauss-Markov Theorem and Best Linear Unbiased Estimator

The [Gauss-Markov theorem](/wiki/gauss_markov_theorem) is a foundational result that gives MSE-based estimation a strong theoretical justification in the context of [linear regression](/wiki/linear_regression). The theorem states that under the following assumptions, the ordinary least squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE):

| Assumption | Meaning |
|---|---|
| Linearity | The relationship between predictors and target is linear in parameters. |
| Zero-mean errors | E[εᵢ] = 0 for all observations. |
| Homoscedasticity | All errors have the same variance, Var(εᵢ) = σ². |
| No autocorrelation | Errors are uncorrelated, Cov(εᵢ, εⱼ) = 0 for i ≠ j. |
| Exogeneity | Errors are uncorrelated with the predictors. |

Under these conditions, OLS achieves the lowest variance among all linear unbiased estimators.[6] Notably, the theorem does not require errors to be Gaussian; it only requires the four conditions above. Gaussian errors give OLS the additional property of being the maximum likelihood estimator and the minimum variance estimator overall (not just among linear unbiased estimators), which is sometimes called the Gauss-Markov-Aitken theorem in its generalized form.[5]

## MSE as a Loss Function for Regression

In [supervised learning](/wiki/supervised_machine_learning), MSE is the default [loss function](/wiki/loss_function) for [regression](/wiki/regression) tasks.[2] During training, the model parameters are adjusted to minimize the MSE over the training data. The gradient of MSE with respect to the model's predicted output ŷᵢ is:

**∂MSE/∂ŷᵢ = −(2/n)(yᵢ − ŷᵢ)**

This gradient has a useful property: it scales linearly with the error. When the prediction is far from the true value, the gradient is large, pushing the model to make a bigger correction. As the prediction approaches the target, the gradient shrinks toward zero, allowing the model to fine-tune near the optimum. This adaptive behavior makes MSE effective for training with [gradient descent](/wiki/gradient_descent).

For [linear regression](/wiki/linear_regression), minimizing MSE yields the Ordinary Least Squares (OLS) solution. The closed-form solution is θ = (XᵀX)⁻¹Xᵀy, where X is the feature matrix and y is the target vector.[5] This solution is guaranteed to be the global minimum because the MSE loss surface is convex for linear models. For [neural networks](/wiki/neural_network) and other nonlinear models, no closed-form solution exists and iterative optimization with backpropagation is used instead.[3]

## Can MSE be used for classification?

Although MSE can technically be used for classification tasks, [cross-entropy loss](/wiki/cross-entropy) is strongly preferred for several reasons:

| Issue | MSE for Classification | Cross-Entropy for Classification |
|---|---|---|
| Gradient behavior | When a sigmoid or softmax output is confidently wrong, the gradient of MSE nearly vanishes due to the flat regions of the sigmoid curve. This leads to extremely slow learning. | The logarithm in cross-entropy cancels the exponential in the sigmoid/softmax, producing a clean gradient proportional to the error (pᵢ − yᵢ). Learning remains fast even when the model is confidently wrong. |
| Probabilistic foundation | MSE assumes Gaussian-distributed errors, which does not match the Bernoulli or categorical distribution underlying classification. | Cross-entropy arises naturally from maximum likelihood estimation under Bernoulli (binary) or categorical (multiclass) distributions. |
| Loss surface | MSE creates a non-convex loss surface when combined with sigmoid or softmax outputs, making optimization harder. | Cross-entropy combined with sigmoid or softmax produces a convex loss surface (for logistic regression), ensuring reliable convergence. |
| Penalty for confidence | MSE penalizes confident wrong predictions only slightly more than uncertain ones. | Cross-entropy imposes a sharply increasing penalty as the model becomes more confidently wrong, driving faster corrections. |

For these reasons, cross-entropy is the standard loss function for both binary and multiclass classification in modern deep learning.[3]

## Sensitivity to Outliers

MSE's quadratic penalty means it is highly sensitive to outliers. A single data point with a large error can dominate the total loss and disproportionately influence the model's parameters. For example, if most errors are around 1 but one outlier has an error of 100, the squared error for the outlier (10,000) is 10,000 times larger than a typical error (1). The model may distort its predictions to reduce this one large error at the expense of fitting the majority of the data well.

This sensitivity is both a strength and a weakness. In applications where large errors are genuinely costly (for example, financial forecasting or safety-critical systems), penalizing them heavily is desirable. In applications where outliers represent noise or data entry errors, MSE can mislead the model. Practitioners should consider the data distribution and application requirements before choosing MSE over more robust alternatives.[10]

### Common Mitigations

Several practical strategies can reduce the harmful influence of outliers on MSE-based training:

| Strategy | Description |
|---|---|
| Outlier removal | Identify and remove data points outside a chosen percentile or beyond a threshold (for example, 3 standard deviations from the mean). |
| Target transformation | Apply a log, square root, or Box-Cox transformation to compress the tail of a heavy-tailed target before training. |
| Robust losses | Switch to [Huber loss](/wiki/huber_loss), Tukey's biweight, or log-cosh loss, which all dampen the influence of large errors. |
| Trimmed loss | Sort residuals and drop the largest k percent before averaging, similar to a trimmed mean. |
| Sample weighting | Down-weight suspected outliers in the loss using sample weights, reducing their gradient contribution. |
| Quantile regression | Optimize a pinball loss to predict a target quantile rather than the mean, which is naturally robust to outliers. |

## MSE vs. MAE vs. RMSE vs. Huber Loss

Several alternative loss functions and metrics are commonly compared with MSE:

| Metric / Loss | Formula | Outlier Sensitivity | Differentiable at 0 | Units | When to Use |
|---|---|---|---|---|---|
| **MSE** | (1/n) Σ(yᵢ − ŷᵢ)² | High (quadratic penalty) | Yes | Squared units of target | Clean data, large errors are costly, need smooth optimization |
| **[MAE](/wiki/mean_absolute_error_mae)** | (1/n) Σ\|yᵢ − ŷᵢ\| | Low (linear penalty) | No (kink at 0) | Same units as target | Noisy data with outliers, median prediction desired |
| **RMSE** | √[(1/n) Σ(yᵢ − ŷᵢ)²] | High (same as MSE) | Yes | Same units as target | Reporting and interpretation (units match target), comparing models |
| **[Huber Loss](/wiki/huber_loss)** | Quadratic for \|error\| ≤ δ, linear for \|error\| > δ | Medium (configurable via δ) | Yes | Same units as target | Data with some outliers but you still want to penalize moderate errors quadratically |
| **MAPE** | (100/n) Σ \|(yᵢ − ŷᵢ)/yᵢ\| | Low | No | Percentage | Forecasting where relative errors matter, but undefined when yᵢ = 0 |
| **sMAPE** | (100/n) Σ \|yᵢ − ŷᵢ\| / ((\|yᵢ\| + \|ŷᵢ\|)/2) | Low | No | Percentage | Symmetric variant of MAPE that handles small or zero targets better |
| **MSLE** | (1/n) Σ (log(1 + yᵢ) − log(1 + ŷᵢ))² | Low for large values | Yes | Squared log-units | Skewed targets such as counts or revenue spanning several orders of magnitude |
| **R²** | 1 − (Σ(yᵢ − ŷᵢ)² / Σ(yᵢ − ȳ)²) | High (driven by MSE) | Yes | Dimensionless | Reporting fraction of variance explained by the model |

**Key differences:**

- **MSE vs. MAE:** MSE penalizes large errors quadratically while [MAE](/wiki/mean_absolute_error_mae) penalizes them linearly. MSE yields the conditional mean as the optimal prediction, whereas MAE yields the conditional median.[2] MAE is more robust to outliers but produces non-smooth gradients (a kink at zero error), which can slow convergence during optimization.
- **MSE vs. RMSE:** RMSE is simply the square root of MSE. Minimizing RMSE is mathematically equivalent to minimizing MSE since the square root is a monotonically increasing function. RMSE is preferred for reporting because its units match the target variable, making it more interpretable.
- **MSE vs. Huber Loss:** [Huber loss](/wiki/huber_loss) is a hybrid that behaves like MSE for small errors (below a threshold δ) and like MAE for large errors (above δ). It combines the smooth optimization of MSE with the outlier robustness of MAE.[10] The trade-off is that Huber loss introduces an additional hyperparameter δ that must be tuned. Huber loss is the standard regression loss in object detection architectures such as Faster R-CNN and SSD, where it is also called Smooth L1.
- **MSE vs. MAPE and sMAPE:** MAPE and sMAPE express error in percentage terms, which is intuitive in business and forecasting contexts. They are scale-free, but MAPE is undefined when the target is zero and asymmetric in its treatment of over- and under-prediction. sMAPE addresses both issues at the cost of slightly less intuitive interpretation.
- **MSE vs. MSLE:** Mean Squared Logarithmic Error squares the difference of the log-transformed target and prediction. It is often used when the target spans several orders of magnitude (such as house prices, population counts, or revenue) and when predicting a value that is, say, twice the truth is more acceptable for large targets than for small ones.

## Normalized MSE (NMSE)

Normalized Mean Squared Error addresses MSE's scale-dependence by dividing by the variance of the observed data:

**NMSE = MSE / Var(y) = [Σ(yᵢ − ŷᵢ)²] / [Σ(yᵢ − ȳ)²]**

Where ȳ is the mean of the observed values. NMSE is dimensionless and scale-independent, making it useful for comparing model performance across different datasets or target variables. An NMSE of 1.0 means the model performs no better than simply predicting the mean, while values below 1.0 indicate useful predictive power.

NMSE is closely related to the coefficient of determination (R²), with R² = 1 − NMSE.[14] NMSE is used in fields such as wireless communications, signal processing, and air quality modeling where comparing predictions across different scales is important.

## When to Use MSE vs. Alternatives

Choosing the right loss function depends on the data characteristics and application requirements:

| Scenario | Recommended Metric | Reason |
|---|---|---|
| Clean data, Gaussian-distributed errors | MSE | Theoretically optimal under Gaussian noise assumption; smooth gradients enable fast convergence |
| Data with frequent outliers or heavy-tailed errors | MAE or Huber Loss | MSE would give disproportionate weight to outliers, distorting the model |
| Need interpretable error in original units | RMSE | Same units as target variable, easier to communicate to stakeholders |
| Comparing models across different scales or datasets | NMSE or R² | Scale-independent, allows fair comparison |
| Object detection bounding box regression | Huber Loss (Smooth L1) | Balances outlier robustness with smooth optimization during early training |
| Financial applications with asymmetric costs | Custom asymmetric loss | MSE treats over-predictions and under-predictions equally, which may not reflect real-world costs |
| Targets spanning orders of magnitude | MSLE | Penalizes relative error rather than absolute error |
| Time series forecasting at scale | MAPE or sMAPE | Scale-free percentage interpretation across many series |
| Quantile prediction (risk, intervals) | Pinball loss | MSE only produces conditional means and cannot directly target quantiles |

## Regularized Variants and Connections to Other Methods

MSE is rarely used in isolation in modern machine learning. It is almost always combined with regularization terms that constrain the model and improve generalization. The most common combinations are:

| Method | Objective | Notes |
|---|---|---|
| Ordinary Least Squares (OLS) | min Σ(yᵢ − ŷᵢ)² | Pure MSE without regularization. |
| [Ridge Regression](/wiki/ridge_regression) | min Σ(yᵢ − ŷᵢ)² + λ Σ βⱼ² | Adds an L2 penalty on coefficients. Equivalent to MAP estimation under a Gaussian prior on parameters. |
| Lasso Regression | min Σ(yᵢ − ŷᵢ)² + λ Σ \|βⱼ\| | Adds an L1 penalty that performs feature selection. Equivalent to MAP estimation under a Laplace prior. |
| Elastic Net | min Σ(yᵢ − ŷᵢ)² + λ₁ Σ \|βⱼ\| + λ₂ Σ βⱼ² | Combines L1 and L2 penalties, useful for correlated features. |
| Weight decay (deep learning) | Loss + λ Σ \|\|θ\|\|² | Equivalent to L2 regularization for full-batch optimization, widely used in [neural networks](/wiki/neural_network). |

These formulations show that MSE is the workhorse data-fitting term in a large family of methods.[1] Adding L2 to MSE produces ridge regression, which slightly biases coefficient estimates toward zero in exchange for substantially lower variance. This trade-off, again, follows directly from the bias-variance decomposition.[5]

## MSE in Image Quality and PSNR

In image processing and computer vision, MSE is the basis for the peak signal-to-noise ratio (PSNR), which compares a reconstructed or compressed image to a reference. PSNR is defined as:

**PSNR = 20 · log₁₀(MAX / √MSE)**

where MAX is the maximum possible pixel value (for example, 255 for 8-bit images). PSNR is expressed in decibels and increases as MSE decreases.[15] Typical lossy image compression yields PSNR values between 30 and 50 dB, with higher numbers indicating closer reconstruction.[15]

Despite its widespread use, PSNR (and MSE on raw pixels) is a weak proxy for perceptual quality. Two images with the same MSE can look very different to a human observer because MSE treats every pixel independently and ignores spatial structure, edges, and texture.[9] For example, a small global brightness shift can produce a higher MSE than a large local distortion that would be more obvious to viewers. This limitation has motivated the development of perceptual metrics:[9]

| Metric | Approach | Strengths |
|---|---|---|
| MSE / PSNR | Pixel-wise squared error | Simple, fast, differentiable, basis for PSNR |
| SSIM | Compares luminance, contrast, structure in local windows | Better correlation with perceived quality |
| MS-SSIM | Multi-scale extension of SSIM | Robust across image resolutions |
| LPIPS | Distance in features of a pretrained deep network | Strong correlation with human judgments |
| FID and KID | Distribution-level metrics for generative models | Captures realism and diversity, not pixel exactness |

In modern image restoration and generative pipelines, MSE is often combined with perceptual losses (for example, a weighted sum of MSE and LPIPS) so the model benefits from both pixel accuracy and perceptual realism.

## How is MSE used in modern deep learning?

MSE remains a core component of many state-of-the-art systems even outside classical regression.

### Diffusion Models

[Diffusion models](/wiki/diffusion_model) such as DDPM (Ho et al., 2020), Stable Diffusion, Imagen, and DALL-E 2 use a remarkably simple training objective. The model learns to predict the noise that was added to a clean sample at a randomly chosen timestep, and the loss is the MSE between the true noise and the predicted noise:

**L_simple = E[ \|\|ε − εθ(xₜ, t)\|\|² ]**

The DDPM authors reported that this simplified, unweighted objective worked better in practice than the full variational bound, writing that "it is beneficial to sample quality (and simpler to implement) to train on the following variant of the variational bound" before introducing L_simple.[8] This simplification of the original variational lower bound was one of the key contributions of the DDPM paper and is often credited with making large-scale diffusion training practical.[8] Despite the complexity of the underlying generative model, the practical training loop is essentially a giant MSE regression problem.[8]

### Autoencoders

[Autoencoders](/wiki/autoencoder) and variational autoencoders (VAEs) often use MSE as the reconstruction loss between input and reconstructed output. For continuous-valued inputs such as natural images or audio waveforms, MSE corresponds to a Gaussian likelihood on the output and works well in practice.[3] For binary or near-binary inputs, binary cross-entropy is sometimes used instead.

### Self-supervised and Representation Learning

Methods such as masked autoencoders (MAE) for vision use MSE on the reconstructed pixels of masked image patches as the pretraining objective. Audio models such as wav2vec and HuBERT use MSE-style losses on continuous representations during certain stages of training. MSE is also used as a feature distillation loss when matching student features to teacher features in [knowledge distillation](/wiki/knowledge_distillation).

### Reinforcement Learning

In value-based [reinforcement learning](/wiki/reinforcement_learning), the temporal-difference error in algorithms such as DQN is often optimized using a Huber-like or MSE loss between the predicted Q-value and the target Q-value. Pure MSE was used in the original DQN, while later work generally switched to Huber loss for stability.[10]

### Regression Heads in Foundation Models

Many foundation models add task-specific regression heads (for example, predicting bounding-box coordinates, depth values, or continuous attributes) trained with MSE or Huber loss on top of pretrained backbones. This pattern appears in object detection, depth estimation, pose estimation, and physical-property prediction.

## Numerical and Practical Considerations

Using MSE effectively requires attention to several practical issues:

| Concern | Recommendation |
|---|---|
| Target scaling | Standardize or normalize targets so that the scale of MSE is comparable across features. Without scaling, large-magnitude targets dominate gradients and small-magnitude targets are essentially ignored. |
| Feature scaling | Scale input features to similar ranges. Although MSE is computed on outputs, gradient magnitudes through the network depend on input scale. |
| Numerical precision | For very small errors, squaring can lead to underflow in single-precision floating point. Mixed-precision training frameworks usually keep loss accumulation in float32. |
| Sample weighting | When data is imbalanced or some observations are more reliable, use sample weights so MSE is a weighted average of squared errors. |
| Reduction mode | Frameworks let you choose whether MSE is averaged across all elements, summed, or returned per-element. The choice affects gradient magnitude and the appropriate learning rate. |
| Multi-output regression | When predicting several targets at once, MSE is averaged across both samples and outputs by default. Consider per-output normalization if outputs have very different scales. |
| Mini-batch noise | Stochastic gradients of MSE on small batches can be noisy. Larger batches or gradient accumulation reduce variance at the cost of compute. |

## How do you compute MSE in Python?

MSE is available in all major machine learning frameworks.

### scikit-learn

```python
from sklearn.metrics import mean_squared_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mse = mean_squared_error(y_true, y_pred)
print(mse)  # Output: 0.375
```

scikit-learn's `mean_squared_error` accepts `sample_weight` for weighted MSE and `multioutput` for multi-target regression. It can return per-output errors via `multioutput='raw_values'`. As of scikit-learn 1.4, the historical `squared=False` argument that returned RMSE was deprecated and replaced by a dedicated `root_mean_squared_error` function; the `squared` parameter was removed in version 1.6.[11]

### PyTorch

```python
import torch
import torch.nn as nn

loss_fn = nn.MSELoss()
y_true = torch.tensor([3.0, -0.5, 2.0, 7.0])
y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])

loss = loss_fn(y_pred, y_true)
print(loss.item())  # Output: 0.375
```

[PyTorch](/wiki/pytorch)'s `nn.MSELoss` supports three reduction modes: `'mean'` (default, computes MSE), `'sum'` (returns total squared error), and `'none'` (returns per-element squared errors). The functional form `torch.nn.functional.mse_loss` is also available for use without instantiating a module.[12]

### TensorFlow / Keras

```python
import tensorflow as tf

loss_fn = tf.keras.losses.MeanSquaredError()
y_true = [[3.0, -0.5], [2.0, 7.0]]
y_pred = [[2.5, 0.0], [2.0, 8.0]]

loss = loss_fn(y_true, y_pred)
print(loss.numpy())  # Output: 0.375
```

[Keras](/wiki/keras) (now part of [TensorFlow](/wiki/tensorflow)) offers `MeanSquaredError` as both a loss class and a metric class (`tf.keras.metrics.MeanSquaredError`), and supports configurable reduction options including `'sum_over_batch_size'`, `'sum'`, and `'none'`. The string shortcut `'mse'` works wherever a loss is expected, for example `model.compile(loss='mse', optimizer='adam')`.[13]

### JAX

```python
import jax.numpy as jnp

def mse_loss(params, x, y, model_fn):
    preds = model_fn(params, x)
    return jnp.mean((preds - y) ** 2)
```

In [JAX](/wiki/jax), MSE is typically written as a plain function and combined with `jax.value_and_grad` for automatic differentiation. Libraries such as Optax provide `optax.l2_loss` (which returns half of the squared error per element) for convenience.

### NumPy (from scratch)

```python
import numpy as np

def mean_squared_error(y_true, y_pred):
    return np.mean((np.array(y_true) - np.array(y_pred)) ** 2)

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_squared_error(y_true, y_pred))  # Output: 0.375
```

### Framework Comparison

| Framework | API | Default Reduction | Sample Weights | RMSE Variant |
|---|---|---|---|---|
| scikit-learn | `mean_squared_error(y_true, y_pred)` | mean over all elements | `sample_weight` | `root_mean_squared_error` (1.4+) |
| PyTorch | `nn.MSELoss()` or `F.mse_loss` | `'mean'` | manual via element-wise multiply | manual `torch.sqrt` |
| TensorFlow / Keras | `tf.keras.losses.MeanSquaredError()` | `'sum_over_batch_size'` | `sample_weight` | `RootMeanSquaredError` metric |
| JAX / Optax | `optax.l2_loss` or hand-written | per-element (sum or mean) | manual broadcast | manual `jnp.sqrt` |

## Common Pitfalls

A few mistakes show up frequently when teams adopt MSE in production:

| Pitfall | Why It Matters | Fix |
|---|---|---|
| Reporting MSE without RMSE | Squared units are hard to interpret. | Report RMSE alongside MSE for stakeholder communication. |
| Comparing MSE across different target scales | A model with target in millions will look much worse than one with target in tens, even if it is better in relative terms. | Use NMSE, R², or normalize targets before training. |
| Ignoring outliers | A few extreme points can dominate the loss and hide systematic errors on the rest of the data. | Plot residuals, consider robust losses, and inspect data quality. |
| Using MSE for classification | Slow training and poor calibration. | Use cross-entropy for classification tasks. |
| Mixing reduction modes between training and metrics | Sum and mean reductions differ by a factor of n, which changes effective learning rate. | Pick one convention and stick with it. |
| Forgetting target standardization | Large-magnitude targets blow up gradients; small-magnitude targets are ignored. | Standardize targets and inverse-transform predictions for reporting. |

## Explain Like I'm 5 (ELI5)

Imagine you and your friends are trying to guess how many candies are in a jar. Each time someone guesses, they might be a little bit wrong. Mean Squared Error is a way to measure how wrong your guesses are, on average.

Here is what you do. Take the difference between each guess and the actual number of candies. Then multiply each difference by itself (that is called squaring it). Finally, add up all those squared numbers and divide by how many guesses there were.

Why do we square the differences? Two reasons. First, it gets rid of the minus sign, so it does not matter if you guessed too high or too low. Second, it makes bigger mistakes count a lot more. If you were off by 2, the squared error is 4. If you were off by 10, the squared error is 100. That is 25 times worse, not just 5 times worse.

If everyone's guesses are close to the real number, the MSE will be small. If the guesses are way off, the MSE will be large. So MSE tells us how good our guessing method is, and smaller is better.

## See Also

- [Mean Absolute Error (MAE)](/wiki/mean_absolute_error_mae)
- [Huber loss](/wiki/huber_loss)
- [Cross-entropy loss](/wiki/cross-entropy)
- [Loss function](/wiki/loss_function)
- [Linear regression](/wiki/linear_regression)
- [Ridge regression](/wiki/ridge_regression)
- [Bias-variance tradeoff](/wiki/bias_variance_tradeoff)
- [Gauss-Markov theorem](/wiki/gauss_markov_theorem)
- [Gradient descent](/wiki/gradient_descent)
- [Diffusion model](/wiki/diffusion_model)
- [Autoencoder](/wiki/autoencoder)

## References

1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. Chapter 2.4, "Statistical Decision Theory."
2. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 1.5.5, "Loss functions for regression," which shows the conditional mean E[t|x] minimizes the expected squared loss.
3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Section 5.5.1, "Maximum Likelihood Estimation," and Section 6.2.2, "Mean Squared Error."
4. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning* (2nd ed.). Springer. Chapter 2.2, "Assessing Model Accuracy."
5. Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press. Section 5.7.1, "MLE for Linear Regression."
6. Lehmann, E. L., & Casella, G. (1998). *Theory of Point Estimation* (2nd ed.). Springer. Chapter 1, "Preparations."
7. Stigler, S. M. (1986). *The History of Statistics: The Measurement of Uncertainty before 1900*. Harvard University Press. Chapters on Legendre (1805), Gauss (1809), Laplace, and the least-squares priority dispute.
8. Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." *Advances in Neural Information Processing Systems* 33. arXiv:2006.11239. See Section 3.4 and Equation 14 for the simplified objective L_simple.
9. Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). "Image Quality Assessment: From Error Visibility to Structural Similarity." *IEEE Transactions on Image Processing*, 13(4), 600-612.
10. Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.
11. scikit-learn documentation. "sklearn.metrics.mean_squared_error" and "sklearn.metrics.root_mean_squared_error" (added in 1.4). https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
12. PyTorch documentation. "MSELoss." https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html
13. TensorFlow documentation. "tf.keras.losses.MeanSquaredError." https://www.tensorflow.org/api_docs/python/tf/keras/losses/MeanSquaredError
14. Wikipedia. "Mean squared error." https://en.wikipedia.org/wiki/Mean_squared_error
15. Wikipedia. "Peak signal-to-noise ratio." https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio
