Mean Squared Error (MSE)

Mean Squared Error (MSE), also called mean squared deviation (MSD), is one of the most widely used metrics for evaluating the performance of regression models in machine learning and statistics. It measures the average of the squared differences between predicted values and actual (observed) values. Because it squares each error before averaging, MSE penalizes large errors more heavily than small ones, making it particularly sensitive to outliers. A lower MSE indicates better predictive accuracy, with a perfect model achieving an MSE of zero.

MSE serves a dual role in machine learning. It functions both as an evaluation metric (measuring how well a trained model performs on held-out data) and as a loss function (the objective that a model minimizes during training via gradient descent or similar optimization algorithms). It also underpins many other quantities used in modern AI, including the peak signal-to-noise ratio (PSNR) for image reconstruction, the simplified training objective for diffusion models, and the reconstruction loss for autoencoders.

Historical Background

The idea of squaring residuals and minimizing their sum predates modern statistics by more than two centuries. The method of least squares, which is the optimization principle that gives rise to MSE, was first published by Adrien-Marie Legendre in 1805 in his work Nouvelles methodes pour la determination des orbites des cometes (New Methods for the Determination of Comet Orbits). Carl Friedrich Gauss claimed to have used the method as early as 1795 and published a probabilistic justification for it in 1809 in Theoria motus corporum coelestium, where he showed that least squares yields the most probable estimate when observation errors follow a normal distribution. Pierre-Simon Laplace independently extended the theory in 1810 by connecting it to the central limit theorem.

The Gauss-Markov theorem, formalized in the 19th century, established that under standard assumptions (linear model, zero-mean errors, equal variance, uncorrelated errors), the ordinary least squares estimator has the lowest variance among all linear unbiased estimators. This result, often abbreviated as BLUE (Best Linear Unbiased Estimator), gave MSE-based estimation a strong theoretical foundation that carried into the 20th century.

In the 20th century, MSE became central to estimation theory through the work of statisticians such as Jerzy Neyman, Egon Pearson, Ronald Fisher, and Erich Lehmann. Fisher's development of maximum likelihood estimation in the 1920s reinforced the connection between MSE and probabilistic modeling under Gaussian assumptions. The rise of digital computing in the second half of the century made least-squares regression practical at scale, and the emergence of neural networks in the 1980s adopted MSE as the natural training objective for continuous-output models, a role it continues to play in modern deep learning.

Mathematical Definition

Given a dataset of n observations, where y_i is the actual value and ŷ_i (y-hat) is the predicted value for the i-th observation, the Mean Squared Error is defined as:

MSE = (1/n) Σᵢ₌₁ⁿ (yᵢ − ŷᵢ)²

In expanded form:

MSE = (1/n) [(y₁ − ŷ₁)² + (y₂ − ŷ₂)² + ... + (yₙ − ŷₙ)²]

Where:

Symbol	Meaning
n	Number of data points (observations)
yᵢ	Actual (observed) value for the i-th data point
ŷᵢ	Predicted value for the i-th data point
(yᵢ − ŷᵢ)	Residual (error) for the i-th data point
Σ	Summation over all n data points

The formula first computes the residual for each observation, squares it to eliminate negative signs and emphasize larger errors, then averages all squared residuals.

In the context of estimation theory, for an estimator θ̂ of a true parameter θ, MSE is defined as the expected value of the squared error:

MSE(θ̂) = E[(θ̂ − θ)²]

This population-level definition treats MSE as an expectation over the distribution of the estimator, while the sample-level definition treats it as a finite arithmetic mean over data points. The two definitions converge as the sample size grows, by the law of large numbers.

Worked Numerical Example

Suppose a model predicts house prices in thousands of dollars. The actual prices for five houses are 250, 300, 180, 420, and 350. The model predicts 240, 310, 200, 400, and 360. The residuals are 10, -10, -20, 20, and -10. Squaring gives 100, 100, 400, 400, and 100. The sum is 1100. Dividing by n equal to 5 yields an MSE of 220 (in squared thousands of dollars). The corresponding RMSE is the square root of 220, approximately 14.83 thousand dollars, which is the figure most practitioners would report.

Key Properties

MSE has several mathematical properties that make it useful for model training and evaluation:

Property	Description
Non-negative	MSE is always greater than or equal to zero, since every squared term is non-negative. An MSE of zero indicates perfect predictions.
Differentiable	MSE is smooth and continuously differentiable everywhere, which makes it well-suited for gradient descent optimization. The gradient of MSE with respect to predictions is −(2/n) Σ(yᵢ − ŷᵢ).
Convex (for linear models)	When used with linear regression or other linear models, MSE forms a convex loss surface with a single global minimum. This guarantees that gradient-based optimization converges to the optimal solution.
Non-convex (for neural networks)	For neural networks with nonlinear activation functions, the MSE loss surface becomes non-convex and may contain local minima and saddle points.
Scale-dependent	MSE is expressed in the squared units of the target variable, making direct interpretation less intuitive. For example, if the target is measured in dollars, MSE is in dollars squared.
Sensitive to outliers	Because errors are squared, a single large error has a disproportionate effect on the overall MSE. An error of 10 contributes 100 to the loss, while an error of 1 contributes only 1.
Lipschitz-bounded gradient (on bounded domains)	When predictions and targets lie in a bounded set, the MSE gradient is Lipschitz continuous, which is a useful condition for the convergence guarantees of many optimization algorithms.
Twice differentiable	The Hessian of MSE with respect to predictions is constant, which simplifies second-order optimization methods such as Newton's method and natural gradient descent.

Bias-Variance Decomposition

One of the most important theoretical results involving MSE is the bias-variance decomposition. For an estimator θ̂ of a true parameter θ, the MSE can be decomposed as:

MSE(θ̂) = Var(θ̂) + [Bias(θ̂)]²

Where:

Var(θ̂) is the variance of the estimator, measuring how much the predictions fluctuate across different training sets.
Bias(θ̂) is the bias, measuring the systematic difference between the average prediction and the true value.

In a supervised learning context with irreducible noise σ², this extends to:

Expected MSE = Bias² + Variance + σ² (irreducible noise)

This decomposition is central to understanding the bias-variance tradeoff. A simple model (for example, linear regression on complex data) may have high bias but low variance, leading to underfitting. A highly complex model (for example, a deep neural network with many parameters) may have low bias but high variance, leading to overfitting. The goal is to find a model that balances both components to minimize overall MSE.

For an unbiased estimator (where Bias = 0), the MSE equals the variance. This is why unbiased estimators are sometimes favored in statistics, though biased estimators with lower variance can achieve a lower overall MSE. A classic illustration is shrinkage, where slightly biased estimators such as ridge regression or the James-Stein estimator can produce lower MSE than the unbiased OLS estimator, particularly in high-dimensional settings.

Connection to Maximum Likelihood Estimation

Minimizing MSE is mathematically equivalent to maximum likelihood estimation (MLE) under the assumption that the errors follow a Gaussian (normal) distribution. This connection provides a probabilistic justification for using MSE.

Consider a model yᵢ = f(xᵢ; θ) + εᵢ, where εᵢ ~ N(0, σ²). Under this assumption, each observation yᵢ follows a normal distribution:

p(yᵢ | xᵢ, θ) = (1 / √(2πσ²)) · exp(−(yᵢ − f(xᵢ; θ))² / (2σ²))

The log-likelihood for the entire dataset is:

log L(θ) = −(n/2) log(2πσ²) − (1/(2σ²)) Σ(yᵢ − f(xᵢ; θ))²

Since σ² is a constant with respect to θ, maximizing the log-likelihood is equivalent to minimizing Σ(yᵢ − f(xᵢ; θ))², which is proportional to MSE. This means that when the noise in the data is truly Gaussian, MSE is the theoretically optimal loss function. When the noise distribution has heavier tails (more outliers than a Gaussian), alternatives like mean absolute error or Huber loss may be more appropriate.

The Gaussian-MSE connection also clarifies the meaning of the predictions a model produces. Minimizing MSE drives the model output toward the conditional expectation E[y | x] of the target given the input. Minimizing MAE, by contrast, drives the model output toward the conditional median. Choosing between MSE and MAE is therefore not just a robustness choice but a statement about which summary of the conditional distribution the model is predicting.

Gauss-Markov Theorem and Best Linear Unbiased Estimator

The Gauss-Markov theorem is a foundational result that gives MSE-based estimation a strong theoretical justification in the context of linear regression. The theorem states that under the following assumptions, the ordinary least squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE):

Assumption	Meaning
Linearity	The relationship between predictors and target is linear in parameters.
Zero-mean errors	E[εᵢ] = 0 for all observations.
Homoscedasticity	All errors have the same variance, Var(εᵢ) = σ².
No autocorrelation	Errors are uncorrelated, Cov(εᵢ, εⱼ) = 0 for i ≠ j.
Exogeneity	Errors are uncorrelated with the predictors.

Under these conditions, OLS achieves the lowest variance among all linear unbiased estimators. Notably, the theorem does not require errors to be Gaussian; it only requires the four conditions above. Gaussian errors give OLS the additional property of being the maximum likelihood estimator and the minimum variance estimator overall (not just among linear unbiased estimators), which is sometimes called the Gauss-Markov-Aitken theorem in its generalized form.

MSE as a Loss Function for Regression

In supervised learning, MSE is the default loss function for regression tasks. During training, the model parameters are adjusted to minimize the MSE over the training data. The gradient of MSE with respect to the model's predicted output ŷᵢ is:

∂MSE/∂ŷᵢ = −(2/n)(yᵢ − ŷᵢ)

This gradient has a useful property: it scales linearly with the error. When the prediction is far from the true value, the gradient is large, pushing the model to make a bigger correction. As the prediction approaches the target, the gradient shrinks toward zero, allowing the model to fine-tune near the optimum. This adaptive behavior makes MSE effective for training with gradient descent.

For linear regression, minimizing MSE yields the Ordinary Least Squares (OLS) solution. The closed-form solution is θ = (XᵀX)⁻¹Xᵀy, where X is the feature matrix and y is the target vector. This solution is guaranteed to be the global minimum because the MSE loss surface is convex for linear models. For neural networks and other nonlinear models, no closed-form solution exists and iterative optimization with backpropagation is used instead.

MSE for Classification: Why Cross-Entropy Is Preferred

Although MSE can technically be used for classification tasks, cross-entropy loss is strongly preferred for several reasons:

Issue	MSE for Classification	Cross-Entropy for Classification
Gradient behavior	When a sigmoid or softmax output is confidently wrong, the gradient of MSE nearly vanishes due to the flat regions of the sigmoid curve. This leads to extremely slow learning.	The logarithm in cross-entropy cancels the exponential in the sigmoid/softmax, producing a clean gradient proportional to the error (pᵢ − yᵢ). Learning remains fast even when the model is confidently wrong.
Probabilistic foundation	MSE assumes Gaussian-distributed errors, which does not match the Bernoulli or categorical distribution underlying classification.	Cross-entropy arises naturally from maximum likelihood estimation under Bernoulli (binary) or categorical (multiclass) distributions.
Loss surface	MSE creates a non-convex loss surface when combined with sigmoid or softmax outputs, making optimization harder.	Cross-entropy combined with sigmoid or softmax produces a convex loss surface (for logistic regression), ensuring reliable convergence.
Penalty for confidence	MSE penalizes confident wrong predictions only slightly more than uncertain ones.	Cross-entropy imposes a sharply increasing penalty as the model becomes more confidently wrong, driving faster corrections.

For these reasons, cross-entropy is the standard loss function for both binary and multiclass classification in modern deep learning.

Sensitivity to Outliers

MSE's quadratic penalty means it is highly sensitive to outliers. A single data point with a large error can dominate the total loss and disproportionately influence the model's parameters. For example, if most errors are around 1 but one outlier has an error of 100, the squared error for the outlier (10,000) is 10,000 times larger than a typical error (1). The model may distort its predictions to reduce this one large error at the expense of fitting the majority of the data well.

This sensitivity is both a strength and a weakness. In applications where large errors are genuinely costly (for example, financial forecasting or safety-critical systems), penalizing them heavily is desirable. In applications where outliers represent noise or data entry errors, MSE can mislead the model. Practitioners should consider the data distribution and application requirements before choosing MSE over more robust alternatives.

Common Mitigations

Several practical strategies can reduce the harmful influence of outliers on MSE-based training:

Strategy	Description
Outlier removal	Identify and remove data points outside a chosen percentile or beyond a threshold (for example, 3 standard deviations from the mean).
Target transformation	Apply a log, square root, or Box-Cox transformation to compress the tail of a heavy-tailed target before training.
Robust losses	Switch to Huber loss, Tukey's biweight, or log-cosh loss, which all dampen the influence of large errors.
Trimmed loss	Sort residuals and drop the largest k percent before averaging, similar to a trimmed mean.
Sample weighting	Down-weight suspected outliers in the loss using sample weights, reducing their gradient contribution.
Quantile regression	Optimize a pinball loss to predict a target quantile rather than the mean, which is naturally robust to outliers.

MSE vs. MAE vs. RMSE vs. Huber Loss

Several alternative loss functions and metrics are commonly compared with MSE:

Metric / Loss	Formula	Outlier Sensitivity	Differentiable at 0	Units	When to Use
MSE	(1/n) Σ(yᵢ − ŷᵢ)²	High (quadratic penalty)	Yes	Squared units of target	Clean data, large errors are costly, need smooth optimization
MAE	(1/n) Σ\|yᵢ − ŷᵢ\|	Low (linear penalty)	No (kink at 0)	Same units as target	Noisy data with outliers, median prediction desired
RMSE	√[(1/n) Σ(yᵢ − ŷᵢ)²]	High (same as MSE)	Yes	Same units as target	Reporting and interpretation (units match target), comparing models
Huber Loss	Quadratic for \|error\| ≤ δ, linear for \|error\| > δ	Medium (configurable via δ)	Yes	Same units as target	Data with some outliers but you still want to penalize moderate errors quadratically
MAPE	(100/n) Σ \|(yᵢ − ŷᵢ)/yᵢ\|	Low	No	Percentage	Forecasting where relative errors matter, but undefined when yᵢ = 0
sMAPE	(100/n) Σ \|yᵢ − ŷᵢ\| / ((\|yᵢ\| + \|ŷᵢ\|)/2)	Low	No	Percentage	Symmetric variant of MAPE that handles small or zero targets better
MSLE	(1/n) Σ (log(1 + yᵢ) − log(1 + ŷᵢ))²	Low for large values	Yes	Squared log-units	Skewed targets such as counts or revenue spanning several orders of magnitude
R²	1 − (Σ(yᵢ − ŷᵢ)² / Σ(yᵢ − ȳ)²)	High (driven by MSE)	Yes	Dimensionless	Reporting fraction of variance explained by the model

Key differences:

MSE vs. MAE: MSE penalizes large errors quadratically while MAE penalizes them linearly. MSE yields the conditional mean as the optimal prediction, whereas MAE yields the conditional median. MAE is more robust to outliers but produces non-smooth gradients (a kink at zero error), which can slow convergence during optimization.
MSE vs. RMSE: RMSE is simply the square root of MSE. Minimizing RMSE is mathematically equivalent to minimizing MSE since the square root is a monotonically increasing function. RMSE is preferred for reporting because its units match the target variable, making it more interpretable.
MSE vs. Huber Loss: Huber loss is a hybrid that behaves like MSE for small errors (below a threshold δ) and like MAE for large errors (above δ). It combines the smooth optimization of MSE with the outlier robustness of MAE. The trade-off is that Huber loss introduces an additional hyperparameter δ that must be tuned. Huber loss is the standard regression loss in object detection architectures such as Faster R-CNN and SSD, where it is also called Smooth L1.
MSE vs. MAPE and sMAPE: MAPE and sMAPE express error in percentage terms, which is intuitive in business and forecasting contexts. They are scale-free, but MAPE is undefined when the target is zero and asymmetric in its treatment of over- and under-prediction. sMAPE addresses both issues at the cost of slightly less intuitive interpretation.
MSE vs. MSLE: Mean Squared Logarithmic Error squares the difference of the log-transformed target and prediction. It is often used when the target spans several orders of magnitude (such as house prices, population counts, or revenue) and when predicting a value that is, say, twice the truth is more acceptable for large targets than for small ones.

Normalized MSE (NMSE)

Normalized Mean Squared Error addresses MSE's scale-dependence by dividing by the variance of the observed data:

NMSE = MSE / Var(y) = [Σ(yᵢ − ŷᵢ)²] / [Σ(yᵢ − ȳ)²]

Where ȳ is the mean of the observed values. NMSE is dimensionless and scale-independent, making it useful for comparing model performance across different datasets or target variables. An NMSE of 1.0 means the model performs no better than simply predicting the mean, while values below 1.0 indicate useful predictive power.

NMSE is closely related to the coefficient of determination (R²), with R² = 1 − NMSE. NMSE is used in fields such as wireless communications, signal processing, and air quality modeling where comparing predictions across different scales is important.

When to Use MSE vs. Alternatives

Choosing the right loss function depends on the data characteristics and application requirements:

Scenario	Recommended Metric	Reason
Clean data, Gaussian-distributed errors	MSE	Theoretically optimal under Gaussian noise assumption; smooth gradients enable fast convergence
Data with frequent outliers or heavy-tailed errors	MAE or Huber Loss	MSE would give disproportionate weight to outliers, distorting the model
Need interpretable error in original units	RMSE	Same units as target variable, easier to communicate to stakeholders
Comparing models across different scales or datasets	NMSE or R²	Scale-independent, allows fair comparison
Object detection bounding box regression	Huber Loss (Smooth L1)	Balances outlier robustness with smooth optimization during early training
Financial applications with asymmetric costs	Custom asymmetric loss	MSE treats over-predictions and under-predictions equally, which may not reflect real-world costs
Targets spanning orders of magnitude	MSLE	Penalizes relative error rather than absolute error
Time series forecasting at scale	MAPE or sMAPE	Scale-free percentage interpretation across many series
Quantile prediction (risk, intervals)	Pinball loss	MSE only produces conditional means and cannot directly target quantiles

Regularized Variants and Connections to Other Methods

MSE is rarely used in isolation in modern machine learning. It is almost always combined with regularization terms that constrain the model and improve generalization. The most common combinations are:

Method	Objective	Notes
Ordinary Least Squares (OLS)	min Σ(yᵢ − ŷᵢ)²	Pure MSE without regularization.
Ridge Regression	min Σ(yᵢ − ŷᵢ)² + λ Σ βⱼ²	Adds an L2 penalty on coefficients. Equivalent to MAP estimation under a Gaussian prior on parameters.
Lasso Regression	min Σ(yᵢ − ŷᵢ)² + λ Σ \|βⱼ\|	Adds an L1 penalty that performs feature selection. Equivalent to MAP estimation under a Laplace prior.
Elastic Net	min Σ(yᵢ − ŷᵢ)² + λ₁ Σ \|βⱼ\| + λ₂ Σ βⱼ²	Combines L1 and L2 penalties, useful for correlated features.
Weight decay (deep learning)	Loss + λ Σ \|\|θ\|\|²	Equivalent to L2 regularization for full-batch optimization, widely used in neural networks.

These formulations show that MSE is the workhorse data-fitting term in a large family of methods. Adding L2 to MSE produces ridge regression, which slightly biases coefficient estimates toward zero in exchange for substantially lower variance. This trade-off, again, follows directly from the bias-variance decomposition.

MSE in Image Quality and PSNR

In image processing and computer vision, MSE is the basis for the peak signal-to-noise ratio (PSNR), which compares a reconstructed or compressed image to a reference. PSNR is defined as:

PSNR = 20 · log₁₀(MAX / √MSE)

where MAX is the maximum possible pixel value (for example, 255 for 8-bit images). PSNR is expressed in decibels and increases as MSE decreases. Typical lossy image compression yields PSNR values between 30 and 50 dB, with higher numbers indicating closer reconstruction.

Despite its widespread use, PSNR (and MSE on raw pixels) is a weak proxy for perceptual quality. Two images with the same MSE can look very different to a human observer because MSE treats every pixel independently and ignores spatial structure, edges, and texture. For example, a small global brightness shift can produce a higher MSE than a large local distortion that would be more obvious to viewers. This limitation has motivated the development of perceptual metrics:

Metric	Approach	Strengths
MSE / PSNR	Pixel-wise squared error	Simple, fast, differentiable, basis for PSNR
SSIM	Compares luminance, contrast, structure in local windows	Better correlation with perceived quality
MS-SSIM	Multi-scale extension of SSIM	Robust across image resolutions
LPIPS	Distance in features of a pretrained deep network	Strong correlation with human judgments
FID and KID	Distribution-level metrics for generative models	Captures realism and diversity, not pixel exactness

In modern image restoration and generative pipelines, MSE is often combined with perceptual losses (for example, a weighted sum of MSE and LPIPS) so the model benefits from both pixel accuracy and perceptual realism.

MSE in Modern Deep Learning

MSE remains a core component of many state-of-the-art systems even outside classical regression.

Diffusion Models

Diffusion models such as DDPM (Ho et al., 2020), Stable Diffusion, Imagen, and DALL-E 2 use a remarkably simple training objective. The model learns to predict the noise that was added to a clean sample at a randomly chosen timestep, and the loss is the MSE between the true noise and the predicted noise:

L_simple = E[ ||ε − εθ(xₜ, t)||² ]

This simplification of the original variational lower bound was one of the key contributions of the DDPM paper and is often credited with making large-scale diffusion training practical. Despite the complexity of the underlying generative model, the practical training loop is essentially a giant MSE regression problem.

Autoencoders

Autoencoders and variational autoencoders (VAEs) often use MSE as the reconstruction loss between input and reconstructed output. For continuous-valued inputs such as natural images or audio waveforms, MSE corresponds to a Gaussian likelihood on the output and works well in practice. For binary or near-binary inputs, binary cross-entropy is sometimes used instead.

Self-supervised and Representation Learning

Methods such as masked autoencoders (MAE) for vision use MSE on the reconstructed pixels of masked image patches as the pretraining objective. Audio models such as wav2vec and HuBERT use MSE-style losses on continuous representations during certain stages of training. MSE is also used as a feature distillation loss when matching student features to teacher features in knowledge distillation.

Reinforcement Learning

In value-based reinforcement learning, the temporal-difference error in algorithms such as DQN is often optimized using a Huber-like or MSE loss between the predicted Q-value and the target Q-value. Pure MSE was used in the original DQN, while later work generally switched to Huber loss for stability.

Regression Heads in Foundation Models

Many foundation models add task-specific regression heads (for example, predicting bounding-box coordinates, depth values, or continuous attributes) trained with MSE or Huber loss on top of pretrained backbones. This pattern appears in object detection, depth estimation, pose estimation, and physical-property prediction.

Numerical and Practical Considerations

Using MSE effectively requires attention to several practical issues:

Concern	Recommendation
Target scaling	Standardize or normalize targets so that the scale of MSE is comparable across features. Without scaling, large-magnitude targets dominate gradients and small-magnitude targets are essentially ignored.
Feature scaling	Scale input features to similar ranges. Although MSE is computed on outputs, gradient magnitudes through the network depend on input scale.
Numerical precision	For very small errors, squaring can lead to underflow in single-precision floating point. Mixed-precision training frameworks usually keep loss accumulation in float32.
Sample weighting	When data is imbalanced or some observations are more reliable, use sample weights so MSE is a weighted average of squared errors.
Reduction mode	Frameworks let you choose whether MSE is averaged across all elements, summed, or returned per-element. The choice affects gradient magnitude and the appropriate learning rate.
Multi-output regression	When predicting several targets at once, MSE is averaged across both samples and outputs by default. Consider per-output normalization if outputs have very different scales.
Mini-batch noise	Stochastic gradients of MSE on small batches can be noisy. Larger batches or gradient accumulation reduce variance at the cost of compute.

Implementation Examples

MSE is available in all major machine learning frameworks.

scikit-learn

from sklearn.metrics import mean_squared_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mse = mean_squared_error(y_true, y_pred)
print(mse)  # Output: 0.375

scikit-learn's mean_squared_error accepts sample_weight for weighted MSE and multioutput for multi-target regression. It can return per-output errors via multioutput='raw_values'. As of scikit-learn 1.4, the historical squared=False argument that returned RMSE has been replaced by a dedicated root_mean_squared_error function.

PyTorch

import torch
import torch.nn as nn

loss_fn = nn.MSELoss()
y_true = torch.tensor([3.0, -0.5, 2.0, 7.0])
y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])

loss = loss_fn(y_pred, y_true)
print(loss.item())  # Output: 0.375

PyTorch's nn.MSELoss supports three reduction modes: 'mean' (default, computes MSE), 'sum' (returns total squared error), and 'none' (returns per-element squared errors). The functional form torch.nn.functional.mse_loss is also available for use without instantiating a module.

TensorFlow / Keras

import tensorflow as tf

loss_fn = tf.keras.losses.MeanSquaredError()
y_true = [[3.0, -0.5], [2.0, 7.0]]
y_pred = [[2.5, 0.0], [2.0, 8.0]]

loss = loss_fn(y_true, y_pred)
print(loss.numpy())  # Output: 0.375

Keras (now part of TensorFlow) offers MeanSquaredError as both a loss class and a metric class (tf.keras.metrics.MeanSquaredError), and supports configurable reduction options including 'sum_over_batch_size', 'sum', and 'none'. The string shortcut 'mse' works wherever a loss is expected, for example model.compile(loss='mse', optimizer='adam').

JAX

import jax.numpy as jnp

def mse_loss(params, x, y, model_fn):
    preds = model_fn(params, x)
    return jnp.mean((preds - y) ** 2)

In JAX, MSE is typically written as a plain function and combined with jax.value_and_grad for automatic differentiation. Libraries such as Optax provide optax.l2_loss (which returns half of the squared error per element) for convenience.

NumPy (from scratch)

import numpy as np

def mean_squared_error(y_true, y_pred):
    return np.mean((np.array(y_true) - np.array(y_pred)) ** 2)

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_squared_error(y_true, y_pred))  # Output: 0.375

Framework Comparison

Framework	API	Default Reduction	Sample Weights	RMSE Variant
scikit-learn	`mean_squared_error(y_true, y_pred)`	mean over all elements	`sample_weight`	`root_mean_squared_error` (1.4+)
PyTorch	`nn.MSELoss()` or `F.mse_loss`	`'mean'`	manual via element-wise multiply	manual `torch.sqrt`
TensorFlow / Keras	`tf.keras.losses.MeanSquaredError()`	`'sum_over_batch_size'`	`sample_weight`	`RootMeanSquaredError` metric
JAX / Optax	`optax.l2_loss` or hand-written	per-element (sum or mean)	manual broadcast	manual `jnp.sqrt`

Common Pitfalls

A few mistakes show up frequently when teams adopt MSE in production:

Pitfall	Why It Matters	Fix
Reporting MSE without RMSE	Squared units are hard to interpret.	Report RMSE alongside MSE for stakeholder communication.
Comparing MSE across different target scales	A model with target in millions will look much worse than one with target in tens, even if it is better in relative terms.	Use NMSE, R², or normalize targets before training.
Ignoring outliers	A few extreme points can dominate the loss and hide systematic errors on the rest of the data.	Plot residuals, consider robust losses, and inspect data quality.
Using MSE for classification	Slow training and poor calibration.	Use cross-entropy for classification tasks.
Mixing reduction modes between training and metrics	Sum and mean reductions differ by a factor of n, which changes effective learning rate.	Pick one convention and stick with it.
Forgetting target standardization	Large-magnitude targets blow up gradients; small-magnitude targets are ignored.	Standardize targets and inverse-transform predictions for reporting.

Explain Like I'm 5 (ELI5)

Imagine you and your friends are trying to guess how many candies are in a jar. Each time someone guesses, they might be a little bit wrong. Mean Squared Error is a way to measure how wrong your guesses are, on average.

Here is what you do. Take the difference between each guess and the actual number of candies. Then multiply each difference by itself (that is called squaring it). Finally, add up all those squared numbers and divide by how many guesses there were.

Why do we square the differences? Two reasons. First, it gets rid of the minus sign, so it does not matter if you guessed too high or too low. Second, it makes bigger mistakes count a lot more. If you were off by 2, the squared error is 4. If you were off by 10, the squared error is 100. That is 25 times worse, not just 5 times worse.

If everyone's guesses are close to the real number, the MSE will be small. If the guesses are way off, the MSE will be large. So MSE tells us how good our guessing method is, and smaller is better.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. Chapter 2.4, "Statistical Decision Theory."
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 1.2.5, "Loss Functions for Regression."
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Section 5.5.1, "Maximum Likelihood Estimation," and Section 6.2.2, "Mean Squared Error."
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning* (2nd ed.). Springer. Chapter 2.2, "Assessing Model Accuracy."
Murphy, K. P. (2012). *Machine Learning: A Probabilistic Perspective*. MIT Press. Section 5.7.1, "MLE for Linear Regression."
Lehmann, E. L., & Casella, G. (1998). *Theory of Point Estimation* (2nd ed.). Springer. Chapter 1, "Preparations."
Stigler, S. M. (1986). *The History of Statistics: The Measurement of Uncertainty before 1900*. Harvard University Press. Chapters on Legendre, Gauss, and Laplace.
Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." *Advances in Neural Information Processing Systems* 33. arXiv:2006.11239.
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). "Image Quality Assessment: From Error Visibility to Structural Similarity." *IEEE Transactions on Image Processing*, 13(4), 600-612.
Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.
scikit-learn documentation. "sklearn.metrics.mean_squared_error." https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
PyTorch documentation. "MSELoss." https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html
TensorFlow documentation. "tf.keras.losses.MeanSquaredError." https://www.tensorflow.org/api_docs/python/tf/keras/losses/MeanSquaredError
Wikipedia. "Mean squared error." https://en.wikipedia.org/wiki/Mean_squared_error
Wikipedia. "Peak signal-to-noise ratio." https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio