Root Mean Squared Error (RMSE)

Root Mean Squared Error (RMSE), also known as root mean square deviation (RMSD), is one of the most widely used metrics for evaluating the performance of regression models in machine learning, statistics, and the applied sciences. It measures the square root of the average of squared differences between predicted values and observed (actual) values. Because RMSE returns a result in the same units as the target variable, it provides an intuitive sense of the typical prediction error. At the same time, by squaring errors before averaging, RMSE gives disproportionately large weight to large errors, making it more sensitive to outliers than the mean absolute error (MAE).

RMSE is closely related to mean squared error (MSE): it is simply the square root of MSE. While MSE is often preferred as a loss function during model training because it is differentiable and well-suited for gradient descent, RMSE is preferred for reporting and interpretation because its units match the target variable. Minimizing RMSE is mathematically equivalent to minimizing MSE, since the square root function is monotonically increasing and preserves the ranking of models.

RMSE can also be understood as the standard deviation of the prediction errors (residuals) when the mean error is zero. In this sense, RMSE describes how spread out the residuals are around the predicted values.

Mathematical definition

Given a dataset of n observations, where y_i is the actual observed value and \u0177_i (y-hat) is the predicted value for the i-th observation, the Root Mean Squared Error is defined as:

RMSE = \u221a[(1/n) \u03a3\u1d62\u208c\u2081\u207f (y\u1d62 \u2212 \u0177\u1d62)\u00b2]

In expanded form:

RMSE = \u221a[(1/n) ((y\u2081 \u2212 \u0177\u2081)\u00b2 + (y\u2082 \u2212 \u0177\u2082)\u00b2 + ... + (y\u2099 \u2212 \u0177\u2099)\u00b2)]

The formula proceeds in three steps:

Compute the residual (error) for each observation: e\u1d62 = y\u1d62 \u2212 \u0177\u1d62
Square each residual, sum them, and divide by n to get the mean squared error.
Take the square root of MSE to bring the result back to the original units of measurement.

Symbol	Meaning
n	Number of observations (data points)
y\u1d62	Actual (observed) value for the i-th data point
\u0177\u1d62	Predicted value for the i-th data point
(y\u1d62 \u2212 \u0177\u1d62)	Residual (error) for the i-th data point
\u03a3	Summation over all n data points

Estimator form

In estimation theory, for an estimator \u03b8\u0302 of a true parameter \u03b8, the RMSE is defined as:

RMSE(\u03b8\u0302) = \u221a[E((\u03b8\u0302 \u2212 \u03b8)\u00b2)] = \u221aMSE(\u03b8\u0302)

For an unbiased estimator (where the bias is zero), the RMSE is equal to the standard deviation of the estimator, also called the standard error.

Worked example

Suppose a model makes predictions for five data points:

Observation (i)	Actual (y\u1d62)	Predicted (\u0177\u1d62)	Error (y\u1d62 \u2212 \u0177\u1d62)	Squared error
1	3.0	2.5	0.5	0.25
2	-0.5	0.0	-0.5	0.25
3	2.0	2.0	0.0	0.00
4	7.0	8.0	-1.0	1.00
5	4.5	4.0	0.5	0.25

Sum of squared errors = 0.25 + 0.25 + 0.00 + 1.00 + 0.25 = 1.75

MSE = 1.75 / 5 = 0.35

RMSE = \u221a0.35 \u2248 0.592

This tells us the model's predictions are off by roughly 0.59 units on average in terms of the root-mean-square deviation, using the same units as the target variable.

Key properties

RMSE has several mathematical properties that make it a standard metric across scientific and engineering disciplines.

Property	Description
Non-negative	RMSE is always greater than or equal to zero. A value of zero indicates perfect predictions with no error.
Same units as target	Unlike MSE, which is in squared units, RMSE is expressed in the same units as the target variable. This makes it directly interpretable.
Scale-dependent	RMSE depends on the scale of the data. An RMSE of 10 could be excellent for a variable ranging in the thousands and poor for a variable ranging from 0 to 20. Comparisons across datasets with different scales are not meaningful without normalization.
Sensitive to large errors	Because errors are squared before averaging, large errors receive disproportionate weight. An error of 10 contributes 100 to the squared sum, while an error of 1 contributes only 1.
Monotonic relationship with MSE	Since RMSE = \u221aMSE, the model that minimizes MSE also minimizes RMSE. The square root is a monotonically increasing function, so it preserves model rankings.
Related to standard deviation	When the mean error (bias) is zero, RMSE equals the standard deviation of the residuals. In general, RMSE\u00b2 = Bias\u00b2 + Variance of residuals.
Satisfies the triangle inequality	RMSE satisfies the triangle inequality, meaning it qualifies as a proper distance metric in a mathematical sense (Chai and Draxler, 2014).
Geometric interpretation	RMSE can be interpreted as the L2 norm (Euclidean norm) of the residual vector divided by \u221an. This connects RMSE to the geometry of vector spaces.

Interpreting RMSE

Interpreting an RMSE value requires context. There is no universal threshold that separates a "good" from a "bad" RMSE, because the metric is scale-dependent. The same RMSE value can be excellent in one application and poor in another.

Guidelines for interpretation

Compare to the range or mean of observed values. An RMSE of 5 is very different depending on whether the target variable ranges from 0 to 10 (RMSE is 50% of the range) or from 0 to 10,000 (RMSE is 0.05% of the range). Normalizing RMSE by the range, mean, or standard deviation of the observed data (see the Normalized RMSE section below) provides a scale-free measure for comparison.

Compare to a baseline model. A useful benchmark is to compare RMSE against a naive model such as one that always predicts the mean of the training data. If the model's RMSE is substantially lower than the baseline's RMSE, it has learned meaningful patterns.

Approximate confidence interval. If the residuals are approximately normally distributed, roughly 68% of predictions fall within +/- 1 RMSE of the true value, and about 95% of predictions fall within +/- 2 RMSE of the true value. This property follows from the relationship between RMSE and the standard deviation of the residuals.

Consider the application. In safety-critical applications (medical dosing, structural engineering), even a small RMSE might be unacceptable. In exploratory forecasting (long-range weather prediction, economic modeling), a larger RMSE may be perfectly reasonable.

Bias-variance decomposition

The squared RMSE (i.e., MSE) can be decomposed into bias and variance components:

RMSE\u00b2 = MSE = Bias\u00b2 + Variance + \u03c3\u00b2

Where:

Bias\u00b2 measures the systematic error: the squared difference between the expected prediction and the true value. High bias indicates underfitting.
Variance measures the sensitivity of predictions to different training sets. High variance indicates overfitting.
\u03c3\u00b2 is the irreducible noise in the data that no model can capture.

This decomposition reveals that reducing RMSE requires managing the bias-variance tradeoff. A model that is too simple will have high bias, while a model that is too complex will have high variance. Both scenarios inflate RMSE.

For model diagnostics, MSE can also be decomposed into systematic and unsystematic components. The systematic component (MSE_s) captures consistent over- or under-prediction and non-unity regression slopes. The unsystematic component (MSE_u) captures random scatter around the regression line. A well-calibrated model should have most of its MSE in the unsystematic component.

Connection to maximum likelihood estimation

Minimizing RMSE (or equivalently MSE) is mathematically equivalent to maximum likelihood estimation when the errors follow a Gaussian (normal) distribution. Under the assumption that y\u1d62 = f(x\u1d62; \u03b8) + \u03b5\u1d62 where \u03b5\u1d62 ~ N(0, \u03c3\u00b2), the log-likelihood is:

log L(\u03b8) = -(n/2) log(2\u03c0\u03c3\u00b2) - (1/(2\u03c3\u00b2)) \u03a3(y\u1d62 \u2212 f(x\u1d62; \u03b8))\u00b2

Maximizing this expression is equivalent to minimizing the sum of squared errors. This provides the theoretical justification for using RMSE as the primary evaluation metric when errors are normally distributed (Hodson, 2022).

Conversely, when errors follow a Laplacian distribution (heavier tails, more outlier-prone), minimizing MAE is the optimal approach under maximum likelihood. The choice between RMSE and MAE therefore depends fundamentally on the error distribution of the data.

RMSE vs. MSE

RMSE and MSE are closely related but serve different purposes:

Aspect	MSE	RMSE
Formula	(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2	\u221a[(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2]
Units	Squared units of target	Same units as target
Primary use	Loss function during training	Reporting and interpretation
Differentiability	Smooth, easy gradient computation	Same optimization behavior (monotonic transform of MSE)
Interpretability	Less intuitive (e.g., "dollars squared")	Directly interpretable (e.g., "dollars")

An RMSE of 10 does not mean the model is off by 10 units on average. Because errors are squared before averaging, RMSE gives more weight to larger errors. The mean absolute error (MAE) provides the true average absolute error.

RMSE vs. MAE

The relationship between RMSE and MAE has been the subject of extensive discussion in the scientific literature. Key differences are:

Aspect	RMSE	MAE
Formula	\u221a[(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2]	(1/n) \u03a3\|y\u1d62 \u2212 \u0177\u1d62\|
Error weighting	Disproportionately penalizes large errors (quadratic)	Treats all errors equally (linear)
Outlier sensitivity	High	Low
Optimal prediction	Conditional mean	Conditional median
Optimal error distribution	Gaussian (normal)	Laplacian
Differentiability	Differentiable everywhere	Not differentiable at zero error
Mathematical bound	RMSE \u2265 MAE always holds	MAE \u2264 RMSE \u2264 \u221an \u00b7 MAE

The inequality MAE \u2264 RMSE always holds. The two metrics are equal only when all errors are identical in magnitude. The upper bound RMSE \u2264 \u221an \u00b7 MAE is reached when all of the error is concentrated in a single observation.

The RMSE vs. MAE debate

This debate has a notable history in the geosciences. Willmott and Matsuura (2005) argued that RMSE is not a good indicator of average model performance because it varies with the variability of the error distribution, not just the average error magnitude. They recommended MAE as a more natural and unambiguous measure of average error.

Chai and Draxler (2014) responded with a counterargument, demonstrating that RMSE is not ambiguous in its meaning and is more appropriate when the error distribution is expected to be Gaussian. They also showed that RMSE satisfies the triangle inequality, qualifying it as a proper distance metric.

Hodson (2022) provided a resolution to this debate by demonstrating that neither metric is inherently superior. The appropriate choice depends on the underlying error distribution: RMSE is optimal for Gaussian errors, and MAE is optimal for Laplacian errors. Hodson recommended that practitioners analyze the distributional properties of their errors before selecting a metric, rather than defaulting to convention.

When to use RMSE over MAE

Scenario	Recommended metric	Reason
Errors are approximately normally distributed	RMSE	Theoretically optimal under the Gaussian assumption; equivalent to maximum likelihood
Large errors are disproportionately costly	RMSE	Quadratic penalty reflects the true cost structure
Model training via gradient descent	RMSE/MSE	Smooth, differentiable loss surface; MAE has a non-smooth gradient at zero
Errors are heavy-tailed or contain many outliers	MAE	Linear penalty prevents outliers from dominating the metric
You want the true average error magnitude	MAE	MAE directly represents the average absolute deviation
Error distribution is unknown or mixed	Report both	Presenting both gives a fuller picture (Chai and Draxler, 2014)

Normalized RMSE (NRMSE)

Because RMSE is scale-dependent, comparing models across datasets with different scales requires normalization. Normalized RMSE (NRMSE) divides RMSE by a characteristic value of the observed data. There are several common normalization methods:

Normalization method	Formula	Notes
By range	NRMSE = RMSE / (y_max \u2212 y_min)	Sensitive to extreme values; commonly used in environmental modeling
By mean	NRMSE = RMSE / \u0233	Also called the coefficient of variation of RMSD, or CV(RMSD); useful when the mean is meaningful
By standard deviation	NRMSE = RMSE / SD(y)	Compares prediction error to the natural variability of the data
By interquartile range	NRMSE = RMSE / IQR	Less sensitive to extreme values than range-based normalization

NRMSE is dimensionless and typically expressed as a proportion or percentage. Lower values indicate better model fit. A range-normalized NRMSE below 0.1 (10%) is generally considered good, though this depends heavily on the application domain.

There is no single normalization method that is universally superior. The choice depends on the data characteristics and the purpose of the comparison.

RMSE as a loss function

While RMSE is most commonly used as an evaluation metric, it can also serve as a loss function. However, MSE (the square of RMSE) is more commonly used for optimization because:

The gradient of MSE with respect to predictions is simpler: \u2202MSE/\u2202\u0177\u1d62 = -(2/n)(y\u1d62 \u2212 \u0177\u1d62)
The additional square root in RMSE introduces an extra scaling factor in the gradient without changing which parameter values minimize the loss.
Since RMSE and MSE rank models identically, there is no practical reason to use RMSE during optimization.

For this reason, MSE is the standard regression loss function during training in frameworks such as PyTorch, TensorFlow, and scikit-learn, while RMSE is reported after training to communicate model performance in interpretable units.

RMSE vs. other regression metrics

RMSE is one of several metrics used to evaluate regression models. The table below compares the most common alternatives:

Metric	Formula	Units	Outlier sensitivity	Best for
MSE	(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2	Squared target units	High	Loss function during training
RMSE	\u221a[(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2]	Target units	High	Interpretable error reporting
MAE	(1/n) \u03a3\|y\u1d62 \u2212 \u0177\u1d62\|	Target units	Low	Average error with outlier robustness
MAPE	(100/n) \u03a3\|(y\u1d62 \u2212 \u0177\u1d62)/y\u1d62\|	Percentage	Low	Percentage-based error across scales
R\u00b2	1 \u2212 [\u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2 / \u03a3(y\u1d62 \u2212 \u0233)\u00b2]	Dimensionless	High	Proportion of variance explained
Huber Loss	Quadratic for small errors, linear for large	Target units	Medium	Balancing sensitivity and robustness

No single metric captures all aspects of model performance. It is common practice to report RMSE alongside R\u00b2 and/or MAE to provide a more complete picture.

Applications

RMSE is used across a broad range of scientific and engineering disciplines. Its versatility comes from the combination of intuitive units, mathematical tractability, and sensitivity to large errors.

Weather forecasting and climate science

RMSE is a standard evaluation metric for numerical weather prediction (NWP) models. Environment Canada and other meteorological agencies use latitude-weighted RMSE to assess forecast accuracy for temperature, pressure, wind speed, and other variables at different lead times. In ensemble forecasting, comparing the RMSE of the ensemble mean against the ensemble spread provides a diagnostic for whether the ensemble is properly calibrated.

In climate modeling, RMSE is used to evaluate downscaling methods, general circulation model outputs, and data-driven forecasting approaches such as those benchmarked in ClimateLearn (Nguyen et al., 2023).

Bioinformatics and drug design

In structural bioinformatics, RMSD measures the average distance between atoms of superimposed protein structures. This application is central to protein structure prediction, molecular docking, and drug design, where the RMSD between a crystallographic conformation and a predicted pose indicates prediction quality.

Engineering and control systems

In control theory, RMSE (or RMSD) serves as a quality measure for state observer performance. In fluid dynamics, it quantifies flow uniformity for velocity, temperature, and concentration fields. In building energy simulation, RMSE is used to calibrate predicted energy consumption against measured data.

Remote sensing and GIS

RMSE is used to assess accuracy in spatial analysis, digital elevation models, and satellite imagery classification. In hydrogeology, it is the primary metric for calibrating groundwater flow models.

Recommender systems

The Netflix Prize competition (2006 to 2009) used RMSE as its primary evaluation criterion. Competitors aimed to reduce the RMSE of movie rating predictions by at least 10% compared to Netflix's existing Cinematch algorithm.

Image quality assessment

In imaging science, RMSE is a component of the Peak Signal-to-Noise Ratio (PSNR), which is widely used to measure the quality of reconstructed or compressed images relative to the original.

Limitations and caveats

While RMSE is widely used, practitioners should be aware of its limitations:

Scale dependence. RMSE is not comparable across datasets with different target scales unless normalized.
Outlier sensitivity. A small number of large errors can inflate RMSE substantially, potentially giving a misleading impression of overall model performance. In such cases, MAE or the median absolute deviation (MAD) may be more representative.
No directional information. RMSE does not distinguish between over-prediction and under-prediction because errors are squared. A model that systematically overestimates by 5 units and a model with random errors averaging 5 units in either direction will have similar RMSE values.
Ambiguity as an "average" error. Willmott and Matsuura (2005) pointed out that RMSE varies with the variability of the error distribution, not just the average error. MAE is a more straightforward measure of the average error magnitude.
Sensitivity to sample size in comparisons. When comparing RMSE values across studies, differences in sample size and data composition can affect the metric, making direct comparisons unreliable without careful normalization.
Not a complete model assessment. RMSE alone does not capture model calibration, uncertainty quantification, or the spatial/temporal distribution of errors. It should be used alongside other metrics and diagnostic plots (residual plots, Q-Q plots).

Historical context

The mathematical foundations of RMSE trace back to the development of the method of least squares. Adrien-Marie Legendre published the first clear exposition of least squares in 1805. Carl Friedrich Gauss claimed to have used the method since 1795, and in 1809 connected it to probability theory and the normal distribution. Within a decade of Legendre's publication, least squares had become the standard tool in astronomy and geodesy.

The modern use of RMSE as an evaluation metric grew out of this tradition. The squaring of residuals (central to both least squares and RMSE) is mathematically tied to the normal distribution through maximum likelihood theory. As statistical methods spread from the physical sciences into economics, psychology, engineering, and eventually machine learning, RMSE became one of the default measures of prediction accuracy.

Implementation examples

RMSE can be computed easily in all major programming languages and machine learning frameworks.

scikit-learn (Python)

from sklearn.metrics import root_mean_squared_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

rmse = root_mean_squared_error(y_true, y_pred)
print(rmse)  # Output: 0.6123724356957945

In older versions of scikit-learn (before v1.4), RMSE was computed using mean_squared_error with squared=False:

from sklearn.metrics import mean_squared_error
import math

rmse = math.sqrt(mean_squared_error(y_true, y_pred))

PyTorch

import torch
import torch.nn as nn

y_true = torch.tensor([3.0, -0.5, 2.0, 7.0])
y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])

mse_loss = nn.MSELoss()
rmse = torch.sqrt(mse_loss(y_pred, y_true))
print(rmse.item())  # Output: 0.6123724356957945

NumPy (from scratch)

import numpy as np

def rmse(y_true, y_pred):
    return np.sqrt(np.mean((np.array(y_true) - np.array(y_pred)) ** 2))

print(rmse([3, -0.5, 2, 7], [2.5, 0.0, 2, 8]))  # Output: 0.6123724356957945

R

y_true <- c(3, -0.5, 2, 7)
y_pred <- c(2.5, 0.0, 2, 8)

rmse <- sqrt(mean((y_true - y_pred)^2))
print(rmse)  # Output: 0.6123724

Explain Like I'm 5 (ELI5)

Imagine you are trying to guess how many apples are in different baskets. After you guess, someone counts the actual number. Sometimes you guess too many, and sometimes you guess too few.

RMSE is a way to figure out how wrong your guesses are, on average. Here is what you do: take each guess and compare it to the real number. Figure out how far off you were. Then multiply each difference by itself (squaring it). Add up all those results, divide by how many baskets you guessed, and then take the square root of that number.

Why go through all this squaring and square-rooting? Because it makes big mistakes count extra. If you were off by 1 apple, the squared error is 1. If you were off by 5 apples, the squared error is 25. That is 25 times worse, not just 5 times worse. So RMSE punishes big blunders more than small slip-ups.

The final number you get is in apples (the same thing you were measuring), so it is easy to understand. If the RMSE is 2, it means your guesses are typically about 2 apples off from the real number. A smaller RMSE means better guessing.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer.
Willmott, C. J., & Matsuura, K. (2005). "Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance." *Climate Research*, 30, 79-82.
Chai, T., & Draxler, R. R. (2014). "Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature." *Geoscientific Model Development*, 7, 1247-1250.
Hodson, T. O. (2022). "Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not." *Geoscientific Model Development*, 15(14), 5481-5487.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Section 5.5.1.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 1.2.5.
Stigler, S. M. (1981). "Gauss and the Invention of Least Squares." *Annals of Statistics*, 9(3), 465-474.
Hyndman, R. J., & Koehler, A. B. (2006). "Another look at measures of forecast accuracy." *International Journal of Forecasting*, 22(4), 679-688.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning* (2nd ed.). Springer.
scikit-learn documentation. "sklearn.metrics.root_mean_squared_error." https://scikit-learn.org/stable/modules/generated/sklearn.metrics.root_mean_squared_error.html
Willmott, C. J., Robeson, S. M., & Matsuura, K. (2012). "A refined index of model performance." *International Journal of Climatology*, 32(13), 2088-2094.
Murphy, A. H. (1988). "Skill Scores Based on the Mean Square Error and Their Relationships to the Correlation Coefficient." *Monthly Weather Review*, 116(12), 2417-2424.
Nguyen, T., et al. (2023). "ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling." *arXiv:2307.01909*.