See also: Mean Squared Error (MSE), Mean Absolute Error (MAE)
Root Mean Squared Error (RMSE), also known as root mean square deviation (RMSD), is one of the most widely used metrics for evaluating the performance of regression models in machine learning, statistics, and the applied sciences. It measures the square root of the average of squared differences between predicted values and observed (actual) values. Because RMSE returns a result in the same units as the target variable, it provides an intuitive sense of the typical prediction error. At the same time, by squaring errors before averaging, RMSE gives disproportionately large weight to large errors, making it more sensitive to outliers than the mean absolute error (MAE).
RMSE is closely related to mean squared error (MSE): it is simply the square root of MSE. While MSE is often preferred as a loss function during model training because it is differentiable and well-suited for gradient descent, RMSE is preferred for reporting and interpretation because its units match the target variable. Minimizing RMSE is mathematically equivalent to minimizing MSE, since the square root function is monotonically increasing and preserves the ranking of models.
RMSE can also be understood as the standard deviation of the prediction errors (residuals) when the mean error is zero. In this sense, RMSE describes how spread out the residuals are around the predicted values.
Given a dataset of n observations, where y_i is the actual observed value and \u0177_i (y-hat) is the predicted value for the i-th observation, the Root Mean Squared Error is defined as:
RMSE = \u221a[(1/n) \u03a3\u1d62\u208c\u2081\u207f (y\u1d62 \u2212 \u0177\u1d62)\u00b2]
In expanded form:
RMSE = \u221a[(1/n) ((y\u2081 \u2212 \u0177\u2081)\u00b2 + (y\u2082 \u2212 \u0177\u2082)\u00b2 + ... + (y\u2099 \u2212 \u0177\u2099)\u00b2)]
The formula proceeds in three steps:
| Symbol | Meaning |
|---|---|
| n | Number of observations (data points) |
| y\u1d62 | Actual (observed) value for the i-th data point |
| \u0177\u1d62 | Predicted value for the i-th data point |
| (y\u1d62 \u2212 \u0177\u1d62) | Residual (error) for the i-th data point |
| \u03a3 | Summation over all n data points |
In estimation theory, for an estimator \u03b8\u0302 of a true parameter \u03b8, the RMSE is defined as:
RMSE(\u03b8\u0302) = \u221a[E((\u03b8\u0302 \u2212 \u03b8)\u00b2)] = \u221aMSE(\u03b8\u0302)
For an unbiased estimator (where the bias is zero), the RMSE is equal to the standard deviation of the estimator, also called the standard error.
Suppose a model makes predictions for five data points:
| Observation (i) | Actual (y\u1d62) | Predicted (\u0177\u1d62) | Error (y\u1d62 \u2212 \u0177\u1d62) | Squared error |
|---|---|---|---|---|
| 1 | 3.0 | 2.5 | 0.5 | 0.25 |
| 2 | -0.5 | 0.0 | -0.5 | 0.25 |
| 3 | 2.0 | 2.0 | 0.0 | 0.00 |
| 4 | 7.0 | 8.0 | -1.0 | 1.00 |
| 5 | 4.5 | 4.0 | 0.5 | 0.25 |
Sum of squared errors = 0.25 + 0.25 + 0.00 + 1.00 + 0.25 = 1.75
MSE = 1.75 / 5 = 0.35
RMSE = \u221a0.35 \u2248 0.592
This tells us the model's predictions are off by roughly 0.59 units on average in terms of the root-mean-square deviation, using the same units as the target variable.
RMSE has several mathematical properties that make it a standard metric across scientific and engineering disciplines.
| Property | Description |
|---|---|
| Non-negative | RMSE is always greater than or equal to zero. A value of zero indicates perfect predictions with no error. |
| Same units as target | Unlike MSE, which is in squared units, RMSE is expressed in the same units as the target variable. This makes it directly interpretable. |
| Scale-dependent | RMSE depends on the scale of the data. An RMSE of 10 could be excellent for a variable ranging in the thousands and poor for a variable ranging from 0 to 20. Comparisons across datasets with different scales are not meaningful without normalization. |
| Sensitive to large errors | Because errors are squared before averaging, large errors receive disproportionate weight. An error of 10 contributes 100 to the squared sum, while an error of 1 contributes only 1. |
| Monotonic relationship with MSE | Since RMSE = \u221aMSE, the model that minimizes MSE also minimizes RMSE. The square root is a monotonically increasing function, so it preserves model rankings. |
| Related to standard deviation | When the mean error (bias) is zero, RMSE equals the standard deviation of the residuals. In general, RMSE\u00b2 = Bias\u00b2 + Variance of residuals. |
| Satisfies the triangle inequality | RMSE satisfies the triangle inequality, meaning it qualifies as a proper distance metric in a mathematical sense (Chai and Draxler, 2014). |
| Geometric interpretation | RMSE can be interpreted as the L2 norm (Euclidean norm) of the residual vector divided by \u221an. This connects RMSE to the geometry of vector spaces. |
Interpreting an RMSE value requires context. There is no universal threshold that separates a "good" from a "bad" RMSE, because the metric is scale-dependent. The same RMSE value can be excellent in one application and poor in another.
Compare to the range or mean of observed values. An RMSE of 5 is very different depending on whether the target variable ranges from 0 to 10 (RMSE is 50% of the range) or from 0 to 10,000 (RMSE is 0.05% of the range). Normalizing RMSE by the range, mean, or standard deviation of the observed data (see the Normalized RMSE section below) provides a scale-free measure for comparison.
Compare to a baseline model. A useful benchmark is to compare RMSE against a naive model such as one that always predicts the mean of the training data. If the model's RMSE is substantially lower than the baseline's RMSE, it has learned meaningful patterns.
Approximate confidence interval. If the residuals are approximately normally distributed, roughly 68% of predictions fall within +/- 1 RMSE of the true value, and about 95% of predictions fall within +/- 2 RMSE of the true value. This property follows from the relationship between RMSE and the standard deviation of the residuals.
Consider the application. In safety-critical applications (medical dosing, structural engineering), even a small RMSE might be unacceptable. In exploratory forecasting (long-range weather prediction, economic modeling), a larger RMSE may be perfectly reasonable.
The squared RMSE (i.e., MSE) can be decomposed into bias and variance components:
RMSE\u00b2 = MSE = Bias\u00b2 + Variance + \u03c3\u00b2
Where:
This decomposition reveals that reducing RMSE requires managing the bias-variance tradeoff. A model that is too simple will have high bias, while a model that is too complex will have high variance. Both scenarios inflate RMSE.
For model diagnostics, MSE can also be decomposed into systematic and unsystematic components. The systematic component (MSE_s) captures consistent over- or under-prediction and non-unity regression slopes. The unsystematic component (MSE_u) captures random scatter around the regression line. A well-calibrated model should have most of its MSE in the unsystematic component.
Minimizing RMSE (or equivalently MSE) is mathematically equivalent to maximum likelihood estimation when the errors follow a Gaussian (normal) distribution. Under the assumption that y\u1d62 = f(x\u1d62; \u03b8) + \u03b5\u1d62 where \u03b5\u1d62 ~ N(0, \u03c3\u00b2), the log-likelihood is:
log L(\u03b8) = -(n/2) log(2\u03c0\u03c3\u00b2) - (1/(2\u03c3\u00b2)) \u03a3(y\u1d62 \u2212 f(x\u1d62; \u03b8))\u00b2
Maximizing this expression is equivalent to minimizing the sum of squared errors. This provides the theoretical justification for using RMSE as the primary evaluation metric when errors are normally distributed (Hodson, 2022).
Conversely, when errors follow a Laplacian distribution (heavier tails, more outlier-prone), minimizing MAE is the optimal approach under maximum likelihood. The choice between RMSE and MAE therefore depends fundamentally on the error distribution of the data.
RMSE and MSE are closely related but serve different purposes:
| Aspect | MSE | RMSE |
|---|---|---|
| Formula | (1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2 | \u221a[(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2] |
| Units | Squared units of target | Same units as target |
| Primary use | Loss function during training | Reporting and interpretation |
| Differentiability | Smooth, easy gradient computation | Same optimization behavior (monotonic transform of MSE) |
| Interpretability | Less intuitive (e.g., "dollars squared") | Directly interpretable (e.g., "dollars") |
An RMSE of 10 does not mean the model is off by 10 units on average. Because errors are squared before averaging, RMSE gives more weight to larger errors. The mean absolute error (MAE) provides the true average absolute error.
The relationship between RMSE and MAE has been the subject of extensive discussion in the scientific literature. Key differences are:
| Aspect | RMSE | MAE |
|---|---|---|
| Formula | \u221a[(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2] | (1/n) \u03a3|y\u1d62 \u2212 \u0177\u1d62| |
| Error weighting | Disproportionately penalizes large errors (quadratic) | Treats all errors equally (linear) |
| Outlier sensitivity | High | Low |
| Optimal prediction | Conditional mean | Conditional median |
| Optimal error distribution | Gaussian (normal) | Laplacian |
| Differentiability | Differentiable everywhere | Not differentiable at zero error |
| Mathematical bound | RMSE \u2265 MAE always holds | MAE \u2264 RMSE \u2264 \u221an \u00b7 MAE |
The inequality MAE \u2264 RMSE always holds. The two metrics are equal only when all errors are identical in magnitude. The upper bound RMSE \u2264 \u221an \u00b7 MAE is reached when all of the error is concentrated in a single observation.
This debate has a notable history in the geosciences. Willmott and Matsuura (2005) argued that RMSE is not a good indicator of average model performance because it varies with the variability of the error distribution, not just the average error magnitude. They recommended MAE as a more natural and unambiguous measure of average error.
Chai and Draxler (2014) responded with a counterargument, demonstrating that RMSE is not ambiguous in its meaning and is more appropriate when the error distribution is expected to be Gaussian. They also showed that RMSE satisfies the triangle inequality, qualifying it as a proper distance metric.
Hodson (2022) provided a resolution to this debate by demonstrating that neither metric is inherently superior. The appropriate choice depends on the underlying error distribution: RMSE is optimal for Gaussian errors, and MAE is optimal for Laplacian errors. Hodson recommended that practitioners analyze the distributional properties of their errors before selecting a metric, rather than defaulting to convention.
| Scenario | Recommended metric | Reason |
|---|---|---|
| Errors are approximately normally distributed | RMSE | Theoretically optimal under the Gaussian assumption; equivalent to maximum likelihood |
| Large errors are disproportionately costly | RMSE | Quadratic penalty reflects the true cost structure |
| Model training via gradient descent | RMSE/MSE | Smooth, differentiable loss surface; MAE has a non-smooth gradient at zero |
| Errors are heavy-tailed or contain many outliers | MAE | Linear penalty prevents outliers from dominating the metric |
| You want the true average error magnitude | MAE | MAE directly represents the average absolute deviation |
| Error distribution is unknown or mixed | Report both | Presenting both gives a fuller picture (Chai and Draxler, 2014) |
Because RMSE is scale-dependent, comparing models across datasets with different scales requires normalization. Normalized RMSE (NRMSE) divides RMSE by a characteristic value of the observed data. There are several common normalization methods:
| Normalization method | Formula | Notes |
|---|---|---|
| By range | NRMSE = RMSE / (y_max \u2212 y_min) | Sensitive to extreme values; commonly used in environmental modeling |
| By mean | NRMSE = RMSE / \u0233 | Also called the coefficient of variation of RMSD, or CV(RMSD); useful when the mean is meaningful |
| By standard deviation | NRMSE = RMSE / SD(y) | Compares prediction error to the natural variability of the data |
| By interquartile range | NRMSE = RMSE / IQR | Less sensitive to extreme values than range-based normalization |
NRMSE is dimensionless and typically expressed as a proportion or percentage. Lower values indicate better model fit. A range-normalized NRMSE below 0.1 (10%) is generally considered good, though this depends heavily on the application domain.
There is no single normalization method that is universally superior. The choice depends on the data characteristics and the purpose of the comparison.
While RMSE is most commonly used as an evaluation metric, it can also serve as a loss function. However, MSE (the square of RMSE) is more commonly used for optimization because:
For this reason, MSE is the standard regression loss function during training in frameworks such as PyTorch, TensorFlow, and scikit-learn, while RMSE is reported after training to communicate model performance in interpretable units.
RMSE is one of several metrics used to evaluate regression models. The table below compares the most common alternatives:
| Metric | Formula | Units | Outlier sensitivity | Best for |
|---|---|---|---|---|
| MSE | (1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2 | Squared target units | High | Loss function during training |
| RMSE | \u221a[(1/n) \u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2] | Target units | High | Interpretable error reporting |
| MAE | (1/n) \u03a3|y\u1d62 \u2212 \u0177\u1d62| | Target units | Low | Average error with outlier robustness |
| MAPE | (100/n) \u03a3|(y\u1d62 \u2212 \u0177\u1d62)/y\u1d62| | Percentage | Low | Percentage-based error across scales |
| R\u00b2 | 1 \u2212 [\u03a3(y\u1d62 \u2212 \u0177\u1d62)\u00b2 / \u03a3(y\u1d62 \u2212 \u0233)\u00b2] | Dimensionless | High | Proportion of variance explained |
| Huber Loss | Quadratic for small errors, linear for large | Target units | Medium | Balancing sensitivity and robustness |
No single metric captures all aspects of model performance. It is common practice to report RMSE alongside R\u00b2 and/or MAE to provide a more complete picture.
RMSE is used across a broad range of scientific and engineering disciplines. Its versatility comes from the combination of intuitive units, mathematical tractability, and sensitivity to large errors.
RMSE is a standard evaluation metric for numerical weather prediction (NWP) models. Environment Canada and other meteorological agencies use latitude-weighted RMSE to assess forecast accuracy for temperature, pressure, wind speed, and other variables at different lead times. In ensemble forecasting, comparing the RMSE of the ensemble mean against the ensemble spread provides a diagnostic for whether the ensemble is properly calibrated.
In climate modeling, RMSE is used to evaluate downscaling methods, general circulation model outputs, and data-driven forecasting approaches such as those benchmarked in ClimateLearn (Nguyen et al., 2023).
In structural bioinformatics, RMSD measures the average distance between atoms of superimposed protein structures. This application is central to protein structure prediction, molecular docking, and drug design, where the RMSD between a crystallographic conformation and a predicted pose indicates prediction quality.
In control theory, RMSE (or RMSD) serves as a quality measure for state observer performance. In fluid dynamics, it quantifies flow uniformity for velocity, temperature, and concentration fields. In building energy simulation, RMSE is used to calibrate predicted energy consumption against measured data.
RMSE is used to assess accuracy in spatial analysis, digital elevation models, and satellite imagery classification. In hydrogeology, it is the primary metric for calibrating groundwater flow models.
The Netflix Prize competition (2006 to 2009) used RMSE as its primary evaluation criterion. Competitors aimed to reduce the RMSE of movie rating predictions by at least 10% compared to Netflix's existing Cinematch algorithm.
In imaging science, RMSE is a component of the Peak Signal-to-Noise Ratio (PSNR), which is widely used to measure the quality of reconstructed or compressed images relative to the original.
While RMSE is widely used, practitioners should be aware of its limitations:
Scale dependence. RMSE is not comparable across datasets with different target scales unless normalized.
Outlier sensitivity. A small number of large errors can inflate RMSE substantially, potentially giving a misleading impression of overall model performance. In such cases, MAE or the median absolute deviation (MAD) may be more representative.
No directional information. RMSE does not distinguish between over-prediction and under-prediction because errors are squared. A model that systematically overestimates by 5 units and a model with random errors averaging 5 units in either direction will have similar RMSE values.
Ambiguity as an "average" error. Willmott and Matsuura (2005) pointed out that RMSE varies with the variability of the error distribution, not just the average error. MAE is a more straightforward measure of the average error magnitude.
Sensitivity to sample size in comparisons. When comparing RMSE values across studies, differences in sample size and data composition can affect the metric, making direct comparisons unreliable without careful normalization.
Not a complete model assessment. RMSE alone does not capture model calibration, uncertainty quantification, or the spatial/temporal distribution of errors. It should be used alongside other metrics and diagnostic plots (residual plots, Q-Q plots).
The mathematical foundations of RMSE trace back to the development of the method of least squares. Adrien-Marie Legendre published the first clear exposition of least squares in 1805. Carl Friedrich Gauss claimed to have used the method since 1795, and in 1809 connected it to probability theory and the normal distribution. Within a decade of Legendre's publication, least squares had become the standard tool in astronomy and geodesy.
The modern use of RMSE as an evaluation metric grew out of this tradition. The squaring of residuals (central to both least squares and RMSE) is mathematically tied to the normal distribution through maximum likelihood theory. As statistical methods spread from the physical sciences into economics, psychology, engineering, and eventually machine learning, RMSE became one of the default measures of prediction accuracy.
RMSE can be computed easily in all major programming languages and machine learning frameworks.
from sklearn.metrics import root_mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
rmse = root_mean_squared_error(y_true, y_pred)
print(rmse) # Output: 0.6123724356957945
In older versions of scikit-learn (before v1.4), RMSE was computed using mean_squared_error with squared=False:
from sklearn.metrics import mean_squared_error
import math
rmse = math.sqrt(mean_squared_error(y_true, y_pred))
import torch
import torch.nn as nn
y_true = torch.tensor([3.0, -0.5, 2.0, 7.0])
y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])
mse_loss = nn.MSELoss()
rmse = torch.sqrt(mse_loss(y_pred, y_true))
print(rmse.item()) # Output: 0.6123724356957945
import numpy as np
def rmse(y_true, y_pred):
return np.sqrt(np.mean((np.array(y_true) - np.array(y_pred)) ** 2))
print(rmse([3, -0.5, 2, 7], [2.5, 0.0, 2, 8])) # Output: 0.6123724356957945
y_true <- c(3, -0.5, 2, 7)
y_pred <- c(2.5, 0.0, 2, 8)
rmse <- sqrt(mean((y_true - y_pred)^2))
print(rmse) # Output: 0.6123724
Imagine you are trying to guess how many apples are in different baskets. After you guess, someone counts the actual number. Sometimes you guess too many, and sometimes you guess too few.
RMSE is a way to figure out how wrong your guesses are, on average. Here is what you do: take each guess and compare it to the real number. Figure out how far off you were. Then multiply each difference by itself (squaring it). Add up all those results, divide by how many baskets you guessed, and then take the square root of that number.
Why go through all this squaring and square-rooting? Because it makes big mistakes count extra. If you were off by 1 apple, the squared error is 1. If you were off by 5 apples, the squared error is 25. That is 25 times worse, not just 5 times worse. So RMSE punishes big blunders more than small slip-ups.
The final number you get is in apples (the same thing you were measuring), so it is easy to understand. If the RMSE is 2, it means your guesses are typically about 2 apples off from the real number. A smaller RMSE means better guessing.