# Mean Absolute Error (MAE)

> Source: https://aiwiki.ai/wiki/mean_absolute_error_mae
> Updated: 2026-06-23
> Categories: Machine Learning, Model Evaluation, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Mean Absolute Error (MAE)** is a [regression](/wiki/regression) accuracy metric and [loss function](/wiki/loss_function) that measures the average absolute difference between predicted values and actual observed values, computed as MAE = (1 / n) Σ |y_i − ŷ_i|. It is the L1 loss: it averages the magnitude of errors without regard to their sign, expresses the result in the same units as the target variable, and is more robust to outliers than [Mean Squared Error (MSE)](/wiki/mean_squared_error_mse) because each error contributes linearly rather than quadratically.[1] The single constant that minimizes MAE over a set of observations is the median, which is why MAE is the natural loss for median and [quantile regression](/wiki/quantile_regression).[11]

MAE is one of the most widely used metrics in statistics and [machine learning](/wiki/machine_learning) for evaluating the accuracy of predictions. Because it treats every error equally and expresses the result in the same units as the original data, MAE is valued for its simplicity, interpretability, and robustness.[1] In the context of [loss functions](/wiki/loss_function), MAE is equivalent to the L1 loss, which plays a central role in optimization, [regularization](/wiki/regularization), and robust estimation. The statisticians Cort Willmott and Kenji Matsuura, who argued for MAE over RMSE in climate model evaluation, summarized its appeal directly: "the MAE is a more natural measure of average error, and (unlike the RMSE) is unambiguous."[1]

MAE is one of the oldest and most enduring measures of prediction quality. The use of absolute deviations as a measure of error pre-dates the [least squares](/wiki/least_squares) method itself: Roger Joseph Boscovich proposed minimizing the sum of absolute deviations in 1757, roughly 48 years before Adrien-Marie Legendre published the method of least squares in 1805 and Carl Friedrich Gauss popularized it.[26] Pierre-Simon Laplace also studied the L1 criterion in connection with the median. Although least squares dominated nineteenth-century practice because of its analytic tractability, the rise of computers and the demands of [robust statistics](/wiki/robust_statistics) brought L1 estimation back into mainstream use during the twentieth century.

## How is MAE calculated?

Given a set of *n* observations where *y_i* represents the actual value and *ŷ_i* represents the predicted value for the *i*-th observation, the Mean Absolute Error is defined as:

**MAE = (1 / n) × Σ |y_i − ŷ_i|**

Or equivalently, using the individual error terms *e_i = y_i − ŷ_i*:

**MAE = (1 / n) × Σ |e_i|**

The summation runs from *i = 1* to *i = n*. Each term |y_i − ŷ_i| computes the absolute difference between the true value and the prediction. By averaging these absolute differences, MAE provides a single number that summarizes the typical prediction error across all data points.

In vector notation, if **y** and **ŷ** are *n*-dimensional vectors of actual and predicted values, then

**MAE = (1 / n) × ‖y − ŷ‖_1**

where ‖·‖_1 denotes the L1 norm (the sum of absolute values of components). This connection between MAE and the L1 norm is the reason MAE and L1 loss are used interchangeably in [deep learning](/wiki/deep_learning) and optimization literature.

### worked example

Consider a simple regression scenario with five observations:

| Observation | Actual Value (y) | Predicted Value (ŷ) | Absolute Error |y − ŷ| |
|---|---|---|---|
| 1 | 3.0 | 2.5 | 0.5 |
| 2 | 5.0 | 4.8 | 0.2 |
| 3 | 2.0 | 3.1 | 1.1 |
| 4 | 7.0 | 6.5 | 0.5 |
| 5 | 4.0 | 4.3 | 0.3 |

**MAE = (0.5 + 0.2 + 1.1 + 0.5 + 0.3) / 5 = 2.6 / 5 = 0.52**

This means the model's predictions are, on average, 0.52 units away from the actual values.

## What are the key properties of MAE?

### same-unit interpretability

Unlike metrics that square the errors, MAE is expressed in the same units as the target variable. If you are predicting house prices in dollars, an MAE of 15,000 means your model is off by $15,000 on average. This makes it straightforward to communicate results to non-technical stakeholders.

### robustness to outliers

MAE treats all errors with equal weight. A prediction that misses by 100 units contributes exactly 100 to the sum, whereas under [Mean Squared Error (MSE)](/wiki/mean_squared_error_mse), the same error contributes 10,000 (the square of 100). This linear treatment means that MAE is significantly less sensitive to outliers than MSE or RMSE.[1] When your dataset contains extreme values or noisy observations, MAE provides a more stable and representative measure of typical model performance.[3]

In the language of [robust statistics](/wiki/robust_statistics), the influence function of an MAE-minimizing estimator (the median) is bounded, while the influence function of an MSE-minimizing estimator (the mean) grows linearly with the deviation. Bounded influence is the technical reason why MAE-based methods resist contamination by even a small fraction of arbitrarily large outliers.[2]

### symmetry

MAE penalizes overestimations and underestimations equally. A prediction that is 5 units too high and one that is 5 units too low both contribute 5 to the error sum. This symmetric treatment is appropriate in many practical settings where the cost of over-predicting and under-predicting is the same. When asymmetric penalties are required, the [pinball loss](/wiki/quantile_regression) (also known as quantile loss or tilted absolute loss) generalizes MAE by weighting positive and negative residuals differently.

### scale dependence

Because MAE uses the same scale as the data, it cannot be directly compared across datasets with different measurement units. An MAE of 2.0 for a temperature forecast (in degrees Celsius) is not comparable to an MAE of 2.0 for a revenue forecast (in thousands of dollars). Normalized variants such as MAPE, sMAPE, NMAE, and MASE address this limitation; they are described in their own sections below.

### non-differentiability at zero

The absolute value function |x| has a sharp corner at x = 0, which means MAE is not differentiable when a prediction exactly matches the actual value. This creates challenges for optimization algorithms based on [gradient descent](/wiki/gradient_descent), since the gradient is undefined at that point. In practice, subgradient methods or smooth approximations such as Huber loss are used to handle this issue. The set-valued subdifferential of |e| at zero is the closed interval [−1, +1], so any element of this interval is a valid subgradient at the kink.

### convexity

The absolute value function is convex, and a non-negative weighted sum of convex functions is convex. Therefore MAE, viewed as a function of the predicted values, is convex (although not strictly convex). When combined with a [linear regression](/wiki/linear_regression) model, MAE minimization becomes a convex optimization problem and can be expressed as a [linear program](/wiki/linear_programming), which historically allowed L1 regression to be solved by the simplex method long before modern interior-point methods became available.

## Why is MAE called L1 loss?

In optimization and [deep learning](/wiki/deep_learning), MAE is commonly referred to as L1 loss. The name comes from its relationship to the L1 norm (also known as the Manhattan distance or taxicab norm), which computes the sum of absolute values of a vector's components.

Formally, minimizing the L1 loss over a training set is equivalent to minimizing:

**L1 Loss = Σ |y_i − f(x_i)|**

where *f(x_i)* is the model's prediction for input *x_i*.

L1 loss is closely related to [L1 regularization](/wiki/l1_regularization) (Lasso), which adds the sum of absolute values of model weights as a penalty term. Both share the property of promoting sparsity: L1 regularization encourages many weights to become exactly zero, and L1 loss encourages the model to optimize toward the median rather than the mean of the target distribution. The Lasso estimator, introduced by Robert Tibshirani in 1996, "minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant," and is one of the most influential applications of the L1 norm in modern statistics.[13]

The absolute-value form is sometimes called **least absolute deviations (LAD)** regression, **least absolute errors (LAE)**, or **least absolute residuals (LAR)** depending on the source. All of these terms refer to the same fundamental criterion: minimize the sum of absolute residuals.

## Why does minimizing MAE give the median?

One of the most important theoretical properties of MAE is its relationship to the median. If you need to choose a single constant value *m* that minimizes the mean absolute error across all observations, the optimal choice is the **median** of those observations.

Formally: *m* is a sample median if and only if *m* minimizes the expression Σ |y_i − m|.

The proof is straightforward. Take the derivative of Σ |y_i − m| with respect to *m*. Each term contributes −1 when y_i > m and +1 when y_i < m. Setting the sum of these signs to zero requires that the number of points above *m* equal the number of points below *m*, which is the defining property of the median.

This stands in contrast to MSE, where the optimal constant predictor is the **mean** of the observations. The distinction has practical consequences:

| Property | MAE (L1 Loss) | MSE (L2 Loss) |
|---|---|---|
| Optimal constant predictor | Median | Mean |
| Sensitivity to outliers | Low (linear penalty) | High (quadratic penalty) |
| Statistical assumption | Laplace-distributed errors | Gaussian-distributed errors |
| Maximum-likelihood interpretation | MLE for Laplace noise | MLE for Gaussian noise |
| Gradient magnitude | Constant (±1/n) | Proportional to error size |
| Differentiability | Not differentiable at zero | Differentiable everywhere |
| Strict convexity | No (only convex) | Yes (strictly convex) |
| Closed-form solution | None in general | Yes (normal equations) |

Because the median is more robust to extreme values than the mean, models trained with MAE loss tend to produce predictions that are more resistant to outlier contamination. From a probabilistic perspective, minimizing MAE is the [maximum likelihood](/wiki/maximum_likelihood_estimation) estimator under a Laplace (double-exponential) noise model, whereas minimizing MSE is the maximum likelihood estimator under a Gaussian noise model.[10][25] This Laplace-Gaussian duality explains both the heavy-tailed robustness of MAE and the analytic convenience of MSE.

### conditional median for regression

The median property generalizes from constant predictors to full regression models. If a model *f* minimizes the expected MAE under the joint distribution of inputs *x* and targets *y*, then *f(x)* is the conditional median of *y* given *x*. In contrast, the MSE-minimizing regression function returns the conditional mean. This is why MAE is the natural loss for **median regression** and the basis of the more general [quantile regression](/wiki/quantile_regression) framework.[11]

## How does MAE relate to quantile regression?

Median regression is a special case of [quantile regression](/wiki/quantile_regression), introduced by Roger Koenker and Gilbert Bassett in 1978.[11] In their formulation, the estimator that minimizes a sum of absolute residuals is the central special case, and other conditional quantiles are estimated by minimizing an asymmetrically weighted sum of absolute errors. Quantile regression generalizes MAE by replacing the symmetric absolute value with an asymmetric loss called the **pinball loss** (also called **tilted absolute loss** or **quantile loss**):

**ρ_τ(e) = τ × e if e ≥ 0**

**ρ_τ(e) = (τ − 1) × e if e < 0**

where *τ ∈ (0, 1)* is the target quantile. When *τ = 0.5*, the pinball loss reduces to half the absolute error, so minimizing it is equivalent to minimizing MAE and yields the conditional median.

For *τ > 0.5*, positive residuals (under-predictions) are weighted more heavily than negative residuals (over-predictions), so the optimizer is pushed toward predicting larger values. For *τ < 0.5*, the asymmetry reverses. This makes pinball loss an essential tool for **probabilistic forecasting**: by training separate models or a single model with multiple output heads at quantiles 0.1, 0.5, and 0.9, practitioners can produce calibrated prediction intervals.[12]

| τ value | Interpretation | Symmetric? | Reduces to MAE? |
|---|---|---|---|
| 0.5 | Median forecast | Yes | Yes (up to a factor of 1/2) |
| 0.1 | 10th percentile (lower tail) | No | No |
| 0.9 | 90th percentile (upper tail) | No | No |
| τ → 0 or τ → 1 | Extreme tail | Highly asymmetric | No |

Quantile loss is built into modern forecasting libraries such as **GluonTS**, **Darts**, **NeuralProphet**, and **Prophet**, and is the backbone of probabilistic deep learning models including DeepAR (Salinas et al., 2020),[18] MQ-CNN, and Temporal Fusion Transformer (Lim et al., 2021).[19]

## gradient behavior

The gradient (or more precisely, the subgradient) of the MAE loss with respect to a single prediction is:

**∂MAE/∂ŷ_i = −sign(y_i − ŷ_i) / n**

where *sign(x)* returns +1 if x > 0, −1 if x < 0, and is undefined (or any value in [−1, +1]) if x = 0.

This means the gradient has a **constant magnitude** of 1/n regardless of how large or small the error is. By comparison, the gradient of MSE is proportional to the error itself, so it naturally shrinks as the prediction approaches the true value.

The constant-magnitude gradient has two practical implications for training [neural networks](/wiki/neural_network) and other models via gradient descent:

1. **Oscillation near the optimum**: Because the gradient does not decrease as the model gets closer to the correct answer, the optimization can overshoot the minimum and oscillate back and forth. This can make convergence slower or less stable compared to MSE.

2. **Equal treatment of all errors**: Large errors receive the same gradient magnitude as small errors. This is beneficial for robustness (outliers do not dominate the gradient), but it also means the model does not "rush" to fix its biggest mistakes first.

To mitigate these issues, practitioners often use learning rate scheduling, gradient clipping, or switch to a smooth approximation like Huber loss. Adaptive optimizers such as [Adam](/wiki/adam_optimizer) and AdamW partially compensate for the constant-gradient issue by maintaining per-parameter running averages of the squared gradients, which effectively rescale the step size based on each parameter's historical gradient magnitude. In practice, modern deep networks trained with Adam and L1 loss converge well, even though L1 lacks the smooth shrinkage that MSE provides.

### subgradient methods

When the gradient does not exist at zero, optimization theory provides the **subgradient**: any value in [−1, +1] is a valid subgradient of |x| at x = 0. Subgradient descent uses any element of the subdifferential as a search direction. Modern automatic differentiation libraries return 0 by convention at the kink, which works well in practice because exact zero residuals are rare on continuous data.

## MAE vs. MSE: when should you use each?

Choosing between MAE and [MSE](/wiki/mean_squared_error_mse) (or its root, RMSE) depends on the problem context, data characteristics, and what types of errors matter most.[3][6] The choice is not purely technical: Willmott and Matsuura (2005) argued that because RMSE depends on the variance of the error distribution and the number of errors, "the MAE is a more natural measure of average error, and (unlike the RMSE) is unambiguous," and recommended MAE for reporting average model performance.[1]

| Consideration | Use MAE | Use MSE/RMSE |
|---|---|---|
| Outliers present | Yes, MAE is robust | No, MSE amplifies outlier effects |
| All errors equally costly | Yes | No, large errors are disproportionately costly |
| Interpretability needed | MAE is in original units | RMSE is in original units; MSE is in squared units |
| Optimization smoothness | MAE gradient is discontinuous at zero | MSE is smooth and differentiable everywhere |
| Error distribution | Laplace (heavy-tailed) | Gaussian (thin-tailed) |
| Gradient-based training | Constant gradient can cause oscillation | Gradient naturally decreases near optimum |

**Use MAE when:**
- Your dataset contains outliers or heavy-tailed noise, and you do not want extreme values to dominate the evaluation.
- The cost of errors is linear: a prediction off by 10 is exactly twice as bad as one off by 5.
- You need an interpretable metric to communicate with stakeholders (for example, "our model is off by $500 on average").
- You are building robust models for real-world applications like delivery time estimation, where typical performance matters more than worst-case performance.

**Use MSE when:**
- Large errors are disproportionately harmful and should be penalized more heavily (for example, in safety-critical systems or financial risk modeling).
- Your data is relatively clean and follows an approximately Gaussian error distribution.
- You need smooth gradients for efficient optimization during model training.
- You are using algorithms that assume or benefit from squared error minimization, such as ordinary least squares [linear regression](/wiki/linear_regression).

### empirical relationship

For any sample, MAE ≤ RMSE always holds (this follows from the Cauchy-Schwarz inequality), with equality only when all absolute errors are identical. The ratio RMSE / MAE is bounded between 1 and √n; values close to 1 indicate that errors are uniform in magnitude, while values closer to √n indicate that a few large errors dominate. This ratio itself can be a useful diagnostic for outlier presence.[6]

## Huber loss: a smooth hybrid alternative

[Huber loss](/wiki/huber_loss), introduced by Peter Huber in 1964, combines the advantages of both MAE and MSE by switching between quadratic and linear behavior based on a threshold parameter delta (δ):[2]

- For errors smaller than δ: Huber loss behaves like MSE (quadratic), providing smooth gradients near the optimum.
- For errors larger than δ: Huber loss behaves like MAE (linear), limiting the influence of outliers.

Formally:

**L_δ(e) = 0.5 × e² if |e| ≤ δ**

**L_δ(e) = δ × (|e| − 0.5 × δ) if |e| > δ**

Huber loss is differentiable everywhere (unlike MAE), and its gradient transitions smoothly from being proportional to the error (like MSE) to having a constant magnitude (like MAE). This makes it a popular choice in robust regression, [reinforcement learning](/wiki/reinforcement_learning), and deep learning applications where both smooth optimization and outlier resistance are desired.

The δ parameter controls the transition point. A large δ makes Huber loss behave more like MSE, while a small δ makes it behave more like MAE. In [PyTorch](/wiki/pytorch), this is implemented as `torch.nn.SmoothL1Loss` (with some scaling differences) and `torch.nn.HuberLoss`. Huber loss is also the default loss in DeepMind's DQN paper (Mnih et al., 2015) for the Bellman residual in deep Q-learning, where reward outliers can otherwise destabilize training.

### M-estimators and the broader robust statistics family

Huber's 1964 paper introduced the framework of **M-estimators**, which are estimators defined by minimizing a sum of a generic loss function ρ.[2] MAE corresponds to ρ(e) = |e|, MSE to ρ(e) = e²/2, and Huber loss to a piecewise combination. Other well-known robust M-estimators include:

| M-estimator | ρ(e) shape | Behavior at large |e| | Bounded influence? |
|---|---|---|---|
| L2 (MSE / OLS) | Quadratic | Quadratic | No |
| L1 (MAE / LAD) | Linear | Linear | Yes (bounded by 1) |
| Huber | Quadratic then linear | Linear | Yes |
| Tukey biweight | Bounded redescending | Constant beyond cutoff | Yes |
| Andrews wave | Bounded redescending | Periodic, capped | Yes |
| Cauchy | Logarithmic | Logarithmic | Yes |

The redescending estimators (Tukey biweight, Andrews wave) actually decrease the influence of points beyond a cutoff distance, making them even more outlier-resistant than MAE, but at the cost of non-convexity.

## MAE in time series forecasting

MAE is one of the most commonly used evaluation metrics for time series forecasting. Its properties make it particularly well-suited for this domain:

- **Direct interpretation**: An MAE of 56 on a revenue forecast in USD means the model is off by $56 on average. This is immediately meaningful to business users.
- **Robustness**: Time series data often contains spikes, anomalies, or seasonal outliers. MAE provides a stable summary of forecast quality even in the presence of such irregularities.
- **Simplicity**: MAE is easy to compute and does not require assumptions about the error distribution.

In forecasting practice, MAE is frequently reported alongside other metrics such as RMSE (which emphasizes large errors), MAPE (which normalizes errors as percentages), and MASE (Mean Absolute Scaled Error, which normalizes by the naive forecast's error).

### the Makridakis competitions

The Makridakis Competitions (M, M2, M3, M4, M5) are a long-running series of public forecasting contests organized by Spyros Makridakis and collaborators since 1982. These competitions have used MAE-derived metrics extensively to rank submissions and to draw broad empirical conclusions about forecasting practice.

| Competition | Year | Series count | Headline accuracy metrics |
|---|---|---|---|
| M | 1982 | 1,001 | MAPE, MSE |
| M2 | 1993 | 29 | MAPE |
| M3 | 2000 | 3,003 | MAPE, sMAPE |
| M4 | 2018 | 100,000 | sMAPE, MASE, OWA |
| M5 | 2020 | 42,840 (Walmart hierarchical sales) | RMSSE, WRMSSE, MAE-based scaled metrics |

The M4 competition introduced the **Overall Weighted Average (OWA)** metric, which combines sMAPE and MASE relative to a seasonal naive baseline.[16] The M5 competition, run on Kaggle in 2020, used a weighted root-mean-squared scaled error and explicitly emphasized hierarchical demand forecasting at Walmart.[17] Across competitions, the empirical finding is consistent: scaled and absolute-error metrics correlate strongly, and MAE-based criteria reward stable, well-calibrated forecasts more than they reward occasional spectacular hits.

### MASE: mean absolute scaled error

MASE was proposed by Hyndman and Koehler (2006) to address the shortcomings of MAPE and sMAPE.[5] It scales MAE by the in-sample MAE of a naive forecast (the previous observation, or the same observation from the previous season for seasonal data):

**MASE = MAE / MAE_naive**

where MAE_naive is the in-sample mean absolute error from the one-step naive forecast on the training set.

MASE has several attractive properties:

- **Scale-free**: It can be compared across series with different units.
- **Always defined**: Unlike MAPE, it does not blow up when y = 0, as long as the training data has some variation.
- **Interpretable threshold**: MASE < 1 means the model beats the naive forecast on average; MASE > 1 means it does not.
- **Symmetric**: It treats over- and under-prediction equally.

MASE has become a de facto standard in academic forecasting research and is the recommended scale-free metric in the third edition of Hyndman and Athanasopoulos's textbook *Forecasting: Principles and Practice*.[23]

### applications by domain

| Domain | Example Use Case | Why MAE Is Suitable |
|---|---|---|
| Retail and demand forecasting | Predicting daily product sales | Interpretable in units sold; robust to promotional spikes |
| Energy | Forecasting electricity demand | Error in kWh is directly actionable for grid operators |
| Finance | Predicting stock returns | Outlier-resistant; treats gains and losses symmetrically |
| Weather and climate | Temperature or precipitation forecasts | Error in degrees or millimeters is easy to communicate |
| Healthcare | Predicting patient wait times | Supports operational planning with average error estimates |
| Transportation | Estimating delivery or travel times | Linear error cost aligns with customer experience |
| Logistics | Estimated time of arrival (ETA) systems | Robust to traffic anomalies; minutes are interpretable |
| Cloud capacity planning | Predicting server load | Resists rare spike events; aggregates well over many machines |

## How is MAE used in modern machine learning?

Beyond its classic role as a regression metric, MAE and its L1 variant are used extensively across modern machine learning architectures. Several prominent areas rely on absolute-error losses:

### object detection and bounding-box regression

In [computer vision](/wiki/computer_vision), bounding-box regression has historically used L1 loss or its smooth variant. The Smooth L1 loss (also called Huber loss with δ = 1) was introduced by Ross Girshick for **Fast R-CNN** in 2015 to combine the smoothness of L2 near zero with the robustness of L1 for large errors.[20] Subsequent detectors including **Faster R-CNN**, **SSD**, **RetinaNet**, and **YOLO** variants have used L1 or Smooth L1 for box coordinate regression. More recent transformer-based detectors such as **DETR** (Carion et al., 2020) use a combination of L1 and Generalized IoU (GIoU) loss for box regression.[21] The DIoU and CIoU variants further refine this approach, but the L1 component remains a standard baseline in detection literature.

### image generation and reconstruction

Image-to-image translation networks such as **pix2pix** (Isola et al., 2017) and **CycleGAN** (Zhu et al., 2017) combine an adversarial loss with an L1 reconstruction loss because L1 produces sharper outputs than L2, which tends to blur high-frequency detail.[22] Many [diffusion models](/wiki/diffusion_model) and [autoencoders](/wiki/autoencoder) include an L1 reconstruction term for the same reason. The intuition is that L1 puts the conditional median at each pixel, and medians of multimodal distributions tend to be one of the modes rather than an average of modes (which would appear as a blurry blend).

### depth estimation and dense prediction

Monocular depth estimation networks (such as **MiDaS**, **DPT**, and **ZoeDepth**) often use scale-invariant L1 or trimmed L1 losses on log-depth, because depth labels contain heavy-tailed errors and outliers from sensor noise. Optical flow networks similarly rely on robust L1-based criteria.

### speech and audio

Speech enhancement, denoising, and source separation systems frequently use L1 or weighted L1 losses on spectrograms because L1 is more perceptually correlated than L2 in many audio settings.

### loss combinations

In modern practice, MAE is rarely used alone. Common combinations include:

| Combination | Where used | Why |
|---|---|---|
| L1 + L2 | Image reconstruction | L1 sharpness with L2 smoothness |
| L1 + GIoU | Object detection (DETR) | Position accuracy plus shape accuracy |
| L1 + adversarial | pix2pix, CycleGAN | Pixel fidelity plus realism |
| Quantile loss at multiple τ | Probabilistic forecasting | Calibrated prediction intervals |
| Huber + entropy | Reinforcement learning | Stable Bellman updates with exploration |

## MAPE: the percentage variant

Mean Absolute Percentage Error (MAPE) expresses MAE as a percentage of the actual values:

**MAPE = (100% / n) × Σ |y_i − ŷ_i| / |y_i|**

MAPE is useful for comparing forecast accuracy across different scales (for example, comparing a model's accuracy on high-volume products versus low-volume products). However, it has several well-known limitations:

- **Undefined for zero actual values**: When any y_i = 0, the formula involves division by zero.
- **Asymmetric penalties**: Underpredictions are capped at 100% error, while overpredictions can yield arbitrarily large percentage errors. This means MAPE systematically favors forecasts that are biased low.
- **Inflated scores for small values**: When actual values are small, even minor absolute errors produce large percentage errors.

Alternatives include Symmetric MAPE (sMAPE), which addresses the asymmetry issue, and MASE (Mean Absolute Scaled Error), which normalizes by the naive forecast and avoids the division-by-zero problem entirely.

### symmetric MAPE (sMAPE)

sMAPE was designed to mitigate MAPE's asymmetry by placing the average of the actual and predicted values in the denominator:

**sMAPE = (200% / n) × Σ |y_i − ŷ_i| / (|y_i| + |ŷ_i|)**

The factor of 200 (rather than 100) ensures that sMAPE ranges from 0% to 200% rather than 0% to 100%. sMAPE was the headline metric for the M3 and M4 competitions, but it has also been criticized: Goodwin and Lawton (1999) showed that sMAPE is still asymmetric in a more subtle way, and Hyndman (2014) recommended MASE over sMAPE for academic comparisons.

### normalized MAE (NMAE)

NMAE divides MAE by some normalization constant such as the range, mean, or standard deviation of the actual values:

**NMAE = MAE / (max(y) − min(y))** or **NMAE = MAE / mean(|y|)**

This is common in recommender systems and physical sciences, where comparing accuracy across heterogeneous datasets matters.

### weighted MAE

When some observations are more important than others (for example, recent observations in time series, or high-revenue products in retail), a weighted MAE is used:

**WMAE = (Σ w_i × |y_i − ŷ_i|) / (Σ w_i)**

where *w_i ≥ 0* is the weight assigned to observation *i*. Weighted RMSSE in the M5 competition is a hierarchical extension of this idea, scaling the squared-error variant of WMAE by a per-series naive baseline.

## decomposition of MAE

Research by Willmott and Matsuura (2005) and subsequent work has shown that MAE can be decomposed into two meaningful components:[1]

1. **Quantity disagreement**: The absolute difference between the mean of predictions and the mean of actual values. This captures systematic bias.
2. **Allocation disagreement**: The remaining portion of MAE after removing quantity disagreement. This captures how well predictions are distributed across individual observations.

This decomposition is particularly useful in remote sensing and spatial analysis, where understanding whether errors stem from systematic bias or from misallocation of values across locations can guide model improvement. Pontius et al. (2008) extended the decomposition to include multiple resolutions, allowing analysts to track how errors aggregate or cancel as data is summarized at coarser spatial scales.[4]

## confidence intervals and statistical significance

When comparing two models by MAE, a single point estimate of MAE on a held-out test set can be misleading because of sampling variability. Several techniques are commonly used to quantify uncertainty:

- **Bootstrap resampling**: Repeatedly resample the test set with replacement, recompute MAE on each resample, and report the empirical 2.5% and 97.5% percentiles as a 95% confidence interval. This is the standard nonparametric approach and is implemented in libraries like `scikit-learn` and `arch`.[7]
- **Diebold-Mariano test**: A formal hypothesis test for whether two competing forecasts have equal expected loss, originally designed for time series and accounting for autocorrelation in the loss differential.[14] The test statistic uses a HAC-corrected variance estimator.
- **Modified Diebold-Mariano (Harvey, Leybourne, Newbold 1997)**: A small-sample correction to the original DM test.[15]
- **Permutation tests**: Shuffle the labels assigned to two models' predictions and recompute the difference in MAE many times to build a null distribution.

In deep learning contexts, MAE differences across runs are often dominated by random initialization and data ordering, so reporting **mean ± standard deviation across multiple seeds** is the most common practice.

## optimization considerations

When MAE is used as a training loss rather than only as an evaluation metric, several practical points are worth knowing:

1. **Use sub-gradient at zero**: Most autograd frameworks define the derivative of |x| at 0 as 0. This convention is fine in practice because exact zero residuals are extremely rare on continuous-valued targets.
2. **Adam often handles L1 well**: Despite the constant gradient magnitude, [Adam](/wiki/adam_optimizer)'s second-moment scaling adapts the effective step size, which mitigates the oscillation that pure SGD with L1 loss can show.
3. **Learning rate scheduling helps**: A cosine or step schedule is especially useful with L1 loss, because the optimizer needs to take smaller steps near the optimum to avoid bouncing around.
4. **Smooth L1 / Huber loss is often a better choice**: When the goal is to keep training stable while still gaining outlier robustness, switching from pure L1 to Smooth L1 with a small δ is a one-line change that often improves convergence.
5. **Linear programming for LAD regression**: For [linear models](/wiki/linear_regression) with no regularization, MAE minimization can be cast as a linear program. This is the underlying solver in `statsmodels.regression.quantile_regression` (with τ = 0.5) and in classic LAD regression code.[12]
6. **Iteratively reweighted least squares (IRLS)**: Another classical approach. Each iteration solves a weighted least squares problem with weights *w_i = 1 / |e_i|*, then updates residuals and repeats until convergence. Care must be taken near zero residuals to avoid division by zero.
7. **Coordinate descent for LASSO**: When MAE is combined with L1 regularization, the combined L1-L1 problem can be solved by specialized methods such as coordinate descent or interior-point methods.

## How do you compute MAE in code?

MAE is available as a built-in function in all major machine learning frameworks.

| Framework | Class or function | Notes |
|---|---|---|
| scikit-learn | `sklearn.metrics.mean_absolute_error(y_true, y_pred)` | Returns a single float; supports `sample_weight` and `multioutput` |
| PyTorch (loss) | `torch.nn.L1Loss()` | Computes mean by default; supports `reduction='sum'` and `'none'` |
| PyTorch (functional) | `torch.nn.functional.l1_loss(input, target)` | Functional form, no parameters |
| TensorFlow / Keras (loss) | `tf.keras.losses.MeanAbsoluteError()` | Trainable loss class |
| TensorFlow / Keras (metric) | `tf.keras.metrics.MeanAbsoluteError()` | Stateful metric for `model.compile(metrics=...)` |
| JAX / Optax | `optax.l1_loss(predictions, targets)` | Returns per-example loss; reduce manually |
| XGBoost | `objective='reg:absoluteerror'` | Available since XGBoost 1.7 |
| LightGBM | `objective='regression_l1'` or `'mae'` | Long-supported |
| CatBoost | `loss_function='MAE'` | Long-supported |
| statsmodels | `QuantReg(y, X).fit(q=0.5)` | Median regression via quantile regression |
| H2O.ai | `distribution='laplace'` for GBM | Equivalent to MAE under Laplace MLE |

### scikit-learn

```python
from sklearn.metrics import mean_absolute_error

y_true = [3.0, 5.0, 2.0, 7.0, 4.0]
y_pred = [2.5, 4.8, 3.1, 6.5, 4.3]

mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae}")  # Output: MAE: 0.52
```

### PyTorch

In PyTorch, MAE is implemented as `torch.nn.L1Loss`, reflecting its equivalence to the L1 loss function:[8]

```python
import torch
import torch.nn as nn

y_true = torch.tensor([3.0, 5.0, 2.0, 7.0, 4.0])
y_pred = torch.tensor([2.5, 4.8, 3.1, 6.5, 4.3])

criterion = nn.L1Loss()
mae = criterion(y_pred, y_true)
print(f"MAE: {mae.item()}")  # Output: MAE: 0.52
```

PyTorch also exposes `torch.nn.SmoothL1Loss` (Huber-style with δ = 1 by default) and `torch.nn.HuberLoss` (with a configurable `delta` argument) for the smooth hybrid loss described above.

### TensorFlow / Keras

TensorFlow provides MAE both as a metric and as a loss function through its Keras API:[9]

```python
import tensorflow as tf

y_true = [3.0, 5.0, 2.0, 7.0, 4.0]
y_pred = [2.5, 4.8, 3.1, 6.5, 4.3]

# As a metric
mae = tf.keras.metrics.mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae.numpy()}")  # Output: MAE: 0.52

# As a loss function for training
loss_fn = tf.keras.losses.MeanAbsoluteError()
```

### NumPy (manual calculation)

```python
import numpy as np

y_true = np.array([3.0, 5.0, 2.0, 7.0, 4.0])
y_pred = np.array([2.5, 4.8, 3.1, 6.5, 4.3])

mae = np.mean(np.abs(y_true - y_pred))
print(f"MAE: {mae}")  # Output: MAE: 0.52
```

### gradient boosting libraries

XGBoost, LightGBM, and CatBoost all support MAE as a regression objective. In LightGBM:

```python
import lightgbm as lgb
params = {
    'objective': 'regression_l1',
    'metric': 'mae',
    'learning_rate': 0.05,
    'num_leaves': 31,
}
model = lgb.train(params, train_data, num_boost_round=200)
```

Note that gradient boosting with MAE is slightly different from MAE-based [neural network](/wiki/neural_network) training because tree-based learners use second-order Taylor approximations of the loss, and MAE has a Hessian of zero almost everywhere.[24] Most libraries fall back to a smoothed approximation or to constant-gradient leaf values when training trees with L1 objective.

### quantile regression

For median regression and other quantiles, scikit-learn provides `QuantileRegressor` and statsmodels provides `QuantReg`:

```python
from sklearn.linear_model import QuantileRegressor
import numpy as np

X = np.random.randn(200, 3)
y = X @ np.array([1.0, -2.0, 0.5]) + np.random.laplace(scale=1.0, size=200)

model = QuantileRegressor(quantile=0.5, alpha=0.0, solver='highs')
model.fit(X, y)
# model.coef_ approximates the L1-optimal coefficients
```

## common pitfalls

A few mistakes recur often enough to be worth listing:

- **Confusing MAE with MAPE in reports**: MAPE values look like MAE values but are percentages, not absolute units. Always label which metric is being reported.
- **Dividing by zero in MAPE**: Filter or floor zero actual values, or switch to MASE.
- **Comparing MAE across datasets with different scales**: A lower MAE does not necessarily mean a better model unless the targets share the same units and distribution. Use MASE or NMAE for cross-dataset comparisons.
- **Reporting MAE without sampling uncertainty**: A model that beats another by 0.01 MAE may not be statistically distinguishable. Bootstrap or Diebold-Mariano tests give a sense of significance.
- **Using MAE on classification probabilities**: MAE on probabilities is sometimes computed but is not a [proper scoring rule](/wiki/proper_scoring_rule) for classification. Brier score, log loss, or calibration metrics are preferred.
- **Forgetting that MAE optimizes for the median, not the mean**: A model trained with MAE produces median forecasts. If you need an unbiased estimate of the mean (for example, a budgeting forecast that must sum correctly across categories), MSE may be more appropriate.
- **Mixing up reduction modes**: PyTorch and JAX expose `'mean'`, `'sum'`, and `'none'` reductions. The numerical scale of the loss differs between modes, which can quietly invalidate learning rate tuning when switching frameworks.

## explain like I'm 5 (ELI5)

Imagine you and your friends are guessing how many candies are in a jar. The jar actually has 50 candies. One friend guesses 47, another guesses 53, and a third guesses 45. To figure out how good your group's guesses are, you look at how far off each guess was: the first friend was off by 3, the second by 3, and the third by 5. You do not care whether someone guessed too high or too low; you just care about how far away the guess was. Then you take the average of those distances: (3 + 3 + 5) / 3 = about 3.7 candies. That average distance is the Mean Absolute Error. A smaller number means the guesses were closer to the real answer.

## see also

- [Loss Function](/wiki/loss_function)
- [Mean Squared Error (MSE)](/wiki/mean_squared_error_mse)
- [Regression](/wiki/regression)
- [Huber Loss](/wiki/huber_loss)
- [Quantile Regression](/wiki/quantile_regression)
- [Robust Statistics](/wiki/robust_statistics)
- [Linear Regression](/wiki/linear_regression)
- [Gradient Descent](/wiki/gradient_descent)
- [Regression Model](/wiki/regression_model)
- [L1 Regularization](/wiki/l1_regularization)
- [Maximum Likelihood Estimation](/wiki/maximum_likelihood_estimation)
- [Adam Optimizer](/wiki/adam_optimizer)
- [PyTorch](/wiki/pytorch)
- [Overfitting](/wiki/overfitting)

## references

1. Willmott, C. J., & Matsuura, K. (2005). "Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance." *Climate Research*, 30(1), 79-82.
2. Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.
3. Chai, T., & Draxler, R. R. (2014). "Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature." *Geoscientific Model Development*, 7(3), 1247-1250.
4. Pontius, R. G., Thontteh, O., & Chen, H. (2008). "Components of information for multiple resolution comparison between maps that share a real variable." *Environmental and Ecological Statistics*, 15, 111-142.
5. Hyndman, R. J., & Koehler, A. B. (2006). "Another look at measures of forecast accuracy." *International Journal of Forecasting*, 22(4), 679-688.
6. Hodson, T. O. (2022). "Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not." *Geoscientific Model Development*, 15(14), 5481-5487.
7. Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
8. Paszke, A., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." *Advances in Neural Information Processing Systems*, 32.
9. Abadi, M., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*, 265-283.
10. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5.5: Maximum Likelihood Estimation.
11. Koenker, R., & Bassett, G. (1978). "Regression Quantiles." *Econometrica*, 46(1), 33-50.
12. Koenker, R. (2005). *Quantile Regression*. Cambridge University Press.
13. Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society. Series B*, 58(1), 267-288.
14. Diebold, F. X., & Mariano, R. S. (1995). "Comparing Predictive Accuracy." *Journal of Business & Economic Statistics*, 13(3), 253-263.
15. Harvey, D., Leybourne, S., & Newbold, P. (1997). "Testing the equality of prediction mean squared errors." *International Journal of Forecasting*, 13(2), 281-291.
16. Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). "The M4 Competition: 100,000 time series and 61 forecasting methods." *International Journal of Forecasting*, 36(1), 54-74.
17. Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2022). "M5 accuracy competition: Results, findings, and conclusions." *International Journal of Forecasting*, 38(4), 1346-1364.
18. Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). "DeepAR: Probabilistic forecasting with autoregressive recurrent networks." *International Journal of Forecasting*, 36(3), 1181-1191.
19. Lim, B., Arik, S. O., Loeff, N., & Pfister, T. (2021). "Temporal Fusion Transformers for interpretable multi-horizon time series forecasting." *International Journal of Forecasting*, 37(4), 1748-1764.
20. Girshick, R. (2015). "Fast R-CNN." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 1440-1448.
21. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). "End-to-End Object Detection with Transformers." *European Conference on Computer Vision (ECCV)*, 213-229.
22. Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). "Image-to-Image Translation with Conditional Adversarial Networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 1125-1134.
23. Hyndman, R. J., & Athanasopoulos, G. (2021). *Forecasting: Principles and Practice* (3rd ed.). OTexts.
24. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 10.
25. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 1.5.5.
26. Dodge, Y. (Ed.) (1987). *Statistical Data Analysis Based on the L1-Norm and Related Methods*. North-Holland. (History of Boscovich's 1757 least absolute deviations criterion and its precedence over Legendre's 1805 least squares.)