Mean Absolute Error (MAE) is one of the most widely used metrics in statistics and machine learning for evaluating the accuracy of predictions. It measures the average magnitude of errors between predicted values and actual observed values, without considering the direction of the errors. Because it treats every error equally and expresses the result in the same units as the original data, MAE is valued for its simplicity, interpretability, and robustness. In the context of loss functions, MAE is equivalent to the L1 loss, which plays a central role in optimization, regularization, and robust estimation.
Given a set of n observations where y_i represents the actual value and ŷ_i represents the predicted value for the i-th observation, the Mean Absolute Error is defined as:
MAE = (1 / n) × Σ |y_i − ŷ_i|
Or equivalently, using the individual error terms e_i = y_i − ŷ_i:
MAE = (1 / n) × Σ |e_i|
The summation runs from i = 1 to i = n. Each term |y_i − ŷ_i| computes the absolute difference between the true value and the prediction. By averaging these absolute differences, MAE provides a single number that summarizes the typical prediction error across all data points.
Consider a simple regression scenario with five observations:
| Observation | Actual Value (y) | Predicted Value (ŷ) | Absolute Error |y − ŷ| | |---|---|---|---| | 1 | 3.0 | 2.5 | 0.5 | | 2 | 5.0 | 4.8 | 0.2 | | 3 | 2.0 | 3.1 | 1.1 | | 4 | 7.0 | 6.5 | 0.5 | | 5 | 4.0 | 4.3 | 0.3 |
MAE = (0.5 + 0.2 + 1.1 + 0.5 + 0.3) / 5 = 2.6 / 5 = 0.52
This means the model's predictions are, on average, 0.52 units away from the actual values.
Unlike metrics that square the errors, MAE is expressed in the same units as the target variable. If you are predicting house prices in dollars, an MAE of 15,000 means your model is off by $15,000 on average. This makes it straightforward to communicate results to non-technical stakeholders.
MAE treats all errors with equal weight. A prediction that misses by 100 units contributes exactly 100 to the sum, whereas under Mean Squared Error (MSE), the same error contributes 10,000 (the square of 100). This linear treatment means that MAE is significantly less sensitive to outliers than MSE or RMSE. When your dataset contains extreme values or noisy observations, MAE provides a more stable and representative measure of typical model performance.
MAE penalizes overestimations and underestimations equally. A prediction that is 5 units too high and one that is 5 units too low both contribute 5 to the error sum. This symmetric treatment is appropriate in many practical settings where the cost of over-predicting and under-predicting is the same.
Because MAE uses the same scale as the data, it cannot be directly compared across datasets with different measurement units. An MAE of 2.0 for a temperature forecast (in degrees Celsius) is not comparable to an MAE of 2.0 for a revenue forecast (in thousands of dollars). Normalized variants such as MAPE address this limitation.
The absolute value function |x| has a sharp corner at x = 0, which means MAE is not differentiable when a prediction exactly matches the actual value. This creates challenges for optimization algorithms based on gradient descent, since the gradient is undefined at that point. In practice, subgradient methods or smooth approximations such as Huber loss are used to handle this issue.
In optimization and deep learning, MAE is commonly referred to as L1 loss. The name comes from its relationship to the L1 norm (also known as the Manhattan distance or taxicab norm), which computes the sum of absolute values of a vector's components.
Formally, minimizing the L1 loss over a training set is equivalent to minimizing:
L1 Loss = Σ |y_i − f(x_i)|
where f(x_i) is the model's prediction for input x_i.
L1 loss is closely related to L1 regularization (Lasso), which adds the sum of absolute values of model weights as a penalty term. Both share the property of promoting sparsity: L1 regularization encourages many weights to become exactly zero, and L1 loss encourages the model to optimize toward the median rather than the mean of the target distribution.
One of the most important theoretical properties of MAE is its relationship to the median. If you need to choose a single constant value m that minimizes the mean absolute error across all observations, the optimal choice is the median of those observations.
Formally: m is a sample median if and only if m minimizes the expression Σ |y_i − m|.
This stands in contrast to MSE, where the optimal constant predictor is the mean of the observations. The distinction has practical consequences:
| Property | MAE (L1 Loss) | MSE (L2 Loss) |
|---|---|---|
| Optimal constant predictor | Median | Mean |
| Sensitivity to outliers | Low (linear penalty) | High (quadratic penalty) |
| Statistical assumption | Laplace-distributed errors | Gaussian-distributed errors |
| Gradient magnitude | Constant (±1/n) | Proportional to error size |
| Differentiability | Not differentiable at zero | Differentiable everywhere |
Because the median is more robust to extreme values than the mean, models trained with MAE loss tend to produce predictions that are more resistant to outlier contamination.
The gradient (or more precisely, the subgradient) of the MAE loss with respect to a single prediction is:
∂MAE/∂ŷ_i = −sign(y_i − ŷ_i) / n
where sign(x) returns +1 if x > 0, −1 if x < 0, and is undefined (or any value in [−1, +1]) if x = 0.
This means the gradient has a constant magnitude of 1/n regardless of how large or small the error is. By comparison, the gradient of MSE is proportional to the error itself, so it naturally shrinks as the prediction approaches the true value.
The constant-magnitude gradient has two practical implications for training neural networks and other models via gradient descent:
Oscillation near the optimum: Because the gradient does not decrease as the model gets closer to the correct answer, the optimization can overshoot the minimum and oscillate back and forth. This can make convergence slower or less stable compared to MSE.
Equal treatment of all errors: Large errors receive the same gradient magnitude as small errors. This is beneficial for robustness (outliers do not dominate the gradient), but it also means the model does not "rush" to fix its biggest mistakes first.
To mitigate these issues, practitioners often use learning rate scheduling, gradient clipping, or switch to a smooth approximation like Huber loss.
Choosing between MAE and MSE (or its root, RMSE) depends on the problem context, data characteristics, and what types of errors matter most.
| Consideration | Use MAE | Use MSE/RMSE |
|---|---|---|
| Outliers present | Yes, MAE is robust | No, MSE amplifies outlier effects |
| All errors equally costly | Yes | No, large errors are disproportionately costly |
| Interpretability needed | MAE is in original units | RMSE is in original units; MSE is in squared units |
| Optimization smoothness | MAE gradient is discontinuous at zero | MSE is smooth and differentiable everywhere |
| Error distribution | Laplace (heavy-tailed) | Gaussian (thin-tailed) |
| Gradient-based training | Constant gradient can cause oscillation | Gradient naturally decreases near optimum |
Use MAE when:
Use MSE when:
Huber loss, introduced by Peter Huber in 1964, combines the advantages of both MAE and MSE by switching between quadratic and linear behavior based on a threshold parameter delta (δ):
Formally:
L_δ(e) = 0.5 × e² if |e| ≤ δ
L_δ(e) = δ × (|e| − 0.5 × δ) if |e| > δ
Huber loss is differentiable everywhere (unlike MAE), and its gradient transitions smoothly from being proportional to the error (like MSE) to having a constant magnitude (like MAE). This makes it a popular choice in robust regression, reinforcement learning, and deep learning applications where both smooth optimization and outlier resistance are desired.
The δ parameter controls the transition point. A large δ makes Huber loss behave more like MSE, while a small δ makes it behave more like MAE. In PyTorch, this is implemented as torch.nn.SmoothL1Loss (with some scaling differences) and torch.nn.HuberLoss.
MAE is one of the most commonly used evaluation metrics for time series forecasting. Its properties make it particularly well-suited for this domain:
In forecasting practice, MAE is frequently reported alongside other metrics such as RMSE (which emphasizes large errors), MAPE (which normalizes errors as percentages), and MASE (Mean Absolute Scaled Error, which normalizes by the naive forecast's error).
| Domain | Example Use Case | Why MAE Is Suitable |
|---|---|---|
| Retail and demand forecasting | Predicting daily product sales | Interpretable in units sold; robust to promotional spikes |
| Energy | Forecasting electricity demand | Error in kWh is directly actionable for grid operators |
| Finance | Predicting stock returns | Outlier-resistant; treats gains and losses symmetrically |
| Weather and climate | Temperature or precipitation forecasts | Error in degrees or millimeters is easy to communicate |
| Healthcare | Predicting patient wait times | Supports operational planning with average error estimates |
| Transportation | Estimating delivery or travel times | Linear error cost aligns with customer experience |
Mean Absolute Percentage Error (MAPE) expresses MAE as a percentage of the actual values:
MAPE = (100% / n) × Σ |y_i − ŷ_i| / |y_i|
MAPE is useful for comparing forecast accuracy across different scales (for example, comparing a model's accuracy on high-volume products versus low-volume products). However, it has several well-known limitations:
Alternatives include Symmetric MAPE (sMAPE), which addresses the asymmetry issue, and MASE (Mean Absolute Scaled Error), which normalizes by the naive forecast and avoids the division-by-zero problem entirely.
Research by Willmott and Matsuura (2005) and subsequent work has shown that MAE can be decomposed into two meaningful components:
This decomposition is particularly useful in remote sensing and spatial analysis, where understanding whether errors stem from systematic bias or from misallocation of values across locations can guide model improvement.
MAE is available as a built-in function in all major machine learning frameworks.
from sklearn.metrics import mean_absolute_error
y_true = [3.0, 5.0, 2.0, 7.0, 4.0]
y_pred = [2.5, 4.8, 3.1, 6.5, 4.3]
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae}") # Output: MAE: 0.52
In PyTorch, MAE is implemented as torch.nn.L1Loss, reflecting its equivalence to the L1 loss function:
import torch
import torch.nn as nn
y_true = torch.tensor([3.0, 5.0, 2.0, 7.0, 4.0])
y_pred = torch.tensor([2.5, 4.8, 3.1, 6.5, 4.3])
criterion = nn.L1Loss()
mae = criterion(y_pred, y_true)
print(f"MAE: {mae.item()}") # Output: MAE: 0.52
TensorFlow provides MAE both as a metric and as a loss function through its Keras API:
import tensorflow as tf
y_true = [3.0, 5.0, 2.0, 7.0, 4.0]
y_pred = [2.5, 4.8, 3.1, 6.5, 4.3]
# As a metric
mae = tf.keras.metrics.mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae.numpy()}") # Output: MAE: 0.52
# As a loss function for training
loss_fn = tf.keras.losses.MeanAbsoluteError()
import numpy as np
y_true = np.array([3.0, 5.0, 2.0, 7.0, 4.0])
y_pred = np.array([2.5, 4.8, 3.1, 6.5, 4.3])
mae = np.mean(np.abs(y_true - y_pred))
print(f"MAE: {mae}") # Output: MAE: 0.52
Imagine you and your friends are guessing how many candies are in a jar. The jar actually has 50 candies. One friend guesses 47, another guesses 53, and a third guesses 45. To figure out how good your group's guesses are, you look at how far off each guess was: the first friend was off by 3, the second by 3, and the third by 5. You do not care whether someone guessed too high or too low; you just care about how far away the guess was. Then you take the average of those distances: (3 + 3 + 5) / 3 = about 3.7 candies. That average distance is the Mean Absolute Error. A smaller number means the guesses were closer to the real answer.