Mean Absolute Error (MAE)
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 6,430 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 6,430 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mean Absolute Error (MAE) is one of the most widely used metrics in statistics and machine learning for evaluating the accuracy of predictions. It measures the average magnitude of errors between predicted values and actual observed values, without considering the direction of the errors. Because it treats every error equally and expresses the result in the same units as the original data, MAE is valued for its simplicity, interpretability, and robustness. In the context of loss functions, MAE is equivalent to the L1 loss, which plays a central role in optimization, regularization, and robust estimation.
MAE is one of the oldest and most enduring measures of prediction quality. The use of absolute deviations as a measure of error pre-dates the least squares method itself: Roger Joseph Boscovich proposed minimizing the sum of absolute deviations in 1757, decades before Adrien-Marie Legendre published the method of least squares in 1805 and Carl Friedrich Gauss popularized it. Pierre-Simon Laplace also studied the L1 criterion in connection with the median. Although least squares dominated nineteenth-century practice because of its analytic tractability, the rise of computers and the demands of robust statistics brought L1 estimation back into mainstream use during the twentieth century.
Given a set of n observations where y_i represents the actual value and ŷ_i represents the predicted value for the i-th observation, the Mean Absolute Error is defined as:
MAE = (1 / n) × Σ |y_i − ŷ_i|
Or equivalently, using the individual error terms e_i = y_i − ŷ_i:
MAE = (1 / n) × Σ |e_i|
The summation runs from i = 1 to i = n. Each term |y_i − ŷ_i| computes the absolute difference between the true value and the prediction. By averaging these absolute differences, MAE provides a single number that summarizes the typical prediction error across all data points.
In vector notation, if y and ŷ are n-dimensional vectors of actual and predicted values, then
MAE = (1 / n) × ‖y − ŷ‖_1
where ‖·‖_1 denotes the L1 norm (the sum of absolute values of components). This connection between MAE and the L1 norm is the reason MAE and L1 loss are used interchangeably in deep learning and optimization literature.
Consider a simple regression scenario with five observations:
| Observation | Actual Value (y) | Predicted Value (ŷ) | Absolute Error |y − ŷ| | |---|---|---|---| | 1 | 3.0 | 2.5 | 0.5 | | 2 | 5.0 | 4.8 | 0.2 | | 3 | 2.0 | 3.1 | 1.1 | | 4 | 7.0 | 6.5 | 0.5 | | 5 | 4.0 | 4.3 | 0.3 |
MAE = (0.5 + 0.2 + 1.1 + 0.5 + 0.3) / 5 = 2.6 / 5 = 0.52
This means the model's predictions are, on average, 0.52 units away from the actual values.
Unlike metrics that square the errors, MAE is expressed in the same units as the target variable. If you are predicting house prices in dollars, an MAE of 15,000 means your model is off by $15,000 on average. This makes it straightforward to communicate results to non-technical stakeholders.
MAE treats all errors with equal weight. A prediction that misses by 100 units contributes exactly 100 to the sum, whereas under Mean Squared Error (MSE), the same error contributes 10,000 (the square of 100). This linear treatment means that MAE is significantly less sensitive to outliers than MSE or RMSE. When your dataset contains extreme values or noisy observations, MAE provides a more stable and representative measure of typical model performance.
In the language of robust statistics, the influence function of an MAE-minimizing estimator (the median) is bounded, while the influence function of an MSE-minimizing estimator (the mean) grows linearly with the deviation. Bounded influence is the technical reason why MAE-based methods resist contamination by even a small fraction of arbitrarily large outliers.
MAE penalizes overestimations and underestimations equally. A prediction that is 5 units too high and one that is 5 units too low both contribute 5 to the error sum. This symmetric treatment is appropriate in many practical settings where the cost of over-predicting and under-predicting is the same. When asymmetric penalties are required, the pinball loss (also known as quantile loss or tilted absolute loss) generalizes MAE by weighting positive and negative residuals differently.
Because MAE uses the same scale as the data, it cannot be directly compared across datasets with different measurement units. An MAE of 2.0 for a temperature forecast (in degrees Celsius) is not comparable to an MAE of 2.0 for a revenue forecast (in thousands of dollars). Normalized variants such as MAPE, sMAPE, NMAE, and MASE address this limitation; they are described in their own sections below.
The absolute value function |x| has a sharp corner at x = 0, which means MAE is not differentiable when a prediction exactly matches the actual value. This creates challenges for optimization algorithms based on gradient descent, since the gradient is undefined at that point. In practice, subgradient methods or smooth approximations such as Huber loss are used to handle this issue. The set-valued subdifferential of |e| at zero is the closed interval [−1, +1], so any element of this interval is a valid subgradient at the kink.
The absolute value function is convex, and a non-negative weighted sum of convex functions is convex. Therefore MAE, viewed as a function of the predicted values, is convex (although not strictly convex). When combined with a linear regression model, MAE minimization becomes a convex optimization problem and can be expressed as a linear program, which historically allowed L1 regression to be solved by the simplex method long before modern interior-point methods became available.
In optimization and deep learning, MAE is commonly referred to as L1 loss. The name comes from its relationship to the L1 norm (also known as the Manhattan distance or taxicab norm), which computes the sum of absolute values of a vector's components.
Formally, minimizing the L1 loss over a training set is equivalent to minimizing:
L1 Loss = Σ |y_i − f(x_i)|
where f(x_i) is the model's prediction for input x_i.
L1 loss is closely related to L1 regularization (Lasso), which adds the sum of absolute values of model weights as a penalty term. Both share the property of promoting sparsity: L1 regularization encourages many weights to become exactly zero, and L1 loss encourages the model to optimize toward the median rather than the mean of the target distribution. The Lasso estimator was popularized by Robert Tibshirani in 1996 and is one of the most influential applications of the L1 norm in modern statistics.
The absolute-value form is sometimes called least absolute deviations (LAD) regression, least absolute errors (LAE), or least absolute residuals (LAR) depending on the source. All of these terms refer to the same fundamental criterion: minimize the sum of absolute residuals.
One of the most important theoretical properties of MAE is its relationship to the median. If you need to choose a single constant value m that minimizes the mean absolute error across all observations, the optimal choice is the median of those observations.
Formally: m is a sample median if and only if m minimizes the expression Σ |y_i − m|.
The proof is straightforward. Take the derivative of Σ |y_i − m| with respect to m. Each term contributes −1 when y_i > m and +1 when y_i < m. Setting the sum of these signs to zero requires that the number of points above m equal the number of points below m, which is the defining property of the median.
This stands in contrast to MSE, where the optimal constant predictor is the mean of the observations. The distinction has practical consequences:
| Property | MAE (L1 Loss) | MSE (L2 Loss) |
|---|---|---|
| Optimal constant predictor | Median | Mean |
| Sensitivity to outliers | Low (linear penalty) | High (quadratic penalty) |
| Statistical assumption | Laplace-distributed errors | Gaussian-distributed errors |
| Maximum-likelihood interpretation | MLE for Laplace noise | MLE for Gaussian noise |
| Gradient magnitude | Constant (±1/n) | Proportional to error size |
| Differentiability | Not differentiable at zero | Differentiable everywhere |
| Strict convexity | No (only convex) | Yes (strictly convex) |
| Closed-form solution | None in general | Yes (normal equations) |
Because the median is more robust to extreme values than the mean, models trained with MAE loss tend to produce predictions that are more resistant to outlier contamination. From a probabilistic perspective, minimizing MAE is the maximum likelihood estimator under a Laplace (double-exponential) noise model, whereas minimizing MSE is the maximum likelihood estimator under a Gaussian noise model. This Laplace-Gaussian duality explains both the heavy-tailed robustness of MAE and the analytic convenience of MSE.
The median property generalizes from constant predictors to full regression models. If a model f minimizes the expected MAE under the joint distribution of inputs x and targets y, then f(x) is the conditional median of y given x. In contrast, the MSE-minimizing regression function returns the conditional mean. This is why MAE is the natural loss for median regression and the basis of the more general quantile regression framework.
Median regression is a special case of quantile regression, introduced by Roger Koenker and Gilbert Bassett in 1978. Quantile regression generalizes MAE by replacing the symmetric absolute value with an asymmetric loss called the pinball loss (also called tilted absolute loss or quantile loss):
ρ_τ(e) = τ × e if e ≥ 0
ρ_τ(e) = (τ − 1) × e if e < 0
where τ ∈ (0, 1) is the target quantile. When τ = 0.5, the pinball loss reduces to half the absolute error, so minimizing it is equivalent to minimizing MAE and yields the conditional median.
For τ > 0.5, positive residuals (under-predictions) are weighted more heavily than negative residuals (over-predictions), so the optimizer is pushed toward predicting larger values. For τ < 0.5, the asymmetry reverses. This makes pinball loss an essential tool for probabilistic forecasting: by training separate models or a single model with multiple output heads at quantiles 0.1, 0.5, and 0.9, practitioners can produce calibrated prediction intervals.
| τ value | Interpretation | Symmetric? | Reduces to MAE? |
|---|---|---|---|
| 0.5 | Median forecast | Yes | Yes (up to a factor of 1/2) |
| 0.1 | 10th percentile (lower tail) | No | No |
| 0.9 | 90th percentile (upper tail) | No | No |
| τ → 0 or τ → 1 | Extreme tail | Highly asymmetric | No |
Quantile loss is built into modern forecasting libraries such as GluonTS, Darts, NeuralProphet, and Prophet, and is the backbone of probabilistic deep learning models including DeepAR (Salinas et al., 2020), MQ-CNN, and Temporal Fusion Transformer (Lim et al., 2021).
The gradient (or more precisely, the subgradient) of the MAE loss with respect to a single prediction is:
∂MAE/∂ŷ_i = −sign(y_i − ŷ_i) / n
where sign(x) returns +1 if x > 0, −1 if x < 0, and is undefined (or any value in [−1, +1]) if x = 0.
This means the gradient has a constant magnitude of 1/n regardless of how large or small the error is. By comparison, the gradient of MSE is proportional to the error itself, so it naturally shrinks as the prediction approaches the true value.
The constant-magnitude gradient has two practical implications for training neural networks and other models via gradient descent:
Oscillation near the optimum: Because the gradient does not decrease as the model gets closer to the correct answer, the optimization can overshoot the minimum and oscillate back and forth. This can make convergence slower or less stable compared to MSE.
Equal treatment of all errors: Large errors receive the same gradient magnitude as small errors. This is beneficial for robustness (outliers do not dominate the gradient), but it also means the model does not "rush" to fix its biggest mistakes first.
To mitigate these issues, practitioners often use learning rate scheduling, gradient clipping, or switch to a smooth approximation like Huber loss. Adaptive optimizers such as Adam and AdamW partially compensate for the constant-gradient issue by maintaining per-parameter running averages of the squared gradients, which effectively rescale the step size based on each parameter's historical gradient magnitude. In practice, modern deep networks trained with Adam and L1 loss converge well, even though L1 lacks the smooth shrinkage that MSE provides.
When the gradient does not exist at zero, optimization theory provides the subgradient: any value in [−1, +1] is a valid subgradient of |x| at x = 0. Subgradient descent uses any element of the subdifferential as a search direction. Modern automatic differentiation libraries return 0 by convention at the kink, which works well in practice because exact zero residuals are rare on continuous data.
Choosing between MAE and MSE (or its root, RMSE) depends on the problem context, data characteristics, and what types of errors matter most.
| Consideration | Use MAE | Use MSE/RMSE |
|---|---|---|
| Outliers present | Yes, MAE is robust | No, MSE amplifies outlier effects |
| All errors equally costly | Yes | No, large errors are disproportionately costly |
| Interpretability needed | MAE is in original units | RMSE is in original units; MSE is in squared units |
| Optimization smoothness | MAE gradient is discontinuous at zero | MSE is smooth and differentiable everywhere |
| Error distribution | Laplace (heavy-tailed) | Gaussian (thin-tailed) |
| Gradient-based training | Constant gradient can cause oscillation | Gradient naturally decreases near optimum |
Use MAE when:
Use MSE when:
For any sample, MAE ≤ RMSE always holds (this follows from the Cauchy-Schwarz inequality), with equality only when all absolute errors are identical. The ratio RMSE / MAE is bounded between 1 and √n; values close to 1 indicate that errors are uniform in magnitude, while values closer to √n indicate that a few large errors dominate. This ratio itself can be a useful diagnostic for outlier presence.
Huber loss, introduced by Peter Huber in 1964, combines the advantages of both MAE and MSE by switching between quadratic and linear behavior based on a threshold parameter delta (δ):
Formally:
L_δ(e) = 0.5 × e² if |e| ≤ δ
L_δ(e) = δ × (|e| − 0.5 × δ) if |e| > δ
Huber loss is differentiable everywhere (unlike MAE), and its gradient transitions smoothly from being proportional to the error (like MSE) to having a constant magnitude (like MAE). This makes it a popular choice in robust regression, reinforcement learning, and deep learning applications where both smooth optimization and outlier resistance are desired.
The δ parameter controls the transition point. A large δ makes Huber loss behave more like MSE, while a small δ makes it behave more like MAE. In PyTorch, this is implemented as torch.nn.SmoothL1Loss (with some scaling differences) and torch.nn.HuberLoss. Huber loss is also the default loss in DeepMind's DQN paper (Mnih et al., 2015) for the Bellman residual in deep Q-learning, where reward outliers can otherwise destabilize training.
Huber's 1964 paper introduced the framework of M-estimators, which are estimators defined by minimizing a sum of a generic loss function ρ. MAE corresponds to ρ(e) = |e|, MSE to ρ(e) = e²/2, and Huber loss to a piecewise combination. Other well-known robust M-estimators include:
| M-estimator | ρ(e) shape | Behavior at large |e| | Bounded influence? | |---|---|---|---| | L2 (MSE / OLS) | Quadratic | Quadratic | No | | L1 (MAE / LAD) | Linear | Linear | Yes (bounded by 1) | | Huber | Quadratic then linear | Linear | Yes | | Tukey biweight | Bounded redescending | Constant beyond cutoff | Yes | | Andrews wave | Bounded redescending | Periodic, capped | Yes | | Cauchy | Logarithmic | Logarithmic | Yes |
The redescending estimators (Tukey biweight, Andrews wave) actually decrease the influence of points beyond a cutoff distance, making them even more outlier-resistant than MAE, but at the cost of non-convexity.
MAE is one of the most commonly used evaluation metrics for time series forecasting. Its properties make it particularly well-suited for this domain:
In forecasting practice, MAE is frequently reported alongside other metrics such as RMSE (which emphasizes large errors), MAPE (which normalizes errors as percentages), and MASE (Mean Absolute Scaled Error, which normalizes by the naive forecast's error).
The Makridakis Competitions (M, M2, M3, M4, M5) are a long-running series of public forecasting contests organized by Spyros Makridakis and collaborators since 1982. These competitions have used MAE-derived metrics extensively to rank submissions and to draw broad empirical conclusions about forecasting practice.
| Competition | Year | Series count | Headline accuracy metrics |
|---|---|---|---|
| M | 1982 | 1,001 | MAPE, MSE |
| M2 | 1993 | 29 | MAPE |
| M3 | 2000 | 3,003 | MAPE, sMAPE |
| M4 | 2018 | 100,000 | sMAPE, MASE, OWA |
| M5 | 2020 | 42,840 (Walmart hierarchical sales) | RMSSE, WRMSSE, MAE-based scaled metrics |
The M4 competition introduced the Overall Weighted Average (OWA) metric, which combines sMAPE and MASE relative to a seasonal naive baseline. The M5 competition, run on Kaggle in 2020, used a weighted root-mean-squared scaled error and explicitly emphasized hierarchical demand forecasting at Walmart. Across competitions, the empirical finding is consistent: scaled and absolute-error metrics correlate strongly, and MAE-based criteria reward stable, well-calibrated forecasts more than they reward occasional spectacular hits.
MASE was proposed by Hyndman and Koehler (2006) to address the shortcomings of MAPE and sMAPE. It scales MAE by the in-sample MAE of a naive forecast (the previous observation, or the same observation from the previous season for seasonal data):
MASE = MAE / MAE_naive
where MAE_naive is the in-sample mean absolute error from the one-step naive forecast on the training set.
MASE has several attractive properties:
MASE has become a de facto standard in academic forecasting research and is the recommended scale-free metric in the third edition of Hyndman and Athanasopoulos's textbook Forecasting: Principles and Practice.
| Domain | Example Use Case | Why MAE Is Suitable |
|---|---|---|
| Retail and demand forecasting | Predicting daily product sales | Interpretable in units sold; robust to promotional spikes |
| Energy | Forecasting electricity demand | Error in kWh is directly actionable for grid operators |
| Finance | Predicting stock returns | Outlier-resistant; treats gains and losses symmetrically |
| Weather and climate | Temperature or precipitation forecasts | Error in degrees or millimeters is easy to communicate |
| Healthcare | Predicting patient wait times | Supports operational planning with average error estimates |
| Transportation | Estimating delivery or travel times | Linear error cost aligns with customer experience |
| Logistics | Estimated time of arrival (ETA) systems | Robust to traffic anomalies; minutes are interpretable |
| Cloud capacity planning | Predicting server load | Resists rare spike events; aggregates well over many machines |
Beyond its classic role as a regression metric, MAE and its L1 variant are used extensively across modern machine learning architectures. Several prominent areas rely on absolute-error losses:
In computer vision, bounding-box regression has historically used L1 loss or its smooth variant. The Smooth L1 loss (also called Huber loss with δ = 1) was introduced by Ross Girshick for Fast R-CNN in 2015 to combine the smoothness of L2 near zero with the robustness of L1 for large errors. Subsequent detectors including Faster R-CNN, SSD, RetinaNet, and YOLO variants have used L1 or Smooth L1 for box coordinate regression. More recent transformer-based detectors such as DETR (Carion et al., 2020) use a combination of L1 and Generalized IoU (GIoU) loss for box regression. The DIoU and CIoU variants further refine this approach, but the L1 component remains a standard baseline in detection literature.
Image-to-image translation networks such as pix2pix (Isola et al., 2017) and CycleGAN (Zhu et al., 2017) combine an adversarial loss with an L1 reconstruction loss because L1 produces sharper outputs than L2, which tends to blur high-frequency detail. Many diffusion models and autoencoders include an L1 reconstruction term for the same reason. The intuition is that L1 puts the conditional median at each pixel, and medians of multimodal distributions tend to be one of the modes rather than an average of modes (which would appear as a blurry blend).
Monocular depth estimation networks (such as MiDaS, DPT, and ZoeDepth) often use scale-invariant L1 or trimmed L1 losses on log-depth, because depth labels contain heavy-tailed errors and outliers from sensor noise. Optical flow networks similarly rely on robust L1-based criteria.
Speech enhancement, denoising, and source separation systems frequently use L1 or weighted L1 losses on spectrograms because L1 is more perceptually correlated than L2 in many audio settings.
In modern practice, MAE is rarely used alone. Common combinations include:
| Combination | Where used | Why |
|---|---|---|
| L1 + L2 | Image reconstruction | L1 sharpness with L2 smoothness |
| L1 + GIoU | Object detection (DETR) | Position accuracy plus shape accuracy |
| L1 + adversarial | pix2pix, CycleGAN | Pixel fidelity plus realism |
| Quantile loss at multiple τ | Probabilistic forecasting | Calibrated prediction intervals |
| Huber + entropy | Reinforcement learning | Stable Bellman updates with exploration |
Mean Absolute Percentage Error (MAPE) expresses MAE as a percentage of the actual values:
MAPE = (100% / n) × Σ |y_i − ŷ_i| / |y_i|
MAPE is useful for comparing forecast accuracy across different scales (for example, comparing a model's accuracy on high-volume products versus low-volume products). However, it has several well-known limitations:
Alternatives include Symmetric MAPE (sMAPE), which addresses the asymmetry issue, and MASE (Mean Absolute Scaled Error), which normalizes by the naive forecast and avoids the division-by-zero problem entirely.
sMAPE was designed to mitigate MAPE's asymmetry by placing the average of the actual and predicted values in the denominator:
sMAPE = (200% / n) × Σ |y_i − ŷ_i| / (|y_i| + |ŷ_i|)
The factor of 200 (rather than 100) ensures that sMAPE ranges from 0% to 200% rather than 0% to 100%. sMAPE was the headline metric for the M3 and M4 competitions, but it has also been criticized: Goodwin and Lawton (1999) showed that sMAPE is still asymmetric in a more subtle way, and Hyndman (2014) recommended MASE over sMAPE for academic comparisons.
NMAE divides MAE by some normalization constant such as the range, mean, or standard deviation of the actual values:
NMAE = MAE / (max(y) − min(y)) or NMAE = MAE / mean(|y|)
This is common in recommender systems and physical sciences, where comparing accuracy across heterogeneous datasets matters.
When some observations are more important than others (for example, recent observations in time series, or high-revenue products in retail), a weighted MAE is used:
WMAE = (Σ w_i × |y_i − ŷ_i|) / (Σ w_i)
where w_i ≥ 0 is the weight assigned to observation i. Weighted RMSSE in the M5 competition is a hierarchical extension of this idea, scaling the squared-error variant of WMAE by a per-series naive baseline.
Research by Willmott and Matsuura (2005) and subsequent work has shown that MAE can be decomposed into two meaningful components:
This decomposition is particularly useful in remote sensing and spatial analysis, where understanding whether errors stem from systematic bias or from misallocation of values across locations can guide model improvement. Pontius et al. (2008) extended the decomposition to include multiple resolutions, allowing analysts to track how errors aggregate or cancel as data is summarized at coarser spatial scales.
When comparing two models by MAE, a single point estimate of MAE on a held-out test set can be misleading because of sampling variability. Several techniques are commonly used to quantify uncertainty:
scikit-learn and arch.In deep learning contexts, MAE differences across runs are often dominated by random initialization and data ordering, so reporting mean ± standard deviation across multiple seeds is the most common practice.
When MAE is used as a training loss rather than only as an evaluation metric, several practical points are worth knowing:
statsmodels.regression.quantile_regression (with τ = 0.5) and in classic LAD regression code.MAE is available as a built-in function in all major machine learning frameworks.
| Framework | Class or function | Notes |
|---|---|---|
| scikit-learn | sklearn.metrics.mean_absolute_error(y_true, y_pred) | Returns a single float; supports sample_weight and multioutput |
| PyTorch (loss) | torch.nn.L1Loss() | Computes mean by default; supports reduction='sum' and 'none' |
| PyTorch (functional) | torch.nn.functional.l1_loss(input, target) | Functional form, no parameters |
| TensorFlow / Keras (loss) | tf.keras.losses.MeanAbsoluteError() | Trainable loss class |
| TensorFlow / Keras (metric) | tf.keras.metrics.MeanAbsoluteError() | Stateful metric for model.compile(metrics=...) |
| JAX / Optax | optax.l1_loss(predictions, targets) | Returns per-example loss; reduce manually |
| XGBoost | objective='reg:absoluteerror' | Available since XGBoost 1.7 |
| LightGBM | objective='regression_l1' or 'mae' | Long-supported |
| CatBoost | loss_function='MAE' | Long-supported |
| statsmodels | QuantReg(y, X).fit(q=0.5) | Median regression via quantile regression |
| H2O.ai | distribution='laplace' for GBM | Equivalent to MAE under Laplace MLE |
from sklearn.metrics import mean_absolute_error
y_true = [3.0, 5.0, 2.0, 7.0, 4.0]
y_pred = [2.5, 4.8, 3.1, 6.5, 4.3]
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae}") # Output: MAE: 0.52
In PyTorch, MAE is implemented as torch.nn.L1Loss, reflecting its equivalence to the L1 loss function:
import torch
import torch.nn as nn
y_true = torch.tensor([3.0, 5.0, 2.0, 7.0, 4.0])
y_pred = torch.tensor([2.5, 4.8, 3.1, 6.5, 4.3])
criterion = nn.L1Loss()
mae = criterion(y_pred, y_true)
print(f"MAE: {mae.item()}") # Output: MAE: 0.52
PyTorch also exposes torch.nn.SmoothL1Loss (Huber-style with δ = 1 by default) and torch.nn.HuberLoss (with a configurable delta argument) for the smooth hybrid loss described above.
TensorFlow provides MAE both as a metric and as a loss function through its Keras API:
import tensorflow as tf
y_true = [3.0, 5.0, 2.0, 7.0, 4.0]
y_pred = [2.5, 4.8, 3.1, 6.5, 4.3]
# As a metric
mae = tf.keras.metrics.mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae.numpy()}") # Output: MAE: 0.52
# As a loss function for training
loss_fn = tf.keras.losses.MeanAbsoluteError()
import numpy as np
y_true = np.array([3.0, 5.0, 2.0, 7.0, 4.0])
y_pred = np.array([2.5, 4.8, 3.1, 6.5, 4.3])
mae = np.mean(np.abs(y_true - y_pred))
print(f"MAE: {mae}") # Output: MAE: 0.52
XGBoost, LightGBM, and CatBoost all support MAE as a regression objective. In LightGBM:
import lightgbm as lgb
params = {
'objective': 'regression_l1',
'metric': 'mae',
'learning_rate': 0.05,
'num_leaves': 31,
}
model = lgb.train(params, train_data, num_boost_round=200)
Note that gradient boosting with MAE is slightly different from MAE-based neural network training because tree-based learners use second-order Taylor approximations of the loss, and MAE has a Hessian of zero almost everywhere. Most libraries fall back to a smoothed approximation or to constant-gradient leaf values when training trees with L1 objective.
For median regression and other quantiles, scikit-learn provides QuantileRegressor and statsmodels provides QuantReg:
from sklearn.linear_model import QuantileRegressor
import numpy as np
X = np.random.randn(200, 3)
y = X @ np.array([1.0, -2.0, 0.5]) + np.random.laplace(scale=1.0, size=200)
model = QuantileRegressor(quantile=0.5, alpha=0.0, solver='highs')
model.fit(X, y)
# model.coef_ approximates the L1-optimal coefficients
A few mistakes recur often enough to be worth listing:
'mean', 'sum', and 'none' reductions. The numerical scale of the loss differs between modes, which can quietly invalidate learning rate tuning when switching frameworks.Imagine you and your friends are guessing how many candies are in a jar. The jar actually has 50 candies. One friend guesses 47, another guesses 53, and a third guesses 45. To figure out how good your group's guesses are, you look at how far off each guess was: the first friend was off by 3, the second by 3, and the third by 5. You do not care whether someone guessed too high or too low; you just care about how far away the guess was. Then you take the average of those distances: (3 + 3 + 5) / 3 = about 3.7 candies. That average distance is the Mean Absolute Error. A smaller number means the guesses were closer to the real answer.