Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is one of the most widely used metrics in statistics and machine learning for evaluating the accuracy of predictions. It measures the average magnitude of errors between predicted values and actual observed values, without considering the direction of the errors. Because it treats every error equally and expresses the result in the same units as the original data, MAE is valued for its simplicity, interpretability, and robustness. In the context of loss functions, MAE is equivalent to the L1 loss, which plays a central role in optimization, regularization, and robust estimation.

MAE is one of the oldest and most enduring measures of prediction quality. The use of absolute deviations as a measure of error pre-dates the least squares method itself: Roger Joseph Boscovich proposed minimizing the sum of absolute deviations in 1757, decades before Adrien-Marie Legendre published the method of least squares in 1805 and Carl Friedrich Gauss popularized it. Pierre-Simon Laplace also studied the L1 criterion in connection with the median. Although least squares dominated nineteenth-century practice because of its analytic tractability, the rise of computers and the demands of robust statistics brought L1 estimation back into mainstream use during the twentieth century.

mathematical definition

Given a set of n observations where y_i represents the actual value and ŷ_i represents the predicted value for the i-th observation, the Mean Absolute Error is defined as:

MAE = (1 / n) × Σ |y_i − ŷ_i|

Or equivalently, using the individual error terms e_i = y_i − ŷ_i:

MAE = (1 / n) × Σ |e_i|

The summation runs from i = 1 to i = n. Each term |y_i − ŷ_i| computes the absolute difference between the true value and the prediction. By averaging these absolute differences, MAE provides a single number that summarizes the typical prediction error across all data points.

In vector notation, if y and ŷ are n-dimensional vectors of actual and predicted values, then

MAE = (1 / n) × ‖y − ŷ‖_1

where ‖·‖_1 denotes the L1 norm (the sum of absolute values of components). This connection between MAE and the L1 norm is the reason MAE and L1 loss are used interchangeably in deep learning and optimization literature.

worked example

Consider a simple regression scenario with five observations:

| Observation | Actual Value (y) | Predicted Value (ŷ) | Absolute Error |y − ŷ| | |---|---|---|---| | 1 | 3.0 | 2.5 | 0.5 | | 2 | 5.0 | 4.8 | 0.2 | | 3 | 2.0 | 3.1 | 1.1 | | 4 | 7.0 | 6.5 | 0.5 | | 5 | 4.0 | 4.3 | 0.3 |

MAE = (0.5 + 0.2 + 1.1 + 0.5 + 0.3) / 5 = 2.6 / 5 = 0.52

This means the model's predictions are, on average, 0.52 units away from the actual values.

key properties

same-unit interpretability

Unlike metrics that square the errors, MAE is expressed in the same units as the target variable. If you are predicting house prices in dollars, an MAE of 15,000 means your model is off by $15,000 on average. This makes it straightforward to communicate results to non-technical stakeholders.

robustness to outliers

MAE treats all errors with equal weight. A prediction that misses by 100 units contributes exactly 100 to the sum, whereas under Mean Squared Error (MSE), the same error contributes 10,000 (the square of 100). This linear treatment means that MAE is significantly less sensitive to outliers than MSE or RMSE. When your dataset contains extreme values or noisy observations, MAE provides a more stable and representative measure of typical model performance.

In the language of robust statistics, the influence function of an MAE-minimizing estimator (the median) is bounded, while the influence function of an MSE-minimizing estimator (the mean) grows linearly with the deviation. Bounded influence is the technical reason why MAE-based methods resist contamination by even a small fraction of arbitrarily large outliers.

symmetry

MAE penalizes overestimations and underestimations equally. A prediction that is 5 units too high and one that is 5 units too low both contribute 5 to the error sum. This symmetric treatment is appropriate in many practical settings where the cost of over-predicting and under-predicting is the same. When asymmetric penalties are required, the pinball loss (also known as quantile loss or tilted absolute loss) generalizes MAE by weighting positive and negative residuals differently.

scale dependence

Because MAE uses the same scale as the data, it cannot be directly compared across datasets with different measurement units. An MAE of 2.0 for a temperature forecast (in degrees Celsius) is not comparable to an MAE of 2.0 for a revenue forecast (in thousands of dollars). Normalized variants such as MAPE, sMAPE, NMAE, and MASE address this limitation; they are described in their own sections below.

non-differentiability at zero

The absolute value function |x| has a sharp corner at x = 0, which means MAE is not differentiable when a prediction exactly matches the actual value. This creates challenges for optimization algorithms based on gradient descent, since the gradient is undefined at that point. In practice, subgradient methods or smooth approximations such as Huber loss are used to handle this issue. The set-valued subdifferential of |e| at zero is the closed interval [−1, +1], so any element of this interval is a valid subgradient at the kink.

convexity

The absolute value function is convex, and a non-negative weighted sum of convex functions is convex. Therefore MAE, viewed as a function of the predicted values, is convex (although not strictly convex). When combined with a linear regression model, MAE minimization becomes a convex optimization problem and can be expressed as a linear program, which historically allowed L1 regression to be solved by the simplex method long before modern interior-point methods became available.

MAE as L1 loss

In optimization and deep learning, MAE is commonly referred to as L1 loss. The name comes from its relationship to the L1 norm (also known as the Manhattan distance or taxicab norm), which computes the sum of absolute values of a vector's components.

Formally, minimizing the L1 loss over a training set is equivalent to minimizing:

L1 Loss = Σ |y_i − f(x_i)|

where f(x_i) is the model's prediction for input x_i.

L1 loss is closely related to L1 regularization (Lasso), which adds the sum of absolute values of model weights as a penalty term. Both share the property of promoting sparsity: L1 regularization encourages many weights to become exactly zero, and L1 loss encourages the model to optimize toward the median rather than the mean of the target distribution. The Lasso estimator was popularized by Robert Tibshirani in 1996 and is one of the most influential applications of the L1 norm in modern statistics.

The absolute-value form is sometimes called least absolute deviations (LAD) regression, least absolute errors (LAE), or least absolute residuals (LAR) depending on the source. All of these terms refer to the same fundamental criterion: minimize the sum of absolute residuals.

connection to the median

One of the most important theoretical properties of MAE is its relationship to the median. If you need to choose a single constant value m that minimizes the mean absolute error across all observations, the optimal choice is the median of those observations.

Formally: m is a sample median if and only if m minimizes the expression Σ |y_i − m|.

The proof is straightforward. Take the derivative of Σ |y_i − m| with respect to m. Each term contributes −1 when y_i > m and +1 when y_i < m. Setting the sum of these signs to zero requires that the number of points above m equal the number of points below m, which is the defining property of the median.

This stands in contrast to MSE, where the optimal constant predictor is the mean of the observations. The distinction has practical consequences:

Property	MAE (L1 Loss)	MSE (L2 Loss)
Optimal constant predictor	Median	Mean
Sensitivity to outliers	Low (linear penalty)	High (quadratic penalty)
Statistical assumption	Laplace-distributed errors	Gaussian-distributed errors
Maximum-likelihood interpretation	MLE for Laplace noise	MLE for Gaussian noise
Gradient magnitude	Constant (±1/n)	Proportional to error size
Differentiability	Not differentiable at zero	Differentiable everywhere
Strict convexity	No (only convex)	Yes (strictly convex)
Closed-form solution	None in general	Yes (normal equations)

Because the median is more robust to extreme values than the mean, models trained with MAE loss tend to produce predictions that are more resistant to outlier contamination. From a probabilistic perspective, minimizing MAE is the maximum likelihood estimator under a Laplace (double-exponential) noise model, whereas minimizing MSE is the maximum likelihood estimator under a Gaussian noise model. This Laplace-Gaussian duality explains both the heavy-tailed robustness of MAE and the analytic convenience of MSE.

conditional median for regression

The median property generalizes from constant predictors to full regression models. If a model f minimizes the expected MAE under the joint distribution of inputs x and targets y, then f(x) is the conditional median of y given x. In contrast, the MSE-minimizing regression function returns the conditional mean. This is why MAE is the natural loss for median regression and the basis of the more general quantile regression framework.

connection to quantile regression

Median regression is a special case of quantile regression, introduced by Roger Koenker and Gilbert Bassett in 1978. Quantile regression generalizes MAE by replacing the symmetric absolute value with an asymmetric loss called the pinball loss (also called tilted absolute loss or quantile loss):

ρ_τ(e) = τ × e if e ≥ 0

ρ_τ(e) = (τ − 1) × e if e < 0

where τ ∈ (0, 1) is the target quantile. When τ = 0.5, the pinball loss reduces to half the absolute error, so minimizing it is equivalent to minimizing MAE and yields the conditional median.

For τ > 0.5, positive residuals (under-predictions) are weighted more heavily than negative residuals (over-predictions), so the optimizer is pushed toward predicting larger values. For τ < 0.5, the asymmetry reverses. This makes pinball loss an essential tool for probabilistic forecasting: by training separate models or a single model with multiple output heads at quantiles 0.1, 0.5, and 0.9, practitioners can produce calibrated prediction intervals.

τ value	Interpretation	Symmetric?	Reduces to MAE?
0.5	Median forecast	Yes	Yes (up to a factor of 1/2)
0.1	10th percentile (lower tail)	No	No
0.9	90th percentile (upper tail)	No	No
τ → 0 or τ → 1	Extreme tail	Highly asymmetric	No

Quantile loss is built into modern forecasting libraries such as GluonTS, Darts, NeuralProphet, and Prophet, and is the backbone of probabilistic deep learning models including DeepAR (Salinas et al., 2020), MQ-CNN, and Temporal Fusion Transformer (Lim et al., 2021).

gradient behavior

The gradient (or more precisely, the subgradient) of the MAE loss with respect to a single prediction is:

∂MAE/∂ŷ_i = −sign(y_i − ŷ_i) / n

where sign(x) returns +1 if x > 0, −1 if x < 0, and is undefined (or any value in [−1, +1]) if x = 0.

This means the gradient has a constant magnitude of 1/n regardless of how large or small the error is. By comparison, the gradient of MSE is proportional to the error itself, so it naturally shrinks as the prediction approaches the true value.

The constant-magnitude gradient has two practical implications for training neural networks and other models via gradient descent:

Oscillation near the optimum: Because the gradient does not decrease as the model gets closer to the correct answer, the optimization can overshoot the minimum and oscillate back and forth. This can make convergence slower or less stable compared to MSE.
Equal treatment of all errors: Large errors receive the same gradient magnitude as small errors. This is beneficial for robustness (outliers do not dominate the gradient), but it also means the model does not "rush" to fix its biggest mistakes first.

To mitigate these issues, practitioners often use learning rate scheduling, gradient clipping, or switch to a smooth approximation like Huber loss. Adaptive optimizers such as Adam and AdamW partially compensate for the constant-gradient issue by maintaining per-parameter running averages of the squared gradients, which effectively rescale the step size based on each parameter's historical gradient magnitude. In practice, modern deep networks trained with Adam and L1 loss converge well, even though L1 lacks the smooth shrinkage that MSE provides.

subgradient methods

When the gradient does not exist at zero, optimization theory provides the subgradient: any value in [−1, +1] is a valid subgradient of |x| at x = 0. Subgradient descent uses any element of the subdifferential as a search direction. Modern automatic differentiation libraries return 0 by convention at the kink, which works well in practice because exact zero residuals are rare on continuous data.

MAE vs. MSE: when to use each

Choosing between MAE and MSE (or its root, RMSE) depends on the problem context, data characteristics, and what types of errors matter most.

Consideration	Use MAE	Use MSE/RMSE
Outliers present	Yes, MAE is robust	No, MSE amplifies outlier effects
All errors equally costly	Yes	No, large errors are disproportionately costly
Interpretability needed	MAE is in original units	RMSE is in original units; MSE is in squared units
Optimization smoothness	MAE gradient is discontinuous at zero	MSE is smooth and differentiable everywhere
Error distribution	Laplace (heavy-tailed)	Gaussian (thin-tailed)
Gradient-based training	Constant gradient can cause oscillation	Gradient naturally decreases near optimum

Use MAE when:

Your dataset contains outliers or heavy-tailed noise, and you do not want extreme values to dominate the evaluation.
The cost of errors is linear: a prediction off by 10 is exactly twice as bad as one off by 5.
You need an interpretable metric to communicate with stakeholders (for example, "our model is off by $500 on average").
You are building robust models for real-world applications like delivery time estimation, where typical performance matters more than worst-case performance.

Use MSE when:

Large errors are disproportionately harmful and should be penalized more heavily (for example, in safety-critical systems or financial risk modeling).
Your data is relatively clean and follows an approximately Gaussian error distribution.
You need smooth gradients for efficient optimization during model training.
You are using algorithms that assume or benefit from squared error minimization, such as ordinary least squares linear regression.

empirical relationship

For any sample, MAE ≤ RMSE always holds (this follows from the Cauchy-Schwarz inequality), with equality only when all absolute errors are identical. The ratio RMSE / MAE is bounded between 1 and √n; values close to 1 indicate that errors are uniform in magnitude, while values closer to √n indicate that a few large errors dominate. This ratio itself can be a useful diagnostic for outlier presence.

Huber loss: a smooth hybrid alternative

Huber loss, introduced by Peter Huber in 1964, combines the advantages of both MAE and MSE by switching between quadratic and linear behavior based on a threshold parameter delta (δ):

For errors smaller than δ: Huber loss behaves like MSE (quadratic), providing smooth gradients near the optimum.
For errors larger than δ: Huber loss behaves like MAE (linear), limiting the influence of outliers.

Formally:

L_δ(e) = 0.5 × e² if |e| ≤ δ

L_δ(e) = δ × (|e| − 0.5 × δ) if |e| > δ

Huber loss is differentiable everywhere (unlike MAE), and its gradient transitions smoothly from being proportional to the error (like MSE) to having a constant magnitude (like MAE). This makes it a popular choice in robust regression, reinforcement learning, and deep learning applications where both smooth optimization and outlier resistance are desired.

The δ parameter controls the transition point. A large δ makes Huber loss behave more like MSE, while a small δ makes it behave more like MAE. In PyTorch, this is implemented as torch.nn.SmoothL1Loss (with some scaling differences) and torch.nn.HuberLoss. Huber loss is also the default loss in DeepMind's DQN paper (Mnih et al., 2015) for the Bellman residual in deep Q-learning, where reward outliers can otherwise destabilize training.

M-estimators and the broader robust statistics family

Huber's 1964 paper introduced the framework of M-estimators, which are estimators defined by minimizing a sum of a generic loss function ρ. MAE corresponds to ρ(e) = |e|, MSE to ρ(e) = e²/2, and Huber loss to a piecewise combination. Other well-known robust M-estimators include:

| M-estimator | ρ(e) shape | Behavior at large |e| | Bounded influence? | |---|---|---|---| | L2 (MSE / OLS) | Quadratic | Quadratic | No | | L1 (MAE / LAD) | Linear | Linear | Yes (bounded by 1) | | Huber | Quadratic then linear | Linear | Yes | | Tukey biweight | Bounded redescending | Constant beyond cutoff | Yes | | Andrews wave | Bounded redescending | Periodic, capped | Yes | | Cauchy | Logarithmic | Logarithmic | Yes |

The redescending estimators (Tukey biweight, Andrews wave) actually decrease the influence of points beyond a cutoff distance, making them even more outlier-resistant than MAE, but at the cost of non-convexity.

MAE in time series forecasting

MAE is one of the most commonly used evaluation metrics for time series forecasting. Its properties make it particularly well-suited for this domain:

Direct interpretation: An MAE of 56 on a revenue forecast in USD means the model is off by $56 on average. This is immediately meaningful to business users.
Robustness: Time series data often contains spikes, anomalies, or seasonal outliers. MAE provides a stable summary of forecast quality even in the presence of such irregularities.
Simplicity: MAE is easy to compute and does not require assumptions about the error distribution.

In forecasting practice, MAE is frequently reported alongside other metrics such as RMSE (which emphasizes large errors), MAPE (which normalizes errors as percentages), and MASE (Mean Absolute Scaled Error, which normalizes by the naive forecast's error).

the Makridakis competitions

The Makridakis Competitions (M, M2, M3, M4, M5) are a long-running series of public forecasting contests organized by Spyros Makridakis and collaborators since 1982. These competitions have used MAE-derived metrics extensively to rank submissions and to draw broad empirical conclusions about forecasting practice.

Competition	Year	Series count	Headline accuracy metrics
M	1982	1,001	MAPE, MSE
M2	1993	29	MAPE
M3	2000	3,003	MAPE, sMAPE
M4	2018	100,000	sMAPE, MASE, OWA
M5	2020	42,840 (Walmart hierarchical sales)	RMSSE, WRMSSE, MAE-based scaled metrics

The M4 competition introduced the Overall Weighted Average (OWA) metric, which combines sMAPE and MASE relative to a seasonal naive baseline. The M5 competition, run on Kaggle in 2020, used a weighted root-mean-squared scaled error and explicitly emphasized hierarchical demand forecasting at Walmart. Across competitions, the empirical finding is consistent: scaled and absolute-error metrics correlate strongly, and MAE-based criteria reward stable, well-calibrated forecasts more than they reward occasional spectacular hits.

MASE: mean absolute scaled error

MASE was proposed by Hyndman and Koehler (2006) to address the shortcomings of MAPE and sMAPE. It scales MAE by the in-sample MAE of a naive forecast (the previous observation, or the same observation from the previous season for seasonal data):

MASE = MAE / MAE_naive

where MAE_naive is the in-sample mean absolute error from the one-step naive forecast on the training set.

MASE has several attractive properties:

Scale-free: It can be compared across series with different units.
Always defined: Unlike MAPE, it does not blow up when y = 0, as long as the training data has some variation.
Interpretable threshold: MASE < 1 means the model beats the naive forecast on average; MASE > 1 means it does not.
Symmetric: It treats over- and under-prediction equally.

MASE has become a de facto standard in academic forecasting research and is the recommended scale-free metric in the third edition of Hyndman and Athanasopoulos's textbook Forecasting: Principles and Practice.

applications by domain

Domain	Example Use Case	Why MAE Is Suitable
Retail and demand forecasting	Predicting daily product sales	Interpretable in units sold; robust to promotional spikes
Energy	Forecasting electricity demand	Error in kWh is directly actionable for grid operators
Finance	Predicting stock returns	Outlier-resistant; treats gains and losses symmetrically
Weather and climate	Temperature or precipitation forecasts	Error in degrees or millimeters is easy to communicate
Healthcare	Predicting patient wait times	Supports operational planning with average error estimates
Transportation	Estimating delivery or travel times	Linear error cost aligns with customer experience
Logistics	Estimated time of arrival (ETA) systems	Robust to traffic anomalies; minutes are interpretable
Cloud capacity planning	Predicting server load	Resists rare spike events; aggregates well over many machines

MAE in modern machine learning

Beyond its classic role as a regression metric, MAE and its L1 variant are used extensively across modern machine learning architectures. Several prominent areas rely on absolute-error losses:

object detection and bounding-box regression

In computer vision, bounding-box regression has historically used L1 loss or its smooth variant. The Smooth L1 loss (also called Huber loss with δ = 1) was introduced by Ross Girshick for Fast R-CNN in 2015 to combine the smoothness of L2 near zero with the robustness of L1 for large errors. Subsequent detectors including Faster R-CNN, SSD, RetinaNet, and YOLO variants have used L1 or Smooth L1 for box coordinate regression. More recent transformer-based detectors such as DETR (Carion et al., 2020) use a combination of L1 and Generalized IoU (GIoU) loss for box regression. The DIoU and CIoU variants further refine this approach, but the L1 component remains a standard baseline in detection literature.

image generation and reconstruction

Image-to-image translation networks such as pix2pix (Isola et al., 2017) and CycleGAN (Zhu et al., 2017) combine an adversarial loss with an L1 reconstruction loss because L1 produces sharper outputs than L2, which tends to blur high-frequency detail. Many diffusion models and autoencoders include an L1 reconstruction term for the same reason. The intuition is that L1 puts the conditional median at each pixel, and medians of multimodal distributions tend to be one of the modes rather than an average of modes (which would appear as a blurry blend).

depth estimation and dense prediction

Monocular depth estimation networks (such as MiDaS, DPT, and ZoeDepth) often use scale-invariant L1 or trimmed L1 losses on log-depth, because depth labels contain heavy-tailed errors and outliers from sensor noise. Optical flow networks similarly rely on robust L1-based criteria.

speech and audio

Speech enhancement, denoising, and source separation systems frequently use L1 or weighted L1 losses on spectrograms because L1 is more perceptually correlated than L2 in many audio settings.

loss combinations

In modern practice, MAE is rarely used alone. Common combinations include:

Combination	Where used	Why
L1 + L2	Image reconstruction	L1 sharpness with L2 smoothness
L1 + GIoU	Object detection (DETR)	Position accuracy plus shape accuracy
L1 + adversarial	pix2pix, CycleGAN	Pixel fidelity plus realism
Quantile loss at multiple τ	Probabilistic forecasting	Calibrated prediction intervals
Huber + entropy	Reinforcement learning	Stable Bellman updates with exploration

MAPE: the percentage variant

Mean Absolute Percentage Error (MAPE) expresses MAE as a percentage of the actual values:

MAPE = (100% / n) × Σ |y_i − ŷ_i| / |y_i|

MAPE is useful for comparing forecast accuracy across different scales (for example, comparing a model's accuracy on high-volume products versus low-volume products). However, it has several well-known limitations:

Undefined for zero actual values: When any y_i = 0, the formula involves division by zero.
Asymmetric penalties: Underpredictions are capped at 100% error, while overpredictions can yield arbitrarily large percentage errors. This means MAPE systematically favors forecasts that are biased low.
Inflated scores for small values: When actual values are small, even minor absolute errors produce large percentage errors.

Alternatives include Symmetric MAPE (sMAPE), which addresses the asymmetry issue, and MASE (Mean Absolute Scaled Error), which normalizes by the naive forecast and avoids the division-by-zero problem entirely.

symmetric MAPE (sMAPE)

sMAPE was designed to mitigate MAPE's asymmetry by placing the average of the actual and predicted values in the denominator:

sMAPE = (200% / n) × Σ |y_i − ŷ_i| / (|y_i| + |ŷ_i|)

The factor of 200 (rather than 100) ensures that sMAPE ranges from 0% to 200% rather than 0% to 100%. sMAPE was the headline metric for the M3 and M4 competitions, but it has also been criticized: Goodwin and Lawton (1999) showed that sMAPE is still asymmetric in a more subtle way, and Hyndman (2014) recommended MASE over sMAPE for academic comparisons.

normalized MAE (NMAE)

NMAE divides MAE by some normalization constant such as the range, mean, or standard deviation of the actual values:

NMAE = MAE / (max(y) − min(y)) or NMAE = MAE / mean(|y|)

This is common in recommender systems and physical sciences, where comparing accuracy across heterogeneous datasets matters.

weighted MAE

When some observations are more important than others (for example, recent observations in time series, or high-revenue products in retail), a weighted MAE is used:

WMAE = (Σ w_i × |y_i − ŷ_i|) / (Σ w_i)

where w_i ≥ 0 is the weight assigned to observation i. Weighted RMSSE in the M5 competition is a hierarchical extension of this idea, scaling the squared-error variant of WMAE by a per-series naive baseline.

decomposition of MAE

Research by Willmott and Matsuura (2005) and subsequent work has shown that MAE can be decomposed into two meaningful components:

Quantity disagreement: The absolute difference between the mean of predictions and the mean of actual values. This captures systematic bias.
Allocation disagreement: The remaining portion of MAE after removing quantity disagreement. This captures how well predictions are distributed across individual observations.

This decomposition is particularly useful in remote sensing and spatial analysis, where understanding whether errors stem from systematic bias or from misallocation of values across locations can guide model improvement. Pontius et al. (2008) extended the decomposition to include multiple resolutions, allowing analysts to track how errors aggregate or cancel as data is summarized at coarser spatial scales.

confidence intervals and statistical significance

When comparing two models by MAE, a single point estimate of MAE on a held-out test set can be misleading because of sampling variability. Several techniques are commonly used to quantify uncertainty:

Bootstrap resampling: Repeatedly resample the test set with replacement, recompute MAE on each resample, and report the empirical 2.5% and 97.5% percentiles as a 95% confidence interval. This is the standard nonparametric approach and is implemented in libraries like scikit-learn and arch.
Diebold-Mariano test: A formal hypothesis test for whether two competing forecasts have equal expected loss, originally designed for time series and accounting for autocorrelation in the loss differential. The test statistic uses a HAC-corrected variance estimator.
Modified Diebold-Mariano (Harvey, Leybourne, Newbold 1997): A small-sample correction to the original DM test.
Permutation tests: Shuffle the labels assigned to two models' predictions and recompute the difference in MAE many times to build a null distribution.

In deep learning contexts, MAE differences across runs are often dominated by random initialization and data ordering, so reporting mean ± standard deviation across multiple seeds is the most common practice.

optimization considerations

When MAE is used as a training loss rather than only as an evaluation metric, several practical points are worth knowing:

Use sub-gradient at zero: Most autograd frameworks define the derivative of |x| at 0 as 0. This convention is fine in practice because exact zero residuals are extremely rare on continuous-valued targets.
Adam often handles L1 well: Despite the constant gradient magnitude, Adam's second-moment scaling adapts the effective step size, which mitigates the oscillation that pure SGD with L1 loss can show.
Learning rate scheduling helps: A cosine or step schedule is especially useful with L1 loss, because the optimizer needs to take smaller steps near the optimum to avoid bouncing around.
Smooth L1 / Huber loss is often a better choice: When the goal is to keep training stable while still gaining outlier robustness, switching from pure L1 to Smooth L1 with a small δ is a one-line change that often improves convergence.
Linear programming for LAD regression: For linear models with no regularization, MAE minimization can be cast as a linear program. This is the underlying solver in statsmodels.regression.quantile_regression (with τ = 0.5) and in classic LAD regression code.
Iteratively reweighted least squares (IRLS): Another classical approach. Each iteration solves a weighted least squares problem with weights w_i = 1 / |e_i|, then updates residuals and repeats until convergence. Care must be taken near zero residuals to avoid division by zero.
Coordinate descent for LASSO: When MAE is combined with L1 regularization, the combined L1-L1 problem can be solved by specialized methods such as coordinate descent or interior-point methods.

implementation

MAE is available as a built-in function in all major machine learning frameworks.

Framework	Class or function	Notes
scikit-learn	`sklearn.metrics.mean_absolute_error(y_true, y_pred)`	Returns a single float; supports `sample_weight` and `multioutput`
PyTorch (loss)	`torch.nn.L1Loss()`	Computes mean by default; supports `reduction='sum'` and `'none'`
PyTorch (functional)	`torch.nn.functional.l1_loss(input, target)`	Functional form, no parameters
TensorFlow / Keras (loss)	`tf.keras.losses.MeanAbsoluteError()`	Trainable loss class
TensorFlow / Keras (metric)	`tf.keras.metrics.MeanAbsoluteError()`	Stateful metric for `model.compile(metrics=...)`
JAX / Optax	`optax.l1_loss(predictions, targets)`	Returns per-example loss; reduce manually
XGBoost	`objective='reg:absoluteerror'`	Available since XGBoost 1.7
LightGBM	`objective='regression_l1'` or `'mae'`	Long-supported
CatBoost	`loss_function='MAE'`	Long-supported
statsmodels	`QuantReg(y, X).fit(q=0.5)`	Median regression via quantile regression
H2O.ai	`distribution='laplace'` for GBM	Equivalent to MAE under Laplace MLE

scikit-learn

from sklearn.metrics import mean_absolute_error

y_true = [3.0, 5.0, 2.0, 7.0, 4.0]
y_pred = [2.5, 4.8, 3.1, 6.5, 4.3]

mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae}")  # Output: MAE: 0.52

PyTorch

In PyTorch, MAE is implemented as torch.nn.L1Loss, reflecting its equivalence to the L1 loss function:

import torch
import torch.nn as nn

y_true = torch.tensor([3.0, 5.0, 2.0, 7.0, 4.0])
y_pred = torch.tensor([2.5, 4.8, 3.1, 6.5, 4.3])

criterion = nn.L1Loss()
mae = criterion(y_pred, y_true)
print(f"MAE: {mae.item()}")  # Output: MAE: 0.52

PyTorch also exposes torch.nn.SmoothL1Loss (Huber-style with δ = 1 by default) and torch.nn.HuberLoss (with a configurable delta argument) for the smooth hybrid loss described above.

TensorFlow / Keras

TensorFlow provides MAE both as a metric and as a loss function through its Keras API:

import tensorflow as tf

y_true = [3.0, 5.0, 2.0, 7.0, 4.0]
y_pred = [2.5, 4.8, 3.1, 6.5, 4.3]

# As a metric
mae = tf.keras.metrics.mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae.numpy()}")  # Output: MAE: 0.52

# As a loss function for training
loss_fn = tf.keras.losses.MeanAbsoluteError()

NumPy (manual calculation)

import numpy as np

y_true = np.array([3.0, 5.0, 2.0, 7.0, 4.0])
y_pred = np.array([2.5, 4.8, 3.1, 6.5, 4.3])

mae = np.mean(np.abs(y_true - y_pred))
print(f"MAE: {mae}")  # Output: MAE: 0.52

gradient boosting libraries

XGBoost, LightGBM, and CatBoost all support MAE as a regression objective. In LightGBM:

import lightgbm as lgb
params = {
    'objective': 'regression_l1',
    'metric': 'mae',
    'learning_rate': 0.05,
    'num_leaves': 31,
}
model = lgb.train(params, train_data, num_boost_round=200)

Note that gradient boosting with MAE is slightly different from MAE-based neural network training because tree-based learners use second-order Taylor approximations of the loss, and MAE has a Hessian of zero almost everywhere. Most libraries fall back to a smoothed approximation or to constant-gradient leaf values when training trees with L1 objective.

quantile regression

For median regression and other quantiles, scikit-learn provides QuantileRegressor and statsmodels provides QuantReg:

from sklearn.linear_model import QuantileRegressor
import numpy as np

X = np.random.randn(200, 3)
y = X @ np.array([1.0, -2.0, 0.5]) + np.random.laplace(scale=1.0, size=200)

model = QuantileRegressor(quantile=0.5, alpha=0.0, solver='highs')
model.fit(X, y)
# model.coef_ approximates the L1-optimal coefficients

common pitfalls

A few mistakes recur often enough to be worth listing:

Confusing MAE with MAPE in reports: MAPE values look like MAE values but are percentages, not absolute units. Always label which metric is being reported.
Dividing by zero in MAPE: Filter or floor zero actual values, or switch to MASE.
Comparing MAE across datasets with different scales: A lower MAE does not necessarily mean a better model unless the targets share the same units and distribution. Use MASE or NMAE for cross-dataset comparisons.
Reporting MAE without sampling uncertainty: A model that beats another by 0.01 MAE may not be statistically distinguishable. Bootstrap or Diebold-Mariano tests give a sense of significance.
Using MAE on classification probabilities: MAE on probabilities is sometimes computed but is not a proper scoring rule for classification. Brier score, log loss, or calibration metrics are preferred.
Forgetting that MAE optimizes for the median, not the mean: A model trained with MAE produces median forecasts. If you need an unbiased estimate of the mean (for example, a budgeting forecast that must sum correctly across categories), MSE may be more appropriate.
Mixing up reduction modes: PyTorch and JAX expose 'mean', 'sum', and 'none' reductions. The numerical scale of the loss differs between modes, which can quietly invalidate learning rate tuning when switching frameworks.

explain like I'm 5 (ELI5)

Imagine you and your friends are guessing how many candies are in a jar. The jar actually has 50 candies. One friend guesses 47, another guesses 53, and a third guesses 45. To figure out how good your group's guesses are, you look at how far off each guess was: the first friend was off by 3, the second by 3, and the third by 5. You do not care whether someone guessed too high or too low; you just care about how far away the guess was. Then you take the average of those distances: (3 + 3 + 5) / 3 = about 3.7 candies. That average distance is the Mean Absolute Error. A smaller number means the guesses were closer to the real answer.

references

Willmott, C. J., & Matsuura, K. (2005). "Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance." Climate Research, 30(1), 79-82.
Huber, P. J. (1964). "Robust Estimation of a Location Parameter." Annals of Mathematical Statistics, 35(1), 73-101.
Chai, T., & Draxler, R. R. (2014). "Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature." Geoscientific Model Development, 7(3), 1247-1250.
Pontius, R. G., Thontteh, O., & Chen, H. (2008). "Components of information for multiple resolution comparison between maps that share a real variable." Environmental and Ecological Statistics, 15, 111-142.
Hyndman, R. J., & Koehler, A. B. (2006). "Another look at measures of forecast accuracy." International Journal of Forecasting, 22(4), 679-688.
Hodson, T. O. (2022). "Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not." Geoscientific Model Development, 15(14), 5481-5487.
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830.
Paszke, A., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." Advances in Neural Information Processing Systems, 32.
Abadi, M., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 265-283.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 5.5: Maximum Likelihood Estimation.
Koenker, R., & Bassett, G. (1978). "Regression Quantiles." Econometrica, 46(1), 33-50.
Koenker, R. (2005). Quantile Regression. Cambridge University Press.
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society. Series B, 58(1), 267-288.
Diebold, F. X., & Mariano, R. S. (1995). "Comparing Predictive Accuracy." Journal of Business & Economic Statistics, 13(3), 253-263.
Harvey, D., Leybourne, S., & Newbold, P. (1997). "Testing the equality of prediction mean squared errors." International Journal of Forecasting, 13(2), 281-291.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). "The M4 Competition: 100,000 time series and 61 forecasting methods." International Journal of Forecasting, 36(1), 54-74.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2022). "M5 accuracy competition: Results, findings, and conclusions." International Journal of Forecasting, 38(4), 1346-1364.
Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). "DeepAR: Probabilistic forecasting with autoregressive recurrent networks." International Journal of Forecasting, 36(3), 1181-1191.
Lim, B., Arik, S. O., Loeff, N., & Pfister, T. (2021). "Temporal Fusion Transformers for interpretable multi-horizon time series forecasting." International Journal of Forecasting, 37(4), 1748-1764.
Girshick, R. (2015). "Fast R-CNN." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1440-1448.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). "End-to-End Object Detection with Transformers." European Conference on Computer Vision (ECCV), 213-229.
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). "Image-to-Image Translation with Conditional Adversarial Networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1125-1134.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 10.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 1.5.5.

mathematical definition

worked example

key properties

same-unit interpretability

robustness to outliers

symmetry

scale dependence

non-differentiability at zero

convexity

MAE as L1 loss

connection to the median

conditional median for regression

connection to quantile regression

gradient behavior

subgradient methods

MAE vs. MSE: when to use each

empirical relationship

Huber loss: a smooth hybrid alternative

M-estimators and the broader robust statistics family

MAE in time series forecasting

the Makridakis competitions

MASE: mean absolute scaled error

applications by domain

MAE in modern machine learning

object detection and bounding-box regression

image generation and reconstruction

depth estimation and dense prediction

speech and audio

loss combinations

MAPE: the percentage variant

symmetric MAPE (sMAPE)

normalized MAE (NMAE)

weighted MAE

decomposition of MAE

confidence intervals and statistical significance

optimization considerations

implementation

scikit-learn

PyTorch

TensorFlow / Keras

NumPy (manual calculation)

gradient boosting libraries

quantile regression

common pitfalls

explain like I'm 5 (ELI5)

see also

references

Improve this article

Related Articles

ARC-AGI 2

False Negative Rate

False Positive Rate (FPR)

Mean Squared Error (MSE)

AUC-ROC

ARIMA

mathematical definition

worked example

key properties

same-unit interpretability

robustness to outliers

symmetry

scale dependence

non-differentiability at zero

convexity

MAE as L1 loss

connection to the median

conditional median for regression

connection to quantile regression

gradient behavior

subgradient methods

MAE vs. MSE: when to use each

empirical relationship

Huber loss: a smooth hybrid alternative

M-estimators and the broader robust statistics family

MAE in time series forecasting

the Makridakis competitions

MASE: mean absolute scaled error

applications by domain

MAE in modern machine learning

object detection and bounding-box regression