L1 loss, also called mean absolute error (MAE) or least absolute deviations (LAD), is a loss function that measures the average of the absolute differences between predicted values and actual target values. It is one of the most widely used loss functions in regression tasks across machine learning and statistics. Unlike the L2 loss (mean squared error), which squares error terms and disproportionately penalizes large deviations, L1 loss applies a linear penalty to all errors regardless of magnitude. This property makes it more robust to outliers in training data.
L1 loss has deep historical roots, predating even the method of least squares. It connects to the Laplace distribution through maximum likelihood estimation, generalizes naturally to quantile regression, and plays a direct role in regularization techniques such as LASSO. In modern deep learning, L1 loss and its smooth variants appear in applications ranging from bounding box regression in object detection to image super-resolution and denoising.
Imagine you are playing a guessing game. Your friend hides some number of marbles in a box, and you try to guess how many there are. After each guess, you find out how far off you were. If the box had 10 marbles and you guessed 7, you were off by 3. If you guessed 13, you were also off by 3.
L1 loss is like adding up all those "how far off" numbers. It does not care whether you guessed too high or too low. It just looks at the distance between your guess and the real answer. The goal is to make that total distance as small as possible.
What makes L1 loss special is that one really bad guess does not ruin your score too much. If you are usually close but one time you guess wildly wrong, L1 loss treats that big mistake more gently than other scoring methods would. It is like a fair teacher who does not let one bad test destroy your whole grade.
The method of least absolute deviations was first proposed by Roger Joseph Boscovich in 1757, nearly fifty years before the method of least squares was introduced by Adrien-Marie Legendre in 1805. Boscovich developed the technique to reconcile inconsistencies in astronomical measurements of the shape of the Earth. Pierre-Simon Laplace further refined the approach in 1788, using a symmetric two-sided exponential distribution (now known as the Laplace distribution) to model measurement errors, and derived the sum of absolute deviations as the natural error measure under that assumption.
Despite its earlier origin, least absolute deviations saw limited adoption compared to least squares throughout much of the 19th and 20th centuries. The primary reason was computational: least squares has a closed-form analytical solution (the normal equations), while minimizing the sum of absolute deviations requires iterative numerical methods. With the advent of linear programming algorithms and modern computing, L1 loss became practical for large-scale problems. In 1978, Roger Koenker and Gilbert Bassett formalized the connection between LAD and quantile regression, showing that minimizing absolute deviations is equivalent to estimating the conditional median (the 50th percentile) of the response variable.
For a dataset of n observations, let y_i denote the true value and \u0177_i (y-hat) the predicted value for the i-th sample. The L1 loss is defined as:
Element-wise absolute error:
l_i = |y_i - \u0177_i|
Mean absolute error (MAE):
L1(y, \u0177) = (1/n) * \u03a3 |y_i - \u0177_i|
where the summation runs from i = 1 to n.
Some formulations omit the 1/n factor and instead define L1 loss as the sum of absolute errors (SAE). In the context of gradient descent optimization, the distinction between sum and mean only affects the effective learning rate and does not change the location of the minimum.
The derivative of the absolute value function |x| is the sign function: +1 when x > 0, and -1 when x < 0. At x = 0, the function has a "kink" and is not differentiable in the classical sense. However, the concept of a subgradient extends the notion of derivative to this point. At x = 0, any value in the interval [-1, 1] is a valid subgradient.
For the L1 loss with respect to the predicted value \u0177_i:
dL1/d\u0177_i = -sign(y_i - \u0177_i) = { -1 if \u0177_i < y_i; +1 if \u0177_i > y_i; any value in [-1, 1] if \u0177_i = y_i }
In practice, deep learning frameworks such as PyTorch and TensorFlow handle the non-differentiable point by assigning a subgradient of 0 at x = 0. Because the probability of any individual prediction exactly equaling the target is essentially zero in continuous-valued problems, this convention has negligible impact on training.
An important consequence of this gradient structure is that the magnitude of the gradient is constant (always 1 or -1) regardless of how large the error is. This stands in contrast to L2 loss, where the gradient magnitude scales linearly with the error. The constant gradient magnitude means that L1 loss does not accelerate updates for large errors and does not slow down updates for small errors.
L1 loss is more robust to outliers than L2 loss. The reason is straightforward: L2 loss squares the residual, so a single data point with a large error contributes a disproportionately large value to the total loss. L1 loss only takes the absolute value, so outliers have a linear rather than quadratic effect.
As a concrete example, consider five data points with residuals [1, 2, 1, 3, 2]. The sum of absolute errors is 9, and the sum of squared errors is 19. Now suppose one data point is an outlier with residual 30: the sum of absolute errors becomes 36 (a 4x increase), while the sum of squared errors becomes 918 (a 48x increase). The L2 loss is dominated by the single outlier, but the L1 loss is not.
From a statistical perspective, this robustness arises because minimizing L1 loss yields the conditional median of the response variable, while minimizing L2 loss yields the conditional mean. The median is a more robust measure of central tendency than the mean because it is less affected by extreme values.
Minimizing L1 loss is equivalent to performing maximum likelihood estimation under the assumption that the errors follow a Laplace distribution. The Laplace distribution has the probability density function:
f(x | \u03bc, b) = (1/2b) * exp(-|x - \u03bc| / b)
The log-likelihood for n independent observations is:
log L = -n * log(2b) - (1/b) * \u03a3 |x_i - \u03bc|
Maximizing the log-likelihood with respect to \u03bc is equivalent to minimizing \u03a3 |x_i - \u03bc|, which is exactly the L1 loss. In contrast, L2 loss corresponds to maximum likelihood estimation under Gaussian (normal) errors. The Laplace distribution has heavier tails than the Gaussian, which explains why L1 loss is more tolerant of large residuals.
Minimizing the L1 loss over a set of observations produces the sample median as the optimal constant predictor. More generally, in a regression setting, L1 loss yields the conditional median of the response variable given the predictors.
L1 loss is a special case of the quantile loss (also called pinball loss), which is defined as:
L_\u03c4(y, \u0177) = \u03c4 * |y - \u0177| if y >= \u0177, or (1 - \u03c4) * |y - \u0177| if y < \u0177
When \u03c4 = 0.5, the quantile loss reduces to half the MAE (a constant scaling factor that does not affect the optimal parameters). This generalization allows practitioners to model any conditional quantile of the response distribution, not just the median.
The absolute value function has a sharp corner at zero, making L1 loss non-differentiable when a prediction exactly matches its target. While this property rarely causes issues in practice (continuous predictions almost never exactly match targets), it can lead to slower or oscillating convergence near the optimum compared to the smooth quadratic curvature of L2 loss.
This non-differentiability also has a beneficial side effect in regularization contexts. When L1 is used as a penalty on model weights (L1 regularization), the sharp corner at zero encourages weights to become exactly zero, producing sparse models. This is the mechanism behind the LASSO (Least Absolute Shrinkage and Selection Operator) method.
L1 loss is a convex function. This means that any local minimum is also a global minimum, and gradient-based optimization methods (or subgradient methods) are guaranteed to converge to the optimal solution for convex models such as linear regression. However, convexity of the loss function alone does not guarantee convexity of the overall training objective when the model itself is non-convex, as is the case with neural networks.
The following table summarizes the key differences between L1 loss and several related loss functions commonly used in regression tasks.
| Property | L1 loss (MAE) | L2 loss (MSE) | Huber loss | Smooth L1 loss |
|---|---|---|---|---|
| Formula (per sample) | |y - \u0177| | (y - \u0177)\u00b2 | Piecewise: quadratic for small errors, linear for large errors | Equivalent to Huber(x)/\u03b2, with different parameterization |
| Gradient magnitude | Constant (\u00b11) | Proportional to error | Proportional to error (small), constant (large) | Similar to Huber |
| Differentiable everywhere | No (kink at 0) | Yes | Yes | Yes |
| Robustness to outliers | High | Low | High (tunable via \u03b4) | High (tunable via \u03b2) |
| Optimal estimator | Conditional median | Conditional mean | Weighted combination | Weighted combination |
| Corresponding error distribution | Laplace | Gaussian | N/A | N/A |
| Convergence near optimum | Can oscillate | Smooth, fast | Smooth, fast | Smooth, fast |
| Use in object detection | Less common | Less common | Less common | Very common (e.g., Fast R-CNN) |
The core difference between L1 and L2 loss is the squaring operation. L2 loss squares each residual, which amplifies large errors and shrinks small errors (those less than 1). This has several practical consequences:
Huber loss, introduced by Peter Huber in 1964, combines the advantages of both L1 and L2 loss. It is defined piecewise with a threshold parameter \u03b4:
L_\u03b4(a) = (1/2) * a\u00b2 if |a| <= \u03b4 L_\u03b4(a) = \u03b4 * (|a| - \u03b4/2) if |a| > \u03b4
For errors smaller than \u03b4, the loss behaves like L2 (quadratic), providing smooth gradients and fast convergence. For errors larger than \u03b4, the loss behaves like L1 (linear), providing robustness to outliers. The \u03b4 parameter is a hyperparameter that must be chosen by the practitioner, typically through cross-validation.
Smooth L1 loss was introduced by Ross Girshick in the Fast R-CNN paper (2015) for bounding box regression in object detection. It is closely related to Huber loss, and in PyTorch it is parameterized by a value \u03b2 (beta):
SmoothL1(x) = 0.5 * x\u00b2 / \u03b2 if |x| < \u03b2 SmoothL1(x) = |x| - 0.5 * \u03b2 if |x| >= \u03b2
As \u03b2 approaches 0, Smooth L1 loss converges to standard L1 loss. The key advantage of Smooth L1 over plain L1 is differentiability at zero, which allows stable gradient-based optimization without the need for subgradient methods.
L1 loss and L1 regularization are related but distinct concepts. L1 loss measures prediction error using absolute deviations. L1 regularization adds a penalty proportional to the sum of the absolute values of model weights to the training objective.
The LASSO (Least Absolute Shrinkage and Selection Operator), introduced by Robert Tibshirani in 1996, combines an L2 loss (least squares) data term with an L1 regularization penalty:
LASSO objective = (1/2n) * \u03a3 (y_i - \u0177_i)\u00b2 + \u03bb * \u03a3 |w_j|
where \u03bb controls the strength of the penalty and w_j are the model weights. The L1 penalty encourages sparsity by driving some weights to exactly zero, effectively performing automatic feature selection. This differs from L2 regularization (Ridge regression), which shrinks weights toward zero but does not set them to exactly zero.
Geometrically, the L1 constraint region forms a diamond (or cross-polytope) in parameter space, while the L2 constraint region forms a sphere. The corners of the diamond lie on the coordinate axes, making it more likely that the optimal solution falls at a corner where one or more parameters are zero.
It is also possible to use L1 loss as the data term together with L1 regularization on the weights, producing a doubly-robust model that is resistant to both outliers in the response variable and irrelevant features in the predictors.
Because L1 loss is not differentiable everywhere, specialized optimization techniques are sometimes needed.
The most straightforward approach replaces the gradient with a subgradient at non-differentiable points. Subgradient descent is guaranteed to converge for convex problems, though typically at a slower rate than gradient descent on smooth functions. The convergence rate for Lipschitz-continuous convex functions (which includes L1 loss) is O(1/sqrt(T)) for T iterations with appropriately decreasing step sizes.
IRLS approximates the L1 objective by solving a sequence of weighted least squares problems. At each iteration, the weights are set inversely proportional to the current residuals, so points with large residuals receive less weight. This approach leverages the efficient closed-form solution of weighted least squares while converging to the L1 solution.
Minimizing the sum of absolute deviations can be reformulated as a linear programming problem by introducing auxiliary variables. This allows the use of standard LP solvers, including the simplex method and interior-point methods. The Barrodale-Roberts algorithm is a specialized simplex-based method designed specifically for L1 regression.
When training neural networks with L1 loss, standard stochastic gradient descent (SGD) or adaptive optimizers such as Adam are used. The subgradient at zero is conventionally set to 0 in automatic differentiation frameworks. In practice, this works well because the probability of hitting the exact non-differentiable point is negligible in floating-point arithmetic.
L1 loss is the standard choice when the training data contains outliers or heavy-tailed error distributions. In fields such as economics, environmental science, and sensor data analysis, measurements often include erroneous readings or extreme values. Using L1 loss prevents these anomalous points from dominating the fitted model.
Research by Zhao et al. (2017) at NVIDIA demonstrated that neural networks trained with L1 loss for image restoration tasks (super-resolution, JPEG artifact removal, demosaicing) produce higher-quality results than those trained with L2 loss, even when evaluated using L2-based metrics like PSNR. The reason is that L2 loss tends to produce blurry outputs by averaging over multiple plausible reconstructions, while L1 loss encourages sharper predictions. Combinations of L1 loss with perceptual losses (such as MS-SSIM) can yield even better results.
In object detection frameworks such as Fast R-CNN, Faster R-CNN, and YOLO, bounding box coordinates are typically regressed using Smooth L1 loss rather than plain L1 loss. Smooth L1 provides robustness to large errors (from poorly matched anchor boxes) while maintaining smooth gradients for precise localization when errors are small.
MAE is commonly used as both a loss function and evaluation metric in time series forecasting. Because time series data frequently contains anomalous spikes or drops, L1 loss helps the model focus on the typical pattern rather than fitting extreme events. Many forecasting competitions (such as the M-competitions) report MAE or related metrics as primary evaluation criteria.
L1 loss appears in several generative modeling architectures. In image-to-image translation (e.g., Pix2Pix), L1 loss is combined with an adversarial loss to encourage the generated image to be close to the target while remaining visually realistic. The L1 term prevents mode collapse and provides a strong pixel-level supervision signal.
In compressed sensing, L1 minimization is used to recover sparse signals from a small number of linear measurements. The key theoretical result (by Candes, Romberg, and Tao, 2006) is that under certain conditions on the measurement matrix, the sparsest solution to an underdetermined system of equations can be found by solving an L1 minimization problem, which is convex and computationally tractable.
PyTorch provides torch.nn.L1Loss as a built-in module:
import torch
import torch.nn as nn
# Create loss function
loss_fn = nn.L1Loss(reduction='mean') # Options: 'none', 'mean', 'sum'
# Example usage
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 7.5])
loss = loss_fn(predictions, targets)
print(loss) # tensor(0.3250)
The reduction parameter controls how individual losses are aggregated: 'none' returns the per-element loss, 'mean' (default) returns the average, and 'sum' returns the total.
TensorFlow and Keras offer MAE as both a loss function and a metric:
import tensorflow as tf
# As a loss function
loss_fn = tf.keras.losses.MeanAbsoluteError()
# Example usage
predictions = tf.constant([2.5, 0.0, 2.1, 7.8])
targets = tf.constant([3.0, -0.5, 2.0, 7.5])
loss = loss_fn(targets, predictions)
print(loss) # tf.Tensor(0.325, shape=(), dtype=float32)
A minimal implementation of L1 loss in NumPy:
import numpy as np
def l1_loss(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
def l1_loss_gradient(y_true, y_pred):
return np.sign(y_pred - y_true) / len(y_true)
L1 loss is a good default choice in the following situations:
L2 loss may be preferable when:
Huber loss or Smooth L1 loss should be considered when:
| Criterion | L1 loss (MAE) | L2 loss (MSE) |
|---|---|---|
| Penalizes errors | Linearly | Quadratically |
| Effect of outliers | Moderate (linear) | Severe (quadratic) |
| Gradient at large errors | Constant magnitude | Large magnitude |
| Gradient near zero error | Constant magnitude | Near zero |
| Differentiable everywhere | No | Yes |
| Statistical estimator | Median | Mean |
| Probabilistic model | Laplace distribution | Gaussian distribution |
| Solution uniqueness | May have multiple | Unique (full rank) |
| Convergence speed | Slower near minimum | Faster near minimum |
| Produces sparse solutions (as regularizer) | Yes | No |