L1 Loss

L1 loss, also called mean absolute error (MAE) or least absolute deviations (LAD), is a loss function that measures the average of the absolute differences between predicted values and actual target values. It is one of the most widely used loss functions in regression tasks across machine learning and statistics. Unlike the L2 loss (mean squared error), which squares error terms and disproportionately penalizes large deviations, L1 loss applies a linear penalty to all errors regardless of magnitude. This property makes it more robust to outliers in training data.

L1 loss has deep historical roots, predating even the method of least squares. It connects to the Laplace distribution through maximum likelihood estimation, generalizes naturally to quantile regression, and plays a direct role in regularization techniques such as LASSO. In modern deep learning, L1 loss and its smooth variants appear in applications ranging from bounding box regression in object detection to image super-resolution and denoising.

Explain like I'm 5 (ELI5)

Imagine you are playing a guessing game. Your friend hides some number of marbles in a box, and you try to guess how many there are. After each guess, you find out how far off you were. If the box had 10 marbles and you guessed 7, you were off by 3. If you guessed 13, you were also off by 3.

L1 loss is like adding up all those "how far off" numbers. It does not care whether you guessed too high or too low. It just looks at the distance between your guess and the real answer. The goal is to make that total distance as small as possible.

What makes L1 loss special is that one really bad guess does not ruin your score too much. If you are usually close but one time you guess wildly wrong, L1 loss treats that big mistake more gently than other scoring methods would. It is like a fair teacher who does not let one bad test destroy your whole grade.

Historical background

The method of least absolute deviations was first proposed by Roger Joseph Boscovich in 1757, nearly fifty years before the method of least squares was introduced by Adrien-Marie Legendre in 1805. Boscovich developed the technique to reconcile inconsistencies in astronomical measurements of the shape of the Earth. Pierre-Simon Laplace further refined the approach in 1788, using a symmetric two-sided exponential distribution (now known as the Laplace distribution) to model measurement errors, and derived the sum of absolute deviations as the natural error measure under that assumption.

Despite its earlier origin, least absolute deviations saw limited adoption compared to least squares throughout much of the 19th and 20th centuries. The primary reason was computational: least squares has a closed-form analytical solution (the normal equations), while minimizing the sum of absolute deviations requires iterative numerical methods. With the advent of linear programming algorithms and modern computing, L1 loss became practical for large-scale problems. In 1978, Roger Koenker and Gilbert Bassett formalized the connection between LAD and quantile regression, showing that minimizing absolute deviations is equivalent to estimating the conditional median (the 50th percentile) of the response variable.

Mathematical definition

For a dataset of n observations, let y_i denote the true value and \u0177_i (y-hat) the predicted value for the i-th sample. The L1 loss is defined as:

Element-wise absolute error:

l_i = |y_i - \u0177_i|

Mean absolute error (MAE):

L1(y, \u0177) = (1/n) * \u03a3 |y_i - \u0177_i|

where the summation runs from i = 1 to n.

Some formulations omit the 1/n factor and instead define L1 loss as the sum of absolute errors (SAE). In the context of gradient descent optimization, the distinction between sum and mean only affects the effective learning rate and does not change the location of the minimum.

Gradient and subgradient

The derivative of the absolute value function |x| is the sign function: +1 when x > 0, and -1 when x < 0. At x = 0, the function has a "kink" and is not differentiable in the classical sense. However, the concept of a subgradient extends the notion of derivative to this point. At x = 0, any value in the interval [-1, 1] is a valid subgradient.

For the L1 loss with respect to the predicted value \u0177_i:

dL1/d\u0177_i = -sign(y_i - \u0177_i) = { -1 if \u0177_i < y_i; +1 if \u0177_i > y_i; any value in [-1, 1] if \u0177_i = y_i }

In practice, deep learning frameworks such as PyTorch and TensorFlow handle the non-differentiable point by assigning a subgradient of 0 at x = 0. Because the probability of any individual prediction exactly equaling the target is essentially zero in continuous-valued problems, this convention has negligible impact on training.

An important consequence of this gradient structure is that the magnitude of the gradient is constant (always 1 or -1) regardless of how large the error is. This stands in contrast to L2 loss, where the gradient magnitude scales linearly with the error. The constant gradient magnitude means that L1 loss does not accelerate updates for large errors and does not slow down updates for small errors.

Properties

Robustness to outliers

L1 loss is more robust to outliers than L2 loss. The reason is straightforward: L2 loss squares the residual, so a single data point with a large error contributes a disproportionately large value to the total loss. L1 loss only takes the absolute value, so outliers have a linear rather than quadratic effect.

As a concrete example, consider five data points with residuals [1, 2, 1, 3, 2]. The sum of absolute errors is 9, and the sum of squared errors is 19. Now suppose one data point is an outlier with residual 30: the sum of absolute errors becomes 36 (a 4x increase), while the sum of squared errors becomes 918 (a 48x increase). The L2 loss is dominated by the single outlier, but the L1 loss is not.

From a statistical perspective, this robustness arises because minimizing L1 loss yields the conditional median of the response variable, while minimizing L2 loss yields the conditional mean. The median is a more robust measure of central tendency than the mean because it is less affected by extreme values.

Connection to the Laplace distribution

Minimizing L1 loss is equivalent to performing maximum likelihood estimation under the assumption that the errors follow a Laplace distribution. The Laplace distribution has the probability density function:

f(x | \u03bc, b) = (1/2b) * exp(-|x - \u03bc| / b)

The log-likelihood for n independent observations is:

log L = -n * log(2b) - (1/b) * \u03a3 |x_i - \u03bc|

Maximizing the log-likelihood with respect to \u03bc is equivalent to minimizing \u03a3 |x_i - \u03bc|, which is exactly the L1 loss. In contrast, L2 loss corresponds to maximum likelihood estimation under Gaussian (normal) errors. The Laplace distribution has heavier tails than the Gaussian, which explains why L1 loss is more tolerant of large residuals.

Connection to median regression and quantile regression

Minimizing the L1 loss over a set of observations produces the sample median as the optimal constant predictor. More generally, in a regression setting, L1 loss yields the conditional median of the response variable given the predictors.

L1 loss is a special case of the quantile loss (also called pinball loss), which is defined as:

L_\u03c4(y, \u0177) = \u03c4 * |y - \u0177| if y >= \u0177, or (1 - \u03c4) * |y - \u0177| if y < \u0177

When \u03c4 = 0.5, the quantile loss reduces to half the MAE (a constant scaling factor that does not affect the optimal parameters). This generalization allows practitioners to model any conditional quantile of the response distribution, not just the median.

Non-differentiability at zero

The absolute value function has a sharp corner at zero, making L1 loss non-differentiable when a prediction exactly matches its target. While this property rarely causes issues in practice (continuous predictions almost never exactly match targets), it can lead to slower or oscillating convergence near the optimum compared to the smooth quadratic curvature of L2 loss.

This non-differentiability also has a beneficial side effect in regularization contexts. When L1 is used as a penalty on model weights (L1 regularization), the sharp corner at zero encourages weights to become exactly zero, producing sparse models. This is the mechanism behind the LASSO (Least Absolute Shrinkage and Selection Operator) method.

Convexity

L1 loss is a convex function. This means that any local minimum is also a global minimum, and gradient-based optimization methods (or subgradient methods) are guaranteed to converge to the optimal solution for convex models such as linear regression. However, convexity of the loss function alone does not guarantee convexity of the overall training objective when the model itself is non-convex, as is the case with neural networks.

Comparison with other loss functions

The following table summarizes the key differences between L1 loss and several related loss functions commonly used in regression tasks.

Property	L1 loss (MAE)	L2 loss (MSE)	Huber loss	Smooth L1 loss
Formula (per sample)	\|y - \u0177\|	(y - \u0177)\u00b2	Piecewise: quadratic for small errors, linear for large errors	Equivalent to Huber(x)/\u03b2, with different parameterization
Gradient magnitude	Constant (\u00b11)	Proportional to error	Proportional to error (small), constant (large)	Similar to Huber
Differentiable everywhere	No (kink at 0)	Yes	Yes	Yes
Robustness to outliers	High	Low	High (tunable via \u03b4)	High (tunable via \u03b2)
Optimal estimator	Conditional median	Conditional mean	Weighted combination	Weighted combination
Corresponding error distribution	Laplace	Gaussian	N/A	N/A
Convergence near optimum	Can oscillate	Smooth, fast	Smooth, fast	Smooth, fast
Use in object detection	Less common	Less common	Less common	Very common (e.g., Fast R-CNN)

L1 loss vs. L2 loss

The core difference between L1 and L2 loss is the squaring operation. L2 loss squares each residual, which amplifies large errors and shrinks small errors (those less than 1). This has several practical consequences:

Outlier sensitivity: L2 loss pulls the model toward fitting outliers at the expense of the majority of data points. L1 loss distributes its penalty more evenly.
Gradient behavior: The L2 gradient is proportional to the residual, so updates are larger when errors are large and smaller when errors are small. The L1 gradient has constant magnitude, so updates are the same size regardless of error magnitude. This means L1 loss can be slower to converge when errors are very small.
Solution stability: Small changes to the data can cause the L1 optimal solution to shift between different sets of data points it passes through. L2 solutions are more stable because the objective is strictly convex (the Hessian is positive definite).
Multiple solutions: L1 loss can have multiple optimal solutions, especially in low-dimensional problems. L2 loss has a unique optimum (assuming the design matrix has full rank).

Huber loss

Huber loss, introduced by Peter Huber in 1964, combines the advantages of both L1 and L2 loss. It is defined piecewise with a threshold parameter \u03b4:

L_\u03b4(a) = (1/2) * a\u00b2 if |a| <= \u03b4 L_\u03b4(a) = \u03b4 * (|a| - \u03b4/2) if |a| > \u03b4

For errors smaller than \u03b4, the loss behaves like L2 (quadratic), providing smooth gradients and fast convergence. For errors larger than \u03b4, the loss behaves like L1 (linear), providing robustness to outliers. The \u03b4 parameter is a hyperparameter that must be chosen by the practitioner, typically through cross-validation.

Smooth L1 loss

Smooth L1 loss was introduced by Ross Girshick in the Fast R-CNN paper (2015) for bounding box regression in object detection. It is closely related to Huber loss, and in PyTorch it is parameterized by a value \u03b2 (beta):

SmoothL1(x) = 0.5 * x\u00b2 / \u03b2 if |x| < \u03b2 SmoothL1(x) = |x| - 0.5 * \u03b2 if |x| >= \u03b2

As \u03b2 approaches 0, Smooth L1 loss converges to standard L1 loss. The key advantage of Smooth L1 over plain L1 is differentiability at zero, which allows stable gradient-based optimization without the need for subgradient methods.

Connection to L1 regularization and LASSO

L1 loss and L1 regularization are related but distinct concepts. L1 loss measures prediction error using absolute deviations. L1 regularization adds a penalty proportional to the sum of the absolute values of model weights to the training objective.

The LASSO (Least Absolute Shrinkage and Selection Operator), introduced by Robert Tibshirani in 1996, combines an L2 loss (least squares) data term with an L1 regularization penalty:

LASSO objective = (1/2n) * \u03a3 (y_i - \u0177_i)\u00b2 + \u03bb * \u03a3 |w_j|

where \u03bb controls the strength of the penalty and w_j are the model weights. The L1 penalty encourages sparsity by driving some weights to exactly zero, effectively performing automatic feature selection. This differs from L2 regularization (Ridge regression), which shrinks weights toward zero but does not set them to exactly zero.

Geometrically, the L1 constraint region forms a diamond (or cross-polytope) in parameter space, while the L2 constraint region forms a sphere. The corners of the diamond lie on the coordinate axes, making it more likely that the optimal solution falls at a corner where one or more parameters are zero.

It is also possible to use L1 loss as the data term together with L1 regularization on the weights, producing a doubly-robust model that is resistant to both outliers in the response variable and irrelevant features in the predictors.

Optimization methods

Because L1 loss is not differentiable everywhere, specialized optimization techniques are sometimes needed.

Subgradient descent

The most straightforward approach replaces the gradient with a subgradient at non-differentiable points. Subgradient descent is guaranteed to converge for convex problems, though typically at a slower rate than gradient descent on smooth functions. The convergence rate for Lipschitz-continuous convex functions (which includes L1 loss) is O(1/sqrt(T)) for T iterations with appropriately decreasing step sizes.

Iteratively reweighted least squares (IRLS)

IRLS approximates the L1 objective by solving a sequence of weighted least squares problems. At each iteration, the weights are set inversely proportional to the current residuals, so points with large residuals receive less weight. This approach leverages the efficient closed-form solution of weighted least squares while converging to the L1 solution.

Linear programming

Minimizing the sum of absolute deviations can be reformulated as a linear programming problem by introducing auxiliary variables. This allows the use of standard LP solvers, including the simplex method and interior-point methods. The Barrodale-Roberts algorithm is a specialized simplex-based method designed specifically for L1 regression.

In deep learning

When training neural networks with L1 loss, standard stochastic gradient descent (SGD) or adaptive optimizers such as Adam are used. The subgradient at zero is conventionally set to 0 in automatic differentiation frameworks. In practice, this works well because the probability of hitting the exact non-differentiable point is negligible in floating-point arithmetic.

Applications

Robust regression

L1 loss is the standard choice when the training data contains outliers or heavy-tailed error distributions. In fields such as economics, environmental science, and sensor data analysis, measurements often include erroneous readings or extreme values. Using L1 loss prevents these anomalous points from dominating the fitted model.

Image super-resolution and restoration

Research by Zhao et al. (2017) at NVIDIA demonstrated that neural networks trained with L1 loss for image restoration tasks (super-resolution, JPEG artifact removal, demosaicing) produce higher-quality results than those trained with L2 loss, even when evaluated using L2-based metrics like PSNR. The reason is that L2 loss tends to produce blurry outputs by averaging over multiple plausible reconstructions, while L1 loss encourages sharper predictions. Combinations of L1 loss with perceptual losses (such as MS-SSIM) can yield even better results.

Object detection (bounding box regression)

In object detection frameworks such as Fast R-CNN, Faster R-CNN, and YOLO, bounding box coordinates are typically regressed using Smooth L1 loss rather than plain L1 loss. Smooth L1 provides robustness to large errors (from poorly matched anchor boxes) while maintaining smooth gradients for precise localization when errors are small.

Time series forecasting

MAE is commonly used as both a loss function and evaluation metric in time series forecasting. Because time series data frequently contains anomalous spikes or drops, L1 loss helps the model focus on the typical pattern rather than fitting extreme events. Many forecasting competitions (such as the M-competitions) report MAE or related metrics as primary evaluation criteria.

Generative models

L1 loss appears in several generative modeling architectures. In image-to-image translation (e.g., Pix2Pix), L1 loss is combined with an adversarial loss to encourage the generated image to be close to the target while remaining visually realistic. The L1 term prevents mode collapse and provides a strong pixel-level supervision signal.

Signal processing and compressed sensing

In compressed sensing, L1 minimization is used to recover sparse signals from a small number of linear measurements. The key theoretical result (by Candes, Romberg, and Tao, 2006) is that under certain conditions on the measurement matrix, the sparsest solution to an underdetermined system of equations can be found by solving an L1 minimization problem, which is convex and computationally tractable.

Implementation in popular frameworks

PyTorch

PyTorch provides torch.nn.L1Loss as a built-in module:

import torch
import torch.nn as nn

# Create loss function
loss_fn = nn.L1Loss(reduction='mean')  # Options: 'none', 'mean', 'sum'

# Example usage
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 7.5])

loss = loss_fn(predictions, targets)
print(loss)  # tensor(0.3250)

The reduction parameter controls how individual losses are aggregated: 'none' returns the per-element loss, 'mean' (default) returns the average, and 'sum' returns the total.

TensorFlow / Keras

TensorFlow and Keras offer MAE as both a loss function and a metric:

import tensorflow as tf

# As a loss function
loss_fn = tf.keras.losses.MeanAbsoluteError()

# Example usage
predictions = tf.constant([2.5, 0.0, 2.1, 7.8])
targets = tf.constant([3.0, -0.5, 2.0, 7.5])

loss = loss_fn(targets, predictions)
print(loss)  # tf.Tensor(0.325, shape=(), dtype=float32)

NumPy (from scratch)

A minimal implementation of L1 loss in NumPy:

import numpy as np

def l1_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def l1_loss_gradient(y_true, y_pred):
    return np.sign(y_pred - y_true) / len(y_true)

Practical considerations

When to use L1 loss

L1 loss is a good default choice in the following situations:

The training data contains outliers or heavy-tailed noise.
The goal is to estimate the conditional median rather than the conditional mean.
Robustness is more important than achieving the lowest possible error on clean data.
The application involves image restoration, where L1 loss tends to produce sharper outputs than L2 loss.

When to prefer L2 loss

L2 loss may be preferable when:

The data is clean and errors are approximately Gaussian.
Smooth, fast convergence is desired (the quadratic curvature of L2 loss helps optimizers converge more quickly near the minimum).
A unique, stable solution is needed.
The application specifically requires minimizing variance rather than deviation from the median.

When to consider Huber or Smooth L1 loss

Huber loss or Smooth L1 loss should be considered when:

Robustness to outliers is needed, but smooth gradients are also desired for stable training.
The application involves bounding box regression or other tasks where both large and small errors are common.
The practitioner is willing to tune the threshold parameter (\u03b4 or \u03b2).

Common pitfalls

Constant gradient magnitude: Because the L1 gradient has constant magnitude, the optimizer does not automatically slow down as it approaches the minimum. This can cause oscillation around the optimum. Using a learning rate schedule or an adaptive optimizer like Adam can mitigate this issue.
Scale sensitivity: Like all loss functions, L1 loss is sensitive to the scale of the target variable. Normalizing or standardizing targets before training can improve optimization stability.
Confusing L1 loss with L1 regularization: L1 loss is about prediction error, while L1 regularization is about constraining model weights. They are often used together but serve different purposes.

Summary comparison table

Criterion	L1 loss (MAE)	L2 loss (MSE)
Penalizes errors	Linearly	Quadratically
Effect of outliers	Moderate (linear)	Severe (quadratic)
Gradient at large errors	Constant magnitude	Large magnitude
Gradient near zero error	Constant magnitude	Near zero
Differentiable everywhere	No	Yes
Statistical estimator	Median	Mean
Probabilistic model	Laplace distribution	Gaussian distribution
Solution uniqueness	May have multiple	Unique (full rank)
Convergence speed	Slower near minimum	Faster near minimum
Produces sparse solutions (as regularizer)	Yes	No

References

Boscovich, R. J. (1757). *De Litteraria Expeditione per Pontificiam ditionem*. Bononiensi Scientiarum et Artium Instituto Atque Academia Commentarii, 4, 353-396.
Laplace, P. S. (1788). *Memoire sur la theorie de Jupiter et de Saturne*. Memoires de l'Academie Royale des Sciences de Paris.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society: Series B*, 58(1), 267-288.
Koenker, R., & Bassett, G. (1978). Regression quantiles. *Econometrica*, 46(1), 33-50.
Huber, P. J. (1964). Robust estimation of a location parameter. *Annals of Mathematical Statistics*, 35(1), 73-101.
Girshick, R. (2015). Fast R-CNN. *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 1440-1448.
Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2017). Loss functions for image restoration with neural networks. *IEEE Transactions on Computational Imaging*, 3(1), 47-57.
Candes, E. J., Romberg, J. K., & Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. *Communications on Pure and Applied Mathematics*, 59(8), 1207-1223.
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 1125-1134.
Barron, J. T. (2019). A general and adaptive robust loss function. *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 4331-4339.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.
Boyd, S., & Vandenberghe, L. (2004). *Convex Optimization*. Cambridge University Press.

Explain like I'm 5 (ELI5)

Historical background

Mathematical definition

Gradient and subgradient

Properties

Robustness to outliers

Connection to the Laplace distribution

Connection to median regression and quantile regression

Non-differentiability at zero

Convexity

Comparison with other loss functions

L1 loss vs. L2 loss

Huber loss

Smooth L1 loss

Connection to L1 regularization and LASSO

Optimization methods

Subgradient descent

Iteratively reweighted least squares (IRLS)

Linear programming

In deep learning

Applications

Robust regression

Image super-resolution and restoration

Object detection (bounding box regression)

Time series forecasting

Generative models

Signal processing and compressed sensing

Implementation in popular frameworks

PyTorch

TensorFlow / Keras

NumPy (from scratch)

Practical considerations

When to use L1 loss

When to prefer L2 loss

When to consider Huber or Smooth L1 loss

Common pitfalls

Summary comparison table

See also

References

Improve this article

Related Articles

L2 Loss

ARC-AGI 2

Squared Loss

Cost

Objective function

AUC-ROC

Explain like I'm 5 (ELI5)

Historical background

Mathematical definition

Gradient and subgradient

Properties

Robustness to outliers

Connection to the Laplace distribution

Connection to median regression and quantile regression

Non-differentiability at zero

Convexity

Comparison with other loss functions

L1 loss vs. L2 loss

Huber loss

Smooth L1 loss

Connection to L1 regularization and LASSO

Optimization methods

Subgradient descent

Iteratively reweighted least squares (IRLS)

Linear programming

In deep learning

Applications

Robust regression

Image super-resolution and restoration

Object detection (bounding box regression)

Time series forecasting

Generative models

Signal processing and compressed sensing

Implementation in popular frameworks

PyTorch

TensorFlow / Keras

NumPy (from scratch)

Practical considerations

When to use L1 loss