# L1 Loss

> Source: https://aiwiki.ai/wiki/l1_loss
> Updated: 2026-07-12
> Categories: Machine Learning, Statistics, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**L1 loss** is a [regression](/wiki/regression) [loss function](/wiki/loss_function) equal to the average of the absolute differences between predicted values and target values, written as $$\frac{1}{n} \sum \lvert y_i - \hat{y}_i \rvert$$. It is also called the **mean absolute error** (MAE) or **least absolute deviations** (LAD). Because the penalty grows linearly with the size of an error rather than quadratically, L1 loss is more robust to outliers than the [L2 loss](/wiki/l2_loss) (mean squared error), its gradient has constant magnitude (the sign of the residual), it is non-differentiable at zero, and its minimizer is the conditional median of the target rather than the mean [1][4][7]. The PyTorch reference describes its `torch.nn.L1Loss` module as a "criterion that measures the mean absolute error (MAE) between each element in the input x and target y" [13].

L1 loss is one of the most widely used loss functions in regression across [machine learning](/wiki/machine_learning) and statistics. It has deep historical roots, predating even the method of least squares: Roger Joseph Boscovich proposed least absolute deviations in 1757, about fifty years before Adrien-Marie Legendre published least squares in 1805 [1]. It connects to the Laplace distribution through maximum likelihood estimation, generalizes naturally to quantile regression [4], and underlies sparsity-inducing [regularization](/wiki/regularization) such as LASSO [3]. In modern [deep learning](/wiki/deep_learning), L1 loss and its smooth variants appear in applications ranging from bounding box regression in [object detection](/wiki/object_detection) [6] to image super-resolution and denoising [7].

## Explain like I'm 5 (ELI5)

Imagine you are playing a guessing game. Your friend hides some number of marbles in a box, and you try to guess how many there are. After each guess, you find out how far off you were. If the box had 10 marbles and you guessed 7, you were off by 3. If you guessed 13, you were also off by 3.

L1 loss is like adding up all those "how far off" numbers. It does not care whether you guessed too high or too low. It just looks at the distance between your guess and the real answer. The goal is to make that total distance as small as possible.

What makes L1 loss special is that one really bad guess does not ruin your score too much. If you are usually close but one time you guess wildly wrong, L1 loss treats that big mistake more gently than other scoring methods would. It is like a fair teacher who does not let one bad test destroy your whole grade.

## What is L1 loss?

L1 loss measures prediction error as the (mean) absolute difference between predictions and targets. For a single sample, the loss is the absolute residual $$\lvert y - \hat{y} \rvert$$; over a dataset, the standard reduction is the mean of those absolute residuals, which is exactly the mean absolute error. The name "L1" comes from the L1 norm (the sum of absolute values) of the residual vector.

Three facts capture why L1 loss behaves the way it does. First, the penalty it assigns to an error is proportional to the size of that error (linear), not the square of it. Second, its derivative with respect to a prediction is just the sign of the residual, so the gradient magnitude is a constant 1 and does not grow with the error. Third, the value that minimizes the sum of absolute deviations is the median of the data, not the mean. Each of these is developed in the sections below, and each is the source of a practical tradeoff against [L2 loss](/wiki/l2_loss).

## Historical background

The method of least absolute deviations was first proposed by Roger Joseph Boscovich in 1757, nearly fifty years before the method of least squares was introduced by Adrien-Marie Legendre in 1805 [1]. Boscovich developed the technique to reconcile inconsistencies in astronomical measurements of the shape of the Earth. Pierre-Simon Laplace further refined the approach in 1788 (and again in later work), using a symmetric two-sided exponential distribution (now known as the Laplace distribution) to model measurement errors, and derived the sum of absolute deviations as the natural error measure under that assumption [2].

Despite its earlier origin, least absolute deviations saw limited adoption compared to least squares throughout much of the 19th and 20th centuries. The primary reason was computational: least squares has a closed-form analytical solution (the normal equations), while minimizing the sum of absolute deviations requires iterative numerical methods. With the advent of linear programming algorithms and modern computing, L1 loss became practical for large-scale problems. In 1978, Roger Koenker and Gilbert Bassett formalized the connection between LAD and quantile regression, showing that minimizing absolute deviations is equivalent to estimating the conditional median (the 50th percentile) of the response variable and that the LAD estimator is "an important special case" of their broader class of regression quantiles [4].

## Mathematical definition

For a dataset of *n* observations, let $$y_i$$ denote the true value and $$\hat{y}_i$$ (y-hat) the predicted value for the $$i$$-th sample. The L1 loss is defined as:

**Element-wise absolute error:**

$$
l_i = \lvert y_i - \hat{y}_i \rvert
$$

**Mean absolute error (MAE):**

$$
\text{L1}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y}_i \rvert
$$

where the summation runs from $$i = 1$$ to $$n$$.

Some formulations omit the 1/n factor and instead define L1 loss as the sum of absolute errors (SAE). In the context of [gradient descent](/wiki/gradient_descent) optimization, the distinction between sum and mean only affects the effective [learning rate](/wiki/learning_rate) and does not change the location of the minimum.

### How is the gradient of L1 loss computed?

The derivative of the absolute value function $$\lvert x \rvert$$ is the sign function: $$+1$$ when $$x > 0$$, and $$-1$$ when $$x < 0$$. At $$x = 0$$, the function has a "kink" and is not differentiable in the classical sense. However, the concept of a **subgradient** extends the notion of derivative to this point. At $$x = 0$$, any value in the interval $$[-1, 1]$$ is a valid subgradient [12].

For the L1 loss with respect to the predicted value $$\hat{y}_i$$:

$$
\frac{d\text{L1}}{d\hat{y}_i} = -\operatorname{sign}(y_i - \hat{y}_i) = \begin{cases} -1 & \text{if } \hat{y}_i < y_i \\ +1 & \text{if } \hat{y}_i > y_i \\ \text{any value in } [-1, 1] & \text{if } \hat{y}_i = y_i \end{cases}
$$

In practice, deep learning frameworks such as [PyTorch](/wiki/pytorch) and [TensorFlow](/wiki/tensorflow) handle the non-differentiable point by assigning a subgradient of 0 at x = 0. Because the probability of any individual prediction exactly equaling the target is essentially zero in continuous-valued problems, this convention has negligible impact on training.

An important consequence of this gradient structure is that the magnitude of the gradient is constant (always 1 or -1) regardless of how large the error is. This stands in contrast to L2 loss, where the gradient magnitude scales linearly with the error. The constant gradient magnitude means that L1 loss does not accelerate updates for large errors and does not slow down updates for small errors.

## Properties

### Why is L1 loss robust to outliers?

L1 loss is more robust to outliers than L2 loss. The reason is straightforward: L2 loss squares the residual, so a single data point with a large error contributes a disproportionately large value to the total loss. L1 loss only takes the absolute value, so outliers have a linear rather than quadratic effect.

As a concrete example, consider five data points with residuals [1, 2, 1, 3, 2]. The sum of absolute errors is 9, and the sum of squared errors is 19. Now suppose one data point is an outlier with residual 30: the sum of absolute errors becomes 36 (a 4x increase), while the sum of squared errors becomes 918 (a 48x increase). The L2 loss is dominated by the single outlier, but the L1 loss is not.

From a statistical perspective, this robustness arises because minimizing L1 loss yields the conditional median of the response variable, while minimizing L2 loss yields the conditional mean [4]. The median is a more robust measure of central tendency than the mean because it is less affected by extreme values. This is the same intuition behind Peter Huber's foundational work on robust estimation [5].

### Connection to the Laplace distribution

Minimizing L1 loss is equivalent to performing maximum likelihood estimation under the assumption that the errors follow a Laplace distribution (the double exponential distribution Laplace used in his 1788 error analysis) [2]. The Laplace distribution has the probability density function:

$$
f(x \mid \mu, b) = \frac{1}{2b} \exp\left(-\frac{\lvert x - \mu \rvert}{b}\right)
$$

The log-likelihood for n independent observations is:

$$
\log L = -n \log(2b) - \frac{1}{b} \sum \lvert x_i - \mu \rvert
$$

Maximizing the log-likelihood with respect to $$\mu$$ is equivalent to minimizing $$\sum \lvert x_i - \mu \rvert$$, which is exactly the L1 loss. In contrast, L2 loss corresponds to maximum likelihood estimation under Gaussian (normal) errors. The Laplace distribution has heavier tails than the Gaussian, which explains why L1 loss is more tolerant of large residuals.

### Connection to median regression and quantile regression

Minimizing the L1 loss over a set of observations produces the sample median as the optimal constant predictor. More generally, in a regression setting, L1 loss yields the conditional median of the response variable given the predictors [4].

L1 loss is a special case of the quantile loss (also called pinball loss), which is defined as:

$$
L_\tau(y, \hat{y}) = \begin{cases} \tau \lvert y - \hat{y} \rvert & \text{if } y \ge \hat{y} \\ (1 - \tau) \lvert y - \hat{y} \rvert & \text{if } y < \hat{y} \end{cases}
$$

When $$\tau = 0.5$$, the quantile loss reduces to half the MAE (a constant scaling factor that does not affect the optimal parameters). This generalization, formalized by Koenker and Bassett, allows practitioners to model any conditional quantile of the response distribution, not just the median [4].

### Why is L1 loss non-differentiable at zero?

The absolute value function has a sharp corner at zero, making L1 loss non-differentiable when a prediction exactly matches its target. While this property rarely causes issues in practice (continuous predictions almost never exactly match targets), it can lead to slower or oscillating convergence near the optimum compared to the smooth quadratic curvature of L2 loss.

This non-differentiability also has a beneficial side effect in regularization contexts. When L1 is used as a penalty on model weights (L1 regularization), the sharp corner at zero encourages weights to become exactly zero, producing sparse models. This is the mechanism behind the LASSO (Least Absolute Shrinkage and Selection Operator) method [3].

### Convexity

L1 loss is a convex function [12]. This means that any local minimum is also a global minimum, and gradient-based optimization methods (or subgradient methods) are guaranteed to converge to the optimal solution for convex models such as [linear regression](/wiki/linear_regression). However, convexity of the loss function alone does not guarantee convexity of the overall training objective when the model itself is non-convex, as is the case with [neural networks](/wiki/neural_network).

## How is L1 loss different from L2 loss?

The following table summarizes the key differences between L1 loss and several related loss functions commonly used in regression tasks.

| Property | L1 loss (MAE) | L2 loss (MSE) | Huber loss | Smooth L1 loss |
|---|---|---|---|---|
| Formula (per sample) | $$\lvert y - \hat{y} \rvert$$ | $$(y - \hat{y})^2$$ | Piecewise: quadratic for small errors, linear for large errors | Equivalent to Huber(x)/β, with different parameterization |
| Gradient magnitude | Constant (±1) | Proportional to error | Proportional to error (small), constant (large) | Similar to Huber |
| Differentiable everywhere | No (kink at 0) | Yes | Yes | Yes |
| Robustness to outliers | High | Low | High (tunable via δ) | High (tunable via β) |
| Optimal estimator | Conditional median | Conditional mean | Weighted combination | Weighted combination |
| Corresponding error distribution | Laplace | Gaussian | N/A | N/A |
| Convergence near optimum | Can oscillate | Smooth, fast | Smooth, fast | Smooth, fast |
| Use in [object detection](/wiki/object_detection) | Less common | Less common | Less common | Very common (e.g., Fast R-CNN) |

### L1 loss vs. L2 loss

The core difference between L1 and L2 loss is the squaring operation. L2 loss squares each residual, which amplifies large errors and shrinks small errors (those less than 1). This has several practical consequences:

- **Outlier sensitivity:** L2 loss pulls the model toward fitting outliers at the expense of the majority of data points. L1 loss distributes its penalty more evenly.
- **Gradient behavior:** The L2 gradient is proportional to the residual, so updates are larger when errors are large and smaller when errors are small. The L1 gradient has constant magnitude, so updates are the same size regardless of error magnitude. This means L1 loss can be slower to converge when errors are very small.
- **Solution stability:** Small changes to the data can cause the L1 optimal solution to shift between different sets of data points it passes through. L2 solutions are more stable because the objective is strictly convex (the Hessian is positive definite).
- **Multiple solutions:** L1 loss can have multiple optimal solutions, especially in low-dimensional problems. L2 loss has a unique optimum (assuming the design matrix has full rank).

### What is the Huber loss?

Huber loss, introduced by Peter Huber in 1964, combines the advantages of both L1 and L2 loss [5]. It is defined piecewise with a threshold parameter $$\delta$$:

$$
L_\delta(a) = \begin{cases} \frac{1}{2} a^2 & \text{if } \lvert a \rvert \le \delta \\ \delta (\lvert a \rvert - \delta/2) & \text{if } \lvert a \rvert > \delta \end{cases}
$$

For errors smaller than $$\delta$$, the loss behaves like L2 (quadratic), providing smooth gradients and fast convergence. For errors larger than $$\delta$$, the loss behaves like L1 (linear), providing robustness to outliers. The $$\delta$$ parameter is a [hyperparameter](/wiki/hyperparameter) that must be chosen by the practitioner, typically through cross-validation.

### What is the Smooth L1 loss?

Smooth L1 loss was introduced by Ross Girshick in the Fast R-CNN paper (2015) for bounding box regression in object detection [6]. It is closely related to Huber loss, and in [PyTorch](/wiki/pytorch) it is parameterized by a value $$\beta$$ (beta):

$$
\text{SmoothL1}(x) = \begin{cases} \dfrac{0.5 x^2}{\beta} & \text{if } \lvert x \rvert < \beta \\ \lvert x \rvert - 0.5\beta & \text{if } \lvert x \rvert \ge \beta \end{cases}
$$

The PyTorch documentation states that Smooth L1 loss "uses a squared term if the absolute element-wise error falls below beta and an L1 term otherwise," that it "is less sensitive to outliers than torch.nn.MSELoss and in some cases prevents exploding gradients," and that it is "equivalent to huber(x,y)/beta" [14]. As $$\beta$$ approaches 0, Smooth L1 loss converges to standard L1Loss; as $$\beta$$ approaches $$\infty$$, it converges to a constant zero loss (while the closely related HuberLoss converges to MSELoss) [14]. The key advantage of Smooth L1 over plain L1 is differentiability at zero, which allows stable gradient-based optimization without the need for subgradient methods.

## Connection to L1 regularization and LASSO

L1 loss and L1 regularization are related but distinct concepts. L1 loss measures prediction error using absolute deviations. L1 regularization adds a penalty proportional to the sum of the absolute values of model weights to the training objective.

The LASSO (Least Absolute Shrinkage and Selection Operator), introduced by Robert Tibshirani in 1996, combines an L2 loss (least squares) data term with an L1 regularization penalty [3]:

$$
\text{LASSO objective} = \frac{1}{2n} \sum (y_i - \hat{y}_i)^2 + \lambda \sum \lvert w_j \rvert
$$

where $$\lambda$$ controls the strength of the penalty and $$w_j$$ are the model weights. The L1 penalty encourages sparsity by driving some weights to exactly zero, effectively performing automatic [feature selection](/wiki/feature_engineering) [3]. This differs from L2 regularization (Ridge regression), which shrinks weights toward zero but does not set them to exactly zero.

Geometrically, the L1 constraint region forms a diamond (or cross-polytope) in parameter space, while the L2 constraint region forms a sphere. The corners of the diamond lie on the coordinate axes, making it more likely that the optimal solution falls at a corner where one or more parameters are zero [11].

It is also possible to use L1 loss as the data term together with L1 regularization on the weights, producing a doubly-robust model that is resistant to both outliers in the response variable and irrelevant features in the predictors.

## Optimization methods

Because L1 loss is not differentiable everywhere, specialized optimization techniques are sometimes needed.

### Subgradient descent

The most straightforward approach replaces the gradient with a subgradient at non-differentiable points. Subgradient descent is guaranteed to converge for convex problems, though typically at a slower rate than gradient descent on smooth functions. The convergence rate for Lipschitz-continuous convex functions (which includes L1 loss) is $$O(1/\sqrt{T})$$ for $$T$$ iterations with appropriately decreasing step sizes [12].

### Iteratively reweighted least squares (IRLS)

IRLS approximates the L1 objective by solving a sequence of weighted least squares problems. At each iteration, the weights are set inversely proportional to the current residuals, so points with large residuals receive less weight. This approach leverages the efficient closed-form solution of weighted least squares while converging to the L1 solution.

### Linear programming

Minimizing the sum of absolute deviations can be reformulated as a linear programming problem by introducing auxiliary variables. This allows the use of standard LP solvers, including the simplex method and interior-point methods. The Barrodale-Roberts algorithm is a specialized simplex-based method designed specifically for L1 regression.

### In [deep learning](/wiki/deep_learning)

When training neural networks with L1 loss, standard [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) or adaptive optimizers such as [Adam](/wiki/adam_optimizer) are used. The subgradient at zero is conventionally set to 0 in automatic differentiation frameworks. In practice, this works well because the probability of hitting the exact non-differentiable point is negligible in floating-point arithmetic.

## What is L1 loss used for?

### Robust regression

L1 loss is the standard choice when the training data contains outliers or heavy-tailed error distributions. In fields such as economics, environmental science, and sensor data analysis, measurements often include erroneous readings or extreme values. Using L1 loss prevents these anomalous points from dominating the fitted model.

### Image super-resolution and restoration

Research by Zhao et al. (2017) at NVIDIA demonstrated that neural networks trained with L1 loss for image restoration tasks (super-resolution, JPEG artifact removal, demosaicing) produce higher-quality results than those trained with L2 loss, even when evaluated using L2-based metrics like PSNR [7]. The authors note that "the L2 loss has been the de facto standard" for image processing networks but that "the quality of the results improves significantly with better choices for the loss function, even when the network architecture is left unchanged" [7]. The reason is that L2 loss tends to produce blurry outputs by averaging over multiple plausible reconstructions, while L1 loss encourages sharper predictions. Combinations of L1 loss with perceptual losses (such as MS-SSIM) can yield even better results [7].

### Object detection (bounding box regression)

In [object detection](/wiki/object_detection) frameworks such as Fast R-CNN, Faster R-CNN, and [YOLO](/wiki/yolo), bounding box coordinates are typically regressed using Smooth L1 loss rather than plain L1 loss [6]. Smooth L1 provides robustness to large errors (from poorly matched anchor boxes) while maintaining smooth gradients for precise localization when errors are small.

### Time series forecasting

MAE is commonly used as both a loss function and evaluation metric in time series forecasting. Because time series data frequently contains anomalous spikes or drops, L1 loss helps the model focus on the typical pattern rather than fitting extreme events. Many forecasting competitions (such as the M-competitions) report MAE or related metrics as primary evaluation criteria.

### Generative models

L1 loss appears in several generative modeling architectures. In image-to-image translation (e.g., Pix2Pix), L1 loss is combined with an adversarial loss to encourage the generated image to be close to the target while remaining visually realistic [9]. The L1 term prevents mode collapse and provides a strong pixel-level supervision signal; Isola et al. report that they use L1 rather than L2 specifically because "L1 encourages less blurring" [9].

### Signal processing and compressed sensing

In compressed sensing, L1 minimization is used to recover sparse signals from a small number of linear measurements. The key theoretical result (by Candes, Romberg, and Tao, 2006) is that under certain conditions on the measurement matrix, the sparsest solution to an underdetermined system of equations can be found by solving an L1 minimization problem, which is convex and computationally tractable [8].

## Implementation in popular frameworks

### [PyTorch](/wiki/pytorch)

PyTorch provides `torch.nn.L1Loss` as a built-in module, documented as a "criterion that measures the mean absolute error (MAE) between each element in the input x and target y" [13]:

```python
import torch
import torch.nn as nn

# Create loss function
loss_fn = nn.L1Loss(reduction='mean')  # Options: 'none', 'mean', 'sum'

# Example usage
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8])
targets = torch.tensor([3.0, -0.5, 2.0, 7.5])

loss = loss_fn(predictions, targets)
print(loss)  # tensor(0.3250)
```

The `reduction` parameter controls how individual losses are aggregated: `'none'` returns the per-element loss, `'mean'` (the default) returns the average, and `'sum'` returns the total [13].

### [TensorFlow](/wiki/tensorflow) / [Keras](/wiki/keras)

TensorFlow and Keras offer MAE as both a loss function and a metric:

```python
import tensorflow as tf

# As a loss function
loss_fn = tf.keras.losses.MeanAbsoluteError()

# Example usage
predictions = tf.constant([2.5, 0.0, 2.1, 7.8])
targets = tf.constant([3.0, -0.5, 2.0, 7.5])

loss = loss_fn(targets, predictions)
print(loss)  # tf.Tensor(0.325, shape=(), dtype=float32)
```

### scikit-learn

For evaluation outside a training loop, scikit-learn exposes `sklearn.metrics.mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')`, described in its documentation as the "mean absolute error regression loss," a non-negative value whose "best value is 0.0" [15]:

```python
from sklearn.metrics import mean_absolute_error

y_true = [3.0, -0.5, 2.0, 7.5]
y_pred = [2.5, 0.0, 2.1, 7.8]

mean_absolute_error(y_true, y_pred)  # 0.325
```

### NumPy (from scratch)

A minimal implementation of L1 loss in NumPy:

```python
import numpy as np

def l1_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def l1_loss_gradient(y_true, y_pred):
    return np.sign(y_pred - y_true) / len(y_true)
```

## Practical considerations

### When should you use L1 loss?

L1 loss is a good default choice in the following situations:

- The training data contains outliers or heavy-tailed noise.
- The goal is to estimate the conditional median rather than the conditional mean.
- Robustness is more important than achieving the lowest possible error on clean data.
- The application involves image restoration, where L1 loss tends to produce sharper outputs than L2 loss [7].

### When should you prefer L2 loss?

L2 loss may be preferable when:

- The data is clean and errors are approximately Gaussian.
- Smooth, fast convergence is desired (the quadratic curvature of L2 loss helps optimizers converge more quickly near the minimum).
- A unique, stable solution is needed.
- The application specifically requires minimizing variance rather than deviation from the median.

### When should you consider Huber or Smooth L1 loss?

Huber loss or Smooth L1 loss should be considered when:

- Robustness to outliers is needed, but smooth gradients are also desired for stable training.
- The application involves bounding box regression or other tasks where both large and small errors are common [6].
- The practitioner is willing to tune the threshold parameter ($$\delta$$ or $$\beta$$).

### Common pitfalls

1. **Constant gradient magnitude:** Because the L1 gradient has constant magnitude, the optimizer does not automatically slow down as it approaches the minimum. This can cause oscillation around the optimum. Using a learning rate schedule or an adaptive optimizer like [Adam](/wiki/adam_optimizer) can mitigate this issue.
2. **Scale sensitivity:** Like all loss functions, L1 loss is sensitive to the scale of the target variable. Normalizing or standardizing targets before training can improve optimization stability.
3. **Confusing L1 loss with L1 regularization:** L1 loss is about prediction error, while L1 regularization is about constraining model weights. They are often used together but serve different purposes.

## Summary comparison table

| Criterion | L1 loss (MAE) | L2 loss (MSE) |
|---|---|---|
| Penalizes errors | Linearly | Quadratically |
| Effect of outliers | Moderate (linear) | Severe (quadratic) |
| Gradient at large errors | Constant magnitude | Large magnitude |
| Gradient near zero error | Constant magnitude | Near zero |
| Differentiable everywhere | No | Yes |
| Statistical estimator | Median | Mean |
| Probabilistic model | Laplace distribution | Gaussian distribution |
| Solution uniqueness | May have multiple | Unique (full rank) |
| Convergence speed | Slower near minimum | Faster near minimum |
| Produces sparse solutions (as regularizer) | Yes | No |

## See also

- [Loss function](/wiki/loss_function)
- [L2 loss](/wiki/l2_loss)
- [Cross-entropy loss](/wiki/cross_entropy_loss)
- [Gradient descent](/wiki/gradient_descent)
- [Regularization](/wiki/regularization)
- [Linear regression](/wiki/linear_regression)
- [Convex optimization](/wiki/convex_optimization)

## References

1. Boscovich, R. J. (1757). *De Litteraria Expeditione per Pontificiam ditionem*. Bononiensi Scientiarum et Artium Instituto Atque Academia Commentarii, 4, 353-396.
2. Laplace, P. S. (1788). *Memoire sur la theorie de Jupiter et de Saturne*. Memoires de l'Academie Royale des Sciences de Paris.
3. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society: Series B*, 58(1), 267-288.
4. Koenker, R., & Bassett, G. (1978). Regression quantiles. *Econometrica*, 46(1), 33-50.
5. Huber, P. J. (1964). Robust estimation of a location parameter. *Annals of Mathematical Statistics*, 35(1), 73-101.
6. Girshick, R. (2015). Fast R-CNN. *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 1440-1448.
7. Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2017). Loss functions for image restoration with neural networks. *IEEE Transactions on Computational Imaging*, 3(1), 47-57.
8. Candes, E. J., Romberg, J. K., & Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. *Communications on Pure and Applied Mathematics*, 59(8), 1207-1223.
9. Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 1125-1134.
10. Barron, J. T. (2019). A general and adaptive robust loss function. *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 4331-4339.
11. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.
12. Boyd, S., & Vandenberghe, L. (2004). *Convex Optimization*. Cambridge University Press.
13. PyTorch Documentation. *torch.nn.L1Loss*. PyTorch Foundation. https://docs.pytorch.org/docs/stable/generated/torch.nn.L1Loss.html
14. PyTorch Documentation. *torch.nn.SmoothL1Loss*. PyTorch Foundation. https://docs.pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html
15. scikit-learn Documentation. *sklearn.metrics.mean_absolute_error*. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html