Loss
Last reviewed
May 9, 2026
Sources
21 citations
Review status
Source-backed
Revision
v5 · 6,352 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
21 citations
Review status
Source-backed
Revision
v5 · 6,352 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning, Loss function, Gradient descent
In machine learning, loss (sometimes called error) is a scalar value that quantifies how far a model's prediction deviates from the true or expected output for a single training example. During the training process, the model's parameters are iteratively adjusted to reduce this value, driving the model toward more accurate predictions. Loss is the fundamental signal that enables learning: without it, optimization algorithms like gradient descent would have no direction in which to update a model's weights.
The concept of loss sits at the intersection of statistics, optimization theory, and machine learning engineering. It is central to nearly every supervised and many unsupervised learning workflows, from simple linear regression to large-scale deep learning systems like transformers. Where the loss function article focuses on specific mathematical formulas and their derivations, this article focuses on loss as an operational quantity: how it is produced, monitored, interpreted, and acted upon during the lifecycle of a model.
Formally, the loss for a single example is defined by a loss function L(y, ŷ), where y is the ground-truth label and ŷ is the model's prediction. The output is a non-negative real number (a scalar) that represents the penalty for that prediction. A loss of zero means the model's prediction exactly matches the true value; larger values indicate greater error.
For example, if a model predicts that a house costs $310,000 and the actual price is $300,000, the squared error loss for that example would be (310,000 - 300,000)^2 = 100,000,000. This single-example loss is the atomic unit of the learning signal in supervised learning.
When training proceeds, the per-example losses are aggregated. Most practitioners report a batch loss (the mean loss over a mini-batch), an epoch loss (the mean over all batches in one pass through the dataset), and sometimes a running loss (a moving average across recent steps). These aggregates are the numbers that appear on training dashboards, and they form the loss curves that practitioners read to diagnose training health.
The machine learning concept of loss has a deep root in statistical decision theory. In that framework, given a true data distribution p(x, y), the true risk (also called the expected loss or population risk) of a model f with parameters θ is the expected value of the loss across the entire distribution:
R(θ) = E[L(y, f(x; θ))]
This quantity is what a model designer ultimately cares about, because it measures performance on the underlying task in the real world. Unfortunately, p(x, y) is unknown. Practitioners only have access to a finite training set drawn from that distribution.
To make progress, machine learning systems minimize the empirical risk instead, which is the average loss over the training set:
R̂(θ) = (1/n) Σ L(y_i, f(x_i; θ))
This substitution is the empirical risk minimization (ERM) principle, formalized by Vladimir Vapnik in the 1990s. The law of large numbers ensures that as the number of training examples n grows, the empirical risk converges to the true risk. For a sufficiently large training set drawn independently from the same distribution as the test data, minimizing empirical loss is a reasonable proxy for minimizing the true risk.
The gap between empirical risk and true risk is the generalization gap, and much of modern machine learning theory is concerned with bounding this gap. Tools such as VC dimension, Rademacher complexity, and PAC-Bayes bounds all attempt to characterize when minimizing training loss will produce a model that performs well on unseen data.
The terms "loss," "cost," and "objective function" are closely related and sometimes used interchangeably in practice. However, many textbooks draw the following distinctions:
| Term | Scope | Description |
|---|---|---|
| Loss function | Single example | Measures prediction error for one data point. Example: the squared error for one sample. |
| Cost function | Entire dataset | Aggregates the loss over all training examples, typically as an average or sum. May also include regularization terms. |
| Objective function | Optimization goal | The broadest term. Can refer to any function being optimized (minimized or maximized), including cost functions, reward functions, or likelihood functions. |
Some authors, including Sebastian Raschka, treat "loss" and "cost" as synonyms, noting that there is no universal consensus on the distinction.[1] In everyday conversation among practitioners, "loss" frequently refers to the aggregated value reported per batch or per epoch during training, even though that is technically a cost function.
A related word is error. In statistics, "error" usually refers to the residual y - ŷ, while "loss" refers to the function applied to that residual. In casual usage these are often blurred, and loss is sometimes called "training error" or "validation error." The Goodfellow, Bengio, and Courville textbook Deep Learning uses "cost function" and "loss function" interchangeably and notes that the choice depends on whichever phrasing is most natural in a given context.[10]
The training loop in most machine learning systems follows a repeated cycle:
This cycle repeats across many batches and epochs until the loss converges to an acceptable level or training is stopped by other criteria.
The loss value is the only signal that connects the data to the model's parameters. A model with poorly chosen loss can train without numerical errors and still fail to learn the intended task, because the gradient produced by the loss does not point in a useful direction. This is why loss design is treated as a first-class concern in modern machine learning research.
In deep learning frameworks like PyTorch, TensorFlow, and JAX, the loss is the entry point for computing gradients via reverse-mode automatic differentiation (autodiff). When the framework executes the forward pass, it builds a computational graph that records every operation applied to the model's tensors. Calling .backward() on the loss tensor traverses this graph in reverse, applying the chain rule at each node to compute the partial derivative of the loss with respect to every parameter that affected it.[2]
Because the gradient is computed with respect to the loss, the loss must be a single scalar. If the model produces a vector or tensor output, that output has to be reduced to one number (typically by summing or averaging the per-example losses) before the backward pass. Many subtle bugs in training code can be traced to losses that were not properly reduced, producing gradients that depend on batch size in unintended ways.
Autodiff also imposes a practical constraint on loss design: the function must be composed of operations the framework can differentiate. Most numerical operations qualify, but operations like sorting, top-k selection, or decoding through a beam search require special handling, smooth relaxations, or surrogate losses that approximate the true objective.
Different tasks call for different loss functions, each with its own range of output values. The summary below lists the most common families; for full mathematical treatment of each function, see the dedicated loss function article and the linked subpages.
| Loss Function | Formula (single example) | Range | Typical Use Case |
|---|---|---|---|
| Mean Squared Error (MSE) | (y - ŷ)² | [0, +∞) | General regression; penalizes large errors heavily |
| Mean Absolute Error (MAE) | |y - ŷ| | [0, +∞) | Regression with outlier robustness |
| Huber Loss | Quadratic near zero, linear far from zero | [0, +∞) | Combines benefits of MSE and MAE |
| Log-cosh loss | log(cosh(y - ŷ)) | [0, +∞) | Smooth alternative to Huber, twice differentiable |
| Quantile (pinball) loss | max(q(y-ŷ), (q-1)(y-ŷ)) | [0, +∞) | Probabilistic forecasting, prediction intervals |
| Loss Function | Formula (single example) | Range | Typical Use Case |
|---|---|---|---|
| Cross-entropy (log loss) | -y log(ŷ) - (1-y) log(1-ŷ) | [0, +∞) | Binary and multi-class classification |
| Hinge loss | max(0, 1 - y·ŷ) | [0, +∞) | Support vector machines, binary classification |
| Focal loss | -α(1-ŷ)^γ log(ŷ) | [0, +∞) | Imbalanced classification, object detection |
| Kullback-Leibler (KL divergence) | Σ p(x) log(p(x)/q(x)) | [0, +∞) | Comparing probability distributions; variational autoencoders, distillation |
| Loss Function | Domain | Typical Use Case |
|---|---|---|
| Sequence cross-entropy | Token-level | Language model next-token prediction |
| CTC loss | Audio, handwriting | Sequence labeling with unknown alignment |
| Triplet / contrastive loss | Embeddings | Face recognition, retrieval, SimCLR, CLIP |
| Dice / IoU loss | Segmentation | Medical imaging, foreground segmentation |
There is no universal "good" loss value that applies across tasks. A cross-entropy loss of 0.3 might be excellent for one dataset and mediocre for another. What matters is the trajectory of the loss over time and how it compares to a baseline or published benchmark for the same task.
Rough rules of thumb help readers calibrate what they see in practice:
These ranges are heuristics, not guarantees. The right way to interpret a loss number is always to compare it to a baseline and to the loss of competing models on the same data.
A single number called "loss" usually decomposes into three closely watched quantities: training loss, validation loss, and test loss.
| Loss type | Computed on | Used for | Notes |
|---|---|---|---|
| Training loss | Training set | Backpropagation, parameter updates | Falls steadily during healthy training; used by the optimizer at every step |
| Validation loss | Held-out validation set | Hyperparameter tuning, early stopping, model selection | The model is not updated against this loss directly; checking it once per epoch is common |
| Test loss | Held-out test set | Final reporting | Should be evaluated only after model selection is complete to give an unbiased estimate of generalization |
Mixing these roles is one of the most common methodological mistakes. Tuning hyperparameters against the test set, for instance, leaks information from the test set into the training process and inflates the reported performance. Best practice is to lock the test set away until the model is finalized and to use a separate validation split for all decisions made during development.
In smaller datasets, k-fold cross-validation averages the loss over multiple train/validation splits, producing a more stable estimate of generalization at the cost of additional compute. In very large datasets typical of deep learning, a single fixed validation split is usually adequate.
A loss curve (also called a learning curve) is a plot of loss values on the y-axis against training steps or epochs on the x-axis. Practitioners routinely plot two curves together: training loss and validation loss. Analyzing these curves is one of the most important diagnostic tools in machine learning.[3]
| Pattern | Training Loss | Validation Loss | Diagnosis |
|---|---|---|---|
| Good fit | Decreases and converges | Decreases and converges near training loss | Model is learning well and generalizing |
| Overfitting | Continues to decrease | Decreases, then increases or plateaus | Model memorizes training data instead of learning general patterns |
| Underfitting | Remains high | Remains high (close to training loss) | Model lacks capacity or has not trained long enough |
| Oscillating loss | Fluctuates erratically | Fluctuates erratically | Learning rate too high, poor data quality, or insufficient shuffling |
| Diverging loss | Increases or explodes to NaN/Inf | Increases or explodes | Severe numerical instability (see section below) |
| Plateau | Stagnates after early drop | Stagnates after early drop | Saddle point or vanishing gradient; try restart, warmup, or optimizer change |
| Loss spike | Sudden upward jump, often recovers | May or may not move | Bad batch, gradient norm explosion, Adam state corruption |
An ideal loss curve shows an exponential-like decrease that gradually flattens, indicating the model has extracted most of the learnable patterns from the data. The gap between training and validation curves is critical: a small gap indicates good generalization, while a growing gap signals overfitting.[3]
Raw per-step losses are noisy because each batch contains a different sample of data. Visualization tools such as TensorBoard and Weights & Biases offer smoothing controls that apply an exponential moving average (EMA) to the loss series, making trends easier to read. The EMA update is y_i = α · y_{i-1} + (1 - α) · x_i, with smoothing factor α (often 0.9 to 0.99 in practice). Heavy smoothing can mask short-lived spikes, so it is good practice to inspect both the smoothed and the raw curves.[4]
Other useful diagnostic plots include the loss histogram (the distribution of per-example losses inside a batch, which can reveal outliers dominating training) and the gradient norm curve (a companion plot that exposes exploding or vanishing gradient regimes that are not always visible in the loss itself).
Effective loss monitoring is essential for producing well-trained models. Several strategies help practitioners get the most from their loss curves:
In very large models, including modern large language models, the training loss occasionally exhibits sudden upward spikes that can be hundreds of times larger than the typical loss. Research on stabilizing pre-training has linked these spikes to abrupt growth in gradient norms, often triggered by a single "bad" batch with numerical outliers, or by parameter norms drifting non-uniformly across layers.[6] Common mitigations include:
Most loss spikes recover on their own within a few iterations. Persistent spikes that fail to recover are often the first sign that hyperparameters need adjustment or that the data pipeline has produced corrupted samples.
One of the most frustrating problems during training is loss divergence, where the loss increases without bound, eventually reaching Inf (infinity) or NaN (not a number). Common causes include:[7]
| Cause | Mechanism | Typical Fix |
|---|---|---|
| Exploding gradients | Gradients grow exponentially through deep layers during backpropagation | Gradient clipping; reduce learning rate; use batch normalization |
| Learning rate too high | Parameter updates overshoot the minimum | Lower the learning rate; use a learning rate scheduler |
| Numerical instability | Operations like log(0), division by zero, or overflow in activations | Add small epsilon values (e.g., log(ŷ + 1e-7)); use mixed-precision training carefully |
| Corrupt or unnormalized data | NaN or extreme values in the input features or labels | Scan data for NaN/Inf; normalize or standardize features |
| Inappropriate loss function | Loss function does not match the task or output range | Verify that loss function assumptions (e.g., probability outputs for cross-entropy) are met |
| Mixed-precision overflow | FP16 activations exceed the representable range | Use loss scaling, switch to bfloat16, or keep certain layers in FP32 |
When NaN appears in the loss, training should be stopped immediately. Continuing after NaN values propagate through the network will corrupt all parameters. Debugging typically involves checking the last few batches before the NaN appeared, inspecting gradient magnitudes, and validating the input data pipeline.
A helpful diagnostic technique is to wrap the forward pass with an anomaly detector, such as PyTorch's torch.autograd.detect_anomaly(), which raises an error at the exact operation that produces a non-finite gradient. This is too slow to leave on during full training but is invaluable when investigating an outage.
The loss landscape is the surface formed by plotting the loss as a function of the model's parameters. For a model with two parameters, this produces a 3D surface; for real neural networks with millions of parameters, the landscape exists in extremely high-dimensional space and can only be visualized through projections.
In their influential 2018 paper "Visualizing the Loss Landscape of Neural Nets," Li et al. introduced filter normalization methods to produce meaningful 2D cross-sections of high-dimensional loss surfaces.[8] Their key findings include:
Loss landscape analysis has become an important research area for understanding why certain architectures and hyperparameter choices lead to better training outcomes.[9]
Research by Dauphin et al. (2014) showed that in very high-dimensional spaces, almost all critical points of the loss are saddle points, not local minima.[15] At a saddle point, the gradient is zero but the loss can decrease in some directions and increase in others. Vanilla gradient descent tends to stall at these points because it has nothing to push it off. Momentum, adaptive optimizers like Adam, and stochastic noise from mini-batching all help models escape saddles. This is why pure batch gradient descent is rarely used in modern deep learning: the noise in stochastic updates is itself a useful feature.
Not all minima are equal. Solutions in flat regions of the loss landscape, where the loss changes slowly with small parameter perturbations, generalize better than solutions in sharp minima where small weight changes produce large loss increases.[8] Several training techniques are designed to bias the optimizer toward flat minima:
A common source of confusion is the distinction between loss and evaluation metrics (such as accuracy, precision, recall, or F1 score).
| Property | Loss | Metric |
|---|---|---|
| Differentiable | Yes (required for gradient-based optimization) | Often not (e.g., accuracy is a step function) |
| Used during training | Yes (drives parameter updates) | Sometimes logged, but not used for optimization |
| Interpretability | Scale depends on the loss function; not always intuitive | Often more interpretable (e.g., "93% accuracy") |
| Relationship to task goal | Proxy for the real objective | Often more directly aligned with the real objective |
Loss must be differentiable so that gradient descent can compute the direction and magnitude of parameter updates. Metrics like accuracy produce discrete (correct/incorrect) outputs, making them non-differentiable and unsuitable as optimization targets. Cross-entropy loss, for instance, serves as a smooth, differentiable proxy for accuracy in classification tasks.[11]
It is also worth noting that loss and accuracy are not always inversely proportional. A model can achieve lower loss while accuracy stays flat, or vice versa. This happens because loss captures confidence in predictions (penalizing uncertain correct predictions and rewarding confident correct ones), while accuracy only counts whether the final prediction was correct or not.[12]
In ranking and search applications, the gap is even larger. Metrics like NDCG and mean average precision involve sorting and rank cutoffs that cannot be differentiated directly, so practitioners minimize surrogate losses (such as pairwise ranking loss or list-wise approximations) that correlate with the metric of interest. Choosing the surrogate is itself a research problem, and progress on a surrogate does not always translate into progress on the metric.
For large language models, the dominant training loss is token-level cross-entropy under a next-token prediction objective. Given a sequence of tokens x_1, x_2, ..., x_T, the model produces a probability distribution over the vocabulary at each position, and the loss for the sequence is:
L = -(1/T) Σ_{t=1}^{T} log p_θ(x_t | x_1, ..., x_{t-1})
This is the negative log-likelihood of the observed sequence under the model. Minimizing this loss is mathematically equivalent to maximum likelihood estimation. A loss of zero would mean the model assigns probability 1 to every token in the training corpus, which is impossible for real natural language.
During training, transformer language models use teacher forcing: at each position, the input includes the ground-truth previous tokens rather than the model's own past predictions. This allows efficient parallel computation of the loss across all positions in a sequence at once. At inference time, however, the model decodes autoregressively from its own outputs, which can introduce exposure bias: the model has never seen its own (sometimes incorrect) outputs as input during training, so errors can compound during generation.[13]
For language models, the loss is often reported as perplexity rather than raw cross-entropy. Perplexity is the exponential of the cross-entropy loss:
PPL = exp(L)
Intuitively, perplexity is the effective number of equally likely choices the model is uncertain between at each token. A perplexity of 10 means the model is, on average, as confused as if it had to choose uniformly among 10 possibilities. For tokenized text, modern frontier models reach perplexities in the single digits on diverse benchmarks. Perplexity is convenient because it is roughly comparable across tokenizations and easier to communicate than nat-valued cross-entropy, though it is sensitive to vocabulary size and tokenizer choice when comparing across systems.
A major empirical finding from research on large models is that the achievable loss follows power laws with respect to model size, dataset size, and compute. Kaplan et al. (2020) showed that for transformer language models, test loss scales as approximately L(N) ∝ N^{-α} with model size N, L(D) ∝ D^{-β} with dataset size D, and similarly with compute C.[16] These trends span many orders of magnitude.
Hoffmann et al. (2022), the Chinchilla paper, refined these findings and argued that for compute-optimal training, model size and training tokens should be scaled in roughly equal proportion: each doubling of parameters should be paired with a doubling of training tokens.[17] The original analysis trained more than 400 models ranging from 70 million to 16 billion parameters and concluded that many earlier large models, including GPT-3 and Gopher, were undertrained relative to their parameter count. The Chinchilla model itself, with 70 billion parameters trained on roughly 1.4 trillion tokens, achieved lower loss and better downstream performance than larger models trained on smaller datasets. See scaling laws for more on this topic.
When comparing language models that use different tokenizers, raw cross-entropy or perplexity become hard to interpret. To make comparisons tokenizer-agnostic, researchers often report bits per byte (BPB) or bits per character (BPC): the cross-entropy loss converted to base-2 and divided by the number of bytes or characters in the original text. A model with 1.0 BPB compresses text to roughly the same size as the natural language entropy estimate that Shannon derived for English in 1951.
In practice, the quantity minimized during training is often not just the raw data loss. Regularization techniques add penalty terms to the loss to discourage overly complex models:
The total loss with regularization can be written as:
Total Loss = Data Loss + λ · Regularization Term
where λ is a hyperparameter that controls the strength of regularization. When multiple regularizers are combined, each typically gets its own coefficient, and tuning these coefficients can have as much impact on final performance as choosing the architecture.
From a Bayesian standpoint, adding L2 regularization corresponds to placing a Gaussian prior on the weights, and adding L1 regularization corresponds to a Laplace prior. Maximum a posteriori (MAP) estimation under these priors yields exactly the same gradient updates as minimizing the regularized loss.
Many real systems optimize more than one objective at once. A self-driving perception network might predict depth, semantic segmentation, and lane lines simultaneously, with each task contributing its own loss term:
L_total = w_1 · L_depth + w_2 · L_segmentation + w_3 · L_lanes
The weights w_i control the relative importance of each task. Picking these by hand is brittle: tasks have different units, scales, and convergence rates, and good values vary across architectures and datasets.
Kendall, Gal, and Cipolla (2018) proposed learning the task weights automatically by treating each task's loss as a Gaussian or Laplacian likelihood with a learned homoscedastic uncertainty σ_i.[19] Each task's weight becomes 1 / (2σ_i²) (for regression) plus a log σ_i term that prevents the model from collapsing all weights to zero. This single change made multi-task learning meaningfully easier to apply on real perception problems, because the network adapts its own task balance instead of relying on hand-tuned ratios.
Other weighting schemes include GradNorm, which equalizes the gradient norms produced by each task, and dynamic weight averaging, which balances the rate of loss decrease across tasks. These methods all reduce the burden of manual tuning and have become standard tools in multi-objective deep learning.
The ecosystem of loss formulations has expanded sharply with the rise of self-supervised learning, generative modeling, and alignment work.
Each of these formulations is the subject of an active research literature, and choosing the right loss is often the single most important decision when designing a new system.
When no standard loss aligns well with the task, practitioners design custom losses. Common motivations include:
When designing a custom loss, three checks are essential. First, the loss must be differentiable (or have well-defined subgradients). Second, it should be tested with extreme inputs (zeros, large values, NaN propagation) so it does not blow up under unusual data. Third, the loss should be checked against the actual evaluation metric on a small benchmark to verify that progress on the loss correlates with progress on the metric. A loss that improves while the metric stagnates is usually a sign of misalignment, sometimes called "goodharting," where the optimizer exploits a feature of the loss that does not reflect the underlying goal.
Most deep learning practitioners interact with loss functions through framework-provided APIs. A few of the most common follow:
| Framework | Class or function | Purpose |
|---|---|---|
| PyTorch | torch.nn.MSELoss | Mean squared error for regression |
| PyTorch | torch.nn.L1Loss | Mean absolute error for regression |
| PyTorch | torch.nn.HuberLoss | Huber loss for robust regression |
| PyTorch | torch.nn.CrossEntropyLoss | Combined log-softmax and negative log-likelihood for multi-class classification |
| PyTorch | torch.nn.BCELoss / BCEWithLogitsLoss | Binary cross-entropy, with the logits version preferred for numerical stability |
| PyTorch | torch.nn.NLLLoss | Negative log-likelihood; pair with log_softmax outputs |
| PyTorch | torch.nn.KLDivLoss | Kullback-Leibler divergence |
| PyTorch | torch.nn.CTCLoss | Connectionist temporal classification for sequence labeling |
| TensorFlow / Keras | tf.keras.losses.MeanSquaredError | Mean squared error |
| TensorFlow / Keras | tf.keras.losses.SparseCategoricalCrossentropy | Cross-entropy with integer-encoded class labels |
| TensorFlow / Keras | tf.keras.losses.BinaryCrossentropy | Binary cross-entropy |
| TensorFlow / Keras | tf.keras.losses.Huber | Huber loss |
| JAX / Optax | optax.softmax_cross_entropy | Functional cross-entropy on logits |
A recurring source of bugs is using torch.nn.BCELoss on outputs that have not yet passed through a sigmoid activation, or applying torch.nn.CrossEntropyLoss to log-probabilities (it expects raw logits internally). The "with logits" variants combine the activation and the loss in a numerically stable way and are usually the right default.
Across years of practice, several rules of thumb have emerged:
log_softmax operation rather than computing them separately.Imagine you are learning to throw a ball into a bucket. Every time you throw, someone tells you how far the ball landed from the bucket. That distance is the "loss." If the ball lands right in the bucket, your loss is zero. If it lands far away, your loss is big.
Your goal is to practice throwing until the loss gets as small as possible. Each time, you adjust how hard you throw and at what angle based on how far off you were last time. That adjustment is exactly what a computer does when it trains a model: it looks at the loss and tweaks its settings to do better next time.
Sometimes you practice with your friends watching (that is like "validation"). If you do great when practicing alone but poorly when your friends watch, it means you are only good at one specific setup and have not really learned. In machine learning, that is called overfitting.