# Loss

> Source: https://aiwiki.ai/wiki/loss
> Updated: 2026-06-21
> Categories: Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning](/wiki/machine_learning), [Loss function](/wiki/loss_function), [Gradient descent](/wiki/gradient_descent)*

## Introduction

In [machine learning](/wiki/machine_learning), **loss** (sometimes called **error**) is a single non-negative number that measures how far a model's prediction is from the correct answer; training works by repeatedly adjusting the model's [parameters](/wiki/parameter) to make this number as small as possible. A loss of zero means a perfect prediction, and larger values mean greater error. Loss is the fundamental signal that enables learning: without it, optimization algorithms like [gradient descent](/wiki/gradient_descent) would have no direction in which to update a model's [weights](/wiki/weight).

More precisely, loss is the scalar output of a [loss function](/wiki/loss_function) applied to one [training](/wiki/training) example, comparing the model's prediction to the true or expected output. During [training](/wiki/training), per-example losses are aggregated and the model's parameters are iteratively updated to reduce the aggregate, driving the model toward more accurate predictions.

The concept of loss sits at the intersection of statistics, optimization theory, and machine learning engineering. It is central to nearly every supervised and many unsupervised learning workflows, from simple [linear regression](/wiki/linear_regression) to large-scale [deep learning](/wiki/deep_learning) systems like [transformers](/wiki/transformer). Where the [loss function](/wiki/loss_function) article focuses on specific mathematical formulas and their derivations, this article focuses on loss as an operational quantity: how it is produced, monitored, interpreted, and acted upon during the lifecycle of a model.

## Definition and formal meaning

Formally, the loss for a single example is defined by a [loss function](/wiki/loss_function) *L(y, ŷ)*, where *y* is the ground-truth label and *ŷ* is the model's prediction. The output is a non-negative real number (a scalar) that represents the penalty for that prediction. A loss of zero means the model's prediction exactly matches the true value; larger values indicate greater error.

For example, if a model predicts that a house costs $310,000 and the actual price is $300,000, the squared error loss for that example would be (310,000 - 300,000)^2 = 100,000,000. This single-example loss is the atomic unit of the learning signal in supervised learning.

When training proceeds, the per-example losses are aggregated. Most practitioners report a **batch loss** (the mean loss over a mini-batch), an **epoch loss** (the mean over all batches in one pass through the dataset), and sometimes a **running loss** (a moving average across recent steps). These aggregates are the numbers that appear on training dashboards, and they form the loss curves that practitioners read to diagnose training health.

## The statistical view: empirical risk

The machine learning concept of loss has a deep root in statistical decision theory. In that framework, given a true data distribution *p(x, y)*, the **true risk** (also called the expected loss or population risk) of a model *f* with parameters θ is the expected value of the loss across the entire distribution:

*R(θ) = E\[L(y, f(x; θ))\]*

This quantity is what a model designer ultimately cares about, because it measures performance on the underlying task in the real world. Unfortunately, *p(x, y)* is unknown. Practitioners only have access to a finite training set drawn from that distribution.

To make progress, machine learning systems minimize the **empirical risk** instead, which is the average loss over the training set:

*R̂(θ) = (1/n) Σ L(y_i, f(x_i; θ))*

This substitution is the [empirical risk minimization](/wiki/empirical_risk_minimization) (ERM) principle, formalized by Vladimir Vapnik in the 1990s. The law of large numbers ensures that as the number of training examples *n* grows, the empirical risk converges to the true risk. For a sufficiently large training set drawn independently from the same distribution as the test data, minimizing empirical loss is a reasonable proxy for minimizing the true risk.

The gap between empirical risk and true risk is the **generalization gap**, and much of modern machine learning theory is concerned with bounding this gap. Tools such as VC dimension, Rademacher complexity, and PAC-Bayes bounds all attempt to characterize when minimizing training loss will produce a model that performs well on unseen data.

## What is the difference between loss, cost, and objective function?

The terms "loss," "[cost](/wiki/cost)," and "objective function" are closely related and sometimes used interchangeably in practice. However, many textbooks draw the following distinctions:

| Term | Scope | Description |
|------|-------|-------------|
| **Loss function** | Single example | Measures prediction error for one data point. Example: the squared error for one sample. |
| **Cost function** | Entire dataset | Aggregates the loss over all training examples, typically as an average or sum. May also include [regularization](/wiki/regularization) terms. |
| **Objective function** | Optimization goal | The broadest term. Can refer to any function being optimized (minimized or maximized), including cost functions, reward functions, or likelihood functions. |

Some authors, including Sebastian Raschka, treat "loss" and "cost" as synonyms, noting that there is no universal consensus on the distinction.[1] In everyday conversation among practitioners, "loss" frequently refers to the aggregated value reported per batch or per epoch during training, even though that is technically a cost function.

A related word is **error**. In statistics, "error" usually refers to the residual *y - ŷ*, while "loss" refers to the function applied to that residual. In casual usage these are often blurred, and loss is sometimes called "training error" or "validation error." The Goodfellow, Bengio, and Courville textbook *Deep Learning* (2016) treats these terms as interchangeable: "The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function."[10] In other words, the distinctions above are conventions, not formal definitions, and the choice depends on whichever phrasing is most natural in a given context.

## Role of loss in training

The training loop in most machine learning systems follows a repeated cycle:

1. **Forward pass.** The model processes an input and produces a prediction.
2. **Loss computation.** The [loss function](/wiki/loss_function) compares the prediction to the ground truth and outputs a scalar loss value.
3. **Backward pass ([backpropagation](/wiki/backpropagation)).** The gradient of the loss with respect to each model parameter is computed.
4. **Parameter update.** An optimizer (such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd), [Adam](/wiki/adam_optimizer), or RMSProp) adjusts the parameters in the direction that reduces the loss.

This cycle repeats across many [batches](/wiki/batch) and [epochs](/wiki/epoch) until the loss converges to an acceptable level or training is stopped by other criteria.

The loss value is the only signal that connects the data to the model's parameters. A model with poorly chosen loss can train without numerical errors and still fail to learn the intended task, because the gradient produced by the loss does not point in a useful direction. This is why loss design is treated as a first-class concern in modern machine learning research.

### Loss and automatic differentiation

In deep learning frameworks like [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and [JAX](/wiki/jax), the loss is the entry point for computing gradients via reverse-mode **automatic differentiation** (autodiff). When the framework executes the forward pass, it builds a computational graph that records every operation applied to the model's tensors. Calling `.backward()` on the loss tensor traverses this graph in reverse, applying the chain rule at each node to compute the partial derivative of the loss with respect to every parameter that affected it.[2]

Because the gradient is computed with respect to the loss, the loss must be a single scalar. If the model produces a vector or tensor output, that output has to be reduced to one number (typically by summing or averaging the per-example losses) before the backward pass. Many subtle bugs in training code can be traced to losses that were not properly reduced, producing gradients that depend on batch size in unintended ways.

Autodiff also imposes a practical constraint on loss design: the function must be composed of operations the framework can differentiate. Most numerical operations qualify, but operations like sorting, top-k selection, or decoding through a beam search require special handling, smooth relaxations, or surrogate losses that approximate the true objective.

## Common loss functions and their typical values

Different tasks call for different loss functions, each with its own range of output values. The summary below lists the most common families; for full mathematical treatment of each function, see the dedicated [loss function](/wiki/loss_function) article and the linked subpages.

### Regression losses

| Loss Function | Formula (single example) | Range | Typical Use Case |
|---|---|---|---|
| [Mean Squared Error (MSE)](/wiki/mean_squared_error_mse) | (y - ŷ)² | [0, +∞) | General regression; penalizes large errors heavily |
| [Mean Absolute Error (MAE)](/wiki/mean_absolute_error_mae) | \|y - ŷ\| | [0, +∞) | Regression with outlier robustness |
| Huber Loss | Quadratic near zero, linear far from zero | [0, +∞) | Combines benefits of MSE and MAE |
| Log-cosh loss | log(cosh(y - ŷ)) | [0, +∞) | Smooth alternative to Huber, twice differentiable |
| Quantile (pinball) loss | max(q(y-ŷ), (q-1)(y-ŷ)) | [0, +∞) | Probabilistic forecasting, prediction intervals |

### Classification losses

| Loss Function | Formula (single example) | Range | Typical Use Case |
|---|---|---|---|
| [Cross-entropy](/wiki/cross-entropy) (log loss) | -y log(ŷ) - (1-y) log(1-ŷ) | [0, +∞) | Binary and multi-class classification |
| [Hinge loss](/wiki/hinge_loss) | max(0, 1 - y·ŷ) | [0, +∞) | [Support vector machines](/wiki/support_vector_machine_svm), binary classification |
| Focal loss | -α(1-ŷ)^γ log(ŷ) | [0, +∞) | Imbalanced classification, object detection |
| Kullback-Leibler ([KL divergence](/wiki/kl_divergence)) | Σ p(x) log(p(x)/q(x)) | [0, +∞) | Comparing probability distributions; [variational autoencoders](/wiki/vae), distillation |

### Sequence and structured losses

| Loss Function | Domain | Typical Use Case |
|---|---|---|
| Sequence cross-entropy | Token-level | [Language model](/wiki/large_language_model) next-token prediction |
| CTC loss | Audio, handwriting | Sequence labeling with unknown alignment |
| Triplet / contrastive loss | Embeddings | Face recognition, retrieval, [SimCLR](/wiki/simclr), [CLIP](/wiki/clip) |
| Dice / IoU loss | Segmentation | Medical imaging, foreground segmentation |

There is no universal "good" loss value that applies across tasks. A cross-entropy loss of 0.3 might be excellent for one dataset and mediocre for another. What matters is the trajectory of the loss over time and how it compares to a baseline or published benchmark for the same task.

### Loss values to expect in practice

Rough rules of thumb help readers calibrate what they see in practice:

- For binary cross-entropy, a random baseline gives loss ≈ 0.693 (which is ln(2)). Strong models on clean data drop well below 0.1, while values stuck near 0.69 indicate the model has learned almost nothing.
- For categorical cross-entropy with *C* classes, the random baseline is ln(*C*). A well-trained ten-class image classifier might converge to a loss of 0.05 to 0.2 on the training set.
- For MSE on standardized targets (zero mean, unit variance), a model that always predicts the mean achieves loss ≈ 1.0. Useful regressors push this well below 0.5.
- For language model perplexity, modern transformer language models on diverse internet text typically reach token-level cross-entropy in the range of 1.5 to 2.5 nats, corresponding to perplexities between roughly 4.5 and 12.

These ranges are heuristics, not guarantees. The right way to interpret a loss number is always to compare it to a baseline and to the loss of competing models on the same data.

## What is the difference between training loss, validation loss, and test loss?

A single number called "loss" usually decomposes into three closely watched quantities: training loss, validation loss, and test loss.

| Loss type | Computed on | Used for | Notes |
|---|---|---|---|
| Training loss | Training set | Backpropagation, parameter updates | Falls steadily during healthy training; used by the optimizer at every step |
| Validation loss | Held-out validation set | Hyperparameter tuning, [early stopping](/wiki/early_stopping), model selection | The model is not updated against this loss directly; checking it once per epoch is common |
| Test loss | Held-out test set | Final reporting | Should be evaluated only after model selection is complete to give an unbiased estimate of generalization |

Mixing these roles is one of the most common methodological mistakes. Tuning hyperparameters against the test set, for instance, leaks information from the test set into the training process and inflates the reported performance. Best practice is to lock the test set away until the model is finalized and to use a separate validation split for all decisions made during development.

In smaller datasets, **k-fold cross-validation** averages the loss over multiple train/validation splits, producing a more stable estimate of generalization at the cost of additional compute. In very large datasets typical of deep learning, a single fixed validation split is usually adequate.

## How do you read a loss curve?

A **loss curve** (also called a **learning curve**) is a plot of loss values on the y-axis against training steps or epochs on the x-axis. Practitioners routinely plot two curves together: **training loss** and **validation loss**. Analyzing these curves is one of the most important diagnostic tools in machine learning.[3]

### Interpreting loss curve patterns

| Pattern | Training Loss | Validation Loss | Diagnosis |
|---|---|---|---|
| **Good fit** | Decreases and converges | Decreases and converges near training loss | Model is learning well and generalizing |
| **[Overfitting](/wiki/overfitting)** | Continues to decrease | Decreases, then increases or plateaus | Model memorizes training data instead of learning general patterns |
| **[Underfitting](/wiki/underfitting)** | Remains high | Remains high (close to training loss) | Model lacks capacity or has not trained long enough |
| **Oscillating loss** | Fluctuates erratically | Fluctuates erratically | [Learning rate](/wiki/learning_rate) too high, poor data quality, or insufficient shuffling |
| **Diverging loss** | Increases or explodes to NaN/Inf | Increases or explodes | Severe numerical instability (see section below) |
| **Plateau** | Stagnates after early drop | Stagnates after early drop | Saddle point or vanishing gradient; try restart, warmup, or optimizer change |
| **Loss spike** | Sudden upward jump, often recovers | May or may not move | Bad batch, gradient norm explosion, [Adam](/wiki/adam_optimizer) state corruption |

An ideal loss curve shows an exponential-like decrease that gradually flattens, indicating the model has extracted most of the learnable patterns from the data. The gap between training and validation curves is critical: a small gap indicates good generalization, while a growing gap signals [overfitting](/wiki/overfitting).[3]

### Smoothing noisy curves

Raw per-step losses are noisy because each batch contains a different sample of data. Visualization tools such as [TensorBoard](/wiki/tensorboard) and Weights & Biases offer **smoothing controls** that apply an exponential moving average (EMA) to the loss series, making trends easier to read. The EMA update is *y_i = α · y_{i-1} + (1 - α) · x_i*, with smoothing factor α (often 0.9 to 0.99 in practice). Heavy smoothing can mask short-lived spikes, so it is good practice to inspect both the smoothed and the raw curves.[4]

Other useful diagnostic plots include the **loss histogram** (the distribution of per-example losses inside a batch, which can reveal outliers dominating training) and the **gradient norm curve** (a companion plot that exposes exploding or vanishing gradient regimes that are not always visible in the loss itself).

## Monitoring loss during training

Effective loss monitoring is essential for producing well-trained models. Several strategies help practitioners get the most from their loss curves:

- **Log both training and validation loss at every epoch.** Tools such as TensorBoard, Weights & Biases, and MLflow provide real-time dashboards that track loss and other metrics.
- **Use [early stopping](/wiki/early_stopping).** If the validation loss does not improve for a specified number of consecutive epochs (called the "patience" period, commonly set between 5 and 10 epochs), halt training automatically. This saves computational resources and helps prevent [overfitting](/wiki/overfitting).[5]
- **Save model checkpoints.** Periodically save the model's parameters so that you can restore the best-performing version, typically the one with the lowest validation loss.
- **Compare against baselines.** Before evaluating whether a loss value is "good," establish a baseline, such as a random predictor, a simple heuristic, or a previously published result on the same dataset.
- **Track auxiliary metrics alongside loss.** Accuracy, F1, BLEU, IoU, or domain-specific metrics often tell a different story than loss. A model whose loss falls but whose accuracy stalls may be becoming more confident on already-correct predictions while producing the same final answers.
- **Watch for distribution drift in validation data.** If the validation loss rises while the training loss is stable, it may not be overfitting; it may indicate that the validation set has been corrupted, reshuffled, or now contains examples from a different distribution.

### Loss spikes in large-model training

In very large models, including modern [large language models](/wiki/large_language_model), the training loss occasionally exhibits sudden upward spikes that can be hundreds of times larger than the typical loss. Research on stabilizing pre-training has linked these spikes to abrupt growth in gradient norms, often triggered by a single "bad" batch with numerical outliers, or by parameter norms drifting non-uniformly across layers.[6] Common mitigations include:

- **Gradient clipping** by global norm, which caps the magnitude of the parameter update.
- **Learning rate warmup** at the start of training, preventing large early updates that destabilize Adam-style optimizers.
- **Skipping bad batches** when the loss exceeds a threshold, sometimes paired with a momentum reset (the SPAM optimizer).
- **Resuming from a recent checkpoint** when a spike fails to recover, and replaying the data in a different order to avoid the offending batch.

Most loss spikes recover on their own within a few iterations. Persistent spikes that fail to recover are often the first sign that hyperparameters need adjustment or that the data pipeline has produced corrupted samples.

## Why does loss become NaN or Inf?

One of the most frustrating problems during [training](/wiki/training) is **loss divergence**, where the loss increases without bound, eventually reaching Inf (infinity) or NaN (not a number). Common causes include:[7]

| Cause | Mechanism | Typical Fix |
|---|---|---|
| Exploding gradients | Gradients grow exponentially through deep layers during [backpropagation](/wiki/backpropagation) | Gradient clipping; reduce learning rate; use batch normalization |
| Learning rate too high | Parameter updates overshoot the minimum | Lower the learning rate; use a learning rate scheduler |
| Numerical instability | Operations like log(0), division by zero, or overflow in activations | Add small epsilon values (e.g., log(ŷ + 1e-7)); use mixed-precision training carefully |
| Corrupt or unnormalized data | NaN or extreme values in the input features or labels | Scan data for NaN/Inf; normalize or standardize features |
| Inappropriate loss function | Loss function does not match the task or output range | Verify that loss function assumptions (e.g., probability outputs for cross-entropy) are met |
| Mixed-precision overflow | FP16 activations exceed the representable range | Use loss scaling, switch to bfloat16, or keep certain layers in FP32 |

When NaN appears in the loss, training should be stopped immediately. Continuing after NaN values propagate through the network will corrupt all parameters. Debugging typically involves checking the last few batches before the NaN appeared, inspecting gradient magnitudes, and validating the input data pipeline.

A helpful diagnostic technique is to wrap the forward pass with an anomaly detector, such as PyTorch's `torch.autograd.detect_anomaly()`, which raises an error at the exact operation that produces a non-finite gradient. This is too slow to leave on during full training but is invaluable when investigating an outage.

## Loss landscape visualization

The **loss landscape** is the surface formed by plotting the loss as a function of the model's parameters. For a model with two parameters, this produces a 3D surface; for real neural networks with millions of parameters, the landscape exists in extremely high-dimensional space and can only be visualized through projections.

In their influential 2018 paper "Visualizing the Loss Landscape of Neural Nets," Li et al. introduced filter normalization methods to produce meaningful 2D cross-sections of high-dimensional loss surfaces.[8] Their key findings include:

- **Shallow networks** tend to have smooth, convex-like loss landscapes with wide minima that are easy for optimizers to find.
- **Deeper networks without skip connections** have chaotic, highly non-convex landscapes with many sharp minima and saddle points, making training difficult.
- **Skip connections** (as in [ResNet](/wiki/resnet)) dramatically smooth the loss landscape, explaining why residual networks are easier to train than plain deep networks.
- **Wider networks** also produce smoother landscapes, correlating with better trainability and generalization.

Loss landscape analysis has become an important research area for understanding why certain architectures and hyperparameter choices lead to better training outcomes.[9]

### Saddle points and high-dimensional optimization

Research by Dauphin et al. (2014) showed that in very high-dimensional spaces, almost all critical points of the loss are saddle points, not local minima.[15] At a saddle point, the gradient is zero but the loss can decrease in some directions and increase in others. Vanilla [gradient descent](/wiki/gradient_descent) tends to stall at these points because it has nothing to push it off. Momentum, adaptive optimizers like [Adam](/wiki/adam_optimizer), and stochastic noise from mini-batching all help models escape saddles. This is why pure batch gradient descent is rarely used in modern deep learning: the noise in stochastic updates is itself a useful feature.

### Flat versus sharp minima

Not all minima are equal. Solutions in **flat** regions of the loss landscape, where the loss changes slowly with small parameter perturbations, generalize better than solutions in **sharp** minima where small weight changes produce large loss increases.[8] Several training techniques are designed to bias the optimizer toward flat minima:

- **Stochastic weight averaging (SWA)** averages model weights over the last portion of training to settle into a wider basin.
- **Sharpness-aware minimization (SAM)**, introduced by Foret et al. in 2020, explicitly minimizes both the loss value and the loss sharpness in a neighborhood of the current parameters, by performing a small ascent step to find the worst-case nearby loss before the gradient descent step.[14]
- **Larger batch sizes paired with appropriate learning-rate scaling**, or smaller batch sizes that inject useful gradient noise, both influence the geometry of the minima the optimizer ends up in.

## How is loss different from accuracy and other metrics?

A common source of confusion is the distinction between **loss** and **evaluation metrics** (such as [accuracy](/wiki/accuracy), [precision](/wiki/precision), [recall](/wiki/recall), or [F1 score](/wiki/f1_score)).

| Property | Loss | Metric |
|---|---|---|
| **Differentiable** | Yes (required for gradient-based optimization) | Often not (e.g., accuracy is a step function) |
| **Used during training** | Yes (drives parameter updates) | Sometimes logged, but not used for optimization |
| **Interpretability** | Scale depends on the loss function; not always intuitive | Often more interpretable (e.g., "93% accuracy") |
| **Relationship to task goal** | Proxy for the real objective | Often more directly aligned with the real objective |

Loss must be differentiable so that [gradient descent](/wiki/gradient_descent) can compute the direction and magnitude of parameter updates. Metrics like accuracy produce discrete (correct/incorrect) outputs, making them non-differentiable and unsuitable as optimization targets. Cross-entropy loss, for instance, serves as a smooth, differentiable proxy for accuracy in classification tasks.[11]

It is also worth noting that loss and accuracy are not always inversely proportional. A model can achieve lower loss while accuracy stays flat, or vice versa. This happens because loss captures confidence in predictions (penalizing uncertain correct predictions and rewarding confident correct ones), while accuracy only counts whether the final prediction was correct or not.[12]

In ranking and search applications, the gap is even larger. Metrics like NDCG and mean average precision involve sorting and rank cutoffs that cannot be differentiated directly, so practitioners minimize **surrogate losses** (such as pairwise ranking loss or list-wise approximations) that correlate with the metric of interest. Choosing the surrogate is itself a research problem, and progress on a surrogate does not always translate into progress on the metric.

## Loss in language models

For [large language models](/wiki/large_language_model), the dominant training loss is **token-level cross-entropy** under a next-token prediction objective. Given a sequence of tokens *x_1, x_2, ..., x_T*, the model produces a probability distribution over the vocabulary at each position, and the loss for the sequence is:

*L = -(1/T) Σ_{t=1}^{T} log p_θ(x_t | x_1, ..., x_{t-1})*

This is the negative log-likelihood of the observed sequence under the model. Minimizing this loss is mathematically equivalent to maximum likelihood estimation. A loss of zero would mean the model assigns probability 1 to every token in the training corpus, which is impossible for real natural language.

During training, transformer language models use **teacher forcing**: at each position, the input includes the ground-truth previous tokens rather than the model's own past predictions. This allows efficient parallel computation of the loss across all positions in a sequence at once. At inference time, however, the model decodes autoregressively from its own outputs, which can introduce **exposure bias**: the model has never seen its own (sometimes incorrect) outputs as input during training, so errors can compound during generation.[13]

### Perplexity

For language models, the loss is often reported as **perplexity** rather than raw cross-entropy. Perplexity is the exponential of the cross-entropy loss:

*PPL = exp(L)*

Intuitively, perplexity is the effective number of equally likely choices the model is uncertain between at each token. A perplexity of 10 means the model is, on average, as confused as if it had to choose uniformly among 10 possibilities. For tokenized text, modern frontier models reach perplexities in the single digits on diverse benchmarks. Perplexity is convenient because it is roughly comparable across tokenizations and easier to communicate than nat-valued cross-entropy, though it is sensitive to vocabulary size and tokenizer choice when comparing across systems.

### Scaling laws

A major empirical finding from research on large models is that the achievable loss follows **power laws** with respect to model size, dataset size, and compute. Kaplan et al. (2020) showed that for transformer language models, test loss scales as approximately *L(N) ∝ N^{-α}* with model size *N*, *L(D) ∝ D^{-β}* with dataset size *D*, and similarly with compute *C*.[16] These trends span many orders of magnitude.

Hoffmann et al. (2022), the Chinchilla paper, refined these findings and argued that for compute-optimal training, model size and training tokens should be scaled in roughly equal proportion: each doubling of parameters should be paired with a doubling of training tokens.[17] The original analysis trained more than 400 models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, and concluded that many earlier large models, including GPT-3 (175 billion parameters) and Gopher (280 billion), were undertrained relative to their parameter count.[17] The Chinchilla model itself, with 70 billion parameters trained on roughly 1.4 trillion tokens (about 20 tokens per parameter), achieved lower loss and better downstream performance than the much larger Gopher trained on the same compute budget. See [scaling laws](/wiki/scaling_laws) for more on this topic.

### Bits per byte and bits per character

When comparing language models that use different tokenizers, raw cross-entropy or perplexity become hard to interpret. To make comparisons tokenizer-agnostic, researchers often report **bits per byte (BPB)** or **bits per character (BPC)**: the cross-entropy loss converted to base-2 and divided by the number of bytes or characters in the original text. These units connect directly to information theory: in his 1951 paper "Prediction and Entropy of Printed English," Claude Shannon estimated the entropy of written English at roughly 0.6 to 1.3 bits per character, which sets the theoretical floor a perfect character-level language model could reach.[22]

## Regularization and loss

In practice, the quantity minimized during training is often not just the raw data loss. [Regularization](/wiki/regularization) techniques add penalty terms to the loss to discourage overly complex models:

- **[L1 regularization](/wiki/l1_regularization)** adds the sum of absolute parameter values, encouraging sparsity.
- **[L2 regularization](/wiki/l2_regularization)** (weight decay) adds the sum of squared parameter values, discouraging large weights.
- **[Dropout](/wiki/dropout_regularization)** randomly zeroes out neurons during training, which implicitly regularizes by preventing co-adaptation.
- **Label smoothing** replaces hard one-hot training labels with a softer distribution (typically *(1 - ε)* on the true class and *ε/(K-1)* spread over the others), preventing the model from becoming overconfident and improving calibration.[18]
- **Mixup and CutMix** create training examples by interpolating pairs of inputs and their labels, effectively regularizing the loss by training on a smoothed version of the data distribution.
- **Spectral norm penalties** and **gradient penalties** add terms based on the model's Lipschitz constant, used in stable GAN training and in some certified-robustness work.

The total loss with regularization can be written as:

*Total Loss = Data Loss + λ · Regularization Term*

where λ is a [hyperparameter](/wiki/hyperparameter) that controls the strength of regularization. When multiple regularizers are combined, each typically gets its own coefficient, and tuning these coefficients can have as much impact on final performance as choosing the architecture.

From a Bayesian standpoint, adding L2 regularization corresponds to placing a Gaussian prior on the weights, and adding L1 regularization corresponds to a Laplace prior. Maximum a posteriori (MAP) estimation under these priors yields exactly the same gradient updates as minimizing the regularized loss.

## Multi-task and weighted loss

Many real systems optimize more than one objective at once. A self-driving perception network might predict depth, semantic segmentation, and lane lines simultaneously, with each task contributing its own loss term:

*L_total = w_1 · L_depth + w_2 · L_segmentation + w_3 · L_lanes*

The weights *w_i* control the relative importance of each task. Picking these by hand is brittle: tasks have different units, scales, and convergence rates, and good values vary across architectures and datasets.

Kendall, Gal, and Cipolla (2018) proposed learning the task weights automatically by treating each task's loss as a Gaussian or Laplacian likelihood with a learned **homoscedastic uncertainty** σ_i.[19] Each task's weight becomes *1 / (2σ_i²)* (for regression) plus a *log σ_i* term that prevents the model from collapsing all weights to zero. This single change made multi-task learning meaningfully easier to apply on real perception problems, because the network adapts its own task balance instead of relying on hand-tuned ratios.

Other weighting schemes include **GradNorm**, which equalizes the gradient norms produced by each task, and **dynamic weight averaging**, which balances the rate of loss decrease across tasks. These methods all reduce the burden of manual tuning and have become standard tools in multi-objective deep learning.

## Specialized losses in modern systems

The ecosystem of loss formulations has expanded sharply with the rise of self-supervised learning, generative modeling, and alignment work.

- **Contrastive losses** like InfoNCE, used in [SimCLR](/wiki/simclr), MoCo, and [CLIP](/wiki/clip), compare positive pairs (different views of the same image, or matched image-text pairs) to a large set of negatives. The model learns embeddings without labels by pulling positives together and pushing negatives apart.
- **Reconstruction losses** in [autoencoders](/wiki/autoencoder) and [variational autoencoders](/wiki/variational_autoencoder) include both a per-pixel data term and a [KL divergence](/wiki/kl_divergence) term that regularizes the latent distribution toward a prior.
- **Adversarial losses** in GANs cast generation as a minimax game between a generator and a discriminator, with several stable variants such as the Wasserstein and hinge formulations.
- **Reward modeling losses** in [reinforcement learning from human feedback](/wiki/rlhf) train a reward model on pairwise preference data, then optimize the language model against this reward with a KL penalty to prevent reward hacking and excessive drift.
- **Direct preference optimization (DPO)** and its variants (IPO, KTO, ORPO) bypass the reward model entirely by deriving a closed-form classification-style loss directly from preference pairs.
- **Diffusion model losses** train a network to predict the noise added to data at each diffusion step, typically using mean squared error in the noise space, with optional weighting that emphasizes signal-to-noise regimes most useful for sample quality.

Each of these formulations is the subject of an active research literature, and choosing the right loss is often the single most important decision when designing a new system.

## Custom and surrogate losses

When no standard loss aligns well with the task, practitioners design **custom losses**. Common motivations include:

- **Asymmetric costs.** In medical screening, missing a positive case is far worse than a false alarm, so the loss can weight false negatives more heavily.
- **Business metrics.** Forecasting demand for a retailer may use a quantile loss or a custom under-/over-prediction penalty that mirrors actual stockout and overstock costs.
- **Perceptual quality.** In image generation and super-resolution, perceptual losses compare features extracted by a pretrained network rather than raw pixels, producing outputs that look more natural to humans.
- **Soft constraints.** Adding a penalty term that grows when the model violates a physical or domain-specific constraint (a soft equality, a non-negativity bound, or a known invariant) lets the optimizer steer toward feasible solutions.

When designing a custom loss, three checks are essential. First, the loss must be differentiable (or have well-defined subgradients). Second, it should be tested with extreme inputs (zeros, large values, NaN propagation) so it does not blow up under unusual data. Third, the loss should be checked against the actual evaluation metric on a small benchmark to verify that progress on the loss correlates with progress on the metric. A loss that improves while the metric stagnates is usually a sign of misalignment, sometimes called "goodharting," where the optimizer exploits a feature of the loss that does not reflect the underlying goal.

## Loss in PyTorch and TensorFlow

Most deep learning practitioners interact with loss functions through framework-provided APIs. A few of the most common follow:

| Framework | Class or function | Purpose |
|---|---|---|
| PyTorch | `torch.nn.MSELoss` | Mean squared error for regression |
| PyTorch | `torch.nn.L1Loss` | Mean absolute error for regression |
| PyTorch | `torch.nn.HuberLoss` | Huber loss for robust regression |
| PyTorch | `torch.nn.CrossEntropyLoss` | Combined log-softmax and negative log-likelihood for multi-class classification |
| PyTorch | `torch.nn.BCELoss` / `BCEWithLogitsLoss` | Binary cross-entropy, with the logits version preferred for numerical stability |
| PyTorch | `torch.nn.NLLLoss` | Negative log-likelihood; pair with `log_softmax` outputs |
| PyTorch | `torch.nn.KLDivLoss` | Kullback-Leibler divergence |
| PyTorch | `torch.nn.CTCLoss` | Connectionist temporal classification for sequence labeling |
| TensorFlow / Keras | `tf.keras.losses.MeanSquaredError` | Mean squared error |
| TensorFlow / Keras | `tf.keras.losses.SparseCategoricalCrossentropy` | Cross-entropy with integer-encoded class labels |
| TensorFlow / Keras | `tf.keras.losses.BinaryCrossentropy` | Binary cross-entropy |
| TensorFlow / Keras | `tf.keras.losses.Huber` | Huber loss |
| JAX / Optax | `optax.softmax_cross_entropy` | Functional cross-entropy on logits |

A recurring source of bugs is using `torch.nn.BCELoss` on outputs that have not yet passed through a [sigmoid](/wiki/sigmoid_function) activation, or applying `torch.nn.CrossEntropyLoss` to log-probabilities (it expects raw logits internally). The "with logits" variants combine the activation and the loss in a numerically stable way and are usually the right default.

## Practical advice for working with loss

Across years of practice, several rules of thumb have emerged:

- **Always sanity-check the loss at step zero.** A randomly initialized classifier should produce a loss close to ln(*C*) for *C* classes. If the initial loss is far from this baseline, the loss function or labels are likely wrong.
- **Overfit a single batch first.** Before launching a full training run, train on one tiny batch repeatedly and confirm the loss drops to nearly zero. Failure to overfit a batch indicates a bug in the model, the loss, or the data pipeline.
- **Prefer numerically stable variants.** Use the "with logits" form of cross-entropy and binary cross-entropy when available. Combine softmax and log into a single `log_softmax` operation rather than computing them separately.
- **Average, do not sum.** Unless a sum is intentional, divide losses by the batch size so the gradient magnitude is independent of how many examples happen to land in a batch.
- **Track per-component losses in multi-loss models.** If the total loss is a sum of several terms, log each one separately. A drop in total loss can hide a regression in a single component.
- **Be patient with validation noise.** A small uptick in validation loss is often noise. Use the patience parameter in early stopping rather than reacting to a single bad epoch.
- **Document the loss recipe.** Custom weights, regularization strengths, and label smoothing factors should be saved alongside model checkpoints, because they are part of the model and will be needed to reproduce results.

## Explain Like I'm 5 (ELI5)

Imagine you are learning to throw a ball into a bucket. Every time you throw, someone tells you how far the ball landed from the bucket. That distance is the "loss." If the ball lands right in the bucket, your loss is zero. If it lands far away, your loss is big.

Your goal is to practice throwing until the loss gets as small as possible. Each time, you adjust how hard you throw and at what angle based on how far off you were last time. That adjustment is exactly what a computer does when it trains a model: it looks at the loss and tweaks its settings to do better next time.

Sometimes you practice with your friends watching (that is like "validation"). If you do great when practicing alone but poorly when your friends watch, it means you are only good at one specific setup and have not really learned. In machine learning, that is called [overfitting](/wiki/overfitting).

## References

1. Raschka, Sebastian. "What is the difference between a cost function and a loss function in machine learning?" sebastianraschka.com.
2. PyTorch documentation. "Automatic Differentiation with torch.autograd." docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html.
3. Brownlee, Jason. "How to use Learning Curves to Diagnose Machine Learning Model Performance." MachineLearningMastery.com, 2019.
4. "Formula for the smoothing algorithm." Weights & Biases documentation. docs.wandb.ai/support/formula_smoothing_algorithm.
5. Brownlee, Jason. "Use Early Stopping to Halt the Training of Neural Networks At the Right Time." MachineLearningMastery.com, 2018.
6. Takase, Sho; Kiyono, Shun; Kobayashi, Sosuke; Suzuki, Jun. "Spike No More: Stabilizing the Pre-training of Large Language Models." arXiv:2312.16903.
7. "Common Causes of NaNs During Training." Baeldung on Computer Science. baeldung.com.
8. Li, Hao; Xu, Zheng; Taylor, Gavin; Studer, Christoph; Goldstein, Tom. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018. arXiv:1712.09913.
9. "Loss Functions and Metrics in Deep Learning." arXiv:2307.02694.
10. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. *Deep Learning*. MIT Press, 2016.
11. "Linear regression: Loss." Google Machine Learning Crash Course. developers.google.com.
12. Na, Youngjin. "Understanding the Difference Between Loss and Accuracy." Medium, 2023.
13. Bachmann, Gregor; Nagarajan, Vaishnavh. "The Pitfalls of Next-Token Prediction." arXiv:2403.06963.
14. Foret, Pierre; Kleiner, Ariel; Mobahi, Hossein; Neyshabur, Behnam. "Sharpness-Aware Minimization for Efficiently Improving Generalization." ICLR 2021. arXiv:2010.01412.
15. Dauphin, Yann; Pascanu, Razvan; Gulcehre, Caglar; Cho, Kyunghyun; Ganguli, Surya; Bengio, Yoshua. "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization." NeurIPS 2014. arXiv:1406.2572.
16. Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario. "Scaling Laws for Neural Language Models." arXiv:2001.08361, 2020.
17. Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; et al. "Training Compute-Optimal Large Language Models" (Chinchilla). arXiv:2203.15556, 2022.
18. Szegedy, Christian; Vanhoucke, Vincent; Ioffe, Sergey; Shlens, Jon; Wojna, Zbigniew. "Rethinking the Inception Architecture for Computer Vision." CVPR 2016 (introduced label smoothing).
19. Kendall, Alex; Gal, Yarin; Cipolla, Roberto. "Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics." CVPR 2018. arXiv:1705.07115.
20. Vapnik, Vladimir. *The Nature of Statistical Learning Theory*. Springer, 1995. (Foundational reference for empirical risk minimization.)
21. "Training and Validation Loss in Deep Learning." GeeksforGeeks. geeksforgeeks.org.
22. Shannon, Claude E. "Prediction and Entropy of Printed English." Bell System Technical Journal, vol. 30, no. 1, 1951, pp. 50-64.

