Loss

Introduction

In machine learning, loss (sometimes called error) is a scalar value that quantifies how far a model's prediction deviates from the true or expected output for a single training example. During the training process, the model's parameters are iteratively adjusted to reduce this value, driving the model toward more accurate predictions. Loss is the fundamental signal that enables learning: without it, optimization algorithms like gradient descent would have no direction in which to update a model's weights.

The concept of loss sits at the intersection of statistics, optimization theory, and machine learning engineering. It is central to nearly every supervised and many unsupervised learning workflows, from simple linear regression to large-scale deep learning systems like transformers. Where the loss function article focuses on specific mathematical formulas and their derivations, this article focuses on loss as an operational quantity: how it is produced, monitored, interpreted, and acted upon during the lifecycle of a model.

Definition and formal meaning

Formally, the loss for a single example is defined by a loss function L(y, ŷ), where y is the ground-truth label and ŷ is the model's prediction. The output is a non-negative real number (a scalar) that represents the penalty for that prediction. A loss of zero means the model's prediction exactly matches the true value; larger values indicate greater error.

For example, if a model predicts that a house costs $310,000 and the actual price is $300,000, the squared error loss for that example would be (310,000 - 300,000)^2 = 100,000,000. This single-example loss is the atomic unit of the learning signal in supervised learning.

When training proceeds, the per-example losses are aggregated. Most practitioners report a batch loss (the mean loss over a mini-batch), an epoch loss (the mean over all batches in one pass through the dataset), and sometimes a running loss (a moving average across recent steps). These aggregates are the numbers that appear on training dashboards, and they form the loss curves that practitioners read to diagnose training health.

The statistical view: empirical risk

The machine learning concept of loss has a deep root in statistical decision theory. In that framework, given a true data distribution p(x, y), the true risk (also called the expected loss or population risk) of a model f with parameters θ is the expected value of the loss across the entire distribution:

R(θ) = E[L(y, f(x; θ))]

This quantity is what a model designer ultimately cares about, because it measures performance on the underlying task in the real world. Unfortunately, p(x, y) is unknown. Practitioners only have access to a finite training set drawn from that distribution.

To make progress, machine learning systems minimize the empirical risk instead, which is the average loss over the training set:

R̂(θ) = (1/n) Σ L(y_i, f(x_i; θ))

This substitution is the empirical risk minimization (ERM) principle, formalized by Vladimir Vapnik in the 1990s. The law of large numbers ensures that as the number of training examples n grows, the empirical risk converges to the true risk. For a sufficiently large training set drawn independently from the same distribution as the test data, minimizing empirical loss is a reasonable proxy for minimizing the true risk.

The gap between empirical risk and true risk is the generalization gap, and much of modern machine learning theory is concerned with bounding this gap. Tools such as VC dimension, Rademacher complexity, and PAC-Bayes bounds all attempt to characterize when minimizing training loss will produce a model that performs well on unseen data.

Loss vs. cost vs. objective function

The terms "loss," "cost," and "objective function" are closely related and sometimes used interchangeably in practice. However, many textbooks draw the following distinctions:

Term	Scope	Description
Loss function	Single example	Measures prediction error for one data point. Example: the squared error for one sample.
Cost function	Entire dataset	Aggregates the loss over all training examples, typically as an average or sum. May also include regularization terms.
Objective function	Optimization goal	The broadest term. Can refer to any function being optimized (minimized or maximized), including cost functions, reward functions, or likelihood functions.

Some authors, including Sebastian Raschka, treat "loss" and "cost" as synonyms, noting that there is no universal consensus on the distinction.^[1] In everyday conversation among practitioners, "loss" frequently refers to the aggregated value reported per batch or per epoch during training, even though that is technically a cost function.

A related word is error. In statistics, "error" usually refers to the residual y - ŷ, while "loss" refers to the function applied to that residual. In casual usage these are often blurred, and loss is sometimes called "training error" or "validation error." The Goodfellow, Bengio, and Courville textbook Deep Learning uses "cost function" and "loss function" interchangeably and notes that the choice depends on whichever phrasing is most natural in a given context.^[10]

Role of loss in training

The training loop in most machine learning systems follows a repeated cycle:

Forward pass. The model processes an input and produces a prediction.
Loss computation. The loss function compares the prediction to the ground truth and outputs a scalar loss value.
Backward pass (backpropagation). The gradient of the loss with respect to each model parameter is computed.
Parameter update. An optimizer (such as stochastic gradient descent, Adam, or RMSProp) adjusts the parameters in the direction that reduces the loss.

This cycle repeats across many batches and epochs until the loss converges to an acceptable level or training is stopped by other criteria.

The loss value is the only signal that connects the data to the model's parameters. A model with poorly chosen loss can train without numerical errors and still fail to learn the intended task, because the gradient produced by the loss does not point in a useful direction. This is why loss design is treated as a first-class concern in modern machine learning research.

Loss and automatic differentiation

In deep learning frameworks like PyTorch, TensorFlow, and JAX, the loss is the entry point for computing gradients via reverse-mode automatic differentiation (autodiff). When the framework executes the forward pass, it builds a computational graph that records every operation applied to the model's tensors. Calling .backward() on the loss tensor traverses this graph in reverse, applying the chain rule at each node to compute the partial derivative of the loss with respect to every parameter that affected it.^[2]

Because the gradient is computed with respect to the loss, the loss must be a single scalar. If the model produces a vector or tensor output, that output has to be reduced to one number (typically by summing or averaging the per-example losses) before the backward pass. Many subtle bugs in training code can be traced to losses that were not properly reduced, producing gradients that depend on batch size in unintended ways.

Autodiff also imposes a practical constraint on loss design: the function must be composed of operations the framework can differentiate. Most numerical operations qualify, but operations like sorting, top-k selection, or decoding through a beam search require special handling, smooth relaxations, or surrogate losses that approximate the true objective.

Common loss functions and their typical values

Different tasks call for different loss functions, each with its own range of output values. The summary below lists the most common families; for full mathematical treatment of each function, see the dedicated loss function article and the linked subpages.

Regression losses

Loss Function	Formula (single example)	Range	Typical Use Case
Mean Squared Error (MSE)	(y - ŷ)²	[0, +∞)	General regression; penalizes large errors heavily
Mean Absolute Error (MAE)	\|y - ŷ\|	[0, +∞)	Regression with outlier robustness
Huber Loss	Quadratic near zero, linear far from zero	[0, +∞)	Combines benefits of MSE and MAE
Log-cosh loss	log(cosh(y - ŷ))	[0, +∞)	Smooth alternative to Huber, twice differentiable
Quantile (pinball) loss	max(q(y-ŷ), (q-1)(y-ŷ))	[0, +∞)	Probabilistic forecasting, prediction intervals

Classification losses

Loss Function	Formula (single example)	Range	Typical Use Case
Cross-entropy (log loss)	-y log(ŷ) - (1-y) log(1-ŷ)	[0, +∞)	Binary and multi-class classification
Hinge loss	max(0, 1 - y·ŷ)	[0, +∞)	Support vector machines, binary classification
Focal loss	-α(1-ŷ)^γ log(ŷ)	[0, +∞)	Imbalanced classification, object detection
Kullback-Leibler (KL divergence)	Σ p(x) log(p(x)/q(x))	[0, +∞)	Comparing probability distributions; variational autoencoders, distillation

Sequence and structured losses

Loss Function	Domain	Typical Use Case
Sequence cross-entropy	Token-level	Language model next-token prediction
CTC loss	Audio, handwriting	Sequence labeling with unknown alignment
Triplet / contrastive loss	Embeddings	Face recognition, retrieval, SimCLR, CLIP
Dice / IoU loss	Segmentation	Medical imaging, foreground segmentation

There is no universal "good" loss value that applies across tasks. A cross-entropy loss of 0.3 might be excellent for one dataset and mediocre for another. What matters is the trajectory of the loss over time and how it compares to a baseline or published benchmark for the same task.

Loss values to expect in practice

Rough rules of thumb help readers calibrate what they see in practice:

For binary cross-entropy, a random baseline gives loss ≈ 0.693 (which is ln(2)). Strong models on clean data drop well below 0.1, while values stuck near 0.69 indicate the model has learned almost nothing.
For categorical cross-entropy with C classes, the random baseline is ln(C). A well-trained ten-class image classifier might converge to a loss of 0.05 to 0.2 on the training set.
For MSE on standardized targets (zero mean, unit variance), a model that always predicts the mean achieves loss ≈ 1.0. Useful regressors push this well below 0.5.
For language model perplexity, modern transformer language models on diverse internet text typically reach token-level cross-entropy in the range of 1.5 to 2.5 nats, corresponding to perplexities between roughly 4.5 and 12.

These ranges are heuristics, not guarantees. The right way to interpret a loss number is always to compare it to a baseline and to the loss of competing models on the same data.

Training, validation, and test loss

A single number called "loss" usually decomposes into three closely watched quantities: training loss, validation loss, and test loss.

Loss type	Computed on	Used for	Notes
Training loss	Training set	Backpropagation, parameter updates	Falls steadily during healthy training; used by the optimizer at every step
Validation loss	Held-out validation set	Hyperparameter tuning, early stopping, model selection	The model is not updated against this loss directly; checking it once per epoch is common
Test loss	Held-out test set	Final reporting	Should be evaluated only after model selection is complete to give an unbiased estimate of generalization

Mixing these roles is one of the most common methodological mistakes. Tuning hyperparameters against the test set, for instance, leaks information from the test set into the training process and inflates the reported performance. Best practice is to lock the test set away until the model is finalized and to use a separate validation split for all decisions made during development.

In smaller datasets, k-fold cross-validation averages the loss over multiple train/validation splits, producing a more stable estimate of generalization at the cost of additional compute. In very large datasets typical of deep learning, a single fixed validation split is usually adequate.

Loss curves

A loss curve (also called a learning curve) is a plot of loss values on the y-axis against training steps or epochs on the x-axis. Practitioners routinely plot two curves together: training loss and validation loss. Analyzing these curves is one of the most important diagnostic tools in machine learning.^[3]

Interpreting loss curve patterns

Pattern	Training Loss	Validation Loss	Diagnosis
Good fit	Decreases and converges	Decreases and converges near training loss	Model is learning well and generalizing
Overfitting	Continues to decrease	Decreases, then increases or plateaus	Model memorizes training data instead of learning general patterns
Underfitting	Remains high	Remains high (close to training loss)	Model lacks capacity or has not trained long enough
Oscillating loss	Fluctuates erratically	Fluctuates erratically	Learning rate too high, poor data quality, or insufficient shuffling
Diverging loss	Increases or explodes to NaN/Inf	Increases or explodes	Severe numerical instability (see section below)
Plateau	Stagnates after early drop	Stagnates after early drop	Saddle point or vanishing gradient; try restart, warmup, or optimizer change
Loss spike	Sudden upward jump, often recovers	May or may not move	Bad batch, gradient norm explosion, Adam state corruption

An ideal loss curve shows an exponential-like decrease that gradually flattens, indicating the model has extracted most of the learnable patterns from the data. The gap between training and validation curves is critical: a small gap indicates good generalization, while a growing gap signals overfitting.^[3]

Smoothing noisy curves

Raw per-step losses are noisy because each batch contains a different sample of data. Visualization tools such as TensorBoard and Weights & Biases offer smoothing controls that apply an exponential moving average (EMA) to the loss series, making trends easier to read. The EMA update is y_i = α · y_{i-1} + (1 - α) · x_i, with smoothing factor α (often 0.9 to 0.99 in practice). Heavy smoothing can mask short-lived spikes, so it is good practice to inspect both the smoothed and the raw curves.^[4]

Other useful diagnostic plots include the loss histogram (the distribution of per-example losses inside a batch, which can reveal outliers dominating training) and the gradient norm curve (a companion plot that exposes exploding or vanishing gradient regimes that are not always visible in the loss itself).

Monitoring loss during training

Effective loss monitoring is essential for producing well-trained models. Several strategies help practitioners get the most from their loss curves:

Log both training and validation loss at every epoch. Tools such as TensorBoard, Weights & Biases, and MLflow provide real-time dashboards that track loss and other metrics.
Use early stopping. If the validation loss does not improve for a specified number of consecutive epochs (called the "patience" period, commonly set between 5 and 10 epochs), halt training automatically. This saves computational resources and helps prevent overfitting.^[5]
Save model checkpoints. Periodically save the model's parameters so that you can restore the best-performing version, typically the one with the lowest validation loss.
Compare against baselines. Before evaluating whether a loss value is "good," establish a baseline, such as a random predictor, a simple heuristic, or a previously published result on the same dataset.
Track auxiliary metrics alongside loss. Accuracy, F1, BLEU, IoU, or domain-specific metrics often tell a different story than loss. A model whose loss falls but whose accuracy stalls may be becoming more confident on already-correct predictions while producing the same final answers.
Watch for distribution drift in validation data. If the validation loss rises while the training loss is stable, it may not be overfitting; it may indicate that the validation set has been corrupted, reshuffled, or now contains examples from a different distribution.

Loss spikes in large-model training

In very large models, including modern large language models, the training loss occasionally exhibits sudden upward spikes that can be hundreds of times larger than the typical loss. Research on stabilizing pre-training has linked these spikes to abrupt growth in gradient norms, often triggered by a single "bad" batch with numerical outliers, or by parameter norms drifting non-uniformly across layers.^[6] Common mitigations include:

Gradient clipping by global norm, which caps the magnitude of the parameter update.
Learning rate warmup at the start of training, preventing large early updates that destabilize Adam-style optimizers.
Skipping bad batches when the loss exceeds a threshold, sometimes paired with a momentum reset (the SPAM optimizer).
Resuming from a recent checkpoint when a spike fails to recover, and replaying the data in a different order to avoid the offending batch.

Most loss spikes recover on their own within a few iterations. Persistent spikes that fail to recover are often the first sign that hyperparameters need adjustment or that the data pipeline has produced corrupted samples.

Loss divergence: NaN and Inf

One of the most frustrating problems during training is loss divergence, where the loss increases without bound, eventually reaching Inf (infinity) or NaN (not a number). Common causes include:^[7]

Cause	Mechanism	Typical Fix
Exploding gradients	Gradients grow exponentially through deep layers during backpropagation	Gradient clipping; reduce learning rate; use batch normalization
Learning rate too high	Parameter updates overshoot the minimum	Lower the learning rate; use a learning rate scheduler
Numerical instability	Operations like log(0), division by zero, or overflow in activations	Add small epsilon values (e.g., log(ŷ + 1e-7)); use mixed-precision training carefully
Corrupt or unnormalized data	NaN or extreme values in the input features or labels	Scan data for NaN/Inf; normalize or standardize features
Inappropriate loss function	Loss function does not match the task or output range	Verify that loss function assumptions (e.g., probability outputs for cross-entropy) are met
Mixed-precision overflow	FP16 activations exceed the representable range	Use loss scaling, switch to bfloat16, or keep certain layers in FP32

When NaN appears in the loss, training should be stopped immediately. Continuing after NaN values propagate through the network will corrupt all parameters. Debugging typically involves checking the last few batches before the NaN appeared, inspecting gradient magnitudes, and validating the input data pipeline.

A helpful diagnostic technique is to wrap the forward pass with an anomaly detector, such as PyTorch's torch.autograd.detect_anomaly(), which raises an error at the exact operation that produces a non-finite gradient. This is too slow to leave on during full training but is invaluable when investigating an outage.

Loss landscape visualization

The loss landscape is the surface formed by plotting the loss as a function of the model's parameters. For a model with two parameters, this produces a 3D surface; for real neural networks with millions of parameters, the landscape exists in extremely high-dimensional space and can only be visualized through projections.

In their influential 2018 paper "Visualizing the Loss Landscape of Neural Nets," Li et al. introduced filter normalization methods to produce meaningful 2D cross-sections of high-dimensional loss surfaces.^[8] Their key findings include:

Shallow networks tend to have smooth, convex-like loss landscapes with wide minima that are easy for optimizers to find.
Deeper networks without skip connections have chaotic, highly non-convex landscapes with many sharp minima and saddle points, making training difficult.
Skip connections (as in ResNet) dramatically smooth the loss landscape, explaining why residual networks are easier to train than plain deep networks.
Wider networks also produce smoother landscapes, correlating with better trainability and generalization.

Loss landscape analysis has become an important research area for understanding why certain architectures and hyperparameter choices lead to better training outcomes.^[9]

Saddle points and high-dimensional optimization

Research by Dauphin et al. (2014) showed that in very high-dimensional spaces, almost all critical points of the loss are saddle points, not local minima.^[15] At a saddle point, the gradient is zero but the loss can decrease in some directions and increase in others. Vanilla gradient descent tends to stall at these points because it has nothing to push it off. Momentum, adaptive optimizers like Adam, and stochastic noise from mini-batching all help models escape saddles. This is why pure batch gradient descent is rarely used in modern deep learning: the noise in stochastic updates is itself a useful feature.

Flat versus sharp minima

Not all minima are equal. Solutions in flat regions of the loss landscape, where the loss changes slowly with small parameter perturbations, generalize better than solutions in sharp minima where small weight changes produce large loss increases.^[8] Several training techniques are designed to bias the optimizer toward flat minima:

Stochastic weight averaging (SWA) averages model weights over the last portion of training to settle into a wider basin.
Sharpness-aware minimization (SAM), introduced by Foret et al. in 2020, explicitly minimizes both the loss value and the loss sharpness in a neighborhood of the current parameters, by performing a small ascent step to find the worst-case nearby loss before the gradient descent step.^[14]
Larger batch sizes paired with appropriate learning-rate scaling, or smaller batch sizes that inject useful gradient noise, both influence the geometry of the minima the optimizer ends up in.

Relationship between loss and metrics

A common source of confusion is the distinction between loss and evaluation metrics (such as accuracy, precision, recall, or F1 score).

Property	Loss	Metric
Differentiable	Yes (required for gradient-based optimization)	Often not (e.g., accuracy is a step function)
Used during training	Yes (drives parameter updates)	Sometimes logged, but not used for optimization
Interpretability	Scale depends on the loss function; not always intuitive	Often more interpretable (e.g., "93% accuracy")
Relationship to task goal	Proxy for the real objective	Often more directly aligned with the real objective

Loss must be differentiable so that gradient descent can compute the direction and magnitude of parameter updates. Metrics like accuracy produce discrete (correct/incorrect) outputs, making them non-differentiable and unsuitable as optimization targets. Cross-entropy loss, for instance, serves as a smooth, differentiable proxy for accuracy in classification tasks.^[11]

It is also worth noting that loss and accuracy are not always inversely proportional. A model can achieve lower loss while accuracy stays flat, or vice versa. This happens because loss captures confidence in predictions (penalizing uncertain correct predictions and rewarding confident correct ones), while accuracy only counts whether the final prediction was correct or not.^[12]

In ranking and search applications, the gap is even larger. Metrics like NDCG and mean average precision involve sorting and rank cutoffs that cannot be differentiated directly, so practitioners minimize surrogate losses (such as pairwise ranking loss or list-wise approximations) that correlate with the metric of interest. Choosing the surrogate is itself a research problem, and progress on a surrogate does not always translate into progress on the metric.

Loss in language models

For large language models, the dominant training loss is token-level cross-entropy under a next-token prediction objective. Given a sequence of tokens x_1, x_2, ..., x_T, the model produces a probability distribution over the vocabulary at each position, and the loss for the sequence is:

L = -(1/T) Σ_{t=1}^{T} log p_θ(x_t | x_1, ..., x_{t-1})

This is the negative log-likelihood of the observed sequence under the model. Minimizing this loss is mathematically equivalent to maximum likelihood estimation. A loss of zero would mean the model assigns probability 1 to every token in the training corpus, which is impossible for real natural language.

During training, transformer language models use teacher forcing: at each position, the input includes the ground-truth previous tokens rather than the model's own past predictions. This allows efficient parallel computation of the loss across all positions in a sequence at once. At inference time, however, the model decodes autoregressively from its own outputs, which can introduce exposure bias: the model has never seen its own (sometimes incorrect) outputs as input during training, so errors can compound during generation.^[13]

Perplexity

For language models, the loss is often reported as perplexity rather than raw cross-entropy. Perplexity is the exponential of the cross-entropy loss:

PPL = exp(L)

Intuitively, perplexity is the effective number of equally likely choices the model is uncertain between at each token. A perplexity of 10 means the model is, on average, as confused as if it had to choose uniformly among 10 possibilities. For tokenized text, modern frontier models reach perplexities in the single digits on diverse benchmarks. Perplexity is convenient because it is roughly comparable across tokenizations and easier to communicate than nat-valued cross-entropy, though it is sensitive to vocabulary size and tokenizer choice when comparing across systems.

Scaling laws

A major empirical finding from research on large models is that the achievable loss follows power laws with respect to model size, dataset size, and compute. Kaplan et al. (2020) showed that for transformer language models, test loss scales as approximately L(N) ∝ N^{-α} with model size N, L(D) ∝ D^{-β} with dataset size D, and similarly with compute C.^[16] These trends span many orders of magnitude.

Hoffmann et al. (2022), the Chinchilla paper, refined these findings and argued that for compute-optimal training, model size and training tokens should be scaled in roughly equal proportion: each doubling of parameters should be paired with a doubling of training tokens.^[17] The original analysis trained more than 400 models ranging from 70 million to 16 billion parameters and concluded that many earlier large models, including GPT-3 and Gopher, were undertrained relative to their parameter count. The Chinchilla model itself, with 70 billion parameters trained on roughly 1.4 trillion tokens, achieved lower loss and better downstream performance than larger models trained on smaller datasets. See scaling laws for more on this topic.

Bits per byte and bits per character

When comparing language models that use different tokenizers, raw cross-entropy or perplexity become hard to interpret. To make comparisons tokenizer-agnostic, researchers often report bits per byte (BPB) or bits per character (BPC): the cross-entropy loss converted to base-2 and divided by the number of bytes or characters in the original text. A model with 1.0 BPB compresses text to roughly the same size as the natural language entropy estimate that Shannon derived for English in 1951.

Regularization and loss

In practice, the quantity minimized during training is often not just the raw data loss. Regularization techniques add penalty terms to the loss to discourage overly complex models:

L1 regularization adds the sum of absolute parameter values, encouraging sparsity.
L2 regularization (weight decay) adds the sum of squared parameter values, discouraging large weights.
Dropout randomly zeroes out neurons during training, which implicitly regularizes by preventing co-adaptation.
Label smoothing replaces hard one-hot training labels with a softer distribution (typically (1 - ε) on the true class and ε/(K-1) spread over the others), preventing the model from becoming overconfident and improving calibration.^[18]
Mixup and CutMix create training examples by interpolating pairs of inputs and their labels, effectively regularizing the loss by training on a smoothed version of the data distribution.
Spectral norm penalties and gradient penalties add terms based on the model's Lipschitz constant, used in stable GAN training and in some certified-robustness work.

The total loss with regularization can be written as:

Total Loss = Data Loss + λ · Regularization Term

where λ is a hyperparameter that controls the strength of regularization. When multiple regularizers are combined, each typically gets its own coefficient, and tuning these coefficients can have as much impact on final performance as choosing the architecture.

From a Bayesian standpoint, adding L2 regularization corresponds to placing a Gaussian prior on the weights, and adding L1 regularization corresponds to a Laplace prior. Maximum a posteriori (MAP) estimation under these priors yields exactly the same gradient updates as minimizing the regularized loss.

Multi-task and weighted loss

Many real systems optimize more than one objective at once. A self-driving perception network might predict depth, semantic segmentation, and lane lines simultaneously, with each task contributing its own loss term:

L_total = w_1 · L_depth + w_2 · L_segmentation + w_3 · L_lanes

The weights w_i control the relative importance of each task. Picking these by hand is brittle: tasks have different units, scales, and convergence rates, and good values vary across architectures and datasets.

Kendall, Gal, and Cipolla (2018) proposed learning the task weights automatically by treating each task's loss as a Gaussian or Laplacian likelihood with a learned homoscedastic uncertainty σ_i.^[19] Each task's weight becomes 1 / (2σ_i²) (for regression) plus a log σ_i term that prevents the model from collapsing all weights to zero. This single change made multi-task learning meaningfully easier to apply on real perception problems, because the network adapts its own task balance instead of relying on hand-tuned ratios.

Other weighting schemes include GradNorm, which equalizes the gradient norms produced by each task, and dynamic weight averaging, which balances the rate of loss decrease across tasks. These methods all reduce the burden of manual tuning and have become standard tools in multi-objective deep learning.

Specialized losses in modern systems

The ecosystem of loss formulations has expanded sharply with the rise of self-supervised learning, generative modeling, and alignment work.

Contrastive losses like InfoNCE, used in SimCLR, MoCo, and CLIP, compare positive pairs (different views of the same image, or matched image-text pairs) to a large set of negatives. The model learns embeddings without labels by pulling positives together and pushing negatives apart.
Reconstruction losses in autoencoders and variational autoencoders include both a per-pixel data term and a KL divergence term that regularizes the latent distribution toward a prior.
Adversarial losses in GANs cast generation as a minimax game between a generator and a discriminator, with several stable variants such as the Wasserstein and hinge formulations.
Reward modeling losses in reinforcement learning from human feedback train a reward model on pairwise preference data, then optimize the language model against this reward with a KL penalty to prevent reward hacking and excessive drift.
Direct preference optimization (DPO) and its variants (IPO, KTO, ORPO) bypass the reward model entirely by deriving a closed-form classification-style loss directly from preference pairs.
Diffusion model losses train a network to predict the noise added to data at each diffusion step, typically using mean squared error in the noise space, with optional weighting that emphasizes signal-to-noise regimes most useful for sample quality.

Each of these formulations is the subject of an active research literature, and choosing the right loss is often the single most important decision when designing a new system.

Custom and surrogate losses

When no standard loss aligns well with the task, practitioners design custom losses. Common motivations include:

Asymmetric costs. In medical screening, missing a positive case is far worse than a false alarm, so the loss can weight false negatives more heavily.
Business metrics. Forecasting demand for a retailer may use a quantile loss or a custom under-/over-prediction penalty that mirrors actual stockout and overstock costs.
Perceptual quality. In image generation and super-resolution, perceptual losses compare features extracted by a pretrained network rather than raw pixels, producing outputs that look more natural to humans.
Soft constraints. Adding a penalty term that grows when the model violates a physical or domain-specific constraint (a soft equality, a non-negativity bound, or a known invariant) lets the optimizer steer toward feasible solutions.

When designing a custom loss, three checks are essential. First, the loss must be differentiable (or have well-defined subgradients). Second, it should be tested with extreme inputs (zeros, large values, NaN propagation) so it does not blow up under unusual data. Third, the loss should be checked against the actual evaluation metric on a small benchmark to verify that progress on the loss correlates with progress on the metric. A loss that improves while the metric stagnates is usually a sign of misalignment, sometimes called "goodharting," where the optimizer exploits a feature of the loss that does not reflect the underlying goal.

Loss in PyTorch and TensorFlow

Most deep learning practitioners interact with loss functions through framework-provided APIs. A few of the most common follow:

Framework	Class or function	Purpose
PyTorch	`torch.nn.MSELoss`	Mean squared error for regression
PyTorch	`torch.nn.L1Loss`	Mean absolute error for regression
PyTorch	`torch.nn.HuberLoss`	Huber loss for robust regression
PyTorch	`torch.nn.CrossEntropyLoss`	Combined log-softmax and negative log-likelihood for multi-class classification
PyTorch	`torch.nn.BCELoss` / `BCEWithLogitsLoss`	Binary cross-entropy, with the logits version preferred for numerical stability
PyTorch	`torch.nn.NLLLoss`	Negative log-likelihood; pair with `log_softmax` outputs
PyTorch	`torch.nn.KLDivLoss`	Kullback-Leibler divergence
PyTorch	`torch.nn.CTCLoss`	Connectionist temporal classification for sequence labeling
TensorFlow / Keras	`tf.keras.losses.MeanSquaredError`	Mean squared error
TensorFlow / Keras	`tf.keras.losses.SparseCategoricalCrossentropy`	Cross-entropy with integer-encoded class labels
TensorFlow / Keras	`tf.keras.losses.BinaryCrossentropy`	Binary cross-entropy
TensorFlow / Keras	`tf.keras.losses.Huber`	Huber loss
JAX / Optax	`optax.softmax_cross_entropy`	Functional cross-entropy on logits

A recurring source of bugs is using torch.nn.BCELoss on outputs that have not yet passed through a sigmoid activation, or applying torch.nn.CrossEntropyLoss to log-probabilities (it expects raw logits internally). The "with logits" variants combine the activation and the loss in a numerically stable way and are usually the right default.

Practical advice for working with loss

Across years of practice, several rules of thumb have emerged:

Always sanity-check the loss at step zero. A randomly initialized classifier should produce a loss close to ln(C) for C classes. If the initial loss is far from this baseline, the loss function or labels are likely wrong.
Overfit a single batch first. Before launching a full training run, train on one tiny batch repeatedly and confirm the loss drops to nearly zero. Failure to overfit a batch indicates a bug in the model, the loss, or the data pipeline.
Prefer numerically stable variants. Use the "with logits" form of cross-entropy and binary cross-entropy when available. Combine softmax and log into a single log_softmax operation rather than computing them separately.
Average, do not sum. Unless a sum is intentional, divide losses by the batch size so the gradient magnitude is independent of how many examples happen to land in a batch.
Track per-component losses in multi-loss models. If the total loss is a sum of several terms, log each one separately. A drop in total loss can hide a regression in a single component.
Be patient with validation noise. A small uptick in validation loss is often noise. Use the patience parameter in early stopping rather than reacting to a single bad epoch.
Document the loss recipe. Custom weights, regularization strengths, and label smoothing factors should be saved alongside model checkpoints, because they are part of the model and will be needed to reproduce results.

Explain Like I'm 5 (ELI5)

Imagine you are learning to throw a ball into a bucket. Every time you throw, someone tells you how far the ball landed from the bucket. That distance is the "loss." If the ball lands right in the bucket, your loss is zero. If it lands far away, your loss is big.

Your goal is to practice throwing until the loss gets as small as possible. Each time, you adjust how hard you throw and at what angle based on how far off you were last time. That adjustment is exactly what a computer does when it trains a model: it looks at the loss and tweaks its settings to do better next time.

Sometimes you practice with your friends watching (that is like "validation"). If you do great when practicing alone but poorly when your friends watch, it means you are only good at one specific setup and have not really learned. In machine learning, that is called overfitting.

References

Raschka, Sebastian. "What is the difference between a cost function and a loss function in machine learning?" sebastianraschka.com.
PyTorch documentation. "Automatic Differentiation with torch.autograd." docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html.
Brownlee, Jason. "How to use Learning Curves to Diagnose Machine Learning Model Performance." MachineLearningMastery.com, 2019.
"Formula for the smoothing algorithm." Weights & Biases documentation. docs.wandb.ai/support/formula_smoothing_algorithm.
Brownlee, Jason. "Use Early Stopping to Halt the Training of Neural Networks At the Right Time." MachineLearningMastery.com, 2018.
Takase, Sho; Kiyono, Shun; Kobayashi, Sosuke; Suzuki, Jun. "Spike No More: Stabilizing the Pre-training of Large Language Models." arXiv:2312.16903.
"Common Causes of NaNs During Training." Baeldung on Computer Science. baeldung.com.
Li, Hao; Xu, Zheng; Taylor, Gavin; Studer, Christoph; Goldstein, Tom. "Visualizing the Loss Landscape of Neural Nets." NeurIPS 2018. arXiv:1712.09913.
"Loss Functions and Metrics in Deep Learning." arXiv:2307.02694.
Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. *Deep Learning*. MIT Press, 2016.
"Linear regression: Loss." Google Machine Learning Crash Course. developers.google.com.
Na, Youngjin. "Understanding the Difference Between Loss and Accuracy." Medium, 2023.
Bachmann, Gregor; Nagarajan, Vaishnavh. "The Pitfalls of Next-Token Prediction." arXiv:2403.06963.
Foret, Pierre; Kleiner, Ariel; Mobahi, Hossein; Neyshabur, Behnam. "Sharpness-Aware Minimization for Efficiently Improving Generalization." ICLR 2021. arXiv:2010.01412.
Dauphin, Yann; Pascanu, Razvan; Gulcehre, Caglar; Cho, Kyunghyun; Ganguli, Surya; Bengio, Yoshua. "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization." NeurIPS 2014. arXiv:1406.2572.
Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario. "Scaling Laws for Neural Language Models." arXiv:2001.08361, 2020.
Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; et al. "Training Compute-Optimal Large Language Models" (Chinchilla). arXiv:2203.15556, 2022.
Szegedy, Christian; Vanhoucke, Vincent; Ioffe, Sergey; Shlens, Jon; Wojna, Zbigniew. "Rethinking the Inception Architecture for Computer Vision." CVPR 2016 (introduced label smoothing).
Kendall, Alex; Gal, Yarin; Cipolla, Roberto. "Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics." CVPR 2018. arXiv:1705.07115.
Vapnik, Vladimir. *The Nature of Statistical Learning Theory*. Springer, 1995. (Foundational reference for empirical risk minimization.)
"Training and Validation Loss in Deep Learning." GeeksforGeeks. geeksforgeeks.org.

Introduction

Definition and formal meaning

The statistical view: empirical risk

Loss vs. cost vs. objective function

Role of loss in training

Loss and automatic differentiation

Common loss functions and their typical values

Regression losses

Classification losses

Sequence and structured losses

Loss values to expect in practice

Training, validation, and test loss

Loss curves

Interpreting loss curve patterns

Smoothing noisy curves

Monitoring loss during training

Loss spikes in large-model training

Loss divergence: NaN and Inf

Loss landscape visualization

Saddle points and high-dimensional optimization

Flat versus sharp minima

Relationship between loss and metrics

Loss in language models

Perplexity

Scaling laws

Bits per byte and bits per character

Regularization and loss

Multi-task and weighted loss

Specialized losses in modern systems

Custom and surrogate losses

Loss in PyTorch and TensorFlow

Practical advice for working with loss

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

Introduction

Definition and formal meaning

The statistical view: empirical risk

Loss vs. cost vs. objective function

Role of loss in training

Loss and automatic differentiation

Common loss functions and their typical values

Regression losses

Classification losses

Sequence and structured losses

Loss values to expect in practice

Training, validation, and test loss

Loss curves

Interpreting loss curve patterns

Smoothing noisy curves

Monitoring loss during training

Loss spikes in large-model training

Loss divergence: NaN and Inf

Loss landscape visualization

Saddle points and high-dimensional optimization

Flat versus sharp minima

Relationship between loss and metrics

Loss in language models

Perplexity

Scaling laws

Bits per byte and bits per character

Regularization and loss

Multi-task and weighted loss

Specialized losses in modern systems

Custom and surrogate losses

Loss in PyTorch and TensorFlow

Practical advice for working with loss

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

L0 Regularization

L1 Loss