See also: Machine learning terms
In machine learning, a loss function (also called a cost function or objective function) is a mathematical formula that measures the difference between a model's predicted output and the actual target value. During training, the model adjusts its parameters to minimize this function. The choice of loss function directly shapes what the model learns: it defines what "good" and "bad" predictions look like in numerical terms. Two models with identical architectures trained on the same data can behave very differently if they optimize different loss functions.
The terms "loss function," "cost function," and "objective function" are related but carry slightly different meanings in practice.
| Term | Scope | Definition |
|---|---|---|
| Loss function | Single example | Measures the error for one training example. For instance, the squared error for a single prediction. |
| Cost function | Entire dataset | The average (or sum) of individual losses across all training examples. This is the quantity actually minimized during training. |
| Objective function | Training process | The most general term. It includes the cost function plus any additional terms such as regularization penalties. The objective function is what the optimizer directly minimizes. |
In practice, most practitioners use these terms interchangeably. Research papers sometimes distinguish them precisely, but colloquial usage rarely does.
Loss functions sit at the center of the training loop. At each step, the model produces a prediction, the loss function scores that prediction against the ground truth, and an optimizer (such as stochastic gradient descent or Adam) uses the loss's gradient to update the model's weights. This cycle repeats for thousands or millions of iterations.
Because the optimizer relies on the gradient of the loss, the function needs to be differentiable (or at least sub-differentiable) with respect to the model's parameters. Functions that are flat over large regions give vanishing gradients, making training slow. Functions with very sharp, narrow minima can make training unstable. A well-chosen loss creates a smooth optimization landscape that guides the model toward generalizable solutions.
Not every mathematical function works well as a training objective. Several properties make a loss function practical and effective.
Differentiability. Gradient descent and its variants require computing partial derivatives of the loss with respect to every model parameter. If the loss is not differentiable at certain points, the optimizer cannot compute a gradient there. Some losses (like MAE) are not differentiable everywhere but have well-defined subgradients that work in practice. Smooth losses such as MSE and cross-entropy produce clean gradient signals throughout the parameter space.
Convexity. A convex loss function has a single global minimum and no local minima, guaranteeing that gradient descent converges to the best solution. MSE for linear models and logistic loss for logistic regression are convex. When a convex loss is paired with a neural network, the overall optimization problem becomes non-convex because of the network's nonlinear activations, but the convexity of the loss itself still contributes to better-behaved gradients.
Bounded below. The loss should have a finite lower bound (typically zero) so the optimizer has a clear target and training does not diverge toward negative infinity.
Sensitivity to prediction errors. The loss should increase meaningfully when predictions worsen. A loss that changes very slowly in response to errors provides weak training signal, making learning inefficient.
Alignment with the evaluation metric. Ideally, minimizing the loss during training should correlate with improving the metric used to evaluate the model (such as accuracy, F1-score, or BLEU). Mismatches between the training loss and evaluation metric can cause models to optimize one objective while performing poorly on the other.
The table below summarizes the most widely used loss functions across different problem types.
| Loss function | Formula | Problem type | When to use |
|---|---|---|---|
| Mean Squared Error (MSE / L2) | L = (1/n) * sum((y - y_hat)^2) | Regression | Default for regression; penalizes large errors heavily |
| Mean Absolute Error (MAE / L1) | L = (1/n) * sum(abs(y - y_hat)) | Regression | When outlier robustness matters more than penalizing large errors |
| Huber Loss | Quadratic for small errors, linear for large errors | Regression | Balances MSE sensitivity with MAE robustness; delta controls the transition |
| Log-Cosh Loss | L = (1/n) * sum(log(cosh(y - y_hat))) | Regression | Smooth alternative to Huber; twice differentiable everywhere |
| Quantile Loss | L = max(q*(y - y_hat), (q-1)*(y - y_hat)) | Regression | Prediction intervals; asymmetric penalties for over/under-prediction |
| Cross-entropy (Log Loss) | L = -sum(y * log(y_hat)) | Multi-class classification | Default for classification with softmax output |
| Binary Cross-Entropy | L = -(y*log(y_hat) + (1-y)*log(1-y_hat)) | Binary classification | Two-class problems with sigmoid output |
| Focal Loss | L = -alpha*(1-y_hat)^gamma * log(y_hat) | Classification (imbalanced) | Imbalanced datasets; down-weights easy examples |
| Hinge Loss | L = max(0, 1 - y*y_hat) | Classification | Support vector machines; encourages margin-based separation |
| Squared Hinge Loss | L = max(0, 1 - y*y_hat)^2 | Classification | Smoother alternative to hinge; differentiable everywhere |
| KL Divergence | D_KL(P || Q) = sum(P(x)*log(P(x)/Q(x))) | Distribution matching | VAEs, knowledge distillation, comparing distributions |
| Contrastive Loss | L = (1-Y)D^2 + Ymax(0, margin-D)^2 | Metric learning | Siamese networks; pulls similar pairs together |
| Triplet Loss | L = max(0, d(a,p) - d(a,n) + margin) | Metric learning | Learning embeddings for face recognition and retrieval |
| InfoNCE Loss | L = -log(exp(sim(a,p)/tau) / sum(exp(sim(a,n_i)/tau))) | Contrastive learning | Self-supervised learning (CLIP, SimCLR, MoCo) |
| CTC Loss | Marginalizes over all valid alignments | Sequence labeling | Speech recognition, handwriting recognition |
| Dice Loss | L = 1 - 2sum(yy_hat) / (sum(y) + sum(y_hat)) | Segmentation | Medical image segmentation with class imbalance |
| Wasserstein Loss | L = E[D(x_real)] - E[D(x_fake)] | Generative models | WGANs; stable GAN training |
Mean Squared Error squares each residual before averaging, which means large errors get disproportionately penalized. This property makes MSE sensitive to outliers: a single data point with a large error can dominate the loss and pull the model toward fitting that outlier. On the other hand, the squaring operation produces a smooth, convex surface that is easy to optimize with gradient-based methods. MSE is the default loss for most regression model tasks and corresponds to maximum likelihood estimation under the assumption of Gaussian noise.
The gradient of MSE with respect to predictions is proportional to the residual (y - y_hat), which means that larger errors produce larger gradients, speeding up correction of big mistakes.
Mean Absolute Error uses absolute values instead of squares, treating all errors linearly. A prediction that is off by 10 units contributes exactly 10 times as much loss as one that is off by 1 unit (compared to 100 times under MSE). This makes MAE more robust to outliers. The downside is that MAE is not differentiable at zero, which can cause small numerical issues during optimization, though in practice subgradient methods handle this without trouble. MAE corresponds to maximum likelihood estimation under the assumption of Laplace-distributed noise.
Proposed by Peter Huber in 1964, the Huber loss is a hybrid. For small residuals (below a threshold delta), it behaves like MSE, giving smooth gradients. For large residuals (above delta), it switches to a linear penalty like MAE, preventing outliers from dominating. The delta parameter lets practitioners tune the transition point. Huber loss is commonly used in reinforcement learning (for instance, in the DQN algorithm) and in robust regression tasks.
The Huber loss is convex and differentiable everywhere (unlike MAE), but its gradient has a discontinuity in the second derivative at the transition point where it switches between quadratic and linear behavior.
Log-cosh loss computes the logarithm of the hyperbolic cosine of the prediction error: L = (1/n) * sum(log(cosh(y_i - y_hat_i))). For small errors, log(cosh(x)) approximates x^2 / 2, behaving like MSE. For large errors, it approximates abs(x) - log(2), behaving like MAE. This means log-cosh naturally blends MSE-like behavior near zero with MAE-like robustness for outliers, similar to Huber loss.
The key advantage of log-cosh over Huber loss is that it is twice differentiable everywhere, producing a fully smooth gradient landscape with no discontinuities in any derivative. This can improve convergence with second-order optimizers or in settings where smooth gradient flow is important. Unlike Huber loss, log-cosh does not require choosing a delta hyperparameter; the transition between quadratic and linear behavior happens automatically based on the magnitude of the error.
Quantile loss (also called pinball loss) is used in quantile regression, where the goal is to predict a specific quantile of the target distribution rather than the mean. For a target quantile q (between 0 and 1), the loss is:
L = max(q * (y - y_hat), (q - 1) * (y - y_hat))
When q = 0.5, quantile loss reduces to the mean absolute error, predicting the median. Setting q = 0.9 produces a loss that penalizes under-predictions (where y > y_hat) nine times more heavily than over-predictions, pushing the model to predict values near the 90th percentile.
Quantile loss is widely used in probabilistic forecasting to construct prediction intervals. A common approach trains two models (or two output heads): one at q = 0.05 and one at q = 0.95, producing a 90% prediction interval. This is especially valuable in financial risk management, supply chain planning, and energy demand forecasting, where understanding the range of possible outcomes matters more than a single point estimate.
Cross-entropy measures the difference between two probability distributions: the true label distribution and the model's predicted distribution. For a multi-class problem with C classes, the loss for a single example is L = -sum over c of y_c * log(y_hat_c), where y_c is 1 for the correct class and 0 otherwise, and y_hat_c is the predicted probability for class c. Minimizing cross-entropy is mathematically equivalent to maximizing the log-likelihood of the correct class, and it is also equivalent to minimizing the KL divergence between the true and predicted distributions (since the entropy of one-hot labels is zero).
Cross-entropy strongly penalizes confident wrong predictions. If the model assigns near-zero probability to the true class, the log term sends the loss toward infinity. This behavior pushes the model to assign high probability to correct answers.
Binary cross-entropy (BCE) is the special case for two-class problems. The model outputs a single probability p through a sigmoid activation, and the loss is L = -(y * log(p) + (1-y) * log(1-p)). BCE is also used in multi-label classification model problems, where each label is treated as an independent binary prediction.
Introduced by Lin et al. in 2017 for object detection (in the RetinaNet paper), focal loss modifies cross-entropy by adding a modulating factor (1 - p_t)^gamma, where p_t is the predicted probability for the true class. When gamma > 0, well-classified examples (high p_t) contribute very little loss, while misclassified examples (low p_t) contribute much more. This focuses training on hard examples and is particularly effective for datasets with extreme class imbalance, such as object detection where background examples vastly outnumber foreground objects. A gamma value of 2 is commonly used. The alpha parameter provides additional class-level weighting.
Focal loss has since been adopted far beyond object detection. It appears in medical imaging, natural language processing, and any classification setting with severe imbalance.
Hinge loss is the standard loss for support vector machines. For a binary classification with labels in {-1, +1}, the loss is L = max(0, 1 - y * f(x)), where f(x) is the raw model output (not a probability). Hinge loss is zero when the prediction has the correct sign and a magnitude of at least 1 (meaning it falls on the correct side of the margin). Predictions within or on the wrong side of the margin incur a linear penalty. This margin-based formulation is what gives SVMs their maximum-margin property.
Hinge loss is not differentiable at the point where y * f(x) = 1, but subgradient methods handle this effectively.
Squared hinge loss replaces the linear penalty of standard hinge loss with a quadratic one: L = max(0, 1 - y * f(x))^2. This modification has two practical benefits. First, the function is differentiable everywhere (including at the margin boundary), which gives cleaner gradient signals for optimization. Second, the quadratic penalty punishes large margin violations more aggressively than the linear version, which can improve performance on some datasets.
The tradeoff is that squared hinge loss is more sensitive to outliers and misclassified points than standard hinge loss, since errors beyond the margin grow quadratically rather than linearly.
Kullback-Leibler divergence measures how one probability distribution Q diverges from a reference distribution P. It is defined as D_KL(P || Q) = sum over x of P(x) * log(P(x) / Q(x)). KL divergence is not symmetric: D_KL(P || Q) is generally not equal to D_KL(Q || P). It is always non-negative and equals zero only when P and Q are identical.
In machine learning, KL divergence appears in several places. In variational autoencoders (VAEs), the loss includes a KL term that regularizes the learned latent distribution to stay close to a prior (typically a standard Gaussian). In knowledge distillation, KL divergence measures how closely the student model's output distribution matches the teacher model's. It is also used in policy optimization methods in reinforcement learning to prevent the updated policy from deviating too far from the old policy.
Contrastive loss, introduced by Hadsell, Chopra, and LeCun in 2006, operates on pairs of examples. Given two inputs and a label indicating whether they are similar (Y=0) or dissimilar (Y=1), the loss pulls similar pairs closer in embedding space and pushes dissimilar pairs apart (up to a margin). The formula is:
L = (1-Y) * D^2 + Y * max(0, margin - D)^2
where D is the Euclidean distance between the two embeddings. Contrastive loss is widely used in Siamese networks for tasks like signature verification and face matching.
Triplet loss, popularized by Schroff et al. in the FaceNet paper (2015), works with triplets: an anchor, a positive example (same class), and a negative example (different class). The loss is L = max(0, d(anchor, positive) - d(anchor, negative) + margin). It encourages the distance from the anchor to the positive to be smaller than the distance to the negative by at least the margin.
A practical challenge with triplet loss is mining informative triplets. Random triplets often produce zero loss and contribute nothing to learning. Hard negative mining, where the negative is chosen to be close to the anchor, significantly improves training efficiency. Semi-hard mining (selecting negatives that are farther than the positive but still within the margin) provides a useful middle ground that avoids collapsed training.
InfoNCE (Noise-Contrastive Estimation) loss, introduced by van den Oord et al. in 2018 as part of the Contrastive Predictive Coding (CPC) framework, generalizes contrastive learning to work with large batches of negative examples. Given an anchor and one positive example drawn from a batch of N samples, the loss is:
L = -log(exp(sim(anchor, positive) / tau) / sum_i(exp(sim(anchor, sample_i) / tau)))
where sim() is typically cosine similarity and tau is a temperature hyperparameter that controls the sharpness of the distribution. InfoNCE is essentially a softmax cross-entropy over the similarity scores, treating the positive pair as the correct "class" among all N candidates.
InfoNCE became the foundation for many modern self-supervised learning methods. SimCLR (Chen et al., 2020) applies it to augmented image pairs. MoCo (He et al., 2020) uses a momentum encoder to maintain a large queue of negatives. CLIP (Radford et al., 2021) extends it to image-text pairs, computing a symmetric InfoNCE loss over all possible pairings in a batch. The temperature parameter tau is critical: too low and the loss focuses exclusively on the hardest negatives, too high and the contrastive signal becomes weak.
Connectionist Temporal Classification (CTC) loss, introduced by Graves et al. in 2006, solves a fundamental problem in sequence-to-sequence tasks: how to train a model when the alignment between input and output is unknown. In speech recognition, for example, a 3-second audio clip might correspond to the word "hello," but there is no label telling the model which millisecond corresponds to which letter.
CTC addresses this by introducing a special blank token and marginalizing over all possible alignments between the input sequence and the output label. The model predicts a probability distribution over the vocabulary (including the blank token) at each time step, and CTC computes the total probability of the correct output by summing over every valid alignment path. The loss is the negative log of this total probability.
CTC was originally developed for handwriting and phoneme recognition, and it remains a core component in modern automatic speech recognition systems. It is used in architectures like DeepSpeech and as one training objective in hybrid speech models.
Dice loss, introduced by Milletari et al. in 2016 in the V-Net paper, is based on the Sorensen-Dice coefficient, a set similarity measure. For binary segmentation, the Dice loss is:
L = 1 - (2 * sum(y * y_hat) + epsilon) / (sum(y) + sum(y_hat) + epsilon)
where y is the ground-truth mask, y_hat is the predicted mask, and epsilon is a small constant for numerical stability. Dice loss directly optimizes the overlap between predicted and ground-truth regions, making it particularly effective for image segmentation tasks where the foreground class occupies a small fraction of the image.
In medical image segmentation, Dice loss has become the standard choice because it handles the severe class imbalance that arises when a tumor or organ occupies only a few percent of the total image volume. It is often combined with cross-entropy in a weighted sum to get both the overlap optimization from Dice and the pixel-wise probability calibration from cross-entropy.
Wasserstein loss is the training objective used in Wasserstein Generative Adversarial Networks (WGANs), proposed by Arjovsky et al. in 2017. Standard GANs use a minimax loss based on the Jensen-Shannon divergence between the real and generated data distributions, which can cause training instability and mode collapse. The Wasserstein loss replaces this with the Earth Mover's Distance (also called the Wasserstein-1 distance), which measures the minimum "cost" of transforming one probability distribution into another.
In practice, the WGAN critic (replacing the discriminator) outputs a real-valued score rather than a probability. The loss is:
L_critic = E[D(x_real)] - E[D(x_fake)] L_generator = -E[D(x_fake)]
The Wasserstein distance provides meaningful gradients even when the real and generated distributions do not overlap, which is a common early-training scenario where standard GAN loss gives zero or uninformative gradients. To ensure the critic satisfies the Lipschitz constraint required by the Wasserstein distance, the original paper uses weight clipping. The improved WGAN-GP variant (Guliani et al., 2017) replaces weight clipping with a gradient penalty term.
The loss function defines the optimization landscape. Every set of model weights corresponds to a point on this landscape, and the loss value at that point is the "altitude." Gradient-based optimizers attempt to walk downhill on this surface.
The shape of the landscape matters enormously. Convex loss functions (like MSE for linear models) have a single global minimum that gradient descent will find. Non-convex losses (common with deep neural networks) have many local minima, saddle points, and flat regions.
Research by Dauphin et al. (2014) demonstrated that in high-dimensional parameter spaces, saddle points are far more common than local minima. With millions or billions of parameters, most critical points (where the gradient is zero) are saddle points rather than true minima. A saddle point is a minimum in some directions and a maximum in others. Near saddle points, vanilla gradient descent stalls because gradients are nearly zero in all directions.
Modern optimizers like Adam, AdaGrad, and RMSProp use adaptive learning rates to navigate these complex landscapes more effectively. Momentum-based methods help escape saddle points by accumulating velocity in consistent gradient directions, allowing the optimizer to "roll through" flat regions rather than stopping at them.
Not all minima are created equal. Research has shown that flat minima (wide basins in the loss landscape) tend to generalize better than sharp minima (narrow valleys). Models that converge to sharp minima may achieve very low training loss but perform poorly on unseen data, because small perturbations to the weights cause large changes in loss. Techniques like large-batch-size reduction, stochastic weight averaging, and sharpness-aware minimization (SAM) are designed to steer optimization toward flatter regions.
The loss function also determines the gradients that flow through the network during backpropagation. Some loss functions produce more informative gradients than others. For example, cross-entropy combined with softmax produces a gradient that is simply (predicted - true), which is well-behaved and informative. MSE combined with sigmoid, by contrast, can produce very small gradients when the output is near 0 or 1, slowing down learning.
In practice, the total loss function often includes regularization terms that penalize model complexity, reducing overfitting. The two most common forms are L1 and L2 regularization.
L1 regularization (also called Lasso) adds the sum of the absolute values of the weights to the loss: L_total = L_data + lambda * sum(abs(w)). L1 tends to push weights to exactly zero, producing sparse models. This acts as a form of automatic feature selection, since features with zero-weight coefficients are effectively ignored.
L2 regularization (also called Ridge, or weight decay in deep learning) adds the sum of the squared weights: L_total = L_data + lambda * sum(w^2). L2 penalizes large weights but rarely drives them to exactly zero. Instead, it encourages many small weights spread across features. L2 regularization corresponds to a Gaussian prior on the weights in a Bayesian interpretation.
The hyperparameter lambda controls the strength of regularization. Too much regularization constrains the model excessively (underfitting); too little provides no benefit.
Elastic Net combines L1 and L2 regularization, adding both penalty terms with separate coefficients. This can be useful when there are correlated features, as L1 alone might arbitrarily select one feature from a correlated group.
Standard loss functions do not always capture what matters for a particular application. In such cases, practitioners design custom losses. A few examples:
When designing a custom loss, the function must remain differentiable (or have usable subgradients). It should also be tested carefully, since unusual loss surfaces can cause training instability or convergence to poor solutions.
The rise of large language models has introduced new loss functions designed to align model behavior with human preferences.
Reinforcement learning from human feedback (RLHF), as used in systems like InstructGPT and ChatGPT, involves training a reward model on human preference data. The reward model is trained with a pairwise ranking loss: given two outputs for the same prompt, the model learns to assign a higher score to the output that humans preferred. The language model is then fine-tuned using PPO (Proximal Policy Optimization) to maximize this reward, with a KL penalty that prevents the model from drifting too far from its original behavior.
Introduced by Rafailov et al. in 2023, DPO bypasses the need for a separate reward model entirely. The key insight is that the optimal policy under the RLHF objective can be expressed in closed form as a function of the language model itself. This allows the preference data to be used directly to optimize the language model with a simple classification-style loss:
L_DPO = -E[log sigmoid(beta * (log pi_theta(y_w|x) / pi_ref(y_w|x) - log pi_theta(y_l|x) / pi_ref(y_l|x)))]
Here, y_w is the preferred (winning) response, y_l is the rejected (losing) response, pi_theta is the model being trained, pi_ref is the reference model, and beta controls how much the model can deviate from the reference. DPO is simpler to implement than full RLHF, requires less compute, and avoids the instability of reinforcement learning. It has been widely adopted, with variants like IPO, KTO, and ORPO building on the same principle.
Plotting the loss value over training iterations produces a loss curve. Reading loss curves is one of the most practical skills in model development.
A healthy training curve starts high and decreases sharply before gradually leveling off. If you plot both training loss and validation loss, several patterns are informative:
| Pattern | Training loss | Validation loss | Likely diagnosis |
|---|---|---|---|
| Both decreasing together | Falling | Falling | Training is progressing well; model is learning generalizable features |
| Training falls, validation rises | Falling | Rising after initial drop | Overfitting; model memorizes training data. Try regularization, dropout, more data, or early stopping |
| Both remain high | Flat or slowly falling | Flat or slowly falling | Underfitting; model lacks capacity. Try a larger model, different architecture, or lower regularization |
| Loss oscillates wildly | Jumping up and down | Jumping up and down | Learning rate too high; reduce it or use a learning rate scheduler |
| Loss drops then plateaus | Flat after initial drop | Flat after initial drop | Possible local minimum or saddle point; try a different optimizer, learning rate warm-up, or restart |
| Validation loss is noisy | Smooth | Noisy or spiky | Batch size may be too small; increase it or use gradient accumulation |
Loss curves can also reveal data issues. A sudden jump in validation loss might indicate corrupted batches or data leakage being disrupted. If training loss drops to near zero very quickly, the task might be too easy or there could be label leakage.
Selecting a loss function is one of the first decisions in any machine learning project. The following guidelines cover the most common scenarios:
For regression tasks: Start with MSE. If outliers are a concern, try MAE or Huber loss. If you need a smooth Huber-like alternative without the delta hyperparameter, consider log-cosh loss. For probabilistic forecasts or prediction intervals, use quantile loss.
For binary classification: Binary cross-entropy is the standard choice. If classes are highly imbalanced, consider focal loss or weighting the positive class more heavily in BCE.
For multi-class classification: Categorical cross-entropy with softmax is the default. For multi-label problems (where an example can belong to multiple classes), use binary cross-entropy applied independently to each label.
For ranking and retrieval: Triplet loss or contrastive loss for learning embeddings. InfoNCE loss (used in CLIP and SimCLR) is a popular modern alternative that scales well to large batch sizes.
For image segmentation: Dice loss for class-imbalanced segmentation tasks, especially in medical imaging. Combining Dice with cross-entropy often outperforms either loss alone.
For sequence labeling: CTC loss when alignment between input and output is unknown (speech recognition, handwriting). Standard cross-entropy when alignment is known (part-of-speech tagging).
For generative models: Reconstruction loss (MSE or BCE) plus a KL divergence term for VAEs. Wasserstein loss for stable GAN training. Perceptual loss for high-quality image generation.
For LLM alignment: DPO or its variants for preference-based fine-tuning. RLHF with a reward model and PPO for more flexible alignment objectives.
The loss function should match the evaluation metric as closely as possible. If the final evaluation uses accuracy, a classification loss is appropriate; if it uses BLEU or ROUGE for text generation, consider a loss that correlates with those metrics. Mismatches between the training loss and the evaluation metric can lead to models that optimize well on paper but perform poorly on the actual task.
Imagine you are trying to learn how to throw a ball into a basket. Each time you throw the ball, you can see whether you missed the basket, hit the rim, or got the ball in. You try to change the way you throw the ball to get better results. In machine learning, the loss function is like a scorekeeper that tells you how far your throw was from the basket. A high score means you missed badly; a low score means you were close. The computer uses this score to figure out how to adjust its predictions to get better and better, just like you adjust your throw to get the ball in the basket.
Different loss functions are like different scorekeepers with different rules. One scorekeeper might punish big misses extra harshly (that is MSE). Another treats every miss the same no matter how far off (that is MAE). Picking the right scorekeeper helps the computer learn the right lessons from its mistakes.