See also: Loss function, Validation loss, Learning curve
In machine learning, training loss is the value of the loss function evaluated on the training data during model training. It is the quantity that the optimizer actually pushes downward at each step of gradient descent, and tracking it over iterations or epochs is the primary signal practitioners use to judge whether training is working. A well-behaved training loss curve generally starts high, drops sharply in the first few thousand steps, then settles into a long, gentler decline as the model approaches a minimum of its training objective.
The distinction matters because "loss function" refers to a mathematical formula (such as squared error or cross-entropy), while training loss is the numerical value that formula produces when fed the model's current predictions on the training set. The same loss function generates a different training loss at every step, and that sequence of numbers is what gets logged, plotted, and analyzed.
A machine learning workflow normally tracks loss on three different data partitions, each serving a separate purpose.
| Quantity | Data used | What it measures | When it is computed |
|---|---|---|---|
| Training loss | Training set | How well the model fits the data it is being optimized on | Continuously during training, often per batch and per epoch |
| Validation loss | Held-out validation set | How well the model generalizes to data it has not seen during gradient updates | Periodically during training (every epoch or every N steps) |
| Test loss | Held-out test set | Final, unbiased estimate of generalization error | Once, after training and hyperparameter tuning are finished |
Training loss is the easiest to drive toward zero because the model has direct access to those examples through gradient updates. Validation loss is the diagnostic that reveals whether the model is learning generalizable patterns or just memorizing the training set. Test loss exists to keep the validation set honest: if the validation loss is repeatedly used to select hyperparameters, it begins to leak information into the model and stops being a fair estimate of true generalization.
A common mistake is conflating training loss with the loss function itself. The loss function is a mathematical object. Training loss is one of many values that function takes on as the model evolves.
The choice of loss function depends on the task. The table below summarizes the most common training objectives and where each is used.
| Loss | Typical task | Notes |
|---|---|---|
| Mean squared error (MSE, L2) | Regression | Penalizes large errors quadratically; smooth and easy to optimize |
| Mean absolute error (MAE, L1) | Regression | Robust to outliers; not differentiable at zero |
| Huber loss | Regression | Quadratic for small residuals, linear for large ones; balances MSE and MAE |
| Cross-entropy (categorical) | Multi-class classification | Default loss for softmax outputs |
| Binary cross-entropy | Binary or multi-label classification | Used with sigmoid outputs |
| Hinge loss | Margin-based classification | Standard objective for support vector machines |
| Focal loss | Imbalanced classification | Down-weights easy examples; introduced for dense object detection |
| Triplet loss | Metric learning | Pulls anchor toward positive, pushes from negative |
| Contrastive (InfoNCE) loss | Self-supervised learning | Used in CLIP, SimCLR, MoCo |
| CTC loss | Speech recognition | Marginalizes over alignments between audio and text |
| Reconstruction loss | Autoencoders | Squared error or BCE between input and reconstruction |
| KL divergence | VAEs, knowledge distillation | Measures distance between two distributions |
| GAN minimax loss | Generative adversarial networks | Discriminator and generator compete |
| Noise prediction (MSE) | Diffusion models | Predicts the noise added at each step (DDPM) |
| Per-token cross-entropy | LLM pretraining | Negative log-likelihood of the next token |
For LLM pretraining, the training loss is almost always the average per-token negative log-likelihood across the batch, which is the same quantity used to compute perplexity. The relationship is simply perplexity = exp(per-token NLL), so a training loss of 2.0 corresponds to a perplexity of about 7.39.
Training loss is reported at two granularities, and confusing them causes a lot of frustration when reading curves.
Per-batch loss is the loss computed on a single mini-batch right after the gradient update. It is logged every step (sometimes every few steps to reduce I/O) and is inherently noisy, because each batch contains a different sample of training examples. A per-batch curve that bounces up and down by 10 to 20 percent from step to step is not necessarily a problem; it is mostly sampling noise.
Per-epoch loss is the average loss over an entire pass through the training set. It is much smoother because the noise from individual batches gets averaged out. Most published learning curves show per-epoch loss for clarity, but per-batch logs are valuable for catching short-lived issues like a corrupted batch or a sudden divergence event.
When training runs are long (LLM pretraining can last for weeks across tens of thousands of steps), people often plot a moving average of the per-batch loss using a window of 100 to 500 steps. This gives the smoothness of per-epoch loss with the temporal resolution of per-batch logging.
The shape of the training loss curve carries a lot of diagnostic information. A few common patterns and what they typically mean:
A steady downward curve that gradually flattens is the textbook healthy training run. The model is converging toward a minimum.
A sudden plateau early in training can mean the learning rate is too small, the model is stuck near a saddle point, or the data is not being shuffled properly so each epoch presents the same easy examples first.
Loss spikes (the curve jumps up sharply, then comes back down) usually indicate that a batch contained unusual examples or that the learning rate is too aggressive for that point in training. A single spike is not always fatal: many training runs recover. Repeated spikes are a warning sign.
Divergence (loss climbs steadily or jumps to NaN) means something is broken. Common causes include a learning rate that is too high, exploding gradients, numerical overflow in mixed-precision training, or a data preprocessing bug that is feeding garbage into the model.
A training loss that drops to essentially zero very quickly suggests the model has memorized the training set or, more concerning, that there is data leakage. If the task is genuinely hard and the model is small, near-zero training loss within a few epochs probably means something is wrong with how the data is being constructed.
Reading training loss in isolation is rarely enough. Plotting it alongside validation loss reveals whether the model is generalizing.
| Pattern | Diagnosis | Typical fix |
|---|---|---|
| Both training and validation loss decreasing | Healthy training, model is generalizing | Continue training |
| Training loss decreasing, validation loss flat or rising | Overfitting; model is memorizing the training set | Add regularization, dropout, data augmentation, or use early stopping |
| Both losses high and flat | Underfitting; model lacks capacity or is poorly initialized | Increase model size, train longer, lower regularization, fix learning rate |
| Training loss near zero, validation loss high | Severe overfitting | Significantly reduce model capacity or get more training data |
| Both losses oscillate wildly | Learning rate too high or batch size too small | Reduce learning rate, use a scheduler, or increase batch size |
| Validation loss spikes while training loss stays smooth | Distribution shift between train and val splits, or batch normalization in eval mode | Audit dataset splits and model.eval() calls |
The gap between training and validation loss (sometimes called the generalization gap) is itself a metric worth watching. A small gap suggests the model would generalize even further with more data or longer training. A large gap suggests the model has reached the limits of what its current capacity can extract from the available training data without overfitting.
Training loss alone is rarely sufficient to debug a training run. Modern experiment trackers log a constellation of related metrics that together paint a much richer picture.
| Metric | Why it matters |
|---|---|
| Validation loss | Direct measure of generalization |
| Validation metrics (accuracy, F1, BLEU, mAP, perplexity) | Task-specific quality, not just loss value |
| Learning rate | Helps correlate loss spikes or plateaus with scheduler events |
| Gradient norm | Sudden growth signals exploding gradients; collapse to zero signals vanishing gradients |
| Weight norm | Tracks weight growth; helpful when diagnosing regularization or weight decay |
| Activation statistics | Mean and standard deviation per layer reveal saturation or dead neurons |
| Time per step | Spikes indicate hardware contention, data loading bottlenecks, or memory swapping |
| GPU memory and utilization | Surfaces inefficiency in data pipelines |
| Per-token loss (LLMs) | Used to compute perplexity and to detect bad batches |
Logging gradient norm in particular is one of the most useful early-warning signals. A gradient norm that suddenly doubles a few steps before a loss spike often makes the failure predictable.
Several platforms have become standard for tracking training loss and related metrics across the deep learning ecosystem.
| Tool | Origin | Notable features |
|---|---|---|
| TensorBoard | Google, 2015 | Native to TensorFlow, widely used with PyTorch via SummaryWriter; local-first |
| Weights and Biases (W&B) | Lukas Biewald, 2017 | Cloud-hosted, rich UI, sweep-based hyperparameter search |
| MLflow | Databricks, 2018 | Open-source, integrates with Spark and broader MLOps stack |
| Neptune.ai | Neptune Labs, 2018 | Strong focus on metadata tracking and team collaboration |
| Aim | Aimhub, 2020 | Open-source, fast UI for comparing thousands of runs |
| Comet ML | Comet, 2017 | Hosted experiment tracking with model registry |
TensorBoard remains the most universally supported option. In PyTorch, the typical pattern is to instantiate a SummaryWriter and call writer.add_scalar("Loss/train", loss.item(), global_step) inside the training loop. Hierarchical tag naming ("Loss/train", "Loss/val") groups related curves automatically in the UI. The TensorBoard documentation recommends calling writer.close() at the end of training so all buffered events are flushed to disk.
Weights and Biases became popular for cloud-hosted runs because it makes it trivial to compare hundreds of experiments side by side, run hyperparameter sweeps, and share results with collaborators via a public URL. MLflow is more common in enterprise settings that already use Databricks or need to integrate with broader MLOps tooling.
Large language model pretraining stretches every assumption about loss curves. Runs last days or weeks, models have hundreds of billions of parameters, and even small instabilities can cost tens of thousands of dollars in wasted compute.
Loss spikes during LLM pretraining are well documented. The PaLM paper (Chowdhery et al., 2022) reported that the training loss spiked roughly 20 times during the run for the 540B parameter model, even though gradient clipping was active. The team's mitigation was practical rather than theoretical: when a spike occurred, they restarted training from a checkpoint about 100 steps before the spike and skipped the next 200 to 500 batches that the model had been about to consume. Skipping the batches almost always prevented the spike from recurring at the same point, suggesting the spikes were caused by specific combinations of data and model state rather than fundamental optimizer failure.
The Spike No More paper (Takase et al., 2023) and the Llama 3 technical report both document similar phenomena and propose more principled mitigations including improved initialization, layer normalization placement, and adaptive gradient clipping.
Gradient clipping (typically clipping the global norm to 1.0) is standard in LLM training to prevent rare large gradients from blowing up the loss. BF16 (bfloat16) has largely replaced FP16 for large-scale training because BF16 has the same exponent range as FP32, eliminating the overflow issues that required loss scaling in FP16. Mixed-precision training in BF16 produces noticeably smoother loss curves than FP16 at the same model scale.
LLM training loss is the per-token cross-entropy. Perplexity, the more interpretable quantity, is just exp(per-token loss). A loss of 2.5 corresponds to a perplexity of about 12.2, meaning the model is on average about as uncertain as if it were choosing uniformly among 12 possible next tokens. Both quantities are reported in practice, with loss preferred during optimization (because gradients flow through it directly) and perplexity preferred when discussing model quality.
The training loss can be thought of as a surface in the high-dimensional space of model parameters. Each combination of weights produces one loss value, and the optimizer's job is to walk downhill on this surface. Visualizing this landscape is mathematically tricky because deep models have millions to billions of parameters, but Li et al. (NeurIPS 2018) introduced a filter normalization technique that produces meaningful 2D and 3D visualizations of loss landscapes. Their key empirical finding was that flatter minima tend to generalize better than sharp ones, supporting earlier theoretical work by Hochreiter and Schmidhuber on the relationship between minimum sharpness and generalization.
The paper also showed that skip connections (as in ResNet architectures) dramatically smooth the loss landscape, which helps explain why very deep networks became trainable once skip connections were introduced. Without skip connections, the loss landscape of a deep network is full of sharp ridges and chaotic regions; with them, the landscape becomes much closer to the gentle bowl that gradient descent handles well.
This line of work also clarified that the training loss landscape exhibits implicit regularization: stochastic gradient descent with reasonable hyperparameters tends to find flat minima that generalize, even when sharper minima with the same training loss exist.
A few patterns trip up practitioners often enough to be worth flagging:
Comparing losses across different scales. MSE on raw pixel values produces numbers in the thousands, while cross-entropy on a 50,000-token vocabulary produces numbers around 10. Comparing absolute values across loss functions is meaningless. Compare relative changes within the same loss function.
Reporting training loss inconsistently. Some pipelines include the regularization term (weight decay, L2 penalty) in the reported training loss, others do not. When comparing runs across codebases, check whether "training loss" includes the regularizer.
Forgetting to call model.eval() when computing validation loss. In PyTorch, this leaves dropout active and uses batch statistics rather than running statistics in batch normalization layers. The result is a noisy, biased validation loss that does not reflect inference behavior.
Not normalizing by batch size in distributed training. When gradients are summed across workers, the effective batch size grows, and the loss should be averaged appropriately. Forgetting this leads to learning rates that are effectively too large or too small.
Mixing per-batch and per-epoch loss in the same plot without saying which is which. The two have very different smoothness, and putting them on the same axes invites misinterpretation.
Reading too much into a single noisy curve. A loss that bounces around within a 10 percent band over 100 steps is normal noise, not evidence of instability. Take a moving average before drawing conclusions.
For LLM-scale runs, monitoring tooling has split into two functional categories. Crash diagnostics are designed to detect divergence as early as possible: rapidly rising loss, gradient norm spikes, NaN detection, and weight norm growth all qualify. The faster these signals trip an alert, the less compute is wasted on a doomed run. Convergence diagnostics, by contrast, are slower-moving statistics that help judge whether a run that is technically training is also making efficient progress: comparison against scaling-law predictions, downstream evaluation metrics, and per-domain perplexity all fall into this bucket. A well-designed monitoring setup logs both categories and surfaces them in different dashboards so on-call engineers can react to crashes without being overwhelmed by slow trends.
Imagine you are practicing free throws in basketball. Each time you shoot, the coach writes down how far the ball was from the center of the hoop. The numbers usually start big (you keep missing badly) and get smaller as you practice. Training loss is the same idea for a computer program that is learning. Every time the program tries to make a guess, the training loss says how wrong the guess was. The program changes itself a little bit to be less wrong next time, and the training loss number gets smaller and smaller.
If the number ever stops going down, that might mean you have learned everything you can from this practice or that something is going wrong, like the gym lights flickering and confusing you. And if the number suddenly jumps up, that probably means a weird ball came along that messed you up. The score by itself does not tell you everything; you also need to try shooting in a different gym (the validation set) to know whether you are really getting better or just memorizing the lines on the floor of one specific court.