Training loss

Introduction

In machine learning, training loss is the value of the loss function evaluated on the training data during model training. It is the quantity that the optimizer actually pushes downward at each step of gradient descent, and tracking it over iterations or epochs is the primary signal practitioners use to judge whether training is working. A well-behaved training loss curve generally starts high, drops sharply in the first few thousand steps, then settles into a long, gentler decline as the model approaches a minimum of its training objective.

The distinction matters because "loss function" refers to a mathematical formula (such as squared error or cross-entropy), while training loss is the numerical value that formula produces when fed the model's current predictions on the training set. The same loss function generates a different training loss at every step, and that sequence of numbers is what gets logged, plotted, and analyzed.

Training loss vs validation loss vs test loss

A machine learning workflow normally tracks loss on three different data partitions, each serving a separate purpose.

Quantity	Data used	What it measures	When it is computed
Training loss	Training set	How well the model fits the data it is being optimized on	Continuously during training, often per batch and per epoch
Validation loss	Held-out validation set	How well the model generalizes to data it has not seen during gradient updates	Periodically during training (every epoch or every N steps)
Test loss	Held-out test set	Final, unbiased estimate of generalization error	Once, after training and hyperparameter tuning are finished

Training loss is the easiest to drive toward zero because the model has direct access to those examples through gradient updates. Validation loss is the diagnostic that reveals whether the model is learning generalizable patterns or just memorizing the training set. Test loss exists to keep the validation set honest: if the validation loss is repeatedly used to select hyperparameters, it begins to leak information into the model and stops being a fair estimate of true generalization.

A common mistake is conflating training loss with the loss function itself. The loss function is a mathematical object. Training loss is one of many values that function takes on as the model evolves.

Common loss functions used during training

The choice of loss function depends on the task. The table below summarizes the most common training objectives and where each is used.

Loss	Typical task	Notes
Mean squared error (MSE, L2)	Regression	Penalizes large errors quadratically; smooth and easy to optimize
Mean absolute error (MAE, L1)	Regression	Robust to outliers; not differentiable at zero
Huber loss	Regression	Quadratic for small residuals, linear for large ones; balances MSE and MAE
Cross-entropy (categorical)	Multi-class classification	Default loss for softmax outputs
Binary cross-entropy	Binary or multi-label classification	Used with sigmoid outputs
Hinge loss	Margin-based classification	Standard objective for support vector machines
Focal loss	Imbalanced classification	Down-weights easy examples; introduced for dense object detection
Triplet loss	Metric learning	Pulls anchor toward positive, pushes from negative
Contrastive (InfoNCE) loss	Self-supervised learning	Used in CLIP, SimCLR, MoCo
CTC loss	Speech recognition	Marginalizes over alignments between audio and text
Reconstruction loss	Autoencoders	Squared error or BCE between input and reconstruction
KL divergence	VAEs, knowledge distillation	Measures distance between two distributions
GAN minimax loss	Generative adversarial networks	Discriminator and generator compete
Noise prediction (MSE)	Diffusion models	Predicts the noise added at each step (DDPM)
Per-token cross-entropy	LLM pretraining	Negative log-likelihood of the next token

For LLM pretraining, the training loss is almost always the average per-token negative log-likelihood across the batch, which is the same quantity used to compute perplexity. The relationship is simply perplexity = exp(per-token NLL), so a training loss of 2.0 corresponds to a perplexity of about 7.39.

Per-batch loss vs per-epoch loss

Training loss is reported at two granularities, and confusing them causes a lot of frustration when reading curves.

Per-batch loss is the loss computed on a single mini-batch right after the gradient update. It is logged every step (sometimes every few steps to reduce I/O) and is inherently noisy, because each batch contains a different sample of training examples. A per-batch curve that bounces up and down by 10 to 20 percent from step to step is not necessarily a problem; it is mostly sampling noise.

Per-epoch loss is the average loss over an entire pass through the training set. It is much smoother because the noise from individual batches gets averaged out. Most published learning curves show per-epoch loss for clarity, but per-batch logs are valuable for catching short-lived issues like a corrupted batch or a sudden divergence event.

When training runs are long (LLM pretraining can last for weeks across tens of thousands of steps), people often plot a moving average of the per-batch loss using a window of 100 to 500 steps. This gives the smoothness of per-epoch loss with the temporal resolution of per-batch logging.

Reading training loss curves

The shape of the training loss curve carries a lot of diagnostic information. A few common patterns and what they typically mean:

A steady downward curve that gradually flattens is the textbook healthy training run. The model is converging toward a minimum.

A sudden plateau early in training can mean the learning rate is too small, the model is stuck near a saddle point, or the data is not being shuffled properly so each epoch presents the same easy examples first.

Loss spikes (the curve jumps up sharply, then comes back down) usually indicate that a batch contained unusual examples or that the learning rate is too aggressive for that point in training. A single spike is not always fatal: many training runs recover. Repeated spikes are a warning sign.

Divergence (loss climbs steadily or jumps to NaN) means something is broken. Common causes include a learning rate that is too high, exploding gradients, numerical overflow in mixed-precision training, or a data preprocessing bug that is feeding garbage into the model.

A training loss that drops to essentially zero very quickly suggests the model has memorized the training set or, more concerning, that there is data leakage. If the task is genuinely hard and the model is small, near-zero training loss within a few epochs probably means something is wrong with how the data is being constructed.

Train vs validation loss patterns

Reading training loss in isolation is rarely enough. Plotting it alongside validation loss reveals whether the model is generalizing.

Pattern	Diagnosis	Typical fix
Both training and validation loss decreasing	Healthy training, model is generalizing	Continue training
Training loss decreasing, validation loss flat or rising	Overfitting; model is memorizing the training set	Add regularization, dropout, data augmentation, or use early stopping
Both losses high and flat	Underfitting; model lacks capacity or is poorly initialized	Increase model size, train longer, lower regularization, fix learning rate
Training loss near zero, validation loss high	Severe overfitting	Significantly reduce model capacity or get more training data
Both losses oscillate wildly	Learning rate too high or batch size too small	Reduce learning rate, use a scheduler, or increase batch size
Validation loss spikes while training loss stays smooth	Distribution shift between train and val splits, or batch normalization in eval mode	Audit dataset splits and model.eval() calls

The gap between training and validation loss (sometimes called the generalization gap) is itself a metric worth watching. A small gap suggests the model would generalize even further with more data or longer training. A large gap suggests the model has reached the limits of what its current capacity can extract from the available training data without overfitting.

What else to log alongside training loss

Training loss alone is rarely sufficient to debug a training run. Modern experiment trackers log a constellation of related metrics that together paint a much richer picture.

Metric	Why it matters
Validation loss	Direct measure of generalization
Validation metrics (accuracy, F1, BLEU, mAP, perplexity)	Task-specific quality, not just loss value
Learning rate	Helps correlate loss spikes or plateaus with scheduler events
Gradient norm	Sudden growth signals exploding gradients; collapse to zero signals vanishing gradients
Weight norm	Tracks weight growth; helpful when diagnosing regularization or weight decay
Activation statistics	Mean and standard deviation per layer reveal saturation or dead neurons
Time per step	Spikes indicate hardware contention, data loading bottlenecks, or memory swapping
GPU memory and utilization	Surfaces inefficiency in data pipelines
Per-token loss (LLMs)	Used to compute perplexity and to detect bad batches

Logging gradient norm in particular is one of the most useful early-warning signals. A gradient norm that suddenly doubles a few steps before a loss spike often makes the failure predictable.

Tools for monitoring training loss

Several platforms have become standard for tracking training loss and related metrics across the deep learning ecosystem.

Tool	Origin	Notable features
TensorBoard	Google, 2015	Native to TensorFlow, widely used with PyTorch via SummaryWriter; local-first
Weights and Biases (W&B)	Lukas Biewald, 2017	Cloud-hosted, rich UI, sweep-based hyperparameter search
MLflow	Databricks, 2018	Open-source, integrates with Spark and broader MLOps stack
Neptune.ai	Neptune Labs, 2018	Strong focus on metadata tracking and team collaboration
Aim	Aimhub, 2020	Open-source, fast UI for comparing thousands of runs
Comet ML	Comet, 2017	Hosted experiment tracking with model registry

TensorBoard remains the most universally supported option. In PyTorch, the typical pattern is to instantiate a SummaryWriter and call writer.add_scalar("Loss/train", loss.item(), global_step) inside the training loop. Hierarchical tag naming ("Loss/train", "Loss/val") groups related curves automatically in the UI. The TensorBoard documentation recommends calling writer.close() at the end of training so all buffered events are flushed to disk.

Weights and Biases became popular for cloud-hosted runs because it makes it trivial to compare hundreds of experiments side by side, run hyperparameter sweeps, and share results with collaborators via a public URL. MLflow is more common in enterprise settings that already use Databricks or need to integrate with broader MLOps tooling.

Special considerations for LLM training

Large language model pretraining stretches every assumption about loss curves. Runs last days or weeks, models have hundreds of billions of parameters, and even small instabilities can cost tens of thousands of dollars in wasted compute.

Loss spikes

Loss spikes during LLM pretraining are well documented. The PaLM paper (Chowdhery et al., 2022) reported that the training loss spiked roughly 20 times during the run for the 540B parameter model, even though gradient clipping was active. The team's mitigation was practical rather than theoretical: when a spike occurred, they restarted training from a checkpoint about 100 steps before the spike and skipped the next 200 to 500 batches that the model had been about to consume. Skipping the batches almost always prevented the spike from recurring at the same point, suggesting the spikes were caused by specific combinations of data and model state rather than fundamental optimizer failure.

The Spike No More paper (Takase et al., 2023) and the Llama 3 technical report both document similar phenomena and propose more principled mitigations including improved initialization, layer normalization placement, and adaptive gradient clipping.

Gradient clipping and numerical stability

Gradient clipping (typically clipping the global norm to 1.0) is standard in LLM training to prevent rare large gradients from blowing up the loss. BF16 (bfloat16) has largely replaced FP16 for large-scale training because BF16 has the same exponent range as FP32, eliminating the overflow issues that required loss scaling in FP16. Mixed-precision training in BF16 produces noticeably smoother loss curves than FP16 at the same model scale.

Loss vs perplexity

LLM training loss is the per-token cross-entropy. Perplexity, the more interpretable quantity, is just exp(per-token loss). A loss of 2.5 corresponds to a perplexity of about 12.2, meaning the model is on average about as uncertain as if it were choosing uniformly among 12 possible next tokens. Both quantities are reported in practice, with loss preferred during optimization (because gradients flow through it directly) and perplexity preferred when discussing model quality.

The loss landscape

The training loss can be thought of as a surface in the high-dimensional space of model parameters. Each combination of weights produces one loss value, and the optimizer's job is to walk downhill on this surface. Visualizing this landscape is mathematically tricky because deep models have millions to billions of parameters, but Li et al. (NeurIPS 2018) introduced a filter normalization technique that produces meaningful 2D and 3D visualizations of loss landscapes. Their key empirical finding was that flatter minima tend to generalize better than sharp ones, supporting earlier theoretical work by Hochreiter and Schmidhuber on the relationship between minimum sharpness and generalization.

The paper also showed that skip connections (as in ResNet architectures) dramatically smooth the loss landscape, which helps explain why very deep networks became trainable once skip connections were introduced. Without skip connections, the loss landscape of a deep network is full of sharp ridges and chaotic regions; with them, the landscape becomes much closer to the gentle bowl that gradient descent handles well.

This line of work also clarified that the training loss landscape exhibits implicit regularization: stochastic gradient descent with reasonable hyperparameters tends to find flat minima that generalize, even when sharper minima with the same training loss exist.

Common mistakes when working with training loss

A few patterns trip up practitioners often enough to be worth flagging:

Comparing losses across different scales. MSE on raw pixel values produces numbers in the thousands, while cross-entropy on a 50,000-token vocabulary produces numbers around 10. Comparing absolute values across loss functions is meaningless. Compare relative changes within the same loss function.

Reporting training loss inconsistently. Some pipelines include the regularization term (weight decay, L2 penalty) in the reported training loss, others do not. When comparing runs across codebases, check whether "training loss" includes the regularizer.

Forgetting to call model.eval() when computing validation loss. In PyTorch, this leaves dropout active and uses batch statistics rather than running statistics in batch normalization layers. The result is a noisy, biased validation loss that does not reflect inference behavior.

Not normalizing by batch size in distributed training. When gradients are summed across workers, the effective batch size grows, and the loss should be averaged appropriately. Forgetting this leads to learning rates that are effectively too large or too small.

Mixing per-batch and per-epoch loss in the same plot without saying which is which. The two have very different smoothness, and putting them on the same axes invites misinterpretation.

Reading too much into a single noisy curve. A loss that bounces around within a 10 percent band over 100 steps is normal noise, not evidence of instability. Take a moving average before drawing conclusions.

Crash diagnostics versus convergence diagnostics

For LLM-scale runs, monitoring tooling has split into two functional categories. Crash diagnostics are designed to detect divergence as early as possible: rapidly rising loss, gradient norm spikes, NaN detection, and weight norm growth all qualify. The faster these signals trip an alert, the less compute is wasted on a doomed run. Convergence diagnostics, by contrast, are slower-moving statistics that help judge whether a run that is technically training is also making efficient progress: comparison against scaling-law predictions, downstream evaluation metrics, and per-domain perplexity all fall into this bucket. A well-designed monitoring setup logs both categories and surfaces them in different dashboards so on-call engineers can react to crashes without being overwhelmed by slow trends.

Explain like I'm 5 (ELI5)

Imagine you are practicing free throws in basketball. Each time you shoot, the coach writes down how far the ball was from the center of the hoop. The numbers usually start big (you keep missing badly) and get smaller as you practice. Training loss is the same idea for a computer program that is learning. Every time the program tries to make a guess, the training loss says how wrong the guess was. The program changes itself a little bit to be less wrong next time, and the training loss number gets smaller and smaller.

If the number ever stops going down, that might mean you have learned everything you can from this practice or that something is going wrong, like the gym lights flickering and confusing you. And if the number suddenly jumps up, that probably means a weird ball came along that messed you up. The score by itself does not tell you everything; you also need to try shooting in a different gym (the validation set) to know whether you are really getting better or just memorizing the lines on the floor of one specific court.

References

Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." https://arxiv.org/abs/2204.02311
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). "Visualizing the Loss Landscape of Neural Nets." *NeurIPS 2018*. https://arxiv.org/abs/1712.09913
Takase, S., Kiyono, S., Kobayashi, S., & Suzuki, J. (2023). "Spike No More: Stabilizing the Pre-training of Large Language Models." https://arxiv.org/abs/2312.16903
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*, Chapters 5 and 8. MIT Press. https://www.deeplearningbook.org/
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). "Focal Loss for Dense Object Detection." *ICCV 2017*. https://arxiv.org/abs/1708.02002
Hochreiter, S., & Schmidhuber, J. (1997). "Flat Minima." *Neural Computation*, 9(1), 1-42.
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). "Identifying and attacking the saddle point problem in high-dimensional non-convex optimization." *NeurIPS 2014*. https://arxiv.org/abs/1406.2572
PyTorch Documentation. "torch.utils.tensorboard." https://pytorch.org/docs/stable/tensorboard.html
Google Developers Machine Learning Crash Course. "Overfitting: Interpreting loss curves." https://developers.google.com/machine-learning/crash-course/overfitting/interpreting-loss-curves
Hugging Face. "Understanding Learning Curves." *LLM Course, Chapter 3.5*. https://huggingface.co/learn/llm-course/en/chapter3/5
Llama Team, AI @ Meta (2024). "The Llama 3 Herd of Models." https://arxiv.org/abs/2407.21783
Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*, MIT Press. https://probml.github.io/pml-book/book1.html

Introduction

Training loss vs validation loss vs test loss

Common loss functions used during training

Per-batch loss vs per-epoch loss

Reading training loss curves

Train vs validation loss patterns

What else to log alongside training loss

Tools for monitoring training loss

Special considerations for LLM training

Loss spikes

Gradient clipping and numerical stability

Loss vs perplexity

The loss landscape

Common mistakes when working with training loss

Crash diagnostics versus convergence diagnostics

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

L1 Loss

L2 Loss

Squared Hinge Loss

Squared Loss

Staged training

AdaGrad

Introduction

Training loss vs validation loss vs test loss

Common loss functions used during training

Per-batch loss vs per-epoch loss

Reading training loss curves

Train vs validation loss patterns

What else to log alongside training loss

Tools for monitoring training loss

Special considerations for LLM training

Loss spikes

Gradient clipping and numerical stability

Loss vs perplexity

The loss landscape

Common mistakes when working with training loss

Crash diagnostics versus convergence diagnostics

Explain like I'm 5 (ELI5)

References

Related Articles

L1 Loss

L2 Loss

Squared Hinge Loss

Squared Loss

Staged training

AdaGrad