Training

Training in machine learning is the process of fitting a model's parameters to data so that the model can make accurate predictions or generate useful outputs. During training, an algorithm iteratively adjusts the model's internal weights and biases by exposing it to examples, computing how far its predictions are from the correct answers, and updating the parameters to reduce that error. Training is the most computationally intensive phase of the machine learning pipeline, and its effectiveness determines how well a model will perform during inference.

The Training Loop

At the core of training neural networks is the training loop, a cycle that repeats thousands or millions of times until the model reaches acceptable performance. Each iteration of the loop consists of four steps.

Forward Pass

In the forward pass, input data flows through the network layer by layer. At each layer, the inputs are multiplied by the layer's weights, a bias term is added, and the result is passed through an activation function (such as ReLU or sigmoid). The output of one layer becomes the input to the next. The final layer produces the model's prediction for the given input.

Loss Computation

Once the forward pass produces a prediction, a loss function (also called a cost function or objective function) quantifies how far that prediction is from the true target. Common loss functions include mean squared error for regression tasks and cross-entropy loss for classification tasks. The loss is a single scalar value that the training process aims to minimize.

Backward Pass (Backpropagation)

Backpropagation computes the gradient of the loss function with respect to every parameter in the network. Using the chain rule from calculus, it works backward from the output layer to the input layer, calculating how much each weight contributed to the error. These gradients indicate the direction and magnitude of change needed to reduce the loss.

Parameter Update

With the gradients computed, an optimizer applies gradient descent (or a variant like Adam, SGD with momentum, or AdaGrad) to update each parameter. The optimizer subtracts a fraction of the gradient from each weight, with that fraction determined by the learning rate. This step moves the model's parameters closer to a configuration that minimizes the loss.

This four-step cycle repeats for every batch of training data. Over many iterations, the model's predictions improve as the loss decreases.

Key Training Concepts

Epochs, Batches, and Iterations

Training data is typically processed in structured units.

Term	Definition
Epoch	One complete pass through the entire training dataset. Models are usually trained for multiple epochs.
Batch	A subset of the training dataset processed together in one forward and backward pass. Batch sizes commonly range from 16 to several thousand.
Iteration	One forward pass, loss computation, backward pass, and parameter update on a single batch. The number of iterations per epoch equals the dataset size divided by the batch size.
Mini-batch	A term used interchangeably with batch in most modern contexts, distinguishing it from full-batch gradient descent (which uses the entire dataset at once).

For example, a dataset with 50,000 samples and a batch size of 256 would require approximately 195 iterations to complete one epoch.

Training Data Requirements

The quantity and quality of training data directly affect model performance. Insufficient data leads to overfitting, where the model memorizes the training examples rather than learning generalizable patterns. Too much noise or mislabeled data degrades performance regardless of data volume.

The scaling laws discovered by researchers at OpenAI and DeepMind have formalized the relationship between data, model size, and compute. The Chinchilla scaling laws (Hoffmann et al., 2022) found that for compute-optimal training, model size and training tokens should be scaled roughly equally: approximately 20 tokens per parameter. A 70-billion-parameter model should therefore be trained on about 1.4 trillion tokens. However, more recent practice has shifted toward training on far more data than the Chinchilla optimum suggests. Meta's Llama 3 70B was trained on roughly 200 tokens per parameter (about 15 trillion tokens), roughly 10 times the Chinchilla-optimal ratio. This trend is driven by economics: models trained on more data per parameter can achieve comparable performance with fewer parameters, making them cheaper to serve at inference time.

Training vs. Inference

Training and inference are the two primary phases of a model's lifecycle. They differ significantly in computational profile and cost structure.

Aspect	Training	Inference
Purpose	Learn model parameters from data	Apply the trained model to new inputs
Compute	Massive (thousands of GPUs for weeks)	Modest per request (single GPU or CPU)
Frequency	Periodic (once or a few times)	Continuous (every time the model is used)
Bottleneck	Compute-bound (FLOPs)	Often latency-bound or memory-bound
Cost profile	High one-time cost	Low per-query cost, but can exceed training cost at scale

Because inference runs continuously and serves potentially millions of users, its cumulative operational cost often exceeds the one-time training cost. This explains why reducing inference costs has become a priority, even if it requires additional training compute.

Training Compute

Training compute is measured in FLOPs (floating-point operations) and is one of the most important factors determining model quality. The compute required for frontier models has grown exponentially.

Model	Year	Estimated Training FLOPs	Estimated Cost
GPT-3	2020	3.1 x 10^23	~$4.6M
GPT-4	2023	2.1 x 10^25	~$78-100M+
Gemini Ultra	2023	5.0 x 10^25	~$191M
Llama 3.1 405B	2024	3.8 x 10^25	~$25M
DeepSeek V3	2024	~5 x 10^24	~$5.6M
Claude 3 Opus	2024	1.6 x 10^25	Tens of millions

As of 2025, over 30 publicly announced models from 12 different developers have exceeded the 10^25 FLOP training threshold. Training costs for frontier models have been growing at roughly 2.5 times per year, though the cost per unit of compute continues to drop due to hardware and algorithmic improvements.

GPT-4, with an estimated 1.76 trillion parameters, required approximately 25,000 NVIDIA A100 GPUs running for about 90 days. By contrast, DeepSeek V3 achieved frontier-level performance for approximately $5.6 million through optimized algorithms and strategic compute allocation, demonstrating that algorithmic efficiency can dramatically reduce costs.

Types of Training

Machine learning employs several distinct training paradigms, each suited to different kinds of data and tasks.

Supervised Learning

Supervised learning is the most common training paradigm. The model receives labeled examples (input-output pairs) and learns to map inputs to their corresponding outputs. The loss function measures the discrepancy between the model's predictions and the true labels. Common applications include image classification, spam detection, and medical diagnosis.

Unsupervised Learning

Unsupervised learning trains on data without labels. The model identifies patterns, structures, or groupings within the data on its own. Common techniques include clustering, dimensionality reduction, and density estimation.

Semi-Supervised Learning

Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data. The labeled examples guide the learning process while the unlabeled data provides additional structural information, making this approach practical when labeling is expensive.

Self-Supervised Learning

Self-supervised learning generates its own supervisory signal from the data itself, without requiring human-provided labels. The model is trained on pretext tasks such as predicting masked tokens (as in BERT), predicting the next token (as in GPT), or predicting future frames in video. This paradigm powers the pre-training phase of most modern large language models.

Reinforcement Learning

Reinforcement learning trains an agent to make decisions by interacting with an environment. The agent takes actions, receives rewards or penalties, and adjusts its policy to maximize cumulative reward over time.

Training Large Language Models

Training modern large language models (LLMs) involves a multi-stage pipeline that goes beyond a single training phase.

Stage 1: Pre-training

Pre-training is the most computationally expensive stage. The model learns language patterns, factual knowledge, and reasoning abilities from vast corpora of text (often trillions of tokens drawn from web crawls, books, and code repositories). The training objective is typically next-token prediction: given a sequence of tokens, predict what comes next. Pre-training establishes the model's general capabilities and accounts for the vast majority of total training compute.

Stage 2: Supervised Fine-Tuning (SFT)

After pre-training, the model undergoes supervised fine-tuning on curated datasets of high-quality input-output pairs. These datasets contain examples of desired behavior: following instructions, answering questions accurately, writing code, and refusing harmful requests. SFT is computationally cheap relative to pre-training (often 100 times less expensive or more) but has a large impact on the model's usefulness and safety.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

RLHF further aligns the model with human preferences. The process involves three sub-steps:

Collecting comparisons. Human annotators compare multiple model responses to the same prompt and rank them by quality.
Training a reward model. A separate model learns to predict human preferences based on the comparison data.
Policy optimization. The language model is fine-tuned using a reinforcement learning algorithm (commonly PPO or DPO) that maximizes the reward model's score while staying close to the SFT model.

RLHF helps the model produce responses that are more helpful, harmless, and honest, addressing aspects of quality that are difficult to capture in supervised training data alone.

Training Instabilities

Training deep neural networks, especially at scale, is prone to several instability issues.

Loss Spikes

Loss spikes are sudden, sharp increases in the training loss that can occur during otherwise stable training. They are often caused by exploding gradients, bad data batches, or numerical overflow. In large-scale LLM training runs costing millions of dollars, loss spikes can waste significant compute if not handled properly. Common mitigations include gradient clipping, learning rate warmup, and skipping corrupted data batches.

Vanishing and Exploding Gradients

The vanishing gradient problem occurs when gradients become extremely small during backpropagation through many layers, effectively preventing the earlier layers from learning. The exploding gradient problem is the opposite: gradients grow exponentially, causing unstable parameter updates and potentially NaN (not a number) values. Solutions include careful weight initialization (such as Xavier or He initialization), batch normalization, residual connections, and gradient clipping.

NaN Loss

A NaN loss indicates a catastrophic numerical failure, usually caused by division by zero, taking the logarithm of zero or negative numbers, or unchecked gradient explosions. NaN loss typically requires restarting training from a checkpoint. Using BF16 mixed precision instead of FP16 reduces the risk because BF16's wider dynamic range prevents many overflow and underflow scenarios.

Distributed Training

Modern large-scale models cannot be trained on a single GPU. Distributed training spreads the workload across many devices, sometimes thousands.

Data Parallelism

In data parallelism, each GPU holds a complete copy of the model but processes a different subset of the training data. After each forward and backward pass, gradients are synchronized across all GPUs (typically via an all-reduce operation) so that every copy stays in sync. This is the simplest form of distributed training and works well when the model fits in a single GPU's memory.

Model Parallelism

When a model is too large to fit on a single GPU, model parallelism splits the model across multiple devices. Tensor parallelism divides individual layers (such as large matrix multiplications) across GPUs, while pipeline parallelism assigns different layers to different GPUs, processing micro-batches in a pipelined fashion to reduce idle time.

Hybrid Parallelism

Frontier models typically combine all three strategies. For example, Meta's Llama 3.1 405B was trained using tensor parallelism of 8, pipeline parallelism of 16, and data parallelism ranging from 8 to 128, distributed across 16,384 NVIDIA H100 GPUs. Frameworks like DeepSpeed, Megatron-LM, and PyTorch FSDP (Fully Sharded Data Parallel) provide implementations of these strategies.

Training Techniques and Tricks

Practitioners use a variety of techniques to improve training efficiency, stability, and final model quality.

Mixed-Precision Training

Mixed-precision training uses lower-precision numerical formats (FP16 or BF16) for most computations while keeping a master copy of the weights in FP32. This cuts memory usage roughly in half and speeds up training by 30 to 50 percent on GPUs with tensor cores, with minimal impact on model quality. BF16 (Brain Float 16) is generally preferred over FP16 for training because it has the same dynamic range as FP32, eliminating the need for loss scaling in most cases.

Learning Rate Warmup and Scheduling

Starting training with a high learning rate can cause instability because the randomly initialized parameters produce large, noisy gradients. Learning rate warmup begins with a very small learning rate and linearly increases it over a set number of steps (typically a few hundred to a few thousand). After warmup, a schedule such as cosine annealing gradually reduces the learning rate over the remainder of training. This warmup-then-decay pattern has become standard practice for training transformers.

Gradient Accumulation

Gradient accumulation allows training with effectively larger batch sizes when GPU memory is limited. Instead of updating the model after every batch, gradients are accumulated over several mini-batches and the parameter update is applied once the accumulated gradients equal the desired effective batch size. This produces the same result as training with a larger batch on more powerful hardware.

Gradient Checkpointing

During the forward pass, intermediate activations are stored in memory for use during backpropagation. Gradient checkpointing saves only a subset of these activations and recomputes the rest as needed during the backward pass. This trades roughly 20 percent additional compute time for a large reduction in memory usage, making it possible to train larger models or use larger batch sizes on the same hardware.

Curriculum Learning

Curriculum learning arranges training data so that the model encounters easier examples first and progressively harder examples later, mimicking how humans learn. This structured ordering can improve convergence speed, produce better local minima, and improve generalization, all without additional computational cost. The main challenges are defining a difficulty metric for the data and designing the schedule for increasing difficulty.

Monitoring Training

Tracking metrics during training is essential for diagnosing problems and ensuring the model is learning effectively. Two platforms dominate this space.

Platform	Type	Key Features
TensorBoard	Open-source (Google)	Real-time visualization of loss, accuracy, and learning rate curves; histograms of weight and gradient distributions; model graph visualization; embedding projections. Integrates with both TensorFlow and PyTorch.
Weights & Biases (W&B)	Cloud-based (free tier available)	Experiment tracking with automatic logging of hyperparameters, git state, and hardware metrics; rich dashboards for comparing runs; gradient and parameter distribution tracking; team collaboration features; integration with most major frameworks.

Key metrics to monitor during training include:

Training loss and validation loss. Divergence between these indicates overfitting.
Gradient norms. A sudden spike in the gradient norm often precedes or coincides with a loss spike.
Learning rate. Confirms the schedule is progressing as intended.
GPU utilization and throughput. Ensures hardware is being used efficiently.
Evaluation metrics. Task-specific measures (accuracy, F1 score, BLEU, perplexity) computed periodically on a held-out validation set.

Reproducibility

Reproducing a training run exactly is one of the most persistent challenges in machine learning. Sources of non-determinism include:

Random seeds. Weight initialization, data shuffling, and dropout all depend on random number generators. Setting fixed seeds is necessary but not sufficient for reproducibility.
Hardware differences. Results can differ between CPU and GPU execution, and between different GPU architectures, because floating-point operations are not strictly associative.
Software versions. Different versions of frameworks, CUDA libraries, and operating systems can produce different results.
Parallelism. In distributed training, the order in which gradient reductions happen may vary across runs, introducing small numerical differences that can compound over many steps.

Best practices for improving reproducibility include pinning all software versions, recording hyperparameters and random seeds, using deterministic algorithms where available, and leveraging experiment tracking tools to log every detail of each run.

Training Cost Trends

The economics of training have shifted rapidly. Frontier model training costs have grown at roughly 2.5 times per year since 2018, with the largest runs in 2025 estimated at $200 million to $400 million. At the same time, the cost per unit of compute drops roughly 10 times per decade due to improvements in hardware (such as NVIDIA's H100 and B200 GPUs), software frameworks, and algorithmic efficiency.

Year	Representative Model	Approximate Training Cost
2018	BERT-Large	~$7,000
2020	GPT-3 175B	~$4.6M
2023	GPT-4	~$78-100M+
2024	Llama 3.1 405B	~$25M
2024	DeepSeek V3	~$5.6M
2025	Frontier 405B+	$80M-$400M

Notably, DeepSeek V3 demonstrated that algorithmic innovations (such as a Mixture-of-Experts architecture and optimized training procedures) can achieve frontier-level performance at a fraction of the cost of brute-force scaling. This suggests that raw compute spending is not the only path to better models.

GPU cloud pricing has also evolved. In mid-2025, AWS reduced H100 instance pricing by approximately 44 percent, bringing on-demand costs to roughly $3.90 per GPU-hour, with reserved pricing falling below $2.00 per GPU-hour.

Explain Like I'm 5 (ELI5)

Imagine you are learning to throw a basketball into a hoop. Every time you throw the ball, you watch where it lands. If it went too far to the left, you adjust and throw a little more to the right next time. If it fell short, you throw harder. Each throw is like one training step: you try something, see how wrong you were, and make a small correction.

Training a computer works the same way. You show it thousands of examples (like photos of cats and dogs) and each time it guesses wrong, it adjusts its internal settings a tiny bit. After seeing enough examples and making enough adjustments, it gets really good at telling cats from dogs, even when it sees a photo it has never seen before.

References

Rumelhart, D., Hinton, G., & Williams, R. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models." *arXiv:2203.15556*.
Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models." *arXiv:2001.08361*.
Micikevicius, P., et al. (2018). "Mixed Precision Training." *ICLR 2018*.
Bengio, Y., et al. (2009). "Curriculum Learning." *Proceedings of ICML*.
Epoch AI. (2025). "Over 30 AI models have been trained at the scale of GPT-4." *epoch.ai*.
Cottier, B. & Rahman, R. (2024). "The Rising Costs of Training Frontier AI Models." *arXiv:2405.21015*.
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." *NeurIPS 2022*.
Loshchilov, I. & Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." *ICLR 2017*.
Rajbhandari, S., et al. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." *SC20*.
Chen, T., et al. (2016). "Training Deep Nets with Sublinear Memory Cost." *arXiv:1604.06174*.
PyTorch Documentation. (2025). "Reproducibility." *pytorch.org*.

The Training Loop

Forward Pass

Loss Computation

Backward Pass (Backpropagation)

Parameter Update

Key Training Concepts

Epochs, Batches, and Iterations

Training Data Requirements

Training vs. Inference

Training Compute

Types of Training

Supervised Learning

Unsupervised Learning

Semi-Supervised Learning

Self-Supervised Learning

Reinforcement Learning

Training Large Language Models

Stage 1: Pre-training

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Training Instabilities

Loss Spikes

Vanishing and Exploding Gradients

NaN Loss

Distributed Training

Data Parallelism

Model Parallelism

Hybrid Parallelism

Training Techniques and Tricks

Mixed-Precision Training

Learning Rate Warmup and Scheduling

Gradient Accumulation

Gradient Checkpointing

Curriculum Learning

Monitoring Training

Reproducibility

Training Cost Trends

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

The Training Loop

Forward Pass

Loss Computation

Backward Pass (Backpropagation)

Parameter Update

Key Training Concepts

Epochs, Batches, and Iterations

Training Data Requirements

Training vs. Inference

Training Compute

Types of Training

Supervised Learning

Unsupervised Learning

Semi-Supervised Learning

Self-Supervised Learning

Reinforcement Learning

Training Large Language Models

Stage 1: Pre-training

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

Training Instabilities

Loss Spikes

Vanishing and Exploding Gradients

NaN Loss

Distributed Training

Data Parallelism

Model Parallelism

Hybrid Parallelism

Training Techniques and Tricks

Mixed-Precision Training

Learning Rate Warmup and Scheduling

Gradient Accumulation

Gradient Checkpointing