Model training
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Model training in machine learning is the process of fitting a model's parameters to a training dataset so that the model can make accurate predictions or decisions on new inputs. During training, the algorithm repeatedly compares the model's current output to the desired output, measures the gap using a loss function, and adjusts the parameters to shrink that gap. The goal is a model that generalizes from examples it has seen to data it has never seen before, without memorizing quirks of the training set, a failure mode known as overfitting.
Training is the most expensive part of building most modern AI systems. Frontier large language models consume thousands of GPUs for weeks or months, while smaller computer vision classifiers can be trained on a laptop in minutes. The mathematical machinery is the same in both cases: a loss function, a procedure to estimate the gradient of that loss with respect to the parameters, and an optimizer that uses the gradient to update the parameters.
Nearly all deep learning systems use the same iterative training loop. Each iteration, called a step, runs four substeps in order.
First, the model performs a forward pass. The current batch of inputs flows through the network layer by layer. Each layer applies its weights, biases, and activation function, and the final layer produces a prediction. Intermediate activations are cached for the backward pass.
Second, the loss function compares predictions to the ground truth labels. For regression problems, the typical choice is mean squared error. For classification problems, cross-entropy loss is standard. Modern transformer language models also use cross-entropy, computed token by token against the next-token target.
Third, backpropagation computes the gradient of the loss with respect to every parameter. Backpropagation is an efficient application of the chain rule from calculus, propagating partial derivatives from the output layer backward through the network. Frameworks like PyTorch, TensorFlow, and JAX build a computation graph during the forward pass and walk it in reverse, so users rarely write gradient code by hand.
Fourth, the optimizer applies the gradients to the parameters. The simplest update rule subtracts the gradient times a learning rate, but most optimizers also maintain running estimates of gradient statistics and use those to scale the update. The next batch is drawn and the loop repeats.
A full pass through the training set is called an epoch. Small models often train for many epochs. Frontier language models typically see each document only once or a few times because the dataset is so large.
The loss function turns model errors into a single number the optimizer can minimize. The choice of loss shapes what the model learns.
For regression, mean squared error (MSE) penalizes the squared difference between prediction and target, and mean absolute error (MAE) penalizes the absolute difference. MAE is more robust to outliers. For classification, cross-entropy loss measures the negative log-probability that the model assigns to the correct class, and it gives much sharper gradients than MSE when the model is confidently wrong, which is why it dominates classification. Binary cross-entropy handles two-class problems; categorical cross-entropy handles multi-class problems.
Other losses include hinge loss for support vector machines, contrastive losses for self-supervised representation learning, and KL divergence for matching probability distributions. Alignment recipes such as direct preference optimization and proximal policy optimization define their own objectives layered on top of standard cross-entropy.
The optimizer turns gradients into parameter updates. A handful of methods dominate.
Stochastic gradient descent (SGD) is the foundational optimizer. Rather than computing the gradient over the entire training set on each step, SGD estimates it from a single sample or a small batch. The estimate is noisy, but each step is cheap, and the noise sometimes helps the optimizer escape bad local minima. SGD with momentum, which adds a moving average of past gradients to smooth the trajectory, remains common in computer vision.
Adam, short for Adaptive Moment Estimation, was introduced by Diederik Kingma and Jimmy Ba in 2014. It tracks both a running mean of gradients and a running mean of squared gradients for each parameter, giving each parameter its own effective learning rate. Adam converges quickly with little tuning and became the default optimizer for deep learning by the late 2010s.
AdamW, proposed by Ilya Loshchilov and Frank Hutter in 2017, fixes a subtle bug in how the original Adam handled weight decay. In AdamW, weight decay is decoupled from the gradient update and applied directly to the parameters, which produces better generalization. AdamW is now the standard choice for training transformer models.
Adam stores two extra tensors the size of the parameters themselves. For a 70B parameter model in bfloat16, that adds about 280 GB of optimizer state. Adafactor, introduced by Google in 2018 for training the T5 model, factorizes the second moment estimate into row and column statistics, dramatically reducing memory at a small cost in convergence speed. Variants such as Lion and the 8-bit optimizers from bitsandbytes make similar trade-offs and are popular for fine-tuning large models on limited hardware.
Hyperparameters are settings chosen before training that control the optimization process itself. Three matter the most.
The learning rate determines the size of each parameter update. Too high, and the loss diverges; too low, and training crawls or gets stuck. Modern practice uses a learning rate schedule that warms up over the first few hundred or thousand steps, then decays (often as a cosine curve) toward zero. AdamW with warmup and cosine decay is the de facto recipe for transformer training.
The batch size controls how many examples contribute to each gradient estimate. Larger batches give smoother gradients and use accelerator hardware more efficiently, but they reduce the number of update steps per epoch and can hurt generalization if pushed too far. Learning rate and batch size are linked: when batch size doubles, the optimal learning rate typically rises as well.
The number of epochs (or, for language models, total tokens seen) sets how long training runs. Other hyperparameters include dropout rate, weight decay strength, gradient clipping threshold, and architecture-specific choices like the number of layers or attention heads. Practitioners use grid search, random search, or Bayesian optimization, but for the largest models, hyperparameters are often inherited from smaller pilot runs.
Different problems call for different supervision signals.
In supervised learning, every training example comes with a target label. The model learns the mapping from input to label by minimizing a loss between its prediction and the label. This is the dominant paradigm for image classification, speech recognition, and most tabular machine learning. Subtypes include regression (continuous targets) and classification (discrete targets).
Unsupervised learning discovers structure in data without explicit labels. Clustering algorithms group similar examples, dimensionality reduction methods compress data into a smaller representation, and density estimation learns the probability distribution of the inputs.
Self-supervised learning is the workhorse behind modern foundation models. The model generates its own labels from the raw data. Language models predict the next token given the previous tokens. Masked image models recover hidden patches. Contrastive methods learn to pull together two augmented views of the same image and push apart views from different images. Self-supervised pretraining produces general-purpose representations that can be reused across many downstream tasks. BERT, GPT, wav2vec, and most modern vision models all rely on it.
In reinforcement learning (RL), an agent interacts with an environment, takes actions, and receives rewards or penalties. There are no labels; the training signal comes from a reward function. The agent learns a policy that maximizes expected cumulative reward. RL trained AlphaGo to play Go and powers many robotic control systems. Modern reasoning models also use RL with verifiable rewards to learn step-by-step problem solving in math and code.
Frontier large language models go through a pipeline of stages, each with its own data, loss, and goals.
Pretraining is the heavy lifting. The model is trained on trillions of tokens of internet text, code, and books using a self-supervised next-token prediction objective. This stage instills the broad world knowledge, grammar, and rough reasoning ability that the model will later refine. Pretraining a flagship model can cost tens of millions of dollars in compute and run for months on thousands of GPUs.
Supervised fine-tuning (SFT) teaches the pretrained model how to behave. The training data consists of curated input-output pairs written or vetted by humans: instructions with high-quality responses, dialogues, chain-of-thought solutions to math problems, and so on. SFT shapes the model into a useful assistant that follows instructions. The optimization objective is still next-token prediction, but the data is small and carefully chosen.
After SFT, the model is aligned with human preferences. Classical RLHF (reinforcement learning from human feedback) was introduced for large language models by OpenAI's InstructGPT work in 2022. Annotators rank pairs of model outputs, a reward model is trained on these rankings, and the language model is then optimized to produce responses the reward model scores highly, using a policy gradient method such as PPO. KL divergence regularization keeps the policy from drifting too far from the SFT model.
Direct preference optimization (DPO), introduced in 2023, achieves similar results without an explicit reward model or RL loop. It treats the SFT model itself as an implicit reward model and optimizes a closed-form loss over preference pairs. DPO is simpler and more stable than PPO. Variants such as SimPO, KTO, and IPO offer further trade-offs.
Throughout 2025, reinforcement learning with verifiable rewards became the preferred approach for reasoning tasks. Methods such as GRPO (used in DeepSeek-R1) and DAPO use automatically checkable signals, such as whether a math answer is correct or whether code passes unit tests, skipping the human preference step for those domains.
A single accelerator is not enough for any modern frontier model. Training is parallelized across many GPUs or TPUs using several strategies that are often combined.
Data parallelism is the simplest. Each device holds a full copy of the model and processes a different shard of the batch. After each backward pass, gradients are averaged across devices using an all-reduce operation, so every replica stays in sync. PyTorch DDP and DeepSpeed ZeRO implement this, with ZeRO further sharding optimizer state, gradients, and parameters across replicas to fit larger models.
Tensor (model) parallelism splits individual layers across devices. Each GPU computes a slice of a matrix multiplication, and partial results are reduced across the group. This is essential when a single layer's weights exceed one device's memory.
Pipeline parallelism splits the model layer-wise across devices. Activations flow forward stage by stage and gradients flow backward. Schemes such as GPipe and PipeDream keep all stages busy by processing multiple micro-batches concurrently. Sequence and expert parallelism extend these ideas to long contexts and mixture-of-experts models. Frameworks such as Megatron-LM, DeepSpeed, FSDP, and Colossal-AI compose multiple parallelism axes into a single training job.
Matrix multiplications dominate training compute, so accelerators with strong tensor performance are the workhorses. NVIDIA GPUs have led the market for nearly a decade. The A100 (Ampere, 2020) was the most-used training chip from 2020 to 2022; the H100 (Hopper, 2022) introduced FP8 tensor cores and the Transformer Engine, delivering roughly four times the training throughput of the A100 on large language models. The B200 (Blackwell, 2024) extended this further. According to Epoch AI, the A100 has been the single most popular chip for training notable ML models.
Google's Tensor Processing Units (TPUs) are an alternative used heavily inside Google and via Google Cloud. TPU v4, v5e, and v5p pods scale to thousands of chips connected by a custom interconnect, and they trained Google's Gemini models. AWS Trainium and various startup accelerators round out the landscape, though NVIDIA still trains the majority of frontier systems.
The goal of training is to generalize to new data, not just fit the training data. A model that memorizes the training set but fails on held-out examples has overfit. The opposite failure, underfitting, happens when the model is too simple or undertrained to capture the underlying pattern.
Practitioners detect overfitting by holding out a validation set and tracking validation loss alongside training loss. When validation loss stops improving while training loss keeps falling, the model is overfitting. Tools such as early stopping, dropout, weight decay (L2 regularization), data augmentation, and ensemble methods help close the gap. K-fold cross-validation rotates the held-out set across the data for a more reliable estimate.
Full retraining of a pretrained model is expensive and often unnecessary. Fine-tuning reuses pretrained weights and continues training on a smaller dataset for a specific task.
Parameter-efficient fine-tuning (PEFT) methods update only a small fraction of the model's parameters. The most popular technique is LoRA (Low-Rank Adaptation), introduced by Microsoft researchers in 2021. LoRA freezes the original weights and learns small low-rank update matrices added back at inference time. Trainable parameter counts drop by orders of magnitude (often under 1% of the full model) with little loss in task performance. QLoRA combines LoRA with 4-bit quantization to fine-tune very large models on a single consumer GPU.
Imagine teaching a robot to tell cats apart from dogs. You show the robot a picture and ask, "Cat or dog?" The robot guesses. If it gets it wrong, you say, "Nope," and it tweaks the little dials in its brain. Show it enough pictures, and the dials end up in just the right positions so the robot guesses correctly almost every time, even on cats and dogs it has never seen. The pictures and answers are the training data, the dials are the parameters, the "nope" signal is the loss function, and the rule for how much to turn the dials is the optimizer.