Model training

Machine Learning

17 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v4 · 3,303 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

Model training is the process of fitting a machine learning model's parameters to data so that it makes accurate predictions on new, unseen inputs. The training algorithm repeatedly runs a four-step loop: a forward pass produces a prediction, a loss function measures the gap between that prediction and the desired output, backpropagation computes the gradient of the loss with respect to every parameter, and an optimizer nudges the parameters down that gradient. Repeated over millions of steps, this procedure of gradient descent turns a randomly initialized network into a system that generalizes from its examples. Training is the most expensive stage of building a modern AI system: Stanford's 2024 AI Index estimated that the compute to train OpenAI's GPT-4 cost about 78 million dollars, while Google's Gemini Ultra cost roughly 191 million dollars.^[19]

Introduction

Model training in machine learning is the process of fitting a model's parameters to a training dataset so that the model can make accurate predictions or decisions on new inputs. During training, the algorithm repeatedly compares the model's current output to the desired output, measures the gap using a loss function, and adjusts the parameters to shrink that gap. The goal is a model that generalizes from examples it has seen to data it has never seen before, without memorizing quirks of the training set, a failure mode known as overfitting.

Training is the most expensive part of building most modern AI systems. Frontier large language models consume thousands of GPUs for weeks or months, while smaller computer vision classifiers can be trained on a laptop in minutes. The mathematical machinery is the same in both cases: a loss function, a procedure to estimate the gradient of that loss with respect to the parameters, and an optimizer that uses the gradient to update the parameters.

How does the training loop work?

Nearly all deep learning systems use the same iterative training loop. Each iteration, called a step, runs four substeps in order.

First, the model performs a forward pass. The current batch of inputs flows through the network layer by layer. Each layer applies its weights, biases, and activation function, and the final layer produces a prediction. Intermediate activations are cached for the backward pass.

Second, the loss function compares predictions to the ground truth labels. For regression problems, the typical choice is mean squared error. For classification problems, cross-entropy loss is standard. Modern transformer language models also use cross-entropy, computed token by token against the next-token target.

Third, backpropagation computes the gradient of the loss with respect to every parameter. Backpropagation is an efficient application of the chain rule from calculus, propagating partial derivatives from the output layer backward through the network.^[1] Frameworks like PyTorch, TensorFlow, and JAX build a computation graph during the forward pass and walk it in reverse, so users rarely write gradient code by hand.

Fourth, the optimizer applies the gradients to the parameters. The simplest update rule subtracts the gradient times a learning rate, but most optimizers also maintain running estimates of gradient statistics and use those to scale the update. The next batch is drawn and the loop repeats.

A full pass through the training set is called an epoch. Small models often train for many epochs. Frontier language models typically see each document only once or a few times because the dataset is so large.

Loss functions

The loss function turns model errors into a single number the optimizer can minimize. The choice of loss shapes what the model learns.

For regression, mean squared error (MSE) penalizes the squared difference between prediction and target, and mean absolute error (MAE) penalizes the absolute difference. MAE is more robust to outliers. For classification, cross-entropy loss measures the negative log-probability that the model assigns to the correct class, and it gives much sharper gradients than MSE when the model is confidently wrong, which is why it dominates classification. Binary cross-entropy handles two-class problems; categorical cross-entropy handles multi-class problems.

Other losses include hinge loss for support vector machines, contrastive losses for self-supervised representation learning, and KL divergence for matching probability distributions. Alignment recipes such as direct preference optimization and proximal policy optimization define their own objectives layered on top of standard cross-entropy.

Optimizers

The optimizer turns gradients into parameter updates. A handful of methods dominate.

Stochastic gradient descent

Stochastic gradient descent (SGD) is the foundational optimizer. Rather than computing the gradient over the entire training set on each step, SGD estimates it from a single sample or a small batch.^[2] The estimate is noisy, but each step is cheap, and the noise sometimes helps the optimizer escape bad local minima. SGD with momentum, which adds a moving average of past gradients to smooth the trajectory, remains common in computer vision.

Adam and AdamW

Adam, short for Adaptive Moment Estimation, was introduced by Diederik Kingma and Jimmy Ba in 2014.^[4] It tracks both a running mean of gradients and a running mean of squared gradients for each parameter, giving each parameter its own effective learning rate.^[4] Adam converges quickly with little tuning and became the default optimizer for deep learning by the late 2010s.^[15]

AdamW, proposed by Ilya Loshchilov and Frank Hutter in 2017, fixes a subtle bug in how the original Adam handled weight decay.^[5] In AdamW, weight decay is decoupled from the gradient update and applied directly to the parameters, which produces better generalization.^[5] AdamW is now the standard choice for training transformer models.^[15]

Adafactor and memory-efficient variants

Adam stores two extra tensors the size of the parameters themselves. For a 70B parameter model in bfloat16, that adds about 280 GB of optimizer state. Adafactor, introduced by Google in 2018 for training the T5 model, factorizes the second moment estimate into row and column statistics, dramatically reducing memory at a small cost in convergence speed. Variants such as Lion and the 8-bit optimizers from bitsandbytes make similar trade-offs and are popular for fine-tuning large models on limited hardware.

Hyperparameters

Hyperparameters are settings chosen before training that control the optimization process itself. Three matter the most.

The learning rate determines the size of each parameter update. Too high, and the loss diverges; too low, and training crawls or gets stuck.^[6] Modern practice uses a learning rate schedule that warms up over the first few hundred or thousand steps, then decays (often as a cosine curve) toward zero. AdamW with warmup and cosine decay is the de facto recipe for transformer training.

The batch size controls how many examples contribute to each gradient estimate. Larger batches give smoother gradients and use accelerator hardware more efficiently, but they reduce the number of update steps per epoch and can hurt generalization if pushed too far. Learning rate and batch size are linked: when batch size doubles, the optimal learning rate typically rises as well.

The number of epochs (or, for language models, total tokens seen) sets how long training runs. Other hyperparameters include dropout rate, weight decay strength, gradient clipping threshold, and architecture-specific choices like the number of layers or attention heads. Practitioners use grid search, random search, or Bayesian optimization, but for the largest models, hyperparameters are often inherited from smaller pilot runs.

What are the main training paradigms?

Different problems call for different supervision signals.

Supervised learning

In supervised learning, every training example comes with a target label. The model learns the mapping from input to label by minimizing a loss between its prediction and the label. This is the dominant paradigm for image classification, speech recognition, and most tabular machine learning. Subtypes include regression (continuous targets) and classification (discrete targets).

Unsupervised learning

Unsupervised learning discovers structure in data without explicit labels. Clustering algorithms group similar examples, dimensionality reduction methods compress data into a smaller representation, and density estimation learns the probability distribution of the inputs.

Self-supervised learning

Self-supervised learning is the workhorse behind modern foundation models. The model generates its own labels from the raw data. Language models predict the next token given the previous tokens. Masked image models recover hidden patches. Contrastive methods learn to pull together two augmented views of the same image and push apart views from different images. Self-supervised pretraining produces general-purpose representations that can be reused across many downstream tasks.^[18] BERT, GPT, wav2vec, and most modern vision models all rely on it.^[18]

Reinforcement learning

In reinforcement learning (RL), an agent interacts with an environment, takes actions, and receives rewards or penalties. There are no labels; the training signal comes from a reward function. The agent learns a policy that maximizes expected cumulative reward. RL trained AlphaGo to play Go and powers many robotic control systems. Modern reasoning models also use RL with verifiable rewards to learn step-by-step problem solving in math and code.

How are large language models trained?

Frontier large language models go through a pipeline of stages, each with its own data, loss, and goals.

Pretraining

Pretraining is the heavy lifting. The model is trained on trillions of tokens of internet text, code, and books using a self-supervised next-token prediction objective. This stage instills the broad world knowledge, grammar, and rough reasoning ability that the model will later refine. Pretraining a flagship model can cost tens of millions of dollars in compute and run for months on thousands of GPUs. The scale is enormous: training GPT-3, the 175-billion-parameter model OpenAI released in 2020, required roughly 3.14e23 floating-point operations, equivalent to about 3,640 petaFLOP/s-days of compute.^[20]

Supervised fine-tuning

Supervised fine-tuning (SFT) teaches the pretrained model how to behave. The training data consists of curated input-output pairs written or vetted by humans: instructions with high-quality responses, dialogues, chain-of-thought solutions to math problems, and so on. SFT shapes the model into a useful assistant that follows instructions.^[14] The optimization objective is still next-token prediction, but the data is small and carefully chosen.

Preference optimization and RLHF

After SFT, the model is aligned with human preferences. Classical RLHF (reinforcement learning from human feedback) was introduced for large language models by OpenAI's InstructGPT work in 2022.^[3] Annotators rank pairs of model outputs, a reward model is trained on these rankings, and the language model is then optimized to produce responses the reward model scores highly, using a policy gradient method such as PPO.^[3] KL divergence regularization keeps the policy from drifting too far from the SFT model.

Direct preference optimization (DPO), introduced in 2023, achieves similar results without an explicit reward model or RL loop. It treats the SFT model itself as an implicit reward model and optimizes a closed-form loss over preference pairs.^[14] DPO is simpler and more stable than PPO. Variants such as SimPO, KTO, and IPO offer further trade-offs.^[17]

Throughout 2025, reinforcement learning with verifiable rewards became the preferred approach for reasoning tasks.^[17] Methods such as GRPO (used in DeepSeek-R1) and DAPO use automatically checkable signals, such as whether a math answer is correct or whether code passes unit tests, skipping the human preference step for those domains.^[17]

How much data and compute does training need? Scaling laws

The amount of data and compute a model needs is governed by empirical scaling laws. The most influential result is the Chinchilla study, published by DeepMind's Jordan Hoffmann and colleagues in 2022, which found that earlier large models had been badly undertrained on data. The paper's central conclusion is that "the model size and the number of training tokens should be scaled equally" for a fixed compute budget, and that "for every doubling of model size the number of training tokens should also be doubled."^[21] In practice this implies a compute-optimal ratio of roughly 20 training tokens per parameter.^[21]

To prove the point, DeepMind trained a 70-billion-parameter model called Chinchilla on 1.4 trillion tokens. Despite being four times smaller than its 280-billion-parameter predecessor Gopher and using the same compute budget, Chinchilla outperformed Gopher, GPT-3 (175B), and Megatron-Turing NLG (530B) across a wide range of benchmarks.^[21] The result reshaped how labs allocate budget between model size and dataset size, and "Chinchilla-optimal" became shorthand for training a model on enough data to justify its parameter count.

Scaling laws also describe how loss falls predictably as a power law in model size, dataset size, and compute. This predictability is why labs run small pilot models to extrapolate the loss of a far larger run before committing to it, and why frontier training is planned around a target compute budget rather than a target accuracy.

How much does it cost to train a frontier model?

Training costs have risen sharply as models have scaled. Stanford's 2024 AI Index, produced with Epoch AI, estimated the compute cost of the final training run at roughly 78 million dollars for GPT-4 and roughly 191 million dollars for Google's Gemini Ultra; OpenAI CEO Sam Altman has separately said GPT-4 cost "more than 100 million dollars" to train, a broader figure that includes more than the final run.^[19] By comparison, the 2024 AI Index estimated that the original 2017 Transformer model cost only about 900 dollars to train.^[19] Meta's Llama 3.1 405B has been estimated at roughly 170 million dollars in amortized hardware and energy.^[22]

The trend is steep and well documented. In a 2024 analysis, Epoch AI's Ben Cottier and colleagues found that "the amortized cost to train the most compute-intensive models has grown precipitously at a rate of 2.4x per year since 2016," and projected that "if the trend of growing development costs continues, the largest training runs will cost more than a billion dollars by 2027."^[22] The same study found that the largest single line items are AI accelerator chips and R&D staff, with hardware at 47 to 67 percent of cost, staff at 29 to 49 percent, and energy at only 2 to 6 percent.^[22] The cost table below summarizes representative estimates.

Model	Year	Estimated training cost	Source
2017 Transformer	2017	~$900	Stanford AI Index 2024 / Epoch AI^[19]
GPT-4	2023	~$78M (compute); Altman: ">$100M"	Stanford AI Index 2024 / Epoch AI^[19]
Llama 3.1 405B	2024	~$170M	Cottier et al. (Epoch AI)^[22]
Gemini Ultra	2024	~$191M	Stanford AI Index 2024 / Epoch AI^[19]

How is training split across many GPUs? Distributed training

A single accelerator is not enough for any modern frontier model. Training is parallelized across many GPUs or TPUs using several strategies that are often combined.

Data parallelism is the simplest. Each device holds a full copy of the model and processes a different shard of the batch.^[9] After each backward pass, gradients are averaged across devices using an all-reduce operation, so every replica stays in sync. PyTorch DDP and DeepSpeed ZeRO implement this, with ZeRO further sharding optimizer state, gradients, and parameters across replicas to fit larger models.

Tensor (model) parallelism splits individual layers across devices. Each GPU computes a slice of a matrix multiplication, and partial results are reduced across the group.^[9] This is essential when a single layer's weights exceed one device's memory.

Pipeline parallelism splits the model layer-wise across devices. Activations flow forward stage by stage and gradients flow backward.^[10] Schemes such as GPipe and PipeDream keep all stages busy by processing multiple micro-batches concurrently.^[10] Sequence and expert parallelism extend these ideas to long contexts and mixture-of-experts models. Frameworks such as Megatron-LM, DeepSpeed, FSDP, and Colossal-AI compose multiple parallelism axes into a single training job.

What hardware is used to train AI models?

Matrix multiplications dominate training compute, so accelerators with strong tensor performance are the workhorses. NVIDIA GPUs have led the market for nearly a decade. The A100 (Ampere, 2020) was the most-used training chip from 2020 to 2022; the H100 (Hopper, 2022) introduced FP8 tensor cores and the Transformer Engine, delivering roughly four times the training throughput of the A100 on large language models.^[11] The B200 (Blackwell, 2024) extended this further. According to Epoch AI, the A100 has been the single most popular chip for training notable ML models.^[16]

Google's Tensor Processing Units (TPUs) are an alternative used heavily inside Google and via Google Cloud. TPU v4, v5e, and v5p pods scale to thousands of chips connected by a custom interconnect, and they trained Google's Gemini models. AWS Trainium and various startup accelerators round out the landscape, though NVIDIA still trains the majority of frontier systems.

Generalization and regularization

The goal of training is to generalize to new data, not just fit the training data. A model that memorizes the training set but fails on held-out examples has overfit. The opposite failure, underfitting, happens when the model is too simple or undertrained to capture the underlying pattern.^[7]

Practitioners detect overfitting by holding out a validation set and tracking validation loss alongside training loss. When validation loss stops improving while training loss keeps falling, the model is overfitting.^[8] Tools such as early stopping, dropout, weight decay (L2 regularization), data augmentation, and ensemble methods help close the gap.^[8] K-fold cross-validation rotates the held-out set across the data for a more reliable estimate.

Fine-tuning and parameter-efficient training

Full retraining of a pretrained model is expensive and often unnecessary. Fine-tuning reuses pretrained weights and continues training on a smaller dataset for a specific task.

Parameter-efficient fine-tuning (PEFT) methods update only a small fraction of the model's parameters.^[12] The most popular technique is LoRA (Low-Rank Adaptation), introduced by Microsoft researchers in 2021. LoRA freezes the original weights and learns small low-rank update matrices added back at inference time.^[13] Trainable parameter counts drop by orders of magnitude (often under 1% of the full model) with little loss in task performance.^[13] QLoRA combines LoRA with 4-bit quantization to fine-tune very large models on a single consumer GPU.

Explain like I'm 5 (ELI5)

Imagine teaching a robot to tell cats apart from dogs. You show the robot a picture and ask, "Cat or dog?" The robot guesses. If it gets it wrong, you say, "Nope," and it tweaks the little dials in its brain. Show it enough pictures, and the dials end up in just the right positions so the robot guesses correctly almost every time, even on cats and dogs it has never seen. The pictures and answers are the training data, the dials are the parameters, the "nope" signal is the loss function, and the rule for how much to turn the dials is the optimizer.

References

"Backpropagation." Wikipedia. https://en.wikipedia.org/wiki/Backpropagation ↩
"Stochastic gradient descent." Wikipedia. https://en.wikipedia.org/wiki/Stochastic_gradient_descent ↩
"Reinforcement learning from human feedback." Wikipedia. https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback ↩
Kingma, D. and Ba, J. "Adam: A Method for Stochastic Optimization." 2014. https://arxiv.org/abs/1412.6980 ↩
Loshchilov, I. and Hutter, F. "Decoupled Weight Decay Regularization." 2017. https://arxiv.org/abs/1711.05101 ↩
IBM. "What is Learning Rate in Machine Learning?" https://www.ibm.com/think/topics/learning-rate ↩
IBM. "What is Overfitting vs. Underfitting?" https://www.ibm.com/think/topics/overfitting-vs-underfitting ↩
AWS. "What is Overfitting?" https://aws.amazon.com/what-is/overfitting/ ↩
Colossal-AI documentation. "Paradigms of Parallelism." https://colossalai.org/docs/concepts/paradigms_of_parallelism/ ↩
DeepSpeed. "Pipeline Parallelism." https://www.deepspeed.ai/tutorials/pipeline/ ↩
NVIDIA. "H100 Tensor Core GPU." https://www.nvidia.com/en-us/data-center/h100/ ↩
Hugging Face. "PEFT." https://github.com/huggingface/peft ↩
Hugging Face. "Parameter-Efficient Fine-Tuning using PEFT." https://huggingface.co/blog/peft ↩
PyTorch. "A Primer on LLM Post-Training." https://pytorch.org/blog/a-primer-on-llm-post-training/ ↩
Kempner Institute, Harvard. "Anything but SGD: Evaluating Optimizers for LLM Training." https://kempnerinstitute.harvard.edu/research/deeper-learning/anything-but-sgd-evaluating-optimizers-for-llm-training/ ↩
Epoch AI. "The NVIDIA A100 has been the most popular hardware for training notable machine learning models." https://epoch.ai/data-insights/models-by-hardware ↩
Red Hat Developer. "Post-training methods for language models." https://developers.redhat.com/articles/2025/11/04/post-training-methods-language-models ↩
"Self-supervised learning." Wikipedia. https://en.wikipedia.org/wiki/Self-supervised_learning ↩
Stanford HAI. "The 2024 AI Index Report" (training cost estimates produced with Epoch AI); reported in Fortune, "Google's Gemini Ultra AI model may have cost $191 million." https://hai.stanford.edu/ai-index/2024-ai-index-report and https://fortune.com/2024/04/18/google-gemini-cost-191-million-to-train-stanford-university-report-estimates/ ↩
Brown, T. et al. "Language Models are Few-Shot Learners" (GPT-3). 2020. https://arxiv.org/abs/2005.14165 ↩
Hoffmann, J. et al. "Training Compute-Optimal Large Language Models" (Chinchilla). DeepMind, 2022. https://arxiv.org/abs/2203.15556 ↩
Cottier, B., Rahman, R. et al. "The rising costs of training frontier AI models." Epoch AI, 2024. https://arxiv.org/abs/2405.21015 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

AWS Trainium Agentic Context Engineering Federated Learning Lepton AI Machine learning terms/All Modal (platform)Sensitive Attribute TensorBoard Terms

Introduction

How does the training loop work?

Loss functions

Optimizers

Stochastic gradient descent

Adam and AdamW

Adafactor and memory-efficient variants

Hyperparameters

What are the main training paradigms?

Supervised learning

Unsupervised learning

Self-supervised learning

Reinforcement learning

How are large language models trained?

Pretraining

Supervised fine-tuning

Preference optimization and RLHF

How much data and compute does training need? Scaling laws

How much does it cost to train a frontier model?

How is training split across many GPUs? Distributed training

What hardware is used to train AI models?

Generalization and regularization

Fine-tuning and parameter-efficient training

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here