Training in machine learning is the process of fitting a model's parameters to data so that the model can make accurate predictions or generate useful outputs. During training, an algorithm iteratively adjusts the model's internal weights and biases by exposing it to examples, computing how far its predictions are from the correct answers, and updating the parameters to reduce that error. Training is the most computationally intensive phase of the machine learning pipeline, and its effectiveness determines how well a model will perform during inference.
At the core of training neural networks is the training loop, a cycle that repeats thousands or millions of times until the model reaches acceptable performance. Each iteration of the loop consists of four steps.
In the forward pass, input data flows through the network layer by layer. At each layer, the inputs are multiplied by the layer's weights, a bias term is added, and the result is passed through an activation function (such as ReLU or sigmoid). The output of one layer becomes the input to the next. The final layer produces the model's prediction for the given input.
Once the forward pass produces a prediction, a loss function (also called a cost function or objective function) quantifies how far that prediction is from the true target. Common loss functions include mean squared error for regression tasks and cross-entropy loss for classification tasks. The loss is a single scalar value that the training process aims to minimize.
Backpropagation computes the gradient of the loss function with respect to every parameter in the network. Using the chain rule from calculus, it works backward from the output layer to the input layer, calculating how much each weight contributed to the error. These gradients indicate the direction and magnitude of change needed to reduce the loss.
With the gradients computed, an optimizer applies gradient descent (or a variant like Adam, SGD with momentum, or AdaGrad) to update each parameter. The optimizer subtracts a fraction of the gradient from each weight, with that fraction determined by the learning rate. This step moves the model's parameters closer to a configuration that minimizes the loss.
This four-step cycle repeats for every batch of training data. Over many iterations, the model's predictions improve as the loss decreases.
Training data is typically processed in structured units.
| Term | Definition |
|---|---|
| Epoch | One complete pass through the entire training dataset. Models are usually trained for multiple epochs. |
| Batch | A subset of the training dataset processed together in one forward and backward pass. Batch sizes commonly range from 16 to several thousand. |
| Iteration | One forward pass, loss computation, backward pass, and parameter update on a single batch. The number of iterations per epoch equals the dataset size divided by the batch size. |
| Mini-batch | A term used interchangeably with batch in most modern contexts, distinguishing it from full-batch gradient descent (which uses the entire dataset at once). |
For example, a dataset with 50,000 samples and a batch size of 256 would require approximately 195 iterations to complete one epoch.
The quantity and quality of training data directly affect model performance. Insufficient data leads to overfitting, where the model memorizes the training examples rather than learning generalizable patterns. Too much noise or mislabeled data degrades performance regardless of data volume.
The scaling laws discovered by researchers at OpenAI and DeepMind have formalized the relationship between data, model size, and compute. The Chinchilla scaling laws (Hoffmann et al., 2022) found that for compute-optimal training, model size and training tokens should be scaled roughly equally: approximately 20 tokens per parameter. A 70-billion-parameter model should therefore be trained on about 1.4 trillion tokens. However, more recent practice has shifted toward training on far more data than the Chinchilla optimum suggests. Meta's Llama 3 70B was trained on roughly 200 tokens per parameter (about 15 trillion tokens), roughly 10 times the Chinchilla-optimal ratio. This trend is driven by economics: models trained on more data per parameter can achieve comparable performance with fewer parameters, making them cheaper to serve at inference time.
Training and inference are the two primary phases of a model's lifecycle. They differ significantly in computational profile and cost structure.
| Aspect | Training | Inference |
|---|---|---|
| Purpose | Learn model parameters from data | Apply the trained model to new inputs |
| Compute | Massive (thousands of GPUs for weeks) | Modest per request (single GPU or CPU) |
| Frequency | Periodic (once or a few times) | Continuous (every time the model is used) |
| Bottleneck | Compute-bound (FLOPs) | Often latency-bound or memory-bound |
| Cost profile | High one-time cost | Low per-query cost, but can exceed training cost at scale |
Because inference runs continuously and serves potentially millions of users, its cumulative operational cost often exceeds the one-time training cost. This explains why reducing inference costs has become a priority, even if it requires additional training compute.
Training compute is measured in FLOPs (floating-point operations) and is one of the most important factors determining model quality. The compute required for frontier models has grown exponentially.
| Model | Year | Estimated Training FLOPs | Estimated Cost |
|---|---|---|---|
| GPT-3 | 2020 | 3.1 x 10^23 | ~$4.6M |
| GPT-4 | 2023 | 2.1 x 10^25 | ~$78-100M+ |
| Gemini Ultra | 2023 | 5.0 x 10^25 | ~$191M |
| Llama 3.1 405B | 2024 | 3.8 x 10^25 | ~$25M |
| DeepSeek V3 | 2024 | ~5 x 10^24 | ~$5.6M |
| Claude 3 Opus | 2024 | 1.6 x 10^25 | Tens of millions |
As of 2025, over 30 publicly announced models from 12 different developers have exceeded the 10^25 FLOP training threshold. Training costs for frontier models have been growing at roughly 2.5 times per year, though the cost per unit of compute continues to drop due to hardware and algorithmic improvements.
GPT-4, with an estimated 1.76 trillion parameters, required approximately 25,000 NVIDIA A100 GPUs running for about 90 days. By contrast, DeepSeek V3 achieved frontier-level performance for approximately $5.6 million through optimized algorithms and strategic compute allocation, demonstrating that algorithmic efficiency can dramatically reduce costs.
Machine learning employs several distinct training paradigms, each suited to different kinds of data and tasks.
Supervised learning is the most common training paradigm. The model receives labeled examples (input-output pairs) and learns to map inputs to their corresponding outputs. The loss function measures the discrepancy between the model's predictions and the true labels. Common applications include image classification, spam detection, and medical diagnosis.
Unsupervised learning trains on data without labels. The model identifies patterns, structures, or groupings within the data on its own. Common techniques include clustering, dimensionality reduction, and density estimation.
Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data. The labeled examples guide the learning process while the unlabeled data provides additional structural information, making this approach practical when labeling is expensive.
Self-supervised learning generates its own supervisory signal from the data itself, without requiring human-provided labels. The model is trained on pretext tasks such as predicting masked tokens (as in BERT), predicting the next token (as in GPT), or predicting future frames in video. This paradigm powers the pre-training phase of most modern large language models.
Reinforcement learning trains an agent to make decisions by interacting with an environment. The agent takes actions, receives rewards or penalties, and adjusts its policy to maximize cumulative reward over time.
Training modern large language models (LLMs) involves a multi-stage pipeline that goes beyond a single training phase.
Pre-training is the most computationally expensive stage. The model learns language patterns, factual knowledge, and reasoning abilities from vast corpora of text (often trillions of tokens drawn from web crawls, books, and code repositories). The training objective is typically next-token prediction: given a sequence of tokens, predict what comes next. Pre-training establishes the model's general capabilities and accounts for the vast majority of total training compute.
After pre-training, the model undergoes supervised fine-tuning on curated datasets of high-quality input-output pairs. These datasets contain examples of desired behavior: following instructions, answering questions accurately, writing code, and refusing harmful requests. SFT is computationally cheap relative to pre-training (often 100 times less expensive or more) but has a large impact on the model's usefulness and safety.
RLHF further aligns the model with human preferences. The process involves three sub-steps:
RLHF helps the model produce responses that are more helpful, harmless, and honest, addressing aspects of quality that are difficult to capture in supervised training data alone.
Training deep neural networks, especially at scale, is prone to several instability issues.
Loss spikes are sudden, sharp increases in the training loss that can occur during otherwise stable training. They are often caused by exploding gradients, bad data batches, or numerical overflow. In large-scale LLM training runs costing millions of dollars, loss spikes can waste significant compute if not handled properly. Common mitigations include gradient clipping, learning rate warmup, and skipping corrupted data batches.
The vanishing gradient problem occurs when gradients become extremely small during backpropagation through many layers, effectively preventing the earlier layers from learning. The exploding gradient problem is the opposite: gradients grow exponentially, causing unstable parameter updates and potentially NaN (not a number) values. Solutions include careful weight initialization (such as Xavier or He initialization), batch normalization, residual connections, and gradient clipping.
A NaN loss indicates a catastrophic numerical failure, usually caused by division by zero, taking the logarithm of zero or negative numbers, or unchecked gradient explosions. NaN loss typically requires restarting training from a checkpoint. Using BF16 mixed precision instead of FP16 reduces the risk because BF16's wider dynamic range prevents many overflow and underflow scenarios.
Modern large-scale models cannot be trained on a single GPU. Distributed training spreads the workload across many devices, sometimes thousands.
In data parallelism, each GPU holds a complete copy of the model but processes a different subset of the training data. After each forward and backward pass, gradients are synchronized across all GPUs (typically via an all-reduce operation) so that every copy stays in sync. This is the simplest form of distributed training and works well when the model fits in a single GPU's memory.
When a model is too large to fit on a single GPU, model parallelism splits the model across multiple devices. Tensor parallelism divides individual layers (such as large matrix multiplications) across GPUs, while pipeline parallelism assigns different layers to different GPUs, processing micro-batches in a pipelined fashion to reduce idle time.
Frontier models typically combine all three strategies. For example, Meta's Llama 3.1 405B was trained using tensor parallelism of 8, pipeline parallelism of 16, and data parallelism ranging from 8 to 128, distributed across 16,384 NVIDIA H100 GPUs. Frameworks like DeepSpeed, Megatron-LM, and PyTorch FSDP (Fully Sharded Data Parallel) provide implementations of these strategies.
Practitioners use a variety of techniques to improve training efficiency, stability, and final model quality.
Mixed-precision training uses lower-precision numerical formats (FP16 or BF16) for most computations while keeping a master copy of the weights in FP32. This cuts memory usage roughly in half and speeds up training by 30 to 50 percent on GPUs with tensor cores, with minimal impact on model quality. BF16 (Brain Float 16) is generally preferred over FP16 for training because it has the same dynamic range as FP32, eliminating the need for loss scaling in most cases.
Starting training with a high learning rate can cause instability because the randomly initialized parameters produce large, noisy gradients. Learning rate warmup begins with a very small learning rate and linearly increases it over a set number of steps (typically a few hundred to a few thousand). After warmup, a schedule such as cosine annealing gradually reduces the learning rate over the remainder of training. This warmup-then-decay pattern has become standard practice for training transformers.
Gradient accumulation allows training with effectively larger batch sizes when GPU memory is limited. Instead of updating the model after every batch, gradients are accumulated over several mini-batches and the parameter update is applied once the accumulated gradients equal the desired effective batch size. This produces the same result as training with a larger batch on more powerful hardware.
During the forward pass, intermediate activations are stored in memory for use during backpropagation. Gradient checkpointing saves only a subset of these activations and recomputes the rest as needed during the backward pass. This trades roughly 20 percent additional compute time for a large reduction in memory usage, making it possible to train larger models or use larger batch sizes on the same hardware.
Curriculum learning arranges training data so that the model encounters easier examples first and progressively harder examples later, mimicking how humans learn. This structured ordering can improve convergence speed, produce better local minima, and improve generalization, all without additional computational cost. The main challenges are defining a difficulty metric for the data and designing the schedule for increasing difficulty.
Tracking metrics during training is essential for diagnosing problems and ensuring the model is learning effectively. Two platforms dominate this space.
| Platform | Type | Key Features |
|---|---|---|
| TensorBoard | Open-source (Google) | Real-time visualization of loss, accuracy, and learning rate curves; histograms of weight and gradient distributions; model graph visualization; embedding projections. Integrates with both TensorFlow and PyTorch. |
| Weights & Biases (W&B) | Cloud-based (free tier available) | Experiment tracking with automatic logging of hyperparameters, git state, and hardware metrics; rich dashboards for comparing runs; gradient and parameter distribution tracking; team collaboration features; integration with most major frameworks. |
Key metrics to monitor during training include:
Reproducing a training run exactly is one of the most persistent challenges in machine learning. Sources of non-determinism include:
Best practices for improving reproducibility include pinning all software versions, recording hyperparameters and random seeds, using deterministic algorithms where available, and leveraging experiment tracking tools to log every detail of each run.
The economics of training have shifted rapidly. Frontier model training costs have grown at roughly 2.5 times per year since 2018, with the largest runs in 2025 estimated at $200 million to $400 million. At the same time, the cost per unit of compute drops roughly 10 times per decade due to improvements in hardware (such as NVIDIA's H100 and B200 GPUs), software frameworks, and algorithmic efficiency.
| Year | Representative Model | Approximate Training Cost |
|---|---|---|
| 2018 | BERT-Large | ~$7,000 |
| 2020 | GPT-3 175B | ~$4.6M |
| 2023 | GPT-4 | ~$78-100M+ |
| 2024 | Llama 3.1 405B | ~$25M |
| 2024 | DeepSeek V3 | ~$5.6M |
| 2025 | Frontier 405B+ | $80M-$400M |
Notably, DeepSeek V3 demonstrated that algorithmic innovations (such as a Mixture-of-Experts architecture and optimized training procedures) can achieve frontier-level performance at a fraction of the cost of brute-force scaling. This suggests that raw compute spending is not the only path to better models.
GPU cloud pricing has also evolved. In mid-2025, AWS reduced H100 instance pricing by approximately 44 percent, bringing on-demand costs to roughly $3.90 per GPU-hour, with reserved pricing falling below $2.00 per GPU-hour.
Imagine you are learning to throw a basketball into a hoop. Every time you throw the ball, you watch where it lands. If it went too far to the left, you adjust and throw a little more to the right next time. If it fell short, you throw harder. Each throw is like one training step: you try something, see how wrong you were, and make a small correction.
Training a computer works the same way. You show it thousands of examples (like photos of cats and dogs) and each time it guesses wrong, it adjusts its internal settings a tiny bit. After seeing enough examples and making enough adjustments, it gets really good at telling cats from dogs, even when it sees a photo it has never seen before.