# Training

> Source: https://aiwiki.ai/wiki/training
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Training** in [machine learning](/wiki/machine_learning) is the process of fitting a [model's](/wiki/model) parameters to data so that the model can make accurate predictions or generate useful outputs. During training, an algorithm iteratively adjusts the model's internal [weights](/wiki/weight) and [biases](/wiki/bias) by exposing it to examples, computing how far its predictions are from the correct answers, and updating the parameters to reduce that error.[1] Training is the most computationally intensive phase of the machine learning pipeline, and its effectiveness determines how well a model will perform during [inference](/wiki/inference). Frontier model training has become an industrial undertaking: the compute used to train notable AI models has grown roughly 4 to 5 times per year since 2010, and the largest 2025 runs are estimated to cost between $100 million and $400 million.[14][7]

The mechanism behind modern training dates to a 1986 paper by David Rumelhart, [Geoffrey Hinton](/wiki/geoffrey_hinton), and Ronald Williams, which introduced [backpropagation](/wiki/backpropagation) as a way to train multi-layer networks. The authors described it as a procedure that "repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector," causing hidden units to "come to represent important features of the task domain."[1]

## The Training Loop

At the core of training [neural networks](/wiki/neural_network) is the **training loop**, a cycle that repeats thousands or millions of times until the model reaches acceptable performance. Each iteration of the loop consists of four steps.

### Forward Pass

In the forward pass, input data flows through the network layer by layer. At each layer, the inputs are multiplied by the layer's weights, a bias term is added, and the result is passed through an [activation function](/wiki/activation_function) (such as [ReLU](/wiki/rectified_linear_unit_relu) or sigmoid). The output of one layer becomes the input to the next. The final layer produces the model's prediction for the given input.

### Loss Computation

Once the forward pass produces a prediction, a [loss function](/wiki/loss_function) (also called a cost function or objective function) quantifies how far that prediction is from the true target. Common loss functions include mean squared error for regression tasks and [cross-entropy](/wiki/cross-entropy) loss for classification tasks. The loss is a single scalar value that the training process aims to minimize.

### Backward Pass (Backpropagation)

[Backpropagation](/wiki/backpropagation) computes the gradient of the loss function with respect to every parameter in the network.[1] Using the chain rule from calculus, it works backward from the output layer to the input layer, calculating how much each weight contributed to the error. These gradients indicate the direction and magnitude of change needed to reduce the loss.

### Parameter Update

With the gradients computed, an optimizer applies [gradient descent](/wiki/gradient_descent) (or a variant like Adam, SGD with momentum, or AdaGrad) to update each parameter.[1] The optimizer subtracts a fraction of the gradient from each weight, with that fraction determined by the [learning rate](/wiki/learning_rate). This step moves the model's parameters closer to a configuration that minimizes the loss.

This four-step cycle repeats for every [batch](/wiki/batch) of training data. Over many iterations, the model's predictions improve as the loss decreases.

## Key Training Concepts

### Epochs, Batches, and Iterations

Training data is typically processed in structured units.

| Term | Definition |
|---|---|
| [Epoch](/wiki/epoch) | One complete pass through the entire training dataset. Models are usually trained for multiple epochs. |
| [Batch](/wiki/batch) | A subset of the training dataset processed together in one forward and backward pass. Batch sizes commonly range from 16 to several thousand. |
| Iteration | One forward pass, loss computation, backward pass, and parameter update on a single batch. The number of iterations per epoch equals the dataset size divided by the batch size. |
| Mini-batch | A term used interchangeably with batch in most modern contexts, distinguishing it from full-batch gradient descent (which uses the entire dataset at once). |

For example, a dataset with 50,000 samples and a batch size of 256 would require approximately 195 iterations to complete one epoch.

### How much training data does a model need?

The quantity and quality of training data directly affect model performance. Insufficient data leads to [overfitting](/wiki/overfitting), where the model memorizes the training examples rather than learning generalizable patterns. Too much noise or mislabeled data degrades performance regardless of data volume.

The [scaling laws](/wiki/scaling_laws) discovered by researchers at OpenAI and DeepMind have formalized the relationship between data, model size, and compute.[3] The Chinchilla scaling laws (Hoffmann et al., 2022) found that for compute-optimal training, model size and training tokens should be scaled roughly equally: approximately 20 tokens per parameter.[2] The paper concluded that "current large language models are significantly undertrained" and that "for every doubling of model size the number of training tokens should also be doubled."[2] A 70-billion-parameter model should therefore be trained on about 1.4 trillion tokens.[2] To demonstrate the point, the authors trained Chinchilla, a 70B-parameter model, on 4 times more data than the 280B Gopher using the same compute budget, and Chinchilla "uniformly and significantly" outperformed Gopher, GPT-3 (175B), and Megatron-Turing NLG (530B) across a wide range of evaluations.[2]

However, more recent practice has shifted toward training on far more data than the Chinchilla optimum suggests. Meta's Llama 3 70B was trained on roughly 200 tokens per parameter (about 15 trillion tokens), roughly 10 times the Chinchilla-optimal ratio.[2] This trend is driven by economics: models trained on more data per parameter can achieve comparable performance with fewer parameters, making them cheaper to serve at inference time.[2]

### How is training different from inference?

Training and inference are the two primary phases of a model's lifecycle. They differ significantly in computational profile and cost structure.

| Aspect | Training | [Inference](/wiki/inference) |
|---|---|---|
| Purpose | Learn model parameters from data | Apply the trained model to new inputs |
| Compute | Massive (thousands of GPUs for weeks) | Modest per request (single GPU or CPU) |
| Frequency | Periodic (once or a few times) | Continuous (every time the model is used) |
| Bottleneck | Compute-bound (FLOPs) | Often latency-bound or memory-bound |
| Cost profile | High one-time cost | Low per-query cost, but can exceed training cost at scale |

Because inference runs continuously and serves potentially millions of users, its cumulative operational cost often exceeds the one-time training cost. This explains why reducing inference costs has become a priority, even if it requires additional training compute.

## Training Compute

Training compute is measured in **FLOPs** (floating-point operations) and is one of the most important factors determining model quality. The compute required for frontier models has grown exponentially. Epoch AI estimates that the training compute of notable models grew about 4.1 times per year between 2010 and May 2024 (90% confidence interval: 3.7x to 4.6x), with frontier language models accelerating to roughly 5 times per year since 2020.[14] This outpaces historically fast technology rollouts such as mobile-phone adoption (about 2 times per year in the 1980s).[14]

| Model | Year | Estimated Training FLOPs | Estimated Cost |
|---|---|---|---|
| GPT-3 | 2020 | 3.1 x 10^23 | ~$4.6M |
| [GPT-4](/wiki/gpt-4) | 2023 | 2.1 x 10^25 | ~$78-100M+ |
| [Gemini](/wiki/gemini) Ultra | 2023 | 5.0 x 10^25 | ~$191M |
| [Llama](/wiki/llama) 3.1 405B | 2024 | 3.8 x 10^25 | ~$25M |
| [DeepSeek](/wiki/deepseek) V3 | 2024 | ~5 x 10^24 | ~$5.6M |
| [Claude](/wiki/claude) 3 Opus | 2024 | 1.6 x 10^25 | Tens of millions |

As of June 2025, over 30 publicly announced models from 12 different developers had exceeded the 10^25 FLOP training threshold, a scale first reached by GPT-4 in March 2023.[6] Epoch AI projects this count could reach a median of about 165 models (90% CI: 103 to 306) by the end of 2028.[6] Training costs for frontier models have been growing at roughly 2.5 times per year, though the cost per unit of compute continues to drop due to hardware and algorithmic improvements.[7]

GPT-4, with an estimated 1.76 trillion parameters, required approximately 25,000 NVIDIA A100 GPUs running for about 90 days.[7] By contrast, DeepSeek V3 achieved frontier-level performance for approximately $5.6 million through optimized algorithms and strategic compute allocation, demonstrating that algorithmic efficiency can dramatically reduce costs.

## Types of Training

Machine learning employs several distinct training paradigms, each suited to different kinds of data and tasks.

### Supervised Learning

[Supervised learning](/wiki/supervised_learning) is the most common training paradigm. The model receives labeled examples (input-output pairs) and learns to map inputs to their corresponding outputs. The loss function measures the discrepancy between the model's predictions and the true labels. Common applications include image classification, spam detection, and medical diagnosis.

### Unsupervised Learning

[Unsupervised learning](/wiki/unsupervised_learning) trains on data without labels. The model identifies patterns, structures, or groupings within the data on its own. Common techniques include [clustering](/wiki/clustering), dimensionality reduction, and density estimation.

### Semi-Supervised Learning

[Semi-supervised learning](/wiki/semi-supervised_learning) combines a small amount of labeled data with a large amount of unlabeled data. The labeled examples guide the learning process while the unlabeled data provides additional structural information, making this approach practical when labeling is expensive.

### Self-Supervised Learning

[Self-supervised learning](/wiki/self-supervised_learning) generates its own supervisory signal from the data itself, without requiring human-provided labels. The model is trained on pretext tasks such as predicting masked tokens (as in [BERT](/wiki/bert)), predicting the next token (as in [GPT](/wiki/gpt)), or predicting future frames in video. This paradigm powers the pre-training phase of most modern [large language models](/wiki/large_language_model).

### Reinforcement Learning

[Reinforcement learning](/wiki/reinforcement_learning) trains an agent to make decisions by interacting with an environment. The agent takes actions, receives rewards or penalties, and adjusts its policy to maximize cumulative reward over time.

## Training Large Language Models

Training modern [large language models](/wiki/large_language_model) (LLMs) involves a multi-stage pipeline that goes beyond a single training phase.

### Stage 1: Pre-training

[Pre-training](/wiki/pre-training) is the most computationally expensive stage. The model learns language patterns, factual knowledge, and reasoning abilities from vast corpora of text (often trillions of tokens drawn from web crawls, books, and code repositories). The training objective is typically next-token prediction: given a sequence of tokens, predict what comes next. Pre-training establishes the model's general capabilities and accounts for the vast majority of total training compute. DeepSeek V3, for example, was pre-trained on 14.8 trillion tokens at a reported cost of 2.788 million H800 GPU hours, which the team estimated at $5.576 million assuming a rental price of $2 per GPU hour.[15]

### Stage 2: Supervised Fine-Tuning (SFT)

After pre-training, the model undergoes [supervised fine-tuning](/wiki/fine_tuning) on curated datasets of high-quality input-output pairs. These datasets contain examples of desired behavior: following instructions, answering questions accurately, writing code, and refusing harmful requests.[8] SFT is computationally cheap relative to pre-training (often 100 times less expensive or more) but has a large impact on the model's usefulness and safety.

### Stage 3: Reinforcement Learning from Human Feedback (RLHF)

[RLHF](/wiki/rlhf) further aligns the model with human preferences.[8] The InstructGPT paper that introduced the modern recipe opened with the observation that "making language models bigger does not inherently make them better at following a user's intent," and showed that a 1.3-billion-parameter aligned model could be preferred by human raters over the 175-billion-parameter GPT-3, despite having 100 times fewer parameters.[8] The process involves three sub-steps:

1. **Collecting comparisons.** Human annotators compare multiple model responses to the same prompt and rank them by quality.
2. **Training a reward model.** A separate model learns to predict human preferences based on the comparison data.
3. **Policy optimization.** The language model is fine-tuned using a reinforcement learning algorithm (commonly PPO or DPO) that maximizes the reward model's score while staying close to the SFT model.[8]

RLHF helps the model produce responses that are more helpful, harmless, and honest, addressing aspects of quality that are difficult to capture in supervised training data alone.

## Training Instabilities

Training deep neural networks, especially at scale, is prone to several instability issues.

### Loss Spikes

Loss spikes are sudden, sharp increases in the training loss that can occur during otherwise stable training. They are often caused by exploding gradients, bad data batches, or numerical overflow. In large-scale LLM training runs costing millions of dollars, loss spikes can waste significant compute if not handled properly. Common mitigations include gradient clipping, learning rate warmup, and skipping corrupted data batches.

### Vanishing and Exploding Gradients

The [vanishing gradient problem](/wiki/vanishing_gradient_problem) occurs when gradients become extremely small during backpropagation through many layers, effectively preventing the earlier layers from learning. The [exploding gradient problem](/wiki/exploding_gradient_problem) is the opposite: gradients grow exponentially, causing unstable parameter updates and potentially NaN (not a number) values. Solutions include careful weight initialization (such as Xavier or He initialization), batch normalization, residual connections, and gradient clipping.

### NaN Loss

A NaN loss indicates a catastrophic numerical failure, usually caused by division by zero, taking the logarithm of zero or negative numbers, or unchecked gradient explosions. NaN loss typically requires restarting training from a checkpoint. Using BF16 mixed precision instead of FP16 reduces the risk because BF16's wider dynamic range prevents many overflow and underflow scenarios.[4]

## Distributed Training

Modern large-scale models cannot be trained on a single GPU. Distributed training spreads the workload across many devices, sometimes thousands.

### Data Parallelism

In [data parallelism](/wiki/data_parallelism), each GPU holds a complete copy of the model but processes a different subset of the training data. After each forward and backward pass, gradients are synchronized across all GPUs (typically via an all-reduce operation) so that every copy stays in sync.[10] This is the simplest form of distributed training and works well when the model fits in a single GPU's memory.

### Model Parallelism

When a model is too large to fit on a single GPU, [model parallelism](/wiki/model_parallelism) splits the model across multiple devices. **Tensor parallelism** divides individual layers (such as large matrix multiplications) across GPUs, while **pipeline parallelism** assigns different layers to different GPUs, processing micro-batches in a pipelined fashion to reduce idle time.

### Hybrid Parallelism

Frontier models typically combine all three strategies. For example, Meta's [Llama 3](/wiki/llama_3).1 405B was trained using tensor parallelism of 8, pipeline parallelism of 16, and data parallelism ranging from 8 to 128, distributed across 16,384 [NVIDIA H100](/wiki/nvidia_h100) GPUs.[16] The full run consumed about 39.3 million H100 GPU hours.[16] Frameworks like DeepSpeed, Megatron-LM, and PyTorch FSDP (Fully Sharded Data Parallel) provide implementations of these strategies.[10]

### Why is large-scale training so unreliable?

At the scale of tens of thousands of accelerators, hardware failures become routine rather than exceptional. Meta reported that during a 54-day snapshot of Llama 3 405B pre-training on 16,384 H100 GPUs, the cluster experienced 466 job interruptions, 419 of them unexpected, which works out to roughly one unexpected failure every three hours.[16] About 78 percent of those unexpected interruptions were attributed to hardware issues, with GPU faults and HBM3 memory failures the leading causes.[16] Despite this, Meta maintained over 90 percent effective training time through automated detection and recovery from checkpoints.[16] This reliability challenge is a primary reason that checkpointing, redundancy, and rapid fault recovery are central to modern training infrastructure.

## Training Techniques and Tricks

Practitioners use a variety of techniques to improve training efficiency, stability, and final model quality.

### Mixed-Precision Training

Mixed-precision training uses lower-precision numerical formats (FP16 or BF16) for most computations while keeping a master copy of the weights in FP32.[4] This cuts memory usage roughly in half and speeds up training by 30 to 50 percent on GPUs with tensor cores, with minimal impact on model quality.[4] BF16 (Brain Float 16) is generally preferred over FP16 for training because it has the same dynamic range as FP32, eliminating the need for loss scaling in most cases.

### Learning Rate Warmup and Scheduling

Starting training with a high learning rate can cause instability because the randomly initialized parameters produce large, noisy gradients. Learning rate warmup begins with a very small learning rate and linearly increases it over a set number of steps (typically a few hundred to a few thousand). After warmup, a schedule such as cosine annealing gradually reduces the [learning rate](/wiki/learning_rate) over the remainder of training.[9] This warmup-then-decay pattern has become standard practice for training [transformers](/wiki/transformer).

### Gradient Accumulation

Gradient accumulation allows training with effectively larger batch sizes when GPU memory is limited. Instead of updating the model after every batch, gradients are accumulated over several mini-batches and the parameter update is applied once the accumulated gradients equal the desired effective batch size. This produces the same result as training with a larger batch on more powerful hardware.

### Gradient Checkpointing

During the forward pass, intermediate activations are stored in memory for use during backpropagation. [Gradient checkpointing](/wiki/checkpoint) saves only a subset of these activations and recomputes the rest as needed during the backward pass.[11] This trades roughly 20 percent additional compute time for a large reduction in memory usage, making it possible to train larger models or use larger batch sizes on the same hardware.[11]

### Curriculum Learning

Curriculum learning arranges training data so that the model encounters easier examples first and progressively harder examples later, mimicking how humans learn.[5] This structured ordering can improve convergence speed, produce better local minima, and improve generalization, all without additional computational cost.[5] The main challenges are defining a difficulty metric for the data and designing the schedule for increasing difficulty.

## Monitoring Training

Tracking metrics during training is essential for diagnosing problems and ensuring the model is learning effectively. Two platforms dominate this space.

| Platform | Type | Key Features |
|---|---|---|
| TensorBoard | Open-source (Google) | Real-time visualization of loss, accuracy, and learning rate curves; histograms of weight and gradient distributions; model graph visualization; embedding projections. Integrates with both [TensorFlow](/wiki/tensorflow) and [PyTorch](/wiki/pytorch). |
| Weights & Biases (W&B) | Cloud-based (free tier available) | Experiment tracking with automatic logging of hyperparameters, git state, and hardware metrics; rich dashboards for comparing runs; gradient and parameter distribution tracking; team collaboration features; integration with most major frameworks. |

Key metrics to monitor during training include:

- **Training loss and validation loss.** Divergence between these indicates [overfitting](/wiki/overfitting).
- **Gradient norms.** A sudden spike in the gradient norm often precedes or coincides with a loss spike.
- **Learning rate.** Confirms the schedule is progressing as intended.
- **GPU utilization and throughput.** Ensures hardware is being used efficiently.
- **Evaluation metrics.** Task-specific measures (accuracy, [F1 score](/wiki/f1_score), BLEU, perplexity) computed periodically on a held-out validation set.

## Reproducibility

Reproducing a training run exactly is one of the most persistent challenges in machine learning. Sources of non-determinism include:

- **Random seeds.** Weight initialization, data shuffling, and dropout all depend on random number generators. Setting fixed seeds is necessary but not sufficient for reproducibility.
- **Hardware differences.** Results can differ between CPU and GPU execution, and between different GPU architectures, because floating-point operations are not strictly associative.[12]
- **Software versions.** Different versions of frameworks, CUDA libraries, and operating systems can produce different results.
- **Parallelism.** In distributed training, the order in which gradient reductions happen may vary across runs, introducing small numerical differences that can compound over many steps.

Best practices for improving reproducibility include pinning all software versions, recording hyperparameters and random seeds, using deterministic algorithms where available, and leveraging experiment tracking tools to log every detail of each run.[12]

## Training Cost Trends

The economics of training have shifted rapidly. According to Cottier et al. (2024), the amortized cost to train the most compute-intensive models has grown at about 2.4 times per year since 2016 (90% CI: 2.0x to 2.9x), and the authors project that "the largest training runs will cost more than a billion dollars by 2027" if the trend continues.[7] The largest runs in 2025 were estimated at $100 million to $400 million.[7] At the same time, the cost per unit of compute drops roughly 10 times per decade due to improvements in hardware (such as NVIDIA's H100 and B200 GPUs), software frameworks, and algorithmic efficiency.[7]

| Year | Representative Model | Approximate Training Cost |
|---|---|---|
| 2018 | BERT-Large | ~$7,000 |
| 2020 | [GPT-3](/wiki/gpt-3) 175B | ~$4.6M |
| 2023 | [GPT-4](/wiki/gpt-4) | ~$78-100M+ |
| 2024 | [Llama](/wiki/llama) 3.1 405B | ~$25M |
| 2024 | [DeepSeek](/wiki/deepseek_v3) V3 | ~$5.6M |
| 2025 | Frontier 405B+ | $80M-$400M |

Notably, DeepSeek V3 demonstrated that algorithmic innovations (such as a [Mixture-of-Experts](/wiki/mixture_of_experts) architecture and optimized training procedures) can achieve frontier-level performance at a fraction of the cost of brute-force scaling.[7][15] This suggests that raw compute spending is not the only path to better models.

GPU cloud pricing has also evolved. In mid-2025, AWS reduced H100 instance pricing by approximately 44 percent, bringing on-demand costs to roughly $3.90 per GPU-hour, with reserved pricing falling below $2.00 per GPU-hour.[13]

## Explain Like I'm 5 (ELI5)

Imagine you are learning to throw a basketball into a hoop. Every time you throw the ball, you watch where it lands. If it went too far to the left, you adjust and throw a little more to the right next time. If it fell short, you throw harder. Each throw is like one training step: you try something, see how wrong you were, and make a small correction.

Training a computer works the same way. You show it thousands of examples (like photos of cats and dogs) and each time it guesses wrong, it adjusts its internal settings a tiny bit. After seeing enough examples and making enough adjustments, it gets really good at telling cats from dogs, even when it sees a photo it has never seen before.

## References

1. Rumelhart, D., Hinton, G., & Williams, R. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
2. Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models." *arXiv:2203.15556*.
3. Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models." *arXiv:2001.08361*.
4. Micikevicius, P., et al. (2018). "Mixed Precision Training." *ICLR 2018*.
5. Bengio, Y., et al. (2009). "Curriculum Learning." *Proceedings of ICML*.
6. Epoch AI. (2025). "Over 30 AI models have been trained at the scale of GPT-4." *epoch.ai*.
7. Cottier, B., Rahman, R., et al. (2024). "The rising costs of training frontier AI models." *arXiv:2405.21015*.
8. Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." *NeurIPS 2022 (arXiv:2203.02155)*.
9. Loshchilov, I. & Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." *ICLR 2017*.
10. Rajbhandari, S., et al. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." *SC20*.
11. Chen, T., et al. (2016). "Training Deep Nets with Sublinear Memory Cost." *arXiv:1604.06174*.
12. PyTorch Documentation. (2025). "Reproducibility." *pytorch.org*.
13. Amazon Web Services. (2025). "Announcing up to 45% price reduction for Amazon EC2 NVIDIA GPU-accelerated instances." *aws.amazon.com*.
14. Sevilla, J., et al. (2024). "Training Compute of Frontier AI Models Grows by 4-5x per Year." *Epoch AI, epoch.ai*.
15. DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." *arXiv:2412.19437*.
16. Grattafiori, A., et al. (2024). "The Llama 3 Herd of Models." *arXiv:2407.21783*.