Training run
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,190 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,190 words
Add missing citations, update stale details, or suggest a clearer explanation.
A training run is a single, deliberate instance of training a neural network from scratch (or from a prior checkpoint) on a specified dataset, with a fixed compute budget, hardware allocation, and time horizon. In the parlance of large AI labs, the phrase typically refers to large-scale pretraining of foundation models: a single run might consume tens of thousands of GPUs continuously for months and cost tens or hundreds of millions of dollars in compute alone.[1] Modern frontier training runs are major engineering operations that combine data preparation, distributed optimization, checkpointing and failure recovery, and a long tail of post-training stages such as supervised fine-tuning and preference optimization. The discipline grew out of academic deep learning training around 2015, but acquired its current scale and operational complexity with the arrival of gpt-3 in 2020 and successor frontier models from 2022 onward.[2][3]
The phrase "training run" predates the deep learning era. Researchers training small statistical models would speak of a run as a single execution of an optimizer over a dataset, distinct from sweeps over hyperparameters or repeated trials for variance estimation. With the rise of deep neural networks in the early 2010s, runs lengthened from minutes to days, and from single GPUs to small clusters. The publication of "Attention Is All You Need" in 2017 and the GPT family that followed produced models whose training durations were measured in weeks, then months, on bespoke supercomputers.[4]
Two developments changed the operational character of the training run. The first was the publication of scaling laws by scaling laws paper from Kaplan and colleagues in 2020 and the chinchilla scaling from Hoffmann and colleagues at DeepMind in 2022, which made it possible to predict, before launch, what model size and token budget would extract the most capability per FLOP.[2][5] The second was the construction of GPU and TPU clusters with tens of thousands of accelerators, which transformed a training run from a software exercise into a hardware reliability problem: at sufficient scale, individual component failures occur faster than a training step can complete, and engineering effort shifts from optimization to checkpointing, monitoring, and recovery.[6]
By 2024 a frontier pretraining run, such as the one Meta executed for llama 3 1 405B, occupied 16,384 H100 GPUs for 54 days of continuous operation and experienced 466 interruptions, of which 419 were unplanned.[6] By 2025 the released deepseek v3 technical report described a single pretraining run lasting under two months on 2,048 H800 GPUs and consuming 2.788 million GPU hours, at an estimated rental cost of about USD 5.576 million.[7] These figures define the contemporary reference points for what "a training run" denotes.
A frontier-class training run typically follows a defined sequence of phases. The order and naming vary by lab, but the underlying flow is broadly consistent.
Before any GPU is allocated to the main run, engineers complete several preparatory steps:
6 * N * D (where N is non-embedding parameter count and D is training tokens). Combined with peak hardware throughput, this yields an expected wall-clock duration and dollar cost.[8] The 6 in the heuristic captures two FLOPs per parameter for the forward pass and four for the backward pass; sequence-length-dependent attention terms add a quadratic correction that is small for short contexts and large for the long contexts used in current frontier runs.[8]The forward and backward passes are parallelized across the cluster using a combination of strategies, often called 3D parallelism:
To fit very large models into accelerator memory, optimizer states, gradients, and parameters are also sharded. fsdp, the PyTorch implementation of Microsoft's ZeRO Stage 3 protocol, partitions all three across data-parallel ranks; only the parameters needed for the current forward or backward step are gathered on demand.[9] Adam-family optimizers, the standard choice since the adam optimizer paper of 2015, double the per-parameter memory footprint because they retain a first-moment and second-moment estimate per parameter; in mixed precision the optimizer state typically dominates the memory bill.[10]
Once launched, training proceeds in steps: each global batch is loaded, a forward pass computes the loss (almost always next-token cross-entropy for pretraining), a backward pass produces gradients, and the optimizer updates the weights. A learning-rate schedule, usually a linear warmup followed by cosine decay, drives the magnitude of updates. Llama 3 405B used a peak learning rate of 8e-5, an 8,000-step warmup, and a cosine decay to 8e-7 over roughly 1.2 million steps.[6] Batch sizes are typically warmed up alongside the learning rate, starting from a few hundred thousand tokens per step and ramping to the target value (Llama 3 reported 16 million tokens per batch at the long-sequence stage) so that the optimizer reaches its high-batch regime smoothly.[6]
During steady state, the engineering team monitors a small set of dashboards almost continuously:
The PaLM 540B run sustained 46.2 percent model FLOPs utilization and 57.8 percent hardware FLOPs utilization on 6,144 TPU v4 chips, which were reported as the highest figures yet achieved at that scale for a dense Transformer.[11]
Because failures are routine, the training process serializes its full state (model weights, optimizer moments, learning-rate schedule position, data loader cursor, RNG state) to durable storage at regular intervals. For BLOOM the optimizer-inclusive checkpoint weighed 2.3 TB, compared with 329 GB for the weights alone.[12] Checkpoint cadence trades off recovery cost against the wall-clock pause incurred during writing; the megatron lm framework and successors implement asynchronous and in-memory checkpoint protocols to hide most of this cost.[13]
When a job fails, an on-call engineer (or, increasingly, automation) identifies the affected rank, reschedules around the failed node, and resumes from the last checkpoint, sometimes also skipping a window of training batches if a loss spike is suspected. The OPT-175B logbook documented job uptime of 51.7 to 58.9 percent over the run, with over 100 restarts during 60 GPU days.[14][13]
Once pretraining concludes, the model enters a fine tuning pipeline collectively called post-training. The modern recipe, established by instructgpt in 2022 and refined since, comprises three families of operations applied in alternating rounds:[15]
For Llama 3 the post-training stage used iterative rounds of SFT, rejection sampling against a reward model, and DPO with a beta of 0.1, with the SFT stage running at a learning rate of 1e-5 over 8,500 to 9,000 steps.[6]
A final evaluation suite, covering capability benchmarks (such as MMLU, GSM8K, HumanEval, IFEval), safety probes, and red-team tests, is run against intermediate and final checkpoints. The decision to release is made on the basis of these results plus internal qualitative review.
The dominant cost driver is GPU-hours multiplied by the rental or amortized cost per GPU-hour. The standard Transformer FLOP heuristic gives an approximate FLOP budget of 6 * N * D for a dense model with N parameters trained on D tokens (and roughly 2 * N * D for inference). Dividing by sustained throughput in FLOPs per second yields wall-clock time; multiplying by the relevant accelerator-hour cost yields a compute estimate.[8]
Several training runs have become reference points in the literature because their costs and durations are explicitly documented or widely estimated:
| Model | Year | Parameters | Training tokens | Hardware | Wall clock | Documented compute or cost |
|---|---|---|---|---|---|---|
| gpt-3 | 2020 | 175B | 300B | V100 GPUs | ~ months | ~3.14e23 FLOPs; estimated ~USD 4.6M[18][19] |
| palm | 2022 | 540B | 780B | 6,144 TPU v4 | ~50 days | 46.2% MFU; 57.8% HFU[11][20] |
| OPT-175B | 2021 to 2022 | 175B | 180B | 992 to 1,024 A100 80GB | 56 days | 4.30e23 FLOPs; ~147 TFLOP/s/GPU[14] |
| bloom | 2022 | 176B | 366B | 384 A100 80GB (Jean Zay) | 3 to 4 months | ~150 TFLOP/s sustained[12] |
| gpt-4 | 2023 | undisclosed | undisclosed | A100 GPUs | ~ months | estimated USD 40M to USD 100M[21][22] |
| llama 3 1 405B | 2024 | 405B | 15.6T | 16,384 H100 | 54 days (snapshot) | 3.8e25 FLOPs pretraining budget[6] |
| deepseek v3 | 2024 | 671B (37B active) | 14.8T | 2,048 H800 | <2 months | 2.788M GPU hours; ~USD 5.576M at USD 2/hour[7] |
The DeepSeek V3 figure of about USD 5.576 million has attracted particular attention because it is roughly an order of magnitude lower than the prevailing assumption for a frontier dense-equivalent run, an effect attributable mainly to the use of a sparse mixture of experts architecture, FP8 mixed-precision training, and several systems-level optimizations described in the technical report.[7] The DeepSeek authors explicitly note that the figure covers only the official run and excludes prior research, ablations, and post-training data labelling.[7] Independent commentary has emphasized that the marginal compute cost of a single run is a fraction of total program cost.[23]
A run at frontier scale is too expensive to debug in flight. Labs run a sweep of much smaller models at the same architectural family to fix learning rate, batch size, weight decay, warmup duration, and other hyperparameters. The naive approach assumes that the smaller models have the same optimum as the target, which is rarely true.
Maximal Update Parametrization (muP), introduced by Yang and colleagues in "Tensor Programs V" in 2022, is a reparametrization of the network that makes the optimal learning rate, momentum coefficients, and per-layer multipliers invariant to network width. Under muP, hyperparameters tuned on a 40-million-parameter proxy can be transferred zero-shot to a 6.7-billion-parameter target model, with the authors reporting that the resulting model outperformed the published 6.7B GPT-3 baseline using tuning cost equivalent to about 7 percent of the full pretraining budget.[24] Variants of muP have since been adopted, in whole or in modified form, by several labs operating at scale.
Beyond muP, the standard pre-launch protocol includes scaling-law fits across the proxy sweep, often using the Chinchilla relation between parameter count and training tokens at a fixed compute budget, to choose the target shape of the full run.[5]
Loss spikes are the most visible failure mode of large training runs. A spike is a sudden, multi-step increase in training loss, often by an order of magnitude or more, attributable to a small set of pathological batches interacting with momentum-driven optimizer dynamics. Spikes that do not recover within a few thousand steps require restart from a prior checkpoint. Standard mitigations include:
Hardware failures dominate the operational logbook. Meta's 54-day Llama 3 snapshot reported that approximately 78 percent of unexpected interruptions were attributable to confirmed or suspected hardware issues, with GPU components accounting for roughly 58.7 percent of unexpected interruptions in total. The team nevertheless achieved over 90 percent effective training time on a 16,384-GPU job by automating the rescheduling of failed ranks, with only three incidents requiring significant manual intervention.[6] The OPT-175B logbook released by Meta in 2022 documents an earlier era when manual intervention was the norm: the 114-page log records the on-call engineers' day-to-day handling of hangs, crashes, NaN values, and configuration errors.[14][27]
Silent data corruption, in which a GPU produces a numerically wrong result without raising a fault, has emerged as a distinct concern at frontier scale and motivates additional checksum and replay protocols beyond the standard checkpointing approach.[28]
The term training run is sometimes qualified to distinguish the major categories of work:
The distinction between these categories matters for cost accounting and for arguments about regulatory thresholds expressed in training compute, since most policy thresholds apply to the pretraining run alone.
Open-source frameworks have made the mechanics of a training run accessible at smaller scale. megatron lm, NVIDIA's reference implementation of 3D parallelism for Transformers, is widely used as the basis for production training stacks.[13] DeepSpeed (ZeRO), Microsoft's optimizer-state sharding library, and the PyTorch fsdp API offer complementary approaches.[9] Hugging Face Accelerate and Transformers wrap these for application teams.
For educational purposes, Andrej Karpathy's nanoGPT repository implements a full character-level GPT pretraining loop in roughly 300 lines of PyTorch, intended as a transparent reference for the same primitives used in production runs (data sharding, gradient accumulation, mixed precision, learning-rate scheduling).[29] Karpathy's "Let's reproduce GPT-2 (124M)" video and the accompanying code build directly on this and remain a common entry point for engineers preparing to participate in larger runs.
Two unusually detailed public artifacts have shaped community understanding of the operational character of a training run:
These releases changed the published norm for frontier runs from a single short technical report to a multi-document release including a logbook, a model card, weights, and increasingly a public TensorBoard.
Estimates of frontier training run cost have grown rapidly. The estimated USD 4.6 million for gpt-3 in 2020 grew to estimates in the USD 40 million to USD 100 million range for gpt-4 by 2023, with several sources placing the compute alone above USD 78 million.[21][22] Subsequent runs by llama 3 1 and competing models have been characterized by industry analysts as still well below the USD 1 billion threshold per run, though forward projections for the late 2020s anticipate per-run costs in that range as cluster sizes increase by roughly an order of magnitude every two years.[30]
The cost of a single run can also be expressed as the product of two more stable quantities: training compute, measured in FLOPs, and a price per FLOP that has historically declined by roughly 30 percent per year due to hardware improvements and software efficiency gains. Combined with order-of-magnitude growth in cluster size, the net effect is that frontier training compute has grown by roughly 4 to 5 times per year through the early 2020s, with run cost growing at a slower but still rapid pace.[1][30]
These figures have entered policy debates: the United States executive order on AI of 2023 and subsequent EU AI Act provisions reference training compute (measured in FLOPs) as a regulatory threshold for additional reporting requirements. A frontier training run, whose compute budget is committed before launch, has therefore become a unit of analysis not only for technical capability prediction but also for legal classification of the resulting model. The Llama 3 405B pretraining budget of 3.8e25 FLOPs falls below, and the DeepSeek V3 budget at roughly 3.4e24 FLOPs falls well below, the 1e26 FLOPs threshold initially proposed in the 2023 executive order.[6][7]
The dominant unsolved questions for training runs reflect the move toward ever larger scale: