# Training run

> Source: https://aiwiki.ai/wiki/training_run
> Updated: 2026-06-07
> Categories: Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **training run** is a single, deliberate instance of training a neural network from scratch (or from a prior checkpoint) on a specified dataset, with a fixed compute budget, hardware allocation, and time horizon. In the parlance of large AI labs, the phrase typically refers to large-scale pretraining of [foundation models](/wiki/foundation_models): a single run might consume tens of thousands of GPUs continuously for months and cost tens or hundreds of millions of dollars in compute alone.[^1] Modern frontier training runs are major engineering operations that combine data preparation, distributed optimization, checkpointing and failure recovery, and a long tail of post-training stages such as supervised fine-tuning and preference optimization. The discipline grew out of academic deep learning training around 2015, but acquired its current scale and operational complexity with the arrival of [gpt-3](/wiki/gpt-3) in 2020 and successor [frontier models](/wiki/frontier_models) from 2022 onward.[^2][^3]

## Background

The phrase "training run" predates the deep learning era. Researchers training small statistical models would speak of a run as a single execution of an optimizer over a dataset, distinct from sweeps over hyperparameters or repeated trials for variance estimation. With the rise of deep neural networks in the early 2010s, runs lengthened from minutes to days, and from single GPUs to small clusters. The publication of "Attention Is All You Need" in 2017 and the GPT family that followed produced models whose training durations were measured in weeks, then months, on bespoke supercomputers.[^4]

Two developments changed the operational character of the training run. The first was the publication of [scaling laws](/wiki/scaling_laws) by [scaling laws paper](/wiki/scaling_laws_paper) from Kaplan and colleagues in 2020 and the [chinchilla scaling](/wiki/chinchilla_scaling) from Hoffmann and colleagues at DeepMind in 2022, which made it possible to predict, before launch, what model size and token budget would extract the most capability per FLOP.[^2][^5] The second was the construction of GPU and TPU clusters with tens of thousands of accelerators, which transformed a training run from a software exercise into a hardware reliability problem: at sufficient scale, individual component failures occur faster than a training step can complete, and engineering effort shifts from optimization to checkpointing, monitoring, and recovery.[^6]

By 2024 a frontier pretraining run, such as the one Meta executed for [llama 3 1](/wiki/llama_3_1) 405B, occupied 16,384 H100 GPUs for 54 days of continuous operation and experienced 466 interruptions, of which 419 were unplanned.[^6] By 2025 the released [deepseek v3](/wiki/deepseek_v3) technical report described a single pretraining run lasting under two months on 2,048 H800 GPUs and consuming 2.788 million GPU hours, at an estimated rental cost of about USD 5.576 million.[^7] These figures define the contemporary reference points for what "a training run" denotes.

## Anatomy of a training run

A frontier-class training run typically follows a defined sequence of phases. The order and naming vary by lab, but the underlying flow is broadly consistent.

### Pre-launch preparation

Before any GPU is allocated to the main run, engineers complete several preparatory steps:

1. **Data collection and curation.** Web crawls, code repositories, books, and licensed corpora are deduplicated, filtered for quality, classified by domain, and balanced into a target data mix. Meta reported a final [llama 3](/wiki/llama_3) mix of roughly 50 percent general web tokens, 25 percent mathematics and reasoning, 17 percent code, and 8 percent multilingual content.[^6] Data quality work typically dwarfs the engineering effort of the run itself: web-scraped text is filtered with model-based classifiers, deduplicated at document and paragraph level, and stripped of personally identifiable information, with each stage validated against held-out probes.
2. **Tokenization.** A [tokenization](/wiki/tokenization) vocabulary is fit on a representative sample, usually using [byte pair encoding](/wiki/byte_pair_encoding) or a SentencePiece variant, after which the entire corpus is tokenized once and stored in memory-mapped binary shards. Vocabulary size, special tokens, and handling of code and non-Latin scripts are decided at this stage and effectively frozen for the duration of the run.[^6] The Llama 3 vocabulary was expanded to 128,000 tokens from the 32,000 used in Llama 2, with the larger vocabulary improving compression on non-English text.[^6]
3. **Architecture and hyperparameter selection.** A model architecture (number of layers, hidden dimension, attention head count, [mixture of experts](/wiki/mixture_of_experts) configuration if any) is fixed, and key hyperparameters are tuned on smaller proxy models.
4. **Compute budget estimation.** A FLOP count is computed from model size, sequence length, and target token count using approximations such as `6 * N * D` (where N is non-embedding parameter count and D is training tokens). Combined with peak hardware throughput, this yields an expected wall-clock duration and dollar cost.[^8] The 6 in the heuristic captures two FLOPs per parameter for the forward pass and four for the backward pass; sequence-length-dependent attention terms add a quadratic correction that is small for short contexts and large for the long contexts used in current frontier runs.[^8]
5. **Dry runs at small scale.** Engineers run the full training pipeline on smaller models to confirm data loaders, distributed communication, gradient clipping, checkpoint serialization, and resume-from-checkpoint logic all behave correctly under failure conditions. Common pre-launch tests include deliberately killing a rank mid-step to confirm clean restart, corrupting a checkpoint to confirm fallback to an earlier one, and replaying the same batch on two ranks to confirm bit-for-bit reproducibility within a single all-reduce.

### Distributed setup

The forward and backward passes are parallelized across the cluster using a combination of strategies, often called 3D parallelism:

- **[data parallelism](/wiki/data_parallelism)** splits the global batch across replicas, each holding a full copy of the model.
- **[tensor parallelism](/wiki/tensor_parallelism)** splits individual matrix multiplications across GPUs within a node, typically along the model's hidden dimension.
- **[pipeline parallelism](/wiki/pipeline_parallelism)** assigns successive layers to different stages of an inter-node pipeline.

To fit very large models into accelerator memory, optimizer states, gradients, and parameters are also sharded. [fsdp](/wiki/fsdp), the PyTorch implementation of Microsoft's ZeRO Stage 3 protocol, partitions all three across data-parallel ranks; only the parameters needed for the current forward or backward step are gathered on demand.[^9] Adam-family optimizers, the standard choice since the [adam optimizer](/wiki/adam_optimizer) paper of 2015, double the per-parameter memory footprint because they retain a first-moment and second-moment estimate per parameter; in mixed precision the optimizer state typically dominates the memory bill.[^10]

### Launch and steady-state training

Once launched, training proceeds in steps: each global batch is loaded, a forward pass computes the loss (almost always next-token cross-entropy for [pretraining](/wiki/pretraining)), a backward pass produces gradients, and the optimizer updates the weights. A learning-rate schedule, usually a linear warmup followed by cosine decay, drives the magnitude of updates. Llama 3 405B used a peak learning rate of 8e-5, an 8,000-step warmup, and a cosine decay to 8e-7 over roughly 1.2 million steps.[^6] Batch sizes are typically warmed up alongside the learning rate, starting from a few hundred thousand tokens per step and ramping to the target value (Llama 3 reported 16 million tokens per batch at the long-sequence stage) so that the optimizer reaches its high-batch regime smoothly.[^6]

During steady state, the engineering team monitors a small set of dashboards almost continuously:

- **Training loss curves** at multiple smoothings, watched for departures from the expected scaling-law fit.
- **Gradient norm, parameter norm, and weight RMS** by layer, watched for the early signature of an impending loss spike.
- **Hardware health**: GPU temperatures, error rates from memory and inter-node links, and step times by rank.
- **Throughput**: tokens per second, model FLOPs utilization (MFU), and hardware FLOPs utilization (HFU).

The PaLM 540B run sustained 46.2 percent model FLOPs utilization and 57.8 percent hardware FLOPs utilization on 6,144 TPU v4 chips, which were reported as the highest figures yet achieved at that scale for a dense Transformer.[^11]

### Checkpointing and recovery

Because failures are routine, the training process serializes its full state (model weights, optimizer moments, learning-rate schedule position, data loader cursor, RNG state) to durable storage at regular intervals. For BLOOM the optimizer-inclusive checkpoint weighed 2.3 TB, compared with 329 GB for the weights alone.[^12] Checkpoint cadence trades off recovery cost against the wall-clock pause incurred during writing; the [megatron lm](/wiki/megatron_lm) framework and successors implement asynchronous and in-memory checkpoint protocols to hide most of this cost.[^13]

When a job fails, an on-call engineer (or, increasingly, automation) identifies the affected rank, reschedules around the failed node, and resumes from the last checkpoint, sometimes also skipping a window of training batches if a loss spike is suspected. The OPT-175B logbook documented job uptime of 51.7 to 58.9 percent over the run, with over 100 restarts during 60 GPU days.[^14][^13]

### Post-training

Once pretraining concludes, the model enters a [fine tuning](/wiki/fine_tuning) pipeline collectively called post-training. The modern recipe, established by [instructgpt](/wiki/instructgpt) in 2022 and refined since, comprises three families of operations applied in alternating rounds:[^15]

1. **Supervised fine-tuning** ([sft](/wiki/sft)): a relatively small, curated corpus of instruction-and-response pairs is used to teach the base model to follow instructions and adhere to a target format.
2. **Preference data collection**: human annotators (or, increasingly, model-based judges in RLAIF schemes) rank pairs of model outputs.
3. **Preference optimization**: a policy is updated against this preference data. The classical approach is [rlhf](/wiki/rlhf) using [ppo](/wiki/ppo) against a learned reward model, introduced by Christiano and colleagues in 2017 for reinforcement learning and adapted to language models by OpenAI.[^16][^15] [direct preference optimization dpo](/wiki/direct_preference_optimization_dpo), proposed by Rafailov and colleagues in 2023, eliminates the explicit reward model and trains directly against pairwise preferences.[^17] More recent methods such as [grpo](/wiki/grpo) (used by DeepSeek for V3 and successors) further reduce the cost and instability of the RL stage.

For Llama 3 the post-training stage used iterative rounds of SFT, rejection sampling against a reward model, and DPO with a beta of 0.1, with the SFT stage running at a learning rate of 1e-5 over 8,500 to 9,000 steps.[^6]

### Evaluation and decision to release

A final evaluation suite, covering capability benchmarks (such as MMLU, GSM8K, HumanEval, IFEval), safety probes, and red-team tests, is run against intermediate and final checkpoints. The decision to release is made on the basis of these results plus internal qualitative review.

## Compute, cost, and reported run figures

The dominant cost driver is GPU-hours multiplied by the rental or amortized cost per GPU-hour. The standard Transformer FLOP heuristic gives an approximate FLOP budget of `6 * N * D` for a dense model with N parameters trained on D tokens (and roughly `2 * N * D` for inference). Dividing by sustained throughput in FLOPs per second yields wall-clock time; multiplying by the relevant accelerator-hour cost yields a compute estimate.[^8]

Several training runs have become reference points in the literature because their costs and durations are explicitly documented or widely estimated:

| Model | Year | Parameters | Training tokens | Hardware | Wall clock | Documented compute or cost |
|---|---|---|---|---|---|---|
| [gpt-3](/wiki/gpt-3) | 2020 | 175B | 300B | V100 GPUs | ~ months | ~3.14e23 FLOPs; estimated ~USD 4.6M[^18][^19] |
| [palm](/wiki/palm) | 2022 | 540B | 780B | 6,144 TPU v4 | ~50 days | 46.2% MFU; 57.8% HFU[^11][^20] |
| OPT-175B | 2021 to 2022 | 175B | 180B | 992 to 1,024 A100 80GB | 56 days | 4.30e23 FLOPs; ~147 TFLOP/s/GPU[^14] |
| [bloom](/wiki/bloom) | 2022 | 176B | 366B | 384 A100 80GB (Jean Zay) | 3 to 4 months | ~150 TFLOP/s sustained[^12] |
| [gpt-4](/wiki/gpt-4) | 2023 | undisclosed | undisclosed | A100 GPUs | ~ months | estimated USD 40M to USD 100M[^21][^22] |
| [llama 3 1](/wiki/llama_3_1) 405B | 2024 | 405B | 15.6T | 16,384 H100 | 54 days (snapshot) | 3.8e25 FLOPs pretraining budget[^6] |
| [deepseek v3](/wiki/deepseek_v3) | 2024 | 671B (37B active) | 14.8T | 2,048 H800 | <2 months | 2.788M GPU hours; ~USD 5.576M at USD 2/hour[^7] |

The DeepSeek V3 figure of about USD 5.576 million has attracted particular attention because it is roughly an order of magnitude lower than the prevailing assumption for a frontier dense-equivalent run, an effect attributable mainly to the use of a sparse [mixture of experts](/wiki/mixture_of_experts) architecture, FP8 mixed-precision training, and several systems-level optimizations described in the technical report.[^7] The DeepSeek authors explicitly note that the figure covers only the official run and excludes prior research, ablations, and post-training data labelling.[^7] Independent commentary has emphasized that the marginal compute cost of a single run is a fraction of total program cost.[^23]

## Hyperparameter tuning before scale-up

A run at frontier scale is too expensive to debug in flight. Labs run a sweep of much smaller models at the same architectural family to fix learning rate, batch size, weight decay, warmup duration, and other hyperparameters. The naive approach assumes that the smaller models have the same optimum as the target, which is rarely true.

**Maximal Update Parametrization** (muP), introduced by Yang and colleagues in "Tensor Programs V" in 2022, is a reparametrization of the network that makes the optimal learning rate, momentum coefficients, and per-layer multipliers invariant to network width. Under muP, hyperparameters tuned on a 40-million-parameter proxy can be transferred zero-shot to a 6.7-billion-parameter target model, with the authors reporting that the resulting model outperformed the published 6.7B GPT-3 baseline using tuning cost equivalent to about 7 percent of the full pretraining budget.[^24] Variants of muP have since been adopted, in whole or in modified form, by several labs operating at scale.

Beyond muP, the standard pre-launch protocol includes scaling-law fits across the proxy sweep, often using the Chinchilla relation between parameter count and training tokens at a fixed compute budget, to choose the target shape of the full run.[^5]

## Loss spikes, divergence, and other failure modes

Loss spikes are the most visible failure mode of large training runs. A spike is a sudden, multi-step increase in training loss, often by an order of magnitude or more, attributable to a small set of pathological batches interacting with momentum-driven optimizer dynamics. Spikes that do not recover within a few thousand steps require restart from a prior checkpoint. Standard mitigations include:

- **Restart and skip.** PaLM's authors documented that, when rare loss spikes occurred, they restarted training from a recent checkpoint and skipped the data batches they suspected of triggering the spike. The widely cited protocol is to roll back roughly 100 steps before the spike and skip 200 to 500 batches.[^25]
- **Gradient clipping.** The global gradient norm is clipped to a fixed value (commonly 1.0). If spikes recur, the clip can be reduced (for example to 0.3).[^25]
- **Lowering the learning rate** or switching the optimizer epsilon, both of which alter the dynamic range of the second-moment denominator in Adam and AdamW.
- **Reparametrizations** such as scaled embeddings or muP that, by construction, prevent uncontrolled gradient norm growth at scale.[^26]

Hardware failures dominate the operational logbook. Meta's 54-day Llama 3 snapshot reported that approximately 78 percent of unexpected interruptions were attributable to confirmed or suspected hardware issues, with GPU components accounting for roughly 58.7 percent of unexpected interruptions in total. The team nevertheless achieved over 90 percent effective training time on a 16,384-GPU job by automating the rescheduling of failed ranks, with only three incidents requiring significant manual intervention.[^6] The OPT-175B logbook released by Meta in 2022 documents an earlier era when manual intervention was the norm: the 114-page log records the on-call engineers' day-to-day handling of hangs, crashes, NaN values, and configuration errors.[^14][^27]

Silent data corruption, in which a GPU produces a numerically wrong result without raising a fault, has emerged as a distinct concern at frontier scale and motivates additional checksum and replay protocols beyond the standard checkpointing approach.[^28]

## Pretraining, continued pretraining, and fine-tuning runs

The term training run is sometimes qualified to distinguish the major categories of work:

- **Pretraining run.** The principal run, consuming the bulk of compute, in which a randomly initialized model is trained on a large general corpus using next-token prediction.[^6]
- **Continued pretraining** (also called domain-adaptive pretraining). The pretrained base is trained further on a specialized corpus, such as code or biomedical text, often with a lower learning rate and a different data mix. Llama 3's long-context extension to 128K tokens, which used approximately 800 billion additional tokens after the main pretraining, is a published example.[^6]
- **Annealing.** A final brief phase that trains on the highest-quality slice of the corpus while annealing the learning rate to a very low value. Llama 3 used roughly 40 million tokens for this phase.[^6]
- **Fine-tuning runs.** A family of post-training runs (SFT, DPO, RLHF) applied to the pretrained base. These are dramatically cheaper than pretraining; DeepSeek V3 reported only 5,000 GPU hours for post-training, compared with 2,664,000 for pretraining and 119,000 for context extension.[^7]

The distinction between these categories matters for cost accounting and for arguments about regulatory thresholds expressed in training compute, since most policy thresholds apply to the pretraining run alone.

## Implementation tools and educational reference

Open-source frameworks have made the mechanics of a training run accessible at smaller scale. [megatron lm](/wiki/megatron_lm), NVIDIA's reference implementation of 3D parallelism for Transformers, is widely used as the basis for production training stacks.[^13] DeepSpeed (ZeRO), Microsoft's optimizer-state sharding library, and the PyTorch [fsdp](/wiki/fsdp) API offer complementary approaches.[^9] Hugging Face Accelerate and Transformers wrap these for application teams.

For educational purposes, Andrej Karpathy's nanoGPT repository implements a full character-level GPT pretraining loop in roughly 300 lines of PyTorch, intended as a transparent reference for the same primitives used in production runs (data sharding, gradient accumulation, mixed precision, learning-rate scheduling).[^29] Karpathy's "Let's reproduce GPT-2 (124M)" video and the accompanying code build directly on this and remain a common entry point for engineers preparing to participate in larger runs.

## Operational logbooks

Two unusually detailed public artifacts have shaped community understanding of the operational character of a training run:

- The **OPT-175B logbook**, released by Meta in 2022 as a 114-page PDF accompanying the OPT model weights, records the day-to-day chronology of the November 2021 to January 2022 run on 992 to 1,024 A100 80GB GPUs on Microsoft Azure. It documents job uptime in the high-50 percent range, frequent hangs, more than 100 restarts, hardware swaps, and changes to gradient clipping and optimizer epsilon in response to loss spikes.[^14][^27]
- The **BLOOM training chronicles**, published by the BigScience workshop alongside the 176B model, document the run that began on 11 March 2022 on 384 A100 80GB GPUs of the French Jean Zay supercomputer and continued for three to four months at roughly 150 TFLOP/s.[^12] The chronicles, combined with a public TensorBoard, made BLOOM the first frontier-scale run whose training curves were openly readable in near real time.

These releases changed the published norm for frontier runs from a single short technical report to a multi-document release including a logbook, a model card, weights, and increasingly a public TensorBoard.

## Cost trends and policy implications

Estimates of frontier training run cost have grown rapidly. The estimated USD 4.6 million for [gpt-3](/wiki/gpt-3) in 2020 grew to estimates in the USD 40 million to USD 100 million range for [gpt-4](/wiki/gpt-4) by 2023, with several sources placing the compute alone above USD 78 million.[^21][^22] Subsequent runs by [llama 3 1](/wiki/llama_3_1) and competing models have been characterized by industry analysts as still well below the USD 1 billion threshold per run, though forward projections for the late 2020s anticipate per-run costs in that range as cluster sizes increase by roughly an order of magnitude every two years.[^30]

The cost of a single run can also be expressed as the product of two more stable quantities: training compute, measured in FLOPs, and a price per FLOP that has historically declined by roughly 30 percent per year due to hardware improvements and software efficiency gains. Combined with order-of-magnitude growth in cluster size, the net effect is that frontier training compute has grown by roughly 4 to 5 times per year through the early 2020s, with run cost growing at a slower but still rapid pace.[^1][^30]

These figures have entered policy debates: the United States executive order on AI of 2023 and subsequent EU AI Act provisions reference training compute (measured in FLOPs) as a regulatory threshold for additional reporting requirements. A frontier training run, whose compute budget is committed before launch, has therefore become a unit of analysis not only for technical capability prediction but also for legal classification of the resulting model. The Llama 3 405B pretraining budget of 3.8e25 FLOPs falls below, and the DeepSeek V3 budget at roughly 3.4e24 FLOPs falls well below, the 1e26 FLOPs threshold initially proposed in the 2023 executive order.[^6][^7]

## Limitations and open problems

The dominant unsolved questions for training runs reflect the move toward ever larger scale:

- **Stability at very low precision.** The DeepSeek V3 run, in which the first frontier-scale model successfully completed pretraining entirely in FP8 mixed precision, established a new lower bound for the precision at which large runs remain numerically tractable; whether this generalizes to denser models or to FP4 remains an open question.[^7]
- **Fault rate at next-generation cluster sizes.** Independent analyses have observed that hardware failures occur approximately every 2.78 hours on a 16,000-GPU job; clusters one order of magnitude larger will need failure budgets and automation that have not yet been demonstrated.[^31]
- **Hyperparameter transfer across depth and architecture.** muP solves width transfer for dense Transformers, but transfer across depth, batch size, and architectural changes (such as introducing or removing experts in an MoE) remains imperfect and is the subject of active research, with follow-up work in late 2025 reporting completed transfer across modules, depth, and duration.[^32]
- **Reproducibility.** Even with identical code, seeds, and data, a frontier run is not bit-exact reproducible across cluster reschedulings because the order of parallel reductions affects floating-point accumulation. Labs accept this and aim instead for statistical reproducibility of the final loss and downstream metrics.

## See also

- [pretraining](/wiki/pretraining)
- [fine tuning](/wiki/fine_tuning)
- [supervised fine-tuning](/wiki/supervised_fine-tuning)
- [rlhf](/wiki/rlhf)
- [direct preference optimization dpo](/wiki/direct_preference_optimization_dpo)
- [grpo](/wiki/grpo)
- [ppo](/wiki/ppo)
- [scaling laws](/wiki/scaling_laws)
- [chinchilla scaling](/wiki/chinchilla_scaling)
- [foundation models](/wiki/foundation_models)
- [frontier models](/wiki/frontier_models)
- [nvidia h100](/wiki/nvidia_h100)
- [fsdp](/wiki/fsdp)
- [megatron lm](/wiki/megatron_lm)
- [mixed precision training](/wiki/mixed_precision_training)
- [adam optimizer](/wiki/adam_optimizer)
- [tokenization](/wiki/tokenization)
- [data parallelism](/wiki/data_parallelism)
- [tensor parallelism](/wiki/tensor_parallelism)
- [pipeline parallelism](/wiki/pipeline_parallelism)
- [gpt-3](/wiki/gpt-3)
- [gpt-4](/wiki/gpt-4)
- [palm](/wiki/palm)
- [llama 3 1](/wiki/llama_3_1)
- [deepseek v3](/wiki/deepseek_v3)
- [bloom](/wiki/bloom)
- [mixture of experts](/wiki/mixture_of_experts)
- [instructgpt](/wiki/instructgpt)

## References

[^1]: Cottier, Ben et al., "The rising costs of training frontier AI models", arXiv:2405.21015, 2024-05-31. https://arxiv.org/html/2405.21015v1. Accessed 2026-05-25.
[^2]: Brown, Tom B. et al., "Language Models are Few-Shot Learners", arXiv:2005.14165, 2020-05-28. https://arxiv.org/abs/2005.14165. Accessed 2026-05-25.
[^3]: Patel, Dylan and Ahmed Ali, "GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE", SemiAnalysis, 2023-07-10. https://newsletter.semianalysis.com/p/gpt-4-architecture-infrastructure. Accessed 2026-05-25.
[^4]: Vaswani, Ashish et al., "Attention Is All You Need", arXiv:1706.03762, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-05-25.
[^5]: Hoffmann, Jordan et al., "Training Compute-Optimal Large Language Models", arXiv:2203.15556, 2022-03-29. https://arxiv.org/abs/2203.15556. Accessed 2026-05-25.
[^6]: Llama Team, AI @ Meta, "The Llama 3 Herd of Models", arXiv:2407.21783, 2024-07-31. https://ar5iv.labs.arxiv.org/html/2407.21783. Accessed 2026-05-25.
[^7]: DeepSeek-AI, "DeepSeek-V3 Technical Report", arXiv:2412.19437, 2024-12-27. https://arxiv.org/html/2412.19437v1. Accessed 2026-05-25.
[^8]: Bahdanau, Dzmitry, "The FLOPs Calculus of Language Model Training", Medium, 2022-01-09. https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4. Accessed 2026-05-25.
[^9]: Ott, Myle et al., "Fully Sharded Data Parallel: faster AI training with fewer GPUs", Engineering at Meta, 2021-07-15. https://engineering.fb.com/2021/07/15/open-source/fsdp/. Accessed 2026-05-25.
[^10]: Kingma, Diederik P. and Jimmy Ba, "Adam: A Method for Stochastic Optimization", arXiv:1412.6980, 2014-12-22. https://arxiv.org/abs/1412.6980. Accessed 2026-05-25.
[^11]: Chowdhery, Aakanksha et al., "PaLM: Scaling Language Modeling with Pathways", arXiv:2204.02311, 2022-04-05. https://arxiv.org/abs/2204.02311. Accessed 2026-05-25.
[^12]: BigScience Workshop, "tr11-176B-logs README", Hugging Face, 2022-07-12. https://huggingface.co/bigscience/tr11-176B-logs. Accessed 2026-05-25.
[^13]: Jiang, Ziheng et al., "MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs", USENIX NSDI, 2024-04-16. https://www.usenix.org/system/files/nsdi24-jiang-ziheng.pdf. Accessed 2026-05-25.
[^14]: Lockwood, Glenn K., "OPT-175B", glennklockwood.com, 2023-12-29. https://glennklockwood.com/garden/OPT-175B. Accessed 2026-05-25.
[^15]: Ouyang, Long et al., "Training language models to follow instructions with human feedback", arXiv:2203.02155, 2022-03-04. https://arxiv.org/abs/2203.02155. Accessed 2026-05-25.
[^16]: Christiano, Paul F. et al., "Deep reinforcement learning from human preferences", arXiv:1706.03741, 2017-06-12. https://arxiv.org/abs/1706.03741. Accessed 2026-05-25.
[^17]: Rafailov, Rafael et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", arXiv:2305.18290, 2023-05-29. https://arxiv.org/abs/2305.18290. Accessed 2026-05-25.
[^18]: Hacker News discussion, "GPT-3/175B model required 3.14E23 flops of compute for training", 2020-05-29. https://news.ycombinator.com/item?id=23346789. Accessed 2026-05-25.
[^19]: Li, Chuan, "OpenAI's GPT-3 Language Model: A Technical Overview", Lambda Labs, 2020-06-03. https://lambda.ai/blog/demystifying-gpt-3. Accessed 2026-05-25.
[^20]: Narang, Sharan and Aakanksha Chowdhery, "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance", Google Research, 2022-04-04. https://research.google/blog/pathways-language-model-palm-scaling-to-540-billion-parameters-for-breakthrough-performance/. Accessed 2026-05-25.
[^21]: Patel, Dylan, "GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE", SemiAnalysis, 2023-07-10. https://newsletter.semianalysis.com/p/gpt-4-architecture-infrastructure. Accessed 2026-05-25.
[^22]: Stanford HAI, "2025 AI Index Report (training cost estimates)", Stanford University, 2025-04-07. https://aiindex.stanford.edu/report/. Accessed 2026-05-25.
[^23]: Lambert, Nathan, "DeepSeek V3 and the actual cost of frontier AI models", Interconnects, 2025-01-13. https://www.interconnects.ai/p/deepseek-v3-and-the-actual-cost-of. Accessed 2026-05-25.
[^24]: Yang, Greg et al., "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer", arXiv:2203.03466, 2022-03-07. https://arxiv.org/abs/2203.03466. Accessed 2026-05-25.
[^25]: Ray, Jaideep, "Loss spikes in training: causes, detection, and mitigations", Better ML on Medium, 2023-08-21. https://medium.com/better-ml/loss-spikes-in-training-causes-detection-and-mitigations-ed66e591b1a1. Accessed 2026-05-25.
[^26]: Takase, Sho et al., "Spike No More: Stabilizing the Pre-training of Large Language Models", arXiv:2312.16903, 2023-12-28. https://arxiv.org/abs/2312.16903. Accessed 2026-05-25.
[^27]: Zhang, Susan et al., "OPT: Open Pre-trained Transformer Language Models", arXiv:2205.01068, 2022-05-02. https://arxiv.org/abs/2205.01068. Accessed 2026-05-25.
[^28]: He, Yi et al., "Understanding Silent Data Corruption in LLM Training", arXiv:2502.12340, 2025-02-17. https://arxiv.org/pdf/2502.12340. Accessed 2026-05-25.
[^29]: Karpathy, Andrej, "nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs", GitHub, 2023-01-02. https://github.com/karpathy/nanoGPT. Accessed 2026-05-25.
[^30]: Cottier, Ben et al., "The rising costs of training frontier AI models", arXiv:2405.21015v2, 2024-08-15. https://arxiv.org/html/2405.21015v2. Accessed 2026-05-25.
[^31]: Cochard, David, "Takeaways From the Llama 3 Release Paper", Medium (ailia Tech BLOG), 2024-07-26. https://medium.com/axinc-ai/takeaways-from-the-llama-3-release-paper-90428875b2d4. Accessed 2026-05-25.
[^32]: Bordelon, Blake et al., "Completed Hyperparameter Transfer across Modules, Width, Depth, Batch & Duration", arXiv:2512.22382, 2025-12-30. https://arxiv.org/html/2512.22382v1. Accessed 2026-05-25.

