See also: Machine learning terms
In machine learning, a checkpoint is a snapshot of a model's state captured at a specific point during the training process. A checkpoint typically stores the model's learned parameters (weights and biases), optimizer state, the current epoch or step count, and other metadata needed to resume or reproduce training. Checkpoints allow practitioners to save progress during long training runs, recover from hardware or software failures, and select the best-performing version of a model for deployment or further fine-tuning.
The concept of checkpointing predates modern deep learning. Early uses appeared in high-performance computing, where long-running simulations saved intermediate state to disk as insurance against system crashes. As neural networks grew larger and training runs stretched from hours to weeks or months, checkpointing became a standard part of every training pipeline.
Imagine you are building a giant LEGO tower, and it takes a really long time to build. Every now and then, you take a picture of your tower so that if it falls over or someone bumps it, you can look at the picture and rebuild it from where you left off instead of starting all over.
Checkpoints in machine learning work the same way. Training a computer to learn things can take days or even weeks. A checkpoint is like that picture of your LEGO tower: it saves all the progress the computer has made so far. If the power goes out or something breaks, the computer can load the last checkpoint and keep going from there instead of starting from scratch.
Sometimes you also want to remember which version of your tower looked the best. Maybe at one point it was really tall and cool, but then you added some pieces that made it wobbly. With checkpoints, you can go back to the version that looked best and use that one.
A minimal checkpoint stores only the model weights, but production checkpoints almost always include additional information. The exact contents depend on the framework and the practitioner's needs, but common components include the following.
| Component | Description | Why it matters |
|---|---|---|
| Model weights (parameters) | The learned weight matrices and bias vectors for every layer | Required to resume training or run inference |
| Optimizer state | Momentum buffers, adaptive learning rate accumulators (e.g., first and second moment estimates in Adam) | Without these, resuming training resets the optimizer, causing loss spikes and slower convergence |
| Learning rate schedule state | The current step in a learning rate warmup, decay, or cosine annealing schedule | Ensures the learning rate picks up where it left off |
| Epoch and step counters | The global training step or epoch number | Allows logging and scheduling to continue seamlessly |
| Random number generator (RNG) states | RNG seeds for Python, NumPy, and the framework's own generator | Needed for bitwise-reproducible training |
| Training configuration | Hyperparameters, batch size, model architecture description | Helps reproduce the experiment months or years later |
| Metrics | Best validation loss, accuracy, or other tracked quantities | Useful for deciding which checkpoint to deploy |
| Data loader state | The current position in the dataset or the shuffle order | Ensures no training examples are repeated or skipped on resume |
When Tianqi Chen and colleagues demonstrated that a 1,000-layer ResNet could be trained with sublinear memory by strategically discarding and recomputing activations (Chen et al., 2016), the term "checkpoint" gained a second meaning in the deep learning vocabulary (see the Gradient checkpointing section below). The two concepts, saving model state to disk and saving activations in memory, are related by the same core idea: store a subset of information now so you can reconstruct the rest later.
Different frameworks and tools use different serialization formats. The choice of format affects file size, loading speed, security, and cross-framework compatibility.
PyTorch saves checkpoints using Python's pickle module through the torch.save() function. The resulting files typically carry a .pt or .pth extension. A common convention is to save a dictionary that bundles the model's state_dict (a mapping of layer names to weight tensors), the optimizer's state_dict, and any additional metadata such as the epoch number or best validation loss.
To load a checkpoint, the user creates a fresh model instance, calls torch.load() to deserialize the dictionary, and then calls model.load_state_dict() to populate the model's parameters. PyTorch's documentation recommends saving only the state_dict rather than the full model object, because the latter embeds the Python class definition via pickle and breaks if the source code changes.
One important limitation is security. Because pickle can execute arbitrary Python code during deserialization, a maliciously crafted .pth file can run shell commands, exfiltrate data, or install backdoors the moment it is loaded with torch.load(). In 2024 and 2025, security researchers at JFrog and Rapid7 demonstrated multiple bypasses of PickleScan, a tool Hugging Face uses to detect malicious pickle payloads, showing that attackers could embed hidden pickle files inside model archives that evade scanning. This vulnerability is one of the main reasons the community has moved toward the safetensors format.
TensorFlow offers two related but distinct persistence mechanisms.
TF checkpoints are created through tf.train.Checkpoint and store the exact values of all tf.Variable objects used by a model. A TensorFlow checkpoint consists of an index file and one or more data shard files. Checkpoints do not store the computation graph, so the original Python source code must be available to rebuild the model before loading the weights. This format is primarily used for resuming training.
SavedModel is TensorFlow's recommended format for deployment and sharing. A SavedModel directory contains a Protocol Buffer file describing the computation graph plus a subdirectory holding a TF checkpoint with the variable values. Because the SavedModel bundles both the graph and the weights, it can be loaded and run without access to the original source code. SavedModels can be served directly through TensorFlow Serving, converted to TensorFlow Lite for mobile deployment, or run in the browser via TensorFlow.js.
Safetensors is a file format developed by Hugging Face specifically to address the security risks of pickle-based formats. It stores tensors in a flat binary layout with a small JSON header that records each tensor's name, data type, and byte offset. The format does not support arbitrary code execution by design; it can only store numerical arrays and their metadata.
In addition to its security benefits, safetensors offers practical performance advantages. It supports zero-copy deserialization, meaning tensors can be memory-mapped directly from disk without copying data into a separate buffer. This makes loading significantly faster, especially when loading weights to CPU. A 2023 security audit by Trail of Bits, commissioned by Hugging Face, EleutherAI, and Stability AI, found no critical vulnerabilities in the format. As of March 2025, over 621,000 of the roughly 1.5 million models on the Hugging Face Hub use the safetensors format, and it has become the default for most newly uploaded models.
GGUF (GPT-Generated Unified Format) is a binary format created by the llama.cpp project in August 2023 to package quantized large language models for efficient CPU and mixed CPU/GPU inference. GGUF superseded the earlier GGML format and was designed for extensibility and backward compatibility.
A GGUF file consists of three sections: a header with a magic number and format version, a metadata section containing model configuration (architecture type, context length, vocabulary, tokenizer settings, RoPE scaling parameters), and a tensor data section with the quantized weights. All metadata and tensor descriptors live in a single file, which simplifies distribution. GGUF supports a range of quantization types, from 2-bit to 8-bit integers along with float16, bfloat16, and float32, enabling users to trade model quality for smaller file sizes and faster inference on consumer hardware.
ONNX (Open Neural Network Exchange) is an open format for representing machine learning models as computation graphs. Originally developed by Facebook and Microsoft in 2017, ONNX defines a standardized set of operators so that a model trained in one framework can be exported and run in another. A model trained in PyTorch can be exported to ONNX using torch.onnx.export(), and a TensorFlow model can be converted using the tf2onnx tool. ONNX is primarily an inference format rather than a training checkpoint format. It captures the model architecture and weights but typically does not include optimizer state or training metadata. The ONNX Runtime, maintained by Microsoft, applies graph-level optimizations such as operator fusion and constant folding to speed up inference.
| Format | Framework | Stores optimizer state | Security | Typical use case |
|---|---|---|---|---|
| .pt / .pth (pickle) | PyTorch | Yes | Vulnerable to arbitrary code execution | Training checkpoints |
| TF checkpoint | TensorFlow | Yes | No known code execution risk | Resuming training |
| SavedModel | TensorFlow | No (inference-only) | No known code execution risk | Deployment, serving |
| Safetensors | Framework-agnostic | No (weights only) | Designed to prevent code execution | Model sharing, deployment |
| GGUF | llama.cpp | No (weights only) | No code execution risk | Quantized LLM inference |
| ONNX | Framework-agnostic | No (weights only) | No code execution risk | Cross-framework inference |
Modern large language models can have tens or hundreds of billions of parameters, which makes checkpoint management a significant engineering challenge. The storage requirements scale directly with model size and the precision used.
For training checkpoints that include optimizer state (such as Adam's first and second moment estimates), each parameter requires approximately 12 bytes when using float32 for the weights (4 bytes for the parameter plus 8 bytes for two optimizer accumulators). For inference-only checkpoints that store just the weights, the cost drops to 2 bytes per parameter when using float16 or bfloat16.
| Model size | Inference checkpoint (float16) | Training checkpoint (float32 + Adam) |
|---|---|---|
| 7B parameters | ~14 GB | ~69 GB |
| 13B parameters | ~26 GB | ~130 GB |
| 70B parameters | ~140 GB | ~782 GB |
| 405B parameters | ~810 GB | ~4.5 TB |
Quantization reduces inference checkpoint sizes further. A 70B model quantized to 4-bit precision (Q4 in GGUF format) occupies roughly 40 to 45 GB, compared to 140 GB in float16. For the largest training runs, checkpoint I/O becomes a meaningful fraction of total wall-clock time, which has driven the development of asynchronous and sharded checkpointing systems.
Deciding when and how to save checkpoints involves balancing storage costs, training overhead, and the risk of losing progress.
The simplest strategy saves a checkpoint at fixed intervals, for example every N training steps or at the end of every epoch. This approach is predictable and easy to implement. The main trade-off is between checkpoint frequency and storage consumption. Saving every 100 steps on a large language model training run can generate terabytes of checkpoint data, while saving only once per epoch risks losing hours of compute if a failure occurs between saves.
Most frameworks support periodic checkpointing out of the box. PyTorch Lightning's ModelCheckpoint callback, TensorFlow's tf.keras.callbacks.ModelCheckpoint, and Hugging Face Transformers' Trainer class all accept arguments for checkpoint frequency and how many recent checkpoints to retain.
Instead of saving at fixed intervals, this strategy monitors a metric on a validation set (typically validation loss or accuracy) and saves a checkpoint only when the metric improves. The result is a single checkpoint file that always holds the best model seen so far.
Best-validation checkpointing is often combined with periodic checkpointing. The periodic checkpoints provide fault tolerance, and the best-validation checkpoint provides the model that will ultimately be deployed or evaluated.
Early stopping extends best-validation checkpointing by halting training entirely when the validation metric has not improved for a specified number of epochs, called the patience. For example, with a patience of 5, training ends if five consecutive epochs pass without the validation loss decreasing.
Early stopping serves as a form of regularization because it prevents the model from continuing to memorize the training data after it has stopped learning generalizable patterns. When training stops, the practitioner loads the best-validation checkpoint rather than the final checkpoint, since the final checkpoint may correspond to a higher validation loss.
The patience value requires tuning. Too small a patience can stop training prematurely, especially on noisy metrics. Too large a patience wastes compute. Common starting points range from 3 to 10 epochs, depending on the dataset size and model complexity.
To limit disk usage, many pipelines keep only the most recent K checkpoints and delete older ones automatically. This ensures that storage consumption stays bounded while still allowing the practitioner to roll back a few steps if a recent checkpoint turns out to be corrupted or suboptimal.
Exponential moving average checkpointing maintains a separate "shadow" copy of model parameters that is a smoothed exponential average of the training parameters over time. At each training step, the shadow weights are updated as:
shadow = decay * shadow + (1 - decay) * current_weights
A typical decay value is 0.999 or 0.9999, meaning the shadow moves slowly and smooths over many steps. The EMA weights are only applied at checkpoint saving and inference time, not during the training forward/backward pass itself.
EMA checkpoints tend to generalize better than the raw training weights because averaging smooths out the noise inherent in stochastic gradient descent. Research has shown that EMA models exhibit improved robustness to noisy labels, better calibration, and stronger transfer learning performance. EMA also reduces the need for aggressive learning rate decay, since the averaging provides a form of implicit regularization.
A related technique is Stochastic Weight Averaging (SWA), which keeps a uniform average of checkpoints saved during the final epochs of a training run. Unlike EMA, which continuously updates a running average, SWA averages snapshots taken at a constant or cyclical learning rate. Both methods improve generalization, but they do not compose well together; practitioners typically use one or the other.
For training runs that span days or weeks across large GPU clusters, hardware and software failures are not exceptional events but statistical certainties. When Meta trained Llama 3.1 405B on 16,384 NVIDIA H100 GPUs, the cluster experienced 466 job interruptions over 54 days, roughly one failure every three hours. Of these, 419 were unexpected: 148 were caused by faulty GPUs, 72 by GPU HBM3 memory errors, and 35 by network switch and cable problems.
As clusters scale further, the challenge intensifies. A single H100 GPU fails on average roughly once every 50,000 hours (about 6 years). At the scale of 100,000 GPUs, that translates to a failure every 30 minutes; at one million GPUs, a failure strikes every 3 minutes. Without robust checkpointing, the expected amount of lost work between failures would make training economically infeasible.
Checkpointing for fault tolerance follows a straightforward pattern: periodically save the full training state (model weights, optimizer state, learning rate schedule, data loader position, and RNG states) to durable storage. After a failure, the job restarts and loads the most recent valid checkpoint, replaying only the training steps that occurred after that checkpoint was saved. The key engineering challenge is minimizing both the frequency of checkpoints (to avoid wasting compute on I/O) and the amount of work lost per failure.
A reasonable checkpointing frequency for large-scale runs is approximately once per hour. More frequent checkpoints waste training throughput on I/O, while less frequent ones risk losing too much work when failures occur. The optimal frequency depends on the mean time between failures for the specific cluster and the time required to write a checkpoint.
Modern large language models can have tens or hundreds of billions of parameters. A single checkpoint for a 70-billion-parameter model, stored in float32 with optimizer state, can exceed 500 GB. Saving this much data to a single file on a single machine is impractical, so distributed training frameworks split checkpoints across multiple files and machines.
PyTorch's FSDP shards model parameters, gradients, and optimizer states across all participating GPUs. When saving a checkpoint, FSDP offers two modes. In the FULL_STATE_DICT mode, all shards are gathered to a single rank (typically rank 0), which writes a single consolidated checkpoint. This produces a file that can be loaded on any number of GPUs but requires that rank 0 has enough CPU memory to hold the entire model. In the SHARDED_STATE_DICT mode, each rank writes its own shard independently, which is faster and uses less memory but produces a checkpoint that can only be loaded with the same sharding configuration.
DeepSpeed's ZeRO (Zero Redundancy Optimizer) stages progressively shard more of the training state across GPUs. At ZeRO Stage 3, parameters, gradients, and optimizer states are all distributed. DeepSpeed checkpoints write per-rank files by default, though a consolidation script can merge them into a single file for inference. DeepSpeed also supports saving 16-bit model weights separately, which produces smaller checkpoint files that are easier to share and deploy.
PyTorch's Distributed Checkpoint library (torch.distributed.checkpoint) is a more recent system designed for large-scale training. DCP supports load-time resharding, meaning a checkpoint saved with one parallelism configuration can be loaded into a different one without manual conversion. Each rank writes its own shard in parallel, and DCP coordinates the metadata so that any rank can reconstruct any part of the model state during loading.
Recent improvements to DCP include process-based asynchronous checkpointing (which offloads data saving to CPU threads, keeping GPU blocking time minimal), save plan caching (which amortizes the planning cost across repeated saves), and local checkpointing (which writes to fast local storage for frequent saves, reducing the cost of remote storage I/O). Google and PyTorch collaborated on the local checkpointing solution, which significantly improves training goodput by enabling more frequent saves without the overhead of writing to network-attached storage.
A major challenge in large-scale training is that checkpoint I/O can block the training loop for minutes. Synchronous checkpointing forces all GPUs to pause while data is written to storage. To reduce this overhead, several systems implement asynchronous checkpointing, which copies the model state to a fast memory tier (such as host RAM or a local NVMe drive) and then flushes it to persistent storage in the background while training continues.
ByteCheckpoint, developed by ByteDance and presented at NSDI 2025, is a unified checkpointing system that achieves an average reduction of 54x in runtime checkpoint stalls compared to existing open-source systems, with saving and loading speedups of up to 9.96x and 8.80x respectively. It provides a parallelism-agnostic checkpoint representation that enables efficient load-time resharding across different parallelism configurations.
DataStates-LLM, presented at HPDC 2024, introduced a lazy asynchronous approach where only modified portions of the state are written. DLRover's Flash Checkpoint system takes a similar approach, persisting checkpoints asynchronously to shared memory and then to storage, reducing recovery to seconds by loading checkpoints directly from shared memory.
One practical headache in distributed training is that changing the parallelism configuration (for example, moving from 8 GPUs to 16 GPUs, or switching from tensor parallelism to pipeline parallelism) often requires converting the checkpoint format. Universal Checkpointing, proposed by Microsoft Research in 2024, addresses this by storing checkpoints in a canonical format that can be resharded on the fly during loading, allowing practitioners to reconfigure their parallelism strategy without manual checkpoint conversion.
Gradient checkpointing (also called activation checkpointing) is a memory optimization technique that is conceptually distinct from saving model state to disk, even though both share the word "checkpoint."
During the forward pass of backpropagation, a neural network must store intermediate activations at each layer so they can be used to compute gradients during the backward pass. For deep networks, these stored activations can consume more GPU memory than the model parameters themselves.
Gradient checkpointing reduces this memory cost by discarding most intermediate activations during the forward pass and recomputing them on the fly during the backward pass. Only the activations at designated "checkpoint" layers are retained. When the backward pass reaches a checkpointed layer, it reruns the forward computation from that checkpoint to regenerate the missing activations, then proceeds with gradient calculation as usual.
The foundational paper for this technique is "Training Deep Nets with Sublinear Memory Cost" by Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin, published in 2016. The authors showed that by checkpointing every sqrt(n)-th layer in a network with n layers, memory usage for activations drops from O(n) to O(sqrt(n)), at the cost of one additional forward pass per mini-batch (roughly a 33% increase in computation time). In the extreme case, memory can be reduced to O(log n) with O(n log n) extra forward computation.
In practice, activation checkpointing typically reduces peak activation memory by 50 to 60 percent while increasing training time by 20 to 35 percent. The exact numbers depend on the model architecture and which layers are checkpointed. For transformer-based models, a common strategy is to checkpoint each transformer block, since these blocks are uniform in structure and each one stores a substantial amount of intermediate activation data.
PyTorch provides torch.utils.checkpoint.checkpoint() and torch.utils.checkpoint.checkpoint_sequential() for applying gradient checkpointing. In 2024, PyTorch introduced Selective Activation Checkpointing (SAC), which allows practitioners to specify a policy function that decides on a per-operation basis which activations to save and which to recompute, giving finer-grained control than the all-or-nothing approach of earlier APIs.
TensorFlow supports gradient checkpointing through tf.recompute_grad(), and the Hugging Face Transformers library exposes a gradient_checkpointing_enable() method that applies the technique to supported model architectures with a single function call.
One of the most common uses of checkpoints is as a starting point for transfer learning and fine-tuning. Instead of training a model from randomly initialized weights, practitioners load a pretrained checkpoint and continue training on a smaller, task-specific dataset. This approach works because the features learned during pretraining (such as edge detectors in vision models or syntactic patterns in language models) transfer to many downstream tasks.
In full fine-tuning, all parameters in the pretrained checkpoint are unfrozen and updated on the new dataset. This gives the model maximum flexibility to adapt but requires enough GPU memory to store the full optimizer state for every parameter. Full fine-tuning is standard for moderate-sized models but becomes prohibitively expensive for large language models with billions of parameters.
Parameter-efficient fine-tuning methods freeze most of the pretrained weights and add or modify only a small number of parameters. This dramatically reduces memory requirements and checkpoint sizes.
LoRA (Low-Rank Adaptation) is the most widely used PEFT method. It inserts pairs of small low-rank matrices alongside the frozen weight matrices in the model. During fine-tuning, only these low-rank matrices are updated. Because the rank is typically small (4 to 64), the number of trainable parameters drops to less than 1% of the original model. A LoRA adapter checkpoint is typically only a few megabytes, compared to the multi-gigabyte base model checkpoint.
QLoRA combines quantization with LoRA by loading the base model in 4-bit precision and training the LoRA adapters in higher precision. This allows fine-tuning of a 65-billion-parameter model on a single 48 GB GPU.
When using PEFT methods, the adapter weights are saved as a separate checkpoint. At inference time, the adapter can be merged into the base model weights or loaded on top of the base model dynamically. Hugging Face's PEFT library supports saving and loading adapter checkpoints in the safetensors format.
The widespread sharing of pretrained checkpoints has transformed how machine learning research and development work. Rather than training every model from scratch, practitioners routinely download and build on publicly available checkpoints.
The Hugging Face Hub is the largest repository of open model checkpoints, hosting over 1 million models as of early 2026. Each model repository is a Git-based repository containing the weight files (typically in safetensors format), a configuration file (config.json), tokenizer files, and a model card with documentation. The Hub supports model versioning through Git commits, community contributions through pull requests, and automated security scanning for pickle-based files.
The Transformers library integrates directly with the Hub, allowing users to download and load a pretrained checkpoint with a single line of code like AutoModel.from_pretrained("model-name").
PyTorch Hub provides a simpler mechanism for publishing and loading pretrained models. Model authors define a hubconf.py file in their GitHub repository that specifies entry points for loading models. Users can then call torch.hub.load() with the repository name and entry point to download and instantiate a model.
Some research groups release not just the final trained checkpoint but also intermediate checkpoints saved at various points during pretraining. Projects such as Pythia (EleutherAI), OLMo (AI2), TinyLlama, and RedPajama-INCITE have published checkpoints at regular intervals throughout training. These intermediate checkpoints are valuable for studying training dynamics, understanding how model capabilities emerge over the course of training, and supporting research in continual learning and model interpretability.
Because different frameworks use different checkpoint formats, converting a model from one format to another is a common task.
| Source | Target | Tool or method |
|---|---|---|
| PyTorch | ONNX | torch.onnx.export() |
| TensorFlow | ONNX | tf2onnx (open-source converter) |
| PyTorch | TensorFlow | ONNX as an intermediate step, or Hugging Face Transformers' from_pretrained(from_tf=True) |
| TensorFlow | PyTorch | tf_checkpoint2pytorch, or Hugging Face Transformers' from_pretrained(from_pt=True) |
| Any framework | Safetensors | Hugging Face safetensors library conversion scripts |
| Any framework | GGUF | llama.cpp's convert.py and quantize tool |
The Hugging Face Transformers library has simplified cross-framework conversion for many popular architectures. Its AutoModel class can automatically detect whether a checkpoint was saved in PyTorch or TensorFlow format and convert on the fly. For models outside the Transformers ecosystem, Microsoft's MMdnn (Multi-Model Deep Neural Network) toolkit supports conversion between Caffe, Keras, MXNet, TensorFlow, CNTK, and PyTorch, though the project is no longer actively maintained.
When converting checkpoints, a common pitfall is numerical drift: slight differences in how frameworks implement operations (for example, different default epsilon values in layer normalization) can cause the converted model to produce slightly different outputs. Practitioners should always validate converted checkpoints against the original by comparing outputs on a reference input.
The security of model checkpoints has become a growing concern as the machine learning community increasingly relies on downloading and loading models from public repositories.
The most serious known vulnerability in checkpoint security involves Python's pickle module, which PyTorch uses internally for torch.save() and torch.load(). Pickle is a general-purpose serialization protocol that can reconstruct arbitrary Python objects, including objects whose constructors execute system commands. An attacker can craft a .pth file that, when loaded, runs hidden shell commands to download malware, steal credentials, or open a reverse shell.
In documented incidents, malicious PyTorch model files uploaded to public repositories have been found to contain embedded shell commands that execute upon loading. Hugging Face scans uploaded models with PickleScan, but researchers have repeatedly demonstrated bypasses. In December 2025, JFrog disclosed zero-day vulnerabilities in PickleScan that allowed attackers to evade detection by using subclasses of dangerous imports or embedding secondary pickle files with non-standard extensions inside model archives.
Several practices help reduce the risk of checkpoint-related attacks:
torch.load(weights_only=True), a PyTorch option introduced to restrict deserialization to tensor data only, blocking most malicious payloads.A number of practical guidelines have emerged from the community's collective experience with checkpointing.
Save optimizer state for training checkpoints. If you plan to resume training, always include the optimizer state. Without it, adaptive optimizers like Adam lose their accumulated gradient statistics, which can cause sudden loss spikes and slower convergence when training resumes.
Separate training checkpoints from deployment artifacts. Training checkpoints are large because they include optimizer state and other metadata. For deployment, export a stripped-down version containing only the model weights, ideally in a safe format like safetensors.
Validate checkpoints after saving. Corrupted checkpoint files waste hours of compute. After writing a checkpoint, verify that it can be loaded successfully. Some teams compute and store checksums alongside their checkpoint files.
Use descriptive naming conventions. Name checkpoint files with the step number, epoch, and the value of the tracked metric (for example, checkpoint-step-50000-valloss-0.0312.pt). This makes it easy to identify and compare checkpoints later.
Automate cleanup. Large training runs can generate hundreds of checkpoint files totaling terabytes. Set up automated policies to delete old checkpoints while retaining the best and most recent ones.
Log checkpoint metadata. Record which checkpoint was used for evaluation, deployment, or fine-tuning. Experiment tracking tools like MLflow, Weights & Biases, and TensorBoard can store this information alongside training metrics.
Keep multiple backup levels. Retain recent checkpoints for quick rollback, but also preserve a few older checkpoints in case recent ones are corrupted or the model has silently degraded.
Use multi-host broadcast for loading. When training on many machines, having every host independently read a large checkpoint from shared storage creates massive I/O pressure. Instead, a single host can read the checkpoint and broadcast it to others via high-speed interconnects like NCCL over InfiniBand. When 32 hosts each read a 100 GB checkpoint simultaneously, the result is 3.2 TB of network transfer that can saturate storage bandwidth.