Saver

Saver in machine learning

In machine learning, a Saver is a utility or class that lets users persist and restore the state of models, variables, or other components during training and evaluation. Saving model state matters for several practical reasons: it preserves intermediate progress, supports transfer learning and fine-tuning, allows training to resume after interruptions, and produces artifacts that can be shipped to production for inference. Most machine learning frameworks ship their own saver. TensorFlow has historically offered tf.train.Saver, then tf.train.Checkpoint, and the higher level SavedModel format. PyTorch relies on torch.save together with the model's state_dict.

The word "saver" in the machine learning world almost always refers to the TensorFlow API named tf.train.Saver, which was introduced in TensorFlow 1.x and is now deprecated in favor of tf.train.Checkpoint. The broader concept, persisting a model's learnable parameters and optimizer state to disk, is universal across frameworks.

Why saving matters

Training a modern neural network can take hours, days, or weeks of compute. Without periodic checkpoints, a single crash, a preemption on a cloud GPU, or a node failure on a training cluster would force the user to start over. Savers solve a handful of practical problems at once:

They allow a training job to resume from the most recent step, with the optimizer's internal state intact (momentum buffers, Adam's first and second moment estimates, learning rate schedules, etc.).
They produce artifacts that can be loaded for evaluation, served for inference, or used as the starting point for further fine-tuning on a new dataset.
They make experiment tracking reproducible. A saved checkpoint plus the training code is, in principle, enough to recreate a model's behavior exactly.
They support early stopping, where the saver keeps the best validation checkpoint while discarding worse ones.

A related concept is the checkpoint, the actual on-disk artifact produced by a saver. Savers write checkpoints; users restore checkpoints. The two terms are often used interchangeably in practice.

tf.train.Saver (TensorFlow 1.x)

In TensorFlow 1.x, the tf.train.Saver class was the canonical way to save and restore the values of tf.Variable objects bound to a tf.Session. It is now available only through the compatibility module as tf.compat.v1.train.Saver and is considered legacy code for users on TensorFlow 2.x. The recommended replacement is tf.train.Checkpoint.

A Saver is created with an optional list of variables to track. If no list is supplied, the saver tracks all variables in the default graph. The two core methods are save() and restore():

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Define variables
W = tf.Variable(tf.random.normal([10, 10]), name='weights')
b = tf.Variable(tf.zeros(<sup><a href="#cite_note-10" class="cite-ref">[10]</a></sup>), name='biases')

# Create a Saver object that keeps the last 5 checkpoints
saver = tf.train.Saver(max_to_keep=5)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    # Training loop
    for step in range(1000):
        # ... training code ...
        if step % 100 == 0:
            saver.save(sess, 'model.ckpt', global_step=step)

To restore later:

with tf.Session() as sess:
    saver.restore(sess, 'model.ckpt-900')
    # variables now hold the values from step 900

Checkpoint files written by Saver

A single save() call writes a handful of files. The exact layout depends on whether the V1 or V2 binary format is in use. The V2 format has been the default since TensorFlow 0.11 and is recommended because it is faster to restore and uses less memory.

File	Contents
`model.ckpt-<step>.data-00000-of-00001`	A TensorBundle file containing the actual tensor values (weights and biases). Large checkpoints may be sharded across several `.data-N-of-M` files.
`model.ckpt-<step>.index`	A string to string immutable table mapping each tensor name to a serialized `BundleEntryProto` describing which data file holds the tensor and at what offset.
`model.ckpt-<step>.meta`	A `MetaGraphDef` protocol buffer containing the computational graph definition. Optional on restore if the user rebuilds the graph in code.
`checkpoint`	A small text file that records the latest checkpoint and the recent checkpoint history. Used by helpers like `tf.train.latest_checkpoint()`.

The max_to_keep argument controls automatic cleanup. A value of 5 keeps the five most recent checkpoints and silently deletes older ones, which is useful for long training runs that would otherwise fill up the disk.

Limitations of the old Saver

tf.train.Saver was designed for the graph and session model of TensorFlow 1.x. It saves variables by name, so renaming a variable in code can break restoration. It does not work naturally with eager execution or with Keras models that use object based attribute tracking. These limitations are the main reason it was replaced.

tf.train.Checkpoint (TensorFlow 2.x)

tf.train.Checkpoint is the modern saver in TensorFlow 2.x. The key conceptual change is object based checkpointing. Instead of recording variables by their global name in a graph, a Checkpoint records the Python object graph rooted at the objects passed into its constructor. When you save a Checkpoint that contains a model and an optimizer, it walks the attributes of both, finds every tf.Variable they own, and writes them out. On restore, the same walk is performed on the new objects and values are matched up by structural position rather than by string name.

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10)
])
optimizer = tf.keras.optimizers.Adam(1e-3)

checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
save_path = checkpoint.save('./ckpt/ckpt')
# ... later, possibly in a different process ...
checkpoint.restore(save_path)

Because restoration is structural, the user can refactor variable names freely without invalidating an existing checkpoint, as long as the attribute structure of the tracked objects is preserved. Renaming a tf.Variable is fine. Moving a layer from one Sequential to another is not.

CheckpointManager

In production training loops, a tf.train.Checkpoint is usually paired with tf.train.CheckpointManager, which handles rotation and the checkpoint index file. A typical training step looks like this:

manager = tf.train.CheckpointManager(
    checkpoint, directory='./ckpt', max_to_keep=3
)

for epoch in range(num_epochs):
    # ... training ...
    manager.save()

# Later, restore the latest
status = checkpoint.restore(manager.latest_checkpoint)
status.expect_partial()  # silences warnings if only part of the graph is restored

Partial restoration and deferred matching

A powerful property of object based checkpointing is deferred restoration. Calling checkpoint.restore(path) does not eagerly assign values. Instead, it queues the restoration and applies values as soon as variables become trackable from the Checkpoint root. This is convenient for transfer learning, where the user might restore the backbone of a network from a pretrained checkpoint and then add new heads that are initialized fresh.

SavedModel vs checkpoint

A frequent source of confusion is the difference between a TensorFlow checkpoint and a TensorFlow SavedModel.

A checkpoint, whether written by tf.train.Saver or tf.train.Checkpoint, contains only the values of the model's variables. It does not contain a description of the computation. To use a checkpoint, the original Python code that defined the model has to be available. Checkpoints are lightweight and well suited for resuming training.

A SavedModel is a directory that bundles a serialized description of the computation (as a protobuf MetaGraphDef) together with a TensorFlow checkpoint of the variables, plus assets and signatures. SavedModels are self contained. They can be loaded without the original Python source, deployed via TensorFlow Serving, converted for TensorFlow Lite or TensorFlow.js, or called from other languages such as C++ or Java. The standard Keras workflow now produces SavedModels by default via model.save('path').

Aspect	Checkpoint	SavedModel
Contains weights	Yes	Yes (in a `variables/` subdirectory that is itself a checkpoint)
Contains graph definition	No	Yes
Self contained for serving	No	Yes
Requires original Python code to load	Yes	No
Typical use	Training resumption, experiment tracking	Deployment, inference, cross language use

Keras savers

Keras, the high level API now bundled with TensorFlow, exposes two main saving methods on every Model instance:

model.save_weights(path) writes only the model's weights to disk. The user is responsible for reconstructing the architecture in code before calling model.load_weights(path). Output format is either a TensorFlow checkpoint or an HDF5 .h5 file, controlled by the save_format argument.
model.save(path) writes the architecture, the weights, the optimizer state, and the training configuration as a complete SavedModel (or an HDF5 file if the path ends in .h5). The resulting artifact can be loaded with tf.keras.models.load_model(path) without any architecture code.

For most Keras users, model.save is the right default. For users who want full control of the file format or who are loading weights into a slightly different architecture, save_weights is more flexible.

PyTorch and torch.save

PyTorch does not have a class named Saver. Saving is handled by two top level functions, torch.save and torch.load, plus the state_dict mechanism on every nn.Module and optimizer.

A state_dict is a Python dictionary that maps each parameter or buffer name in a module to its tensor value. Optimizers also expose a state_dict containing momentum buffers and hyperparameters. The recommended approach is to save the state_dict rather than the entire module, because pickling a whole module ties the checkpoint to the exact class definition at the time of saving. If the class is later refactored, the saved object may fail to load.

import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 10)

model = Net()

# Save only the weights
torch.save(model.state_dict(), 'model.pth')

# Restore
model = Net()
model.load_state_dict(torch.load('model.pth'))
model.eval()

General checkpoints in PyTorch

For resuming training, PyTorch users wrap multiple pieces of state in a dictionary and pickle the dictionary with torch.save. The conventional file extension for this richer artifact is .tar, although nothing in PyTorch enforces it.

torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, 'checkpoint.tar')

# Restore
checkpoint = torch.load('checkpoint.tar')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
model.train()  # set back to training mode

Two small but important details trip up new users. First, model.eval() must be called before inference so that dropout and batch normalization layers behave correctly. Second, when moving a checkpoint between CPU and GPU, the map_location argument to torch.load controls where the tensors are placed.

Higher level wrappers

Frameworks built on PyTorch, such as PyTorch Lightning and Hugging Face Transformers, add their own checkpointing helpers that wrap torch.save with conventions for distributed training, sharding for large models, and integration with experiment trackers. The underlying mechanism is the same.

Common use cases

Savers, by whatever name, support a handful of recurring workflows in deep learning:

Training resumption. A periodic checkpoint plus a CheckpointManager or equivalent allows a job that was killed at step 7,420 to restart from step 7,000 instead of step zero.
Best checkpoint selection. Track validation loss during training and overwrite a best.ckpt file whenever the metric improves. After training ends, this file holds the best model, not the final one.
Transfer learning and fine tuning. Load a checkpoint of a pretrained backbone (for example a ResNet or a BERT encoder), freeze part of it, and attach a new task specific head. Object based saving in tf.train.Checkpoint and state_dict filtering in PyTorch both support partial loads.
Deployment. Convert a final checkpoint into a serving format (SavedModel for TensorFlow Serving, TorchScript or ONNX for PyTorch) and ship that artifact rather than the checkpoint itself.
Reproducibility. Pin the checkpoint, the code commit, and the data version together so that a colleague can reproduce a result a year later.

Practical tips

Save more than the weights. A checkpoint that also contains the optimizer state, the global step, and the learning rate schedule allows clean resumption. A checkpoint that contains only the weights silently resets the optimizer.
Use a max_to_keep style cap. Long runs at scale generate gigabytes of checkpoints. A rolling window of three to five most recent checkpoints, plus a separately tracked best checkpoint, is a sensible default.
Pin the framework version. Across major versions of TensorFlow or PyTorch, the on disk format is generally stable, but edge cases exist. Recording the framework version next to the checkpoint avoids surprises.
Validate restoration. After restore() or load_state_dict(), run one batch through the model and confirm the loss or a sample prediction looks reasonable. A silent shape mismatch can otherwise survive into deployment.

Explain like I'm 5

Imagine working on a giant Lego model that takes weeks to finish. Every evening, before going to bed, you take a careful photo of every piece and where it goes. If you knock the whole thing over, or if you have to leave home for a week, you can rebuild it exactly from the photos. A saver in machine learning is that photo. It records the position of every knob inside a learning model so that the model can pause, sleep, get moved to a new computer, or wake up months later and keep going from the same spot.

References

TensorFlow Documentation. "tf.compat.v1.train.Saver." https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/Saver
TensorFlow Documentation. "tf.train.Checkpoint." https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint
TensorFlow Documentation. "Training checkpoints." https://www.tensorflow.org/guide/checkpoint
TensorFlow Documentation. "Using the SavedModel format." https://www.tensorflow.org/guide/saved_model
TensorFlow Documentation. "Migrating model checkpoints." https://www.tensorflow.org/guide/migrate/migrating_checkpoints
TensorFlow Documentation. "Save and load models." https://www.tensorflow.org/tutorials/keras/save_and_load
PyTorch Documentation. "Saving and Loading Models." https://docs.pytorch.org/tutorials/beginner/saving_loading_models.html
PyTorch Documentation. "Saving and loading a general checkpoint in PyTorch." https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html
Keras Documentation. "Save, serialize, and export models." https://keras.io/guides/serialization_and_saving/

Saver in machine learning

Why saving matters

tf.train.Saver (TensorFlow 1.x)

Checkpoint files written by Saver

Limitations of the old Saver

tf.train.Checkpoint (TensorFlow 2.x)

CheckpointManager

Partial restoration and deferred matching

SavedModel vs checkpoint

Keras savers

PyTorch and torch.save

General checkpoints in PyTorch

Higher level wrappers

Common use cases

Practical tips

Explain like I'm 5

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering

Saver in machine learning

Why saving matters

tf.train.Saver (TensorFlow 1.x)

Checkpoint files written by Saver

Limitations of the old Saver

tf.train.Checkpoint (TensorFlow 2.x)

CheckpointManager

Partial restoration and deferred matching

SavedModel vs checkpoint

Keras savers

PyTorch and torch.save

General checkpoints in PyTorch

Higher level wrappers

Common use cases

Practical tips

Explain like I'm 5

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering