Saver
Last reviewed
May 11, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 ยท 2,204 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 ยท 2,204 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, a Saver is a utility or class that lets users persist and restore the state of models, variables, or other components during training and evaluation. Saving model state matters for several practical reasons: it preserves intermediate progress, supports transfer learning and fine-tuning, allows training to resume after interruptions, and produces artifacts that can be shipped to production for inference. Most machine learning frameworks ship their own saver. TensorFlow has historically offered tf.train.Saver, then tf.train.Checkpoint, and the higher level SavedModel format. PyTorch relies on torch.save together with the model's state_dict.
The word "saver" in the machine learning world almost always refers to the TensorFlow API named tf.train.Saver, which was introduced in TensorFlow 1.x and is now deprecated in favor of tf.train.Checkpoint. The broader concept, persisting a model's learnable parameters and optimizer state to disk, is universal across frameworks.
Training a modern neural network can take hours, days, or weeks of compute. Without periodic checkpoints, a single crash, a preemption on a cloud GPU, or a node failure on a training cluster would force the user to start over. Savers solve a handful of practical problems at once:
A related concept is the checkpoint, the actual on-disk artifact produced by a saver. Savers write checkpoints; users restore checkpoints. The two terms are often used interchangeably in practice.
In TensorFlow 1.x, the tf.train.Saver class was the canonical way to save and restore the values of tf.Variable objects bound to a tf.Session. It is now available only through the compatibility module as tf.compat.v1.train.Saver and is considered legacy code for users on TensorFlow 2.x. The recommended replacement is tf.train.Checkpoint.
A Saver is created with an optional list of variables to track. If no list is supplied, the saver tracks all variables in the default graph. The two core methods are save() and restore():
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
# Define variables
W = tf.Variable(tf.random.normal([10, 10]), name='weights')
b = tf.Variable(tf.zeros(<sup><a href="#cite_note-10" class="cite-ref">[10]</a></sup>), name='biases')
# Create a Saver object that keeps the last 5 checkpoints
saver = tf.train.Saver(max_to_keep=5)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Training loop
for step in range(1000):
# ... training code ...
if step % 100 == 0:
saver.save(sess, 'model.ckpt', global_step=step)
To restore later:
with tf.Session() as sess:
saver.restore(sess, 'model.ckpt-900')
# variables now hold the values from step 900
A single save() call writes a handful of files. The exact layout depends on whether the V1 or V2 binary format is in use. The V2 format has been the default since TensorFlow 0.11 and is recommended because it is faster to restore and uses less memory.
| File | Contents |
|---|---|
model.ckpt-<step>.data-00000-of-00001 | A TensorBundle file containing the actual tensor values (weights and biases). Large checkpoints may be sharded across several .data-N-of-M files. |
model.ckpt-<step>.index | A string to string immutable table mapping each tensor name to a serialized BundleEntryProto describing which data file holds the tensor and at what offset. |
model.ckpt-<step>.meta | A MetaGraphDef protocol buffer containing the computational graph definition. Optional on restore if the user rebuilds the graph in code. |
checkpoint | A small text file that records the latest checkpoint and the recent checkpoint history. Used by helpers like tf.train.latest_checkpoint(). |
The max_to_keep argument controls automatic cleanup. A value of 5 keeps the five most recent checkpoints and silently deletes older ones, which is useful for long training runs that would otherwise fill up the disk.
tf.train.Saver was designed for the graph and session model of TensorFlow 1.x. It saves variables by name, so renaming a variable in code can break restoration. It does not work naturally with eager execution or with Keras models that use object based attribute tracking. These limitations are the main reason it was replaced.
tf.train.Checkpoint is the modern saver in TensorFlow 2.x. The key conceptual change is object based checkpointing. Instead of recording variables by their global name in a graph, a Checkpoint records the Python object graph rooted at the objects passed into its constructor. When you save a Checkpoint that contains a model and an optimizer, it walks the attributes of both, finds every tf.Variable they own, and writes them out. On restore, the same walk is performed on the new objects and values are matched up by structural position rather than by string name.
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10)
])
optimizer = tf.keras.optimizers.Adam(1e-3)
checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
save_path = checkpoint.save('./ckpt/ckpt')
# ... later, possibly in a different process ...
checkpoint.restore(save_path)
Because restoration is structural, the user can refactor variable names freely without invalidating an existing checkpoint, as long as the attribute structure of the tracked objects is preserved. Renaming a tf.Variable is fine. Moving a layer from one Sequential to another is not.
In production training loops, a tf.train.Checkpoint is usually paired with tf.train.CheckpointManager, which handles rotation and the checkpoint index file. A typical training step looks like this:
manager = tf.train.CheckpointManager(
checkpoint, directory='./ckpt', max_to_keep=3
)
for epoch in range(num_epochs):
# ... training ...
manager.save()
# Later, restore the latest
status = checkpoint.restore(manager.latest_checkpoint)
status.expect_partial() # silences warnings if only part of the graph is restored
A powerful property of object based checkpointing is deferred restoration. Calling checkpoint.restore(path) does not eagerly assign values. Instead, it queues the restoration and applies values as soon as variables become trackable from the Checkpoint root. This is convenient for transfer learning, where the user might restore the backbone of a network from a pretrained checkpoint and then add new heads that are initialized fresh.
A frequent source of confusion is the difference between a TensorFlow checkpoint and a TensorFlow SavedModel.
A checkpoint, whether written by tf.train.Saver or tf.train.Checkpoint, contains only the values of the model's variables. It does not contain a description of the computation. To use a checkpoint, the original Python code that defined the model has to be available. Checkpoints are lightweight and well suited for resuming training.
A SavedModel is a directory that bundles a serialized description of the computation (as a protobuf MetaGraphDef) together with a TensorFlow checkpoint of the variables, plus assets and signatures. SavedModels are self contained. They can be loaded without the original Python source, deployed via TensorFlow Serving, converted for TensorFlow Lite or TensorFlow.js, or called from other languages such as C++ or Java. The standard Keras workflow now produces SavedModels by default via model.save('path').
| Aspect | Checkpoint | SavedModel |
|---|---|---|
| Contains weights | Yes | Yes (in a variables/ subdirectory that is itself a checkpoint) |
| Contains graph definition | No | Yes |
| Self contained for serving | No | Yes |
| Requires original Python code to load | Yes | No |
| Typical use | Training resumption, experiment tracking | Deployment, inference, cross language use |
Keras, the high level API now bundled with TensorFlow, exposes two main saving methods on every Model instance:
model.save_weights(path) writes only the model's weights to disk. The user is responsible for reconstructing the architecture in code before calling model.load_weights(path). Output format is either a TensorFlow checkpoint or an HDF5 .h5 file, controlled by the save_format argument.model.save(path) writes the architecture, the weights, the optimizer state, and the training configuration as a complete SavedModel (or an HDF5 file if the path ends in .h5). The resulting artifact can be loaded with tf.keras.models.load_model(path) without any architecture code.For most Keras users, model.save is the right default. For users who want full control of the file format or who are loading weights into a slightly different architecture, save_weights is more flexible.
PyTorch does not have a class named Saver. Saving is handled by two top level functions, torch.save and torch.load, plus the state_dict mechanism on every nn.Module and optimizer.
A state_dict is a Python dictionary that maps each parameter or buffer name in a module to its tensor value. Optimizers also expose a state_dict containing momentum buffers and hyperparameters. The recommended approach is to save the state_dict rather than the entire module, because pickling a whole module ties the checkpoint to the exact class definition at the time of saving. If the class is later refactored, the saved object may fail to load.
import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(10, 10)
self.fc2 = nn.Linear(10, 10)
model = Net()
# Save only the weights
torch.save(model.state_dict(), 'model.pth')
# Restore
model = Net()
model.load_state_dict(torch.load('model.pth'))
model.eval()
For resuming training, PyTorch users wrap multiple pieces of state in a dictionary and pickle the dictionary with torch.save. The conventional file extension for this richer artifact is .tar, although nothing in PyTorch enforces it.
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, 'checkpoint.tar')
# Restore
checkpoint = torch.load('checkpoint.tar')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
model.train() # set back to training mode
Two small but important details trip up new users. First, model.eval() must be called before inference so that dropout and batch normalization layers behave correctly. Second, when moving a checkpoint between CPU and GPU, the map_location argument to torch.load controls where the tensors are placed.
Frameworks built on PyTorch, such as PyTorch Lightning and Hugging Face Transformers, add their own checkpointing helpers that wrap torch.save with conventions for distributed training, sharding for large models, and integration with experiment trackers. The underlying mechanism is the same.
Savers, by whatever name, support a handful of recurring workflows in deep learning:
CheckpointManager or equivalent allows a job that was killed at step 7,420 to restart from step 7,000 instead of step zero.best.ckpt file whenever the metric improves. After training ends, this file holds the best model, not the final one.tf.train.Checkpoint and state_dict filtering in PyTorch both support partial loads.max_to_keep style cap. Long runs at scale generate gigabytes of checkpoints. A rolling window of three to five most recent checkpoints, plus a separately tracked best checkpoint, is a sensible default.restore() or load_state_dict(), run one batch through the model and confirm the loss or a sample prediction looks reasonable. A silent shape mismatch can otherwise survive into deployment.Imagine working on a giant Lego model that takes weeks to finish. Every evening, before going to bed, you take a careful photo of every piece and where it goes. If you knock the whole thing over, or if you have to leave home for a week, you can rebuild it exactly from the photos. A saver in machine learning is that photo. It records the position of every knob inside a learning model so that the model can pause, sleep, get moved to a new computer, or wake up months later and keep going from the same spot.