# Saver

> Source: https://aiwiki.ai/wiki/saver
> Updated: 2026-05-11
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Saver in machine learning

In machine learning, a **Saver** is a utility or class that lets users persist and restore the state of models, variables, or other components during training and evaluation. Saving model state matters for several practical reasons: it preserves intermediate progress, supports [transfer learning](/wiki/transfer_learning) and [fine-tuning](/wiki/fine_tuning), allows training to resume after interruptions, and produces artifacts that can be shipped to production for [inference](/wiki/inference). Most [machine learning](/wiki/machine_learning) frameworks ship their own saver. [TensorFlow](/wiki/tensorflow) has historically offered `tf.train.Saver`, then `tf.train.Checkpoint`, and the higher level SavedModel format. [PyTorch](/wiki/pytorch) relies on `torch.save` together with the model's `state_dict`.

The word "saver" in the machine learning world almost always refers to the [TensorFlow](/wiki/tensorflow) API named `tf.train.Saver`, which was introduced in TensorFlow 1.x and is now deprecated in favor of `tf.train.Checkpoint`. The broader concept, persisting a model's learnable parameters and optimizer state to disk, is universal across frameworks.

## Why saving matters

Training a modern neural network can take hours, days, or weeks of compute. Without periodic checkpoints, a single crash, a preemption on a cloud GPU, or a node failure on a training cluster would force the user to start over. Savers solve a handful of practical problems at once:

* They allow a training job to resume from the most recent step, with the optimizer's internal state intact (momentum buffers, Adam's first and second moment estimates, learning rate schedules, etc.).
* They produce artifacts that can be loaded for evaluation, served for inference, or used as the starting point for further [fine-tuning](/wiki/fine_tuning) on a new dataset.
* They make experiment tracking reproducible. A saved checkpoint plus the training code is, in principle, enough to recreate a model's behavior exactly.
* They support early stopping, where the saver keeps the best validation checkpoint while discarding worse ones.

A related concept is the **checkpoint**, the actual on-disk artifact produced by a saver. Savers write checkpoints; users restore checkpoints. The two terms are often used interchangeably in practice.

## tf.train.Saver (TensorFlow 1.x)

In TensorFlow 1.x, the `tf.train.Saver` class was the canonical way to save and restore the values of `tf.Variable` objects bound to a `tf.Session`. It is now available only through the compatibility module as `tf.compat.v1.train.Saver` and is considered legacy code for users on TensorFlow 2.x. The recommended replacement is `tf.train.Checkpoint`.

A `Saver` is created with an optional list of variables to track. If no list is supplied, the saver tracks all variables in the default graph. The two core methods are `save()` and `restore()`:

```python
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Define variables
W = tf.Variable(tf.random.normal([10, 10]), name='weights')
b = tf.Variable(tf.zeros([10]), name='biases')

# Create a Saver object that keeps the last 5 checkpoints
saver = tf.train.Saver(max_to_keep=5)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    # Training loop
    for step in range(1000):
        # ... training code ...
        if step % 100 == 0:
            saver.save(sess, 'model.ckpt', global_step=step)
```

To restore later:

```python
with tf.Session() as sess:
    saver.restore(sess, 'model.ckpt-900')
    # variables now hold the values from step 900
```

### Checkpoint files written by Saver

A single `save()` call writes a handful of files. The exact layout depends on whether the V1 or V2 binary format is in use. The V2 format has been the default since TensorFlow 0.11 and is recommended because it is faster to restore and uses less memory.

| File | Contents |
| --- | --- |
| `model.ckpt-<step>.data-00000-of-00001` | A TensorBundle file containing the actual tensor values (weights and biases). Large checkpoints may be sharded across several `.data-N-of-M` files. |
| `model.ckpt-<step>.index` | A string to string immutable table mapping each tensor name to a serialized `BundleEntryProto` describing which data file holds the tensor and at what offset. |
| `model.ckpt-<step>.meta` | A `MetaGraphDef` protocol buffer containing the computational graph definition. Optional on restore if the user rebuilds the graph in code. |
| `checkpoint` | A small text file that records the latest checkpoint and the recent checkpoint history. Used by helpers like `tf.train.latest_checkpoint()`. |

The `max_to_keep` argument controls automatic cleanup. A value of 5 keeps the five most recent checkpoints and silently deletes older ones, which is useful for long training runs that would otherwise fill up the disk.

### Limitations of the old Saver

`tf.train.Saver` was designed for the graph and session model of TensorFlow 1.x. It saves variables by name, so renaming a variable in code can break restoration. It does not work naturally with eager execution or with [Keras](/wiki/keras) models that use object based attribute tracking. These limitations are the main reason it was replaced.

## tf.train.Checkpoint (TensorFlow 2.x)

`tf.train.Checkpoint` is the modern saver in TensorFlow 2.x. The key conceptual change is **object based checkpointing**. Instead of recording variables by their global name in a graph, a `Checkpoint` records the Python object graph rooted at the objects passed into its constructor. When you save a `Checkpoint` that contains a model and an optimizer, it walks the attributes of both, finds every `tf.Variable` they own, and writes them out. On restore, the same walk is performed on the new objects and values are matched up by structural position rather than by string name.

```python
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10)
])
optimizer = tf.keras.optimizers.Adam(1e-3)

checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
save_path = checkpoint.save('./ckpt/ckpt')
# ... later, possibly in a different process ...
checkpoint.restore(save_path)
```

Because restoration is structural, the user can refactor variable names freely without invalidating an existing checkpoint, as long as the attribute structure of the tracked objects is preserved. Renaming a `tf.Variable` is fine. Moving a layer from one `Sequential` to another is not.

### CheckpointManager

In production training loops, a `tf.train.Checkpoint` is usually paired with `tf.train.CheckpointManager`, which handles rotation and the `checkpoint` index file. A typical training step looks like this:

```python
manager = tf.train.CheckpointManager(
    checkpoint, directory='./ckpt', max_to_keep=3
)

for epoch in range(num_epochs):
    # ... training ...
    manager.save()

# Later, restore the latest
status = checkpoint.restore(manager.latest_checkpoint)
status.expect_partial()  # silences warnings if only part of the graph is restored
```

### Partial restoration and deferred matching

A powerful property of object based checkpointing is **deferred restoration**. Calling `checkpoint.restore(path)` does not eagerly assign values. Instead, it queues the restoration and applies values as soon as variables become trackable from the `Checkpoint` root. This is convenient for transfer learning, where the user might restore the backbone of a network from a pretrained checkpoint and then add new heads that are initialized fresh.

## SavedModel vs checkpoint

A frequent source of confusion is the difference between a TensorFlow **checkpoint** and a TensorFlow **SavedModel**.

A checkpoint, whether written by `tf.train.Saver` or `tf.train.Checkpoint`, contains only the values of the model's variables. It does not contain a description of the computation. To use a checkpoint, the original Python code that defined the model has to be available. Checkpoints are lightweight and well suited for resuming training.

A SavedModel is a directory that bundles a serialized description of the computation (as a protobuf `MetaGraphDef`) together with a TensorFlow checkpoint of the variables, plus assets and signatures. SavedModels are self contained. They can be loaded without the original Python source, deployed via TensorFlow Serving, converted for TensorFlow Lite or TensorFlow.js, or called from other languages such as C++ or Java. The standard Keras workflow now produces SavedModels by default via `model.save('path')`.

| Aspect | Checkpoint | SavedModel |
| --- | --- | --- |
| Contains weights | Yes | Yes (in a `variables/` subdirectory that is itself a checkpoint) |
| Contains graph definition | No | Yes |
| Self contained for serving | No | Yes |
| Requires original Python code to load | Yes | No |
| Typical use | Training resumption, experiment tracking | Deployment, inference, cross language use |

## Keras savers

[Keras](/wiki/keras), the high level API now bundled with TensorFlow, exposes two main saving methods on every `Model` instance:

* `model.save_weights(path)` writes only the model's weights to disk. The user is responsible for reconstructing the architecture in code before calling `model.load_weights(path)`. Output format is either a TensorFlow checkpoint or an HDF5 `.h5` file, controlled by the `save_format` argument.
* `model.save(path)` writes the architecture, the weights, the optimizer state, and the training configuration as a complete SavedModel (or an HDF5 file if the path ends in `.h5`). The resulting artifact can be loaded with `tf.keras.models.load_model(path)` without any architecture code.

For most Keras users, `model.save` is the right default. For users who want full control of the file format or who are loading weights into a slightly different architecture, `save_weights` is more flexible.

## PyTorch and torch.save

[PyTorch](/wiki/pytorch) does not have a class named `Saver`. Saving is handled by two top level functions, `torch.save` and `torch.load`, plus the `state_dict` mechanism on every `nn.Module` and optimizer.

A **state_dict** is a Python dictionary that maps each parameter or buffer name in a module to its tensor value. Optimizers also expose a `state_dict` containing momentum buffers and hyperparameters. The recommended approach is to save the state_dict rather than the entire module, because pickling a whole module ties the checkpoint to the exact class definition at the time of saving. If the class is later refactored, the saved object may fail to load.

```python
import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 10)

model = Net()

# Save only the weights
torch.save(model.state_dict(), 'model.pth')

# Restore
model = Net()
model.load_state_dict(torch.load('model.pth'))
model.eval()
```

### General checkpoints in PyTorch

For resuming training, PyTorch users wrap multiple pieces of state in a dictionary and pickle the dictionary with `torch.save`. The conventional file extension for this richer artifact is `.tar`, although nothing in PyTorch enforces it.

```python
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, 'checkpoint.tar')

# Restore
checkpoint = torch.load('checkpoint.tar')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
model.train()  # set back to training mode
```

Two small but important details trip up new users. First, `model.eval()` must be called before inference so that dropout and batch normalization layers behave correctly. Second, when moving a checkpoint between CPU and GPU, the `map_location` argument to `torch.load` controls where the tensors are placed.

### Higher level wrappers

Frameworks built on PyTorch, such as PyTorch Lightning and Hugging Face Transformers, add their own checkpointing helpers that wrap `torch.save` with conventions for distributed training, sharding for large models, and integration with experiment trackers. The underlying mechanism is the same.

## Common use cases

Savers, by whatever name, support a handful of recurring workflows in deep learning:

* **Training resumption.** A periodic checkpoint plus a `CheckpointManager` or equivalent allows a job that was killed at step 7,420 to restart from step 7,000 instead of step zero.
* **Best checkpoint selection.** Track validation loss during training and overwrite a `best.ckpt` file whenever the metric improves. After training ends, this file holds the best model, not the final one.
* **Transfer learning and fine tuning.** Load a checkpoint of a pretrained backbone (for example a [ResNet](/wiki/resnet) or a [BERT](/wiki/bert) encoder), freeze part of it, and attach a new task specific head. Object based saving in `tf.train.Checkpoint` and `state_dict` filtering in PyTorch both support partial loads.
* **Deployment.** Convert a final checkpoint into a serving format (SavedModel for TensorFlow Serving, TorchScript or ONNX for PyTorch) and ship that artifact rather than the checkpoint itself.
* **Reproducibility.** Pin the checkpoint, the code commit, and the data version together so that a colleague can reproduce a result a year later.

## Practical tips

* Save more than the weights. A checkpoint that also contains the optimizer state, the global step, and the learning rate schedule allows clean resumption. A checkpoint that contains only the weights silently resets the optimizer.
* Use a `max_to_keep` style cap. Long runs at scale generate gigabytes of checkpoints. A rolling window of three to five most recent checkpoints, plus a separately tracked best checkpoint, is a sensible default.
* Pin the framework version. Across major versions of TensorFlow or PyTorch, the on disk format is generally stable, but edge cases exist. Recording the framework version next to the checkpoint avoids surprises.
* Validate restoration. After `restore()` or `load_state_dict()`, run one batch through the model and confirm the loss or a sample prediction looks reasonable. A silent shape mismatch can otherwise survive into deployment.

## Explain like I'm 5

Imagine working on a giant Lego model that takes weeks to finish. Every evening, before going to bed, you take a careful photo of every piece and where it goes. If you knock the whole thing over, or if you have to leave home for a week, you can rebuild it exactly from the photos. A saver in machine learning is that photo. It records the position of every knob inside a learning model so that the model can pause, sleep, get moved to a new computer, or wake up months later and keep going from the same spot.

## References

1. TensorFlow Documentation. "tf.compat.v1.train.Saver." https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/Saver
2. TensorFlow Documentation. "tf.train.Checkpoint." https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint
3. TensorFlow Documentation. "Training checkpoints." https://www.tensorflow.org/guide/checkpoint
4. TensorFlow Documentation. "Using the SavedModel format." https://www.tensorflow.org/guide/saved_model
5. TensorFlow Documentation. "Migrating model checkpoints." https://www.tensorflow.org/guide/migrate/migrating_checkpoints
6. TensorFlow Documentation. "Save and load models." https://www.tensorflow.org/tutorials/keras/save_and_load
7. PyTorch Documentation. "Saving and Loading Models." https://docs.pytorch.org/tutorials/beginner/saving_loading_models.html
8. PyTorch Documentation. "Saving and loading a general checkpoint in PyTorch." https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html
9. Keras Documentation. "Save, serialize, and export models." https://keras.io/guides/serialization_and_saving/
