# Saver

> Source: https://aiwiki.ai/wiki/saver
> Updated: 2026-06-29
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Saver in machine learning

In machine learning, a **Saver** is a utility or class that persists and restores the state of a model, its variables, and its optimizer to disk, so training can be paused, resumed, evaluated, or deployed without recomputing from scratch. In [TensorFlow](/wiki/tensorflow), where the term originated, `tf.train.Saver` saves and restores the values of `tf.Variable` objects to checkpoint files; it was the canonical mechanism in TensorFlow 1.x and is now superseded by the object based `tf.train.Checkpoint` in TensorFlow 2.x [1][2]. The same underlying idea, writing a model's learnable parameters and optimizer state to disk so they can be loaded later, is universal across frameworks: [PyTorch](/wiki/pytorch) uses `torch.save` with a `state_dict`, and Keras exposes `model.save_weights` and `model.save` [7][9].

Saving model state matters for several practical reasons: it preserves intermediate progress, supports [transfer learning](/wiki/transfer_learning) and [fine-tuning](/wiki/fine_tuning), allows training to resume after interruptions, and produces artifacts that can be shipped to production for [inference](/wiki/inference). Most [machine learning](/wiki/machine_learning) frameworks ship their own saver. [TensorFlow](/wiki/tensorflow) has historically offered `tf.train.Saver`, then `tf.train.Checkpoint`, and the higher level SavedModel format. [PyTorch](/wiki/pytorch) relies on `torch.save` together with the model's `state_dict`.

The word "saver" in the machine learning world almost always refers to the [TensorFlow](/wiki/tensorflow) API named `tf.train.Saver`, which was introduced in TensorFlow 1.x and is now deprecated in favor of `tf.train.Checkpoint`. The broader concept, persisting a model's learnable parameters and optimizer state to disk, is universal across frameworks.

## Why does saving model state matter?

Training a modern neural network can take hours, days, or weeks of compute. Without periodic checkpoints, a single crash, a preemption on a cloud GPU, or a node failure on a training cluster would force the user to start over. Savers solve a handful of practical problems at once:

* They allow a training job to resume from the most recent step, with the optimizer's internal state intact (momentum buffers, Adam's first and second moment estimates, learning rate schedules, etc.).
* They produce artifacts that can be loaded for evaluation, served for inference, or used as the starting point for further [fine-tuning](/wiki/fine_tuning) on a new dataset.
* They make experiment tracking reproducible. A saved checkpoint plus the training code is, in principle, enough to recreate a model's behavior exactly.
* They support early stopping, where the saver keeps the best validation checkpoint while discarding worse ones.

A related concept is the **checkpoint**, the actual on-disk artifact produced by a saver. The TensorFlow guide notes that checkpoint prefixes such as `'./tf_ckpts/ckpt-10'` "are not files on disk. Instead they are prefixes for an `index` file and one or more data files which contain the variable values" [3]. Savers write checkpoints; users restore checkpoints. The two terms are often used interchangeably in practice.

## What was tf.train.Saver in TensorFlow 1.x?

In TensorFlow 1.x, the `tf.train.Saver` class was the canonical way to save and restore the values of `tf.Variable` objects bound to a `tf.Session`. It is now available only through the compatibility module as `tf.compat.v1.train.Saver` and is considered legacy code for users on TensorFlow 2.x. The recommended replacement is `tf.train.Checkpoint` [1][5].

A `Saver` is created with an optional list of variables to track. If no list is supplied, the saver tracks all variables in the default graph. By default the constructor keeps the 5 most recent checkpoints (`max_to_keep=5`) and writes in the V2 binary format (`write_version=tf.train.SaverDef.V2`); the `keep_checkpoint_every_n_hours` argument defaults to 10000.0, which effectively disables the every-N-hours retention rule [1]. The two core methods are `save()` and `restore()`:

```python
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Define variables
W = tf.Variable(tf.random.normal([10, 10]), name='weights')
b = tf.Variable(tf.zeros([10]), name='biases')

# Create a Saver object that keeps the last 5 checkpoints
saver = tf.train.Saver(max_to_keep=5)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    # Training loop
    for step in range(1000):
        # ... training code ...
        if step % 100 == 0:
            saver.save(sess, 'model.ckpt', global_step=step)
```

To restore later:

```python
with tf.Session() as sess:
    saver.restore(sess, 'model.ckpt-900')
    # variables now hold the values from step 900
```

### What files does a Saver checkpoint contain?

A single `save()` call writes a handful of files. The exact layout depends on whether the V1 or V2 binary format is in use. The V2 format became the default in TensorFlow 0.12 (release r12) and is recommended because, per the release notes, it "significantly reduces the peak memory required" and the latency incurred during restore; older V1 checkpoints remain readable [10].

| File | Contents |
| --- | --- |
| `model.ckpt-<step>.data-00000-of-00001` | A TensorBundle file containing the actual tensor values (weights and biases). Large checkpoints may be sharded across several `.data-N-of-M` files; each tensor's `BundleEntryProto` carries a `shard_id` pointing to the data shard that holds it. |
| `model.ckpt-<step>.index` | A string to string immutable table (a `tensorflow::table::Table`) mapping each tensor name to a serialized `BundleEntryProto` that records which data file holds the tensor, the byte offset, and a checksum. |
| `model.ckpt-<step>.meta` | A `MetaGraphDef` protocol buffer containing the computational graph definition. Optional on restore if the user rebuilds the graph in code. |
| `checkpoint` | A small text file that records the latest checkpoint and the recent checkpoint history. Used by helpers like `tf.train.latest_checkpoint()`. |

The `max_to_keep` argument controls automatic cleanup. A value of 5 keeps the five most recent checkpoints and silently deletes older ones, which is useful for long training runs that would otherwise fill up the disk.

### What are the limitations of the old Saver?

`tf.train.Saver` was designed for the graph and session model of TensorFlow 1.x. It saves variables by name, so renaming a variable in code can break restoration. It does not work naturally with eager execution or with [Keras](/wiki/keras) models that use object based attribute tracking. These limitations are the main reason it was replaced.

## What is tf.train.Checkpoint in TensorFlow 2.x?

`tf.train.Checkpoint` is the modern saver in TensorFlow 2.x. The key conceptual change is **object based checkpointing**. As the TensorFlow guide puts it, "The easiest way to manage variables is by attaching them to Python objects, then referencing those objects," and "Subclasses of `tf.train.Checkpoint`, `tf.keras.layers.Layer`, and `tf.keras.Model` automatically track variables assigned to their attributes" [3]. Instead of recording variables by their global name in a graph, a `Checkpoint` records the Python object graph rooted at the objects passed into its constructor. When you save a `Checkpoint` that contains a model and an optimizer, it walks the attributes of both, finds every `tf.Variable` they own, and writes them out. On restore, the same walk is performed on the new objects and values are matched up by structural position rather than by string name.

```python
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10)
])
optimizer = tf.keras.optimizers.Adam(1e-3)

checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
save_path = checkpoint.save('./ckpt/ckpt')
# ... later, possibly in a different process ...
checkpoint.restore(save_path)
```

Because restoration is structural, the user can refactor variable names freely without invalidating an existing checkpoint, as long as the attribute structure of the tracked objects is preserved. Renaming a `tf.Variable` is fine. Moving a layer from one `Sequential` to another is not.

### What does CheckpointManager do?

In production training loops, a `tf.train.Checkpoint` is usually paired with `tf.train.CheckpointManager`, which handles rotation and the `checkpoint` index file by automatically deleting old checkpoints once their count exceeds `max_to_keep` [3]. A typical training step looks like this:

```python
manager = tf.train.CheckpointManager(
    checkpoint, directory='./ckpt', max_to_keep=3
)

for epoch in range(num_epochs):
    # ... training ...
    manager.save()

# Later, restore the latest
status = checkpoint.restore(manager.latest_checkpoint)
status.expect_partial()  # silences warnings if only part of the graph is restored
```

### How does deferred (partial) restoration work?

A powerful property of object based checkpointing is **deferred restoration**. As the TensorFlow guide explains, layer objects "may defer the creation of variables to their first call, when input shapes are available," so "a restore must happen between the variable's creation and its first use. To support this idiom, `tf.train.Checkpoint` defers restores which don't yet have a matching variable" [3]. Calling `checkpoint.restore(path)` does not eagerly assign values. Instead, it queues the restoration and applies values as soon as variables become trackable from the `Checkpoint` root. This is convenient for transfer learning, where the user might restore the backbone of a network from a pretrained checkpoint and then add new heads that are initialized fresh.

## How does a checkpoint differ from a SavedModel?

A frequent source of confusion is the difference between a TensorFlow **checkpoint** and a TensorFlow **SavedModel**.

A checkpoint, whether written by `tf.train.Saver` or `tf.train.Checkpoint`, contains only the values of the model's variables. It does not contain a description of the computation. To use a checkpoint, the original Python code that defined the model has to be available. Checkpoints are lightweight and well suited for resuming training.

A SavedModel is a directory that bundles a serialized description of the computation together with a TensorFlow checkpoint of the variables, plus assets and signatures. The TensorFlow documentation states that "A SavedModel contains a complete TensorFlow program, including trained parameters (i.e, tf.Variables) and computation. It does not require the original model building code to run" [4]. Inside the directory, `saved_model.pb` stores the program and its named signatures, while the `variables/` subdirectory "contains a standard training checkpoint" [4]. SavedModels are therefore self contained: they can be loaded without the original Python source, deployed via TensorFlow Serving, converted for TensorFlow Lite or TensorFlow.js, or called from other languages such as C++ or Java. The standard Keras workflow can produce SavedModels via `model.save('path')`.

| Aspect | Checkpoint | SavedModel |
| --- | --- | --- |
| Contains weights | Yes | Yes (in a `variables/` subdirectory that is itself a checkpoint) |
| Contains graph definition | No | Yes |
| Self contained for serving | No | Yes |
| Requires original Python code to load | Yes | No |
| Typical use | Training resumption, experiment tracking | Deployment, inference, cross language use |

## How do Keras savers work?

[Keras](/wiki/keras), the high level API now bundled with TensorFlow, exposes two main saving methods on every `Model` instance [9]:

* `model.save_weights(path)` writes only the model's weights to disk. The user is responsible for reconstructing the architecture in code before calling `model.load_weights(path)`. Output format is either a TensorFlow checkpoint or an HDF5 `.h5` file, controlled by the `save_format` argument.
* `model.save(path)` writes the architecture, the weights, the optimizer state, and the training configuration as a complete model artifact (a SavedModel directory, a `.keras` file in newer Keras, or an HDF5 `.h5` file). The resulting artifact can be loaded with `tf.keras.models.load_model(path)` without any architecture code.

For most Keras users, `model.save` is the right default. For users who want full control of the file format or who are loading weights into a slightly different architecture, `save_weights` is more flexible.

## How does saving work in PyTorch?

[PyTorch](/wiki/pytorch) does not have a class named `Saver`. Saving is handled by two top level functions, `torch.save` and `torch.load`, plus the `state_dict` mechanism on every `nn.Module` and optimizer [7].

A **state_dict** is a Python dictionary that maps each parameter or buffer name in a module to its tensor value. Optimizers also expose a `state_dict` containing momentum buffers and hyperparameters. The PyTorch documentation recommends saving the `state_dict` rather than the entire module, because pickling a whole module ties the checkpoint to the exact class definition at the time of saving [7]. If the class is later refactored, the saved object may fail to load.

```python
import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 10)
        self.fc2 = nn.Linear(10, 10)

model = Net()

# Save only the weights
torch.save(model.state_dict(), 'model.pth')

# Restore
model = Net()
model.load_state_dict(torch.load('model.pth'))
model.eval()
```

### How do you save a general checkpoint in PyTorch?

For resuming training, PyTorch users wrap multiple pieces of state in a dictionary and pickle the dictionary with `torch.save`. The conventional file extension for this richer artifact is `.tar`, although nothing in PyTorch enforces it [8].

```python
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, 'checkpoint.tar')

# Restore
checkpoint = torch.load('checkpoint.tar')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
model.train()  # set back to training mode
```

Two small but important details trip up new users. First, `model.eval()` must be called before inference so that dropout and batch normalization layers behave correctly [7]. Second, when moving a checkpoint between CPU and GPU, the `map_location` argument to `torch.load` controls where the tensors are placed.

### Higher level wrappers

Frameworks built on PyTorch, such as PyTorch Lightning and Hugging Face Transformers, add their own checkpointing helpers that wrap `torch.save` with conventions for distributed training, sharding for large models, and integration with experiment trackers. The underlying mechanism is the same.

## What is a saver used for?

Savers, by whatever name, support a handful of recurring workflows in deep learning:

* **Training resumption.** A periodic checkpoint plus a `CheckpointManager` or equivalent allows a job that was killed at step 7,420 to restart from step 7,000 instead of step zero.
* **Best checkpoint selection.** Track validation loss during training and overwrite a `best.ckpt` file whenever the metric improves. After training ends, this file holds the best model, not the final one.
* **Transfer learning and fine tuning.** Load a checkpoint of a pretrained backbone (for example a [ResNet](/wiki/resnet) or a [BERT](/wiki/bert) encoder), freeze part of it, and attach a new task specific head. Object based saving in `tf.train.Checkpoint` and `state_dict` filtering in PyTorch both support partial loads.
* **Deployment.** Convert a final checkpoint into a serving format (SavedModel for TensorFlow Serving, TorchScript or ONNX for PyTorch) and ship that artifact rather than the checkpoint itself.
* **Reproducibility.** Pin the checkpoint, the code commit, and the data version together so that a colleague can reproduce a result a year later.

## Practical tips

* Save more than the weights. A checkpoint that also contains the optimizer state, the global step, and the learning rate schedule allows clean resumption. A checkpoint that contains only the weights silently resets the optimizer.
* Use a `max_to_keep` style cap. Long runs at scale generate gigabytes of checkpoints. A rolling window of three to five most recent checkpoints, plus a separately tracked best checkpoint, is a sensible default.
* Pin the framework version. Across major versions of TensorFlow or PyTorch, the on disk format is generally stable, but edge cases exist. Recording the framework version next to the checkpoint avoids surprises.
* Validate restoration. After `restore()` or `load_state_dict()`, run one batch through the model and confirm the loss or a sample prediction looks reasonable. A silent shape mismatch can otherwise survive into deployment.

## Explain like I'm 5

Imagine working on a giant Lego model that takes weeks to finish. Every evening, before going to bed, you take a careful photo of every piece and where it goes. If you knock the whole thing over, or if you have to leave home for a week, you can rebuild it exactly from the photos. A saver in machine learning is that photo. It records the position of every knob inside a learning model so that the model can pause, sleep, get moved to a new computer, or wake up months later and keep going from the same spot.

## References

1. TensorFlow Documentation. "tf.compat.v1.train.Saver." https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/Saver
2. TensorFlow Documentation. "tf.train.Checkpoint." https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint
3. TensorFlow Documentation. "Training checkpoints." https://www.tensorflow.org/guide/checkpoint
4. TensorFlow Documentation. "Using the SavedModel format." https://www.tensorflow.org/guide/saved_model
5. TensorFlow Documentation. "Migrating model checkpoints." https://www.tensorflow.org/guide/migrate/migrating_checkpoints
6. TensorFlow Documentation. "Save and load models." https://www.tensorflow.org/tutorials/keras/save_and_load
7. PyTorch Documentation. "Saving and Loading Models." https://docs.pytorch.org/tutorials/beginner/saving_loading_models.html
8. PyTorch Documentation. "Saving and loading a general checkpoint in PyTorch." https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html
9. Keras Documentation. "Save, serialize, and export models." https://keras.io/guides/serialization_and_saving/
10. TensorFlow. "TensorFlow 0.12 (r12) release notes: new V2 checkpoint format in tf.train.Saver." https://github.com/tensorflow/tensorflow/releases