# Root directory

> Source: https://aiwiki.ai/wiki/root_directory
> Updated: 2026-05-11
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Root directory in machine learning

In the context of machine learning, the term "root directory" does not refer to a single algorithmic concept. It refers to a directory on disk (or in object storage) that holds the artifacts a training run produces and a serving run consumes: [checkpoints](/wiki/checkpoint), event files for [TensorBoard](/wiki/tensorboard), exported [SavedModels](/wiki/savedmodel), assets, and bookkeeping metadata. In [TensorFlow](/wiki/tensorflow) specifically, several APIs take a directory argument by name, including the `model_dir` parameter of `tf.estimator.Estimator`, the `directory` parameter of `tf.train.CheckpointManager`, the `filepath` parameter of `tf.keras.callbacks.ModelCheckpoint`, and the `log_dir` parameter of `tf.keras.callbacks.TensorBoard`. The same path is often called the model directory, the checkpoint directory, the experiment directory, or simply the logdir, depending on which tool is writing into it.

The directory itself is just a folder, but how you organize it has real consequences. A clean layout lets you resume an interrupted training run from the last checkpoint, compare two experiments in TensorBoard side by side, ship a SavedModel to [TensorFlow Serving](/wiki/tensorflow_serving) without rewiring code, and avoid the silent bug where two runs scribble over each other's events. A messy layout is one of those problems you do not notice until a week into a project, by which point the cost of cleaning it up is high.

### Definition

In operating systems, the root directory is the top level directory of a file system hierarchy. It is denoted by a forward slash (`/`) on Unix-like systems such as [Linux](/wiki/linux) and [macOS](/wiki/macos), and by a drive letter followed by a backslash (for example `C:\`) on Windows. In day to day machine learning work, however, "root directory" is rarely used in that strict OS sense. It usually means the root of a project or the root of an experiment: the top folder under which everything else for that workload lives.

A project root for a typical TensorFlow repository tends to look like this:

```
my_project/
  data/
  src/
  configs/
  notebooks/
  experiments/
    run_2026_05_10_baseline/
      checkpoints/
      logs/train/
      logs/validation/
      saved_model/
      config.json
```

The top folder is the project root. Each subdirectory under `experiments/` is an experiment root, also called a run directory. TensorFlow APIs do not care about the project root. They care about the run directory, because that is what gets passed to `model_dir`, `log_dir`, or `tf.train.CheckpointManager`.

## The role of the directory in TensorFlow

### model_dir in tf.estimator

`tf.estimator.Estimator` takes a `model_dir` argument in its constructor. The first time you call `train()`, TensorFlow writes an initial checkpoint into that directory, plus a `graph.pbtxt`, plus an `events.out.tfevents.*` event file for TensorBoard. On subsequent calls to `train()`, `evaluate()`, or `predict()`, the Estimator rebuilds the model from the latest checkpoint in `model_dir`. If you do not pass `model_dir`, the Estimator creates a temporary directory using Python's `tempfile.mkdtemp`, which is fine for a quick test but useless if you want to resume training later.

A fresh Estimator run produces a directory roughly like this:

```
model_dir/
  checkpoint
  graph.pbtxt
  model.ckpt-0.data-00000-of-00001
  model.ckpt-0.index
  model.ckpt-0.meta
  model.ckpt-1000.data-00000-of-00001
  model.ckpt-1000.index
  model.ckpt-1000.meta
  events.out.tfevents.1715200000.hostname
```

The schedule is controlled by `tf.estimator.RunConfig`. The defaults save a checkpoint every 600 seconds and keep the last five, but you can override both with `save_checkpoints_steps`, `save_checkpoints_secs`, and `keep_checkpoint_max`.

### tf.train.Checkpoint and CheckpointManager

Outside of Estimator, the modern object based API is `tf.train.Checkpoint` paired with `tf.train.CheckpointManager`. You build a `Checkpoint` object that tracks Python objects (a model, an optimizer, a step counter, an iterator), then hand it to a manager along with a directory:

```python
ckpt = tf.train.Checkpoint(step=tf.Variable(1), optimizer=opt, net=net)
manager = tf.train.CheckpointManager(ckpt, './tf_ckpts', max_to_keep=3)
```

Each call to `manager.save()` writes a new pair of files under `./tf_ckpts` with a numeric suffix tied to the manager's save counter:

```
tf_ckpts/
  checkpoint
  ckpt-1.data-00000-of-00001
  ckpt-1.index
  ckpt-2.data-00000-of-00001
  ckpt-2.index
  ckpt-3.data-00000-of-00001
  ckpt-3.index
```

The `checkpoint` file is a small text file that records which prefixes exist and which one is the latest. It is the file `tf.train.latest_checkpoint('./tf_ckpts')` reads to answer the question of where to resume from. The `.index` file holds metadata about the variables stored in the checkpoint. The `.data-00000-of-00001` shard holds the actual tensor values, sharded only when training runs in a distributed setup. Strings like `./tf_ckpts/ckpt-2` are checkpoint prefixes, not real files; passing the prefix to `ckpt.restore()` tells TensorFlow to load the index and data files that share that prefix.

`max_to_keep` is the knob that controls how many recent checkpoints survive. Older ones are deleted automatically. If you want to preserve a specific checkpoint forever (for example the one tied to a published paper or a deployed model), call `manager.checkpoints` and copy the relevant files out, because the manager will eventually rotate it out otherwise.

### tf.keras.callbacks.ModelCheckpoint

For [Keras](/wiki/keras) models, the equivalent is `tf.keras.callbacks.ModelCheckpoint`. The callback takes a `filepath` argument that can be either a directory or a templated file pattern with Python `str.format` placeholders:

```python
tf.keras.callbacks.ModelCheckpoint(
    filepath='checkpoints/cp-{epoch:04d}.ckpt',
    save_weights_only=True,
    save_freq='epoch',
)
```

The extension matters. A `.h5` suffix triggers the older HDF5 single file format. A `.keras` suffix uses the modern Keras zip format. A bare path with no extension, or `.ckpt`, uses the TensorFlow checkpoint format and writes the same `.index` and `.data-*` shards described above. In recent versions of Keras, `save_weights_only=True` requires the path to end in `.weights.h5`, which can surprise users migrating from older code.

### TensorBoard logdir and event files

TensorBoard reads the run directory through its `--logdir` flag and walks it recursively, looking for files whose names contain `tfevents`. Each subdirectory that contains event files is treated as a single run and gets its own line in the TensorBoard UI. A typical layout for a training run with separate train and validation summaries looks like:

```
logs/
  train/
    events.out.tfevents.1715200000.hostname.0.v2
  validation/
    events.out.tfevents.1715200000.hostname.1.v2
```

The convention behind this layout is the reason `tf.keras.callbacks.TensorBoard` writes train and validation summaries into sibling subdirectories: TensorBoard treats them as two separate runs and draws them as two separate curves on the same chart. If you put both event files in the same directory, TensorBoard tries to stitch them together as one run, which is what you want when a training job crashes and resumes, but not what you want when one stream is training loss and the other is validation loss.

A common multi experiment layout is:

```
logs/
  baseline/
    train/
    validation/
  larger_lr/
    train/
    validation/
```

Launching `tensorboard --logdir logs/` shows both experiments at once, color coded per top level directory. This is the path of least friction for comparing runs.

### SavedModel directory

When you export a model for serving or for [TensorFlow Lite](/wiki/tensorflow_lite) conversion, TensorFlow writes a SavedModel into a directory. The structure is fixed and is what tools like [TensorFlow Serving](/wiki/tensorflow_serving), [TensorFlow.js](/wiki/tensorflow_js), and [TensorFlow Hub](/wiki/tensorflow_hub) expect:

```
saved_model/
  saved_model.pb
  fingerprint.pb
  variables/
    variables.data-00000-of-00001
    variables.index
  assets/
  assets.extra/
```

`saved_model.pb` is the serialized graph and the named signatures (the input and output tensor specs for each callable function). `variables/` holds the trained values in the standard checkpoint format. `assets/` holds files the graph needs at load time, such as the vocabulary text files used by lookup tables. `assets.extra/` is reserved for files that are not used by the graph but might be useful to the consumer (model cards, license notes, conversion hints). `fingerprint.pb` is a small set of 64 bit hashes that uniquely identify the SavedModel's contents and is read with `tf.saved_model.experimental.read_fingerprint`.

TensorFlow Serving expects a version subdirectory above this layout: `models/my_model/1/`, `models/my_model/2/`, and so on. The integer is the model version, and serving automatically picks up the highest numbered subdirectory unless you tell it otherwise. If you write a SavedModel directly into `models/my_model/` without a version folder, serving refuses to load it.

## Remote root directories

None of these TensorFlow APIs require a local path. They all accept URIs that the underlying `tf.io.gfile` layer understands, which means you can point `model_dir`, `log_dir`, `filepath`, or the `directory` argument at object storage. The common schemes are `gs://` for [Google Cloud Storage](/wiki/google_cloud_storage), `s3://` for [Amazon S3](/wiki/amazon_s3), and `hdfs://` for HDFS.

```python
checkpoint_path = 'gs://my-bucket/runs/2026_05_10/save_at_{epoch}'
callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path)
```

The API surface is the same as for a local path. `tf.train.latest_checkpoint('gs://bucket/run/')` works. `tf.saved_model.save(model, 's3://bucket/model/1/')` works. What changes is the failure mode. Object storage is eventually consistent for listings on some providers, which can make `tf.train.latest_checkpoint` return a stale answer right after a save. Latency is higher, so writing a checkpoint every few seconds is much more expensive than writing one every few minutes. Authentication needs to be in place: a service account JSON pointed to by `GOOGLE_APPLICATION_CREDENTIALS` for `gs://`, IAM credentials in the environment or instance profile for `s3://`.

On [Amazon SageMaker](/wiki/amazon_sagemaker) the convention is to write checkpoints to the local path `/opt/ml/checkpoints`, which SageMaker syncs to an S3 location specified in the job configuration. This works around the latency and consistency issues by keeping the training loop local and offloading the upload to a separate process. Managed Spot Training uses the same path to recover after a spot interruption.

## Conventions and pitfalls

A few patterns show up repeatedly in TensorFlow codebases that have been around for more than a few months.

### One run, one directory

The single most useful convention is that every training run gets its own directory, and that directory is never reused. Name it with a timestamp, a short description, or both. Inside that directory, put the checkpoints, the event files, the exported SavedModel, and a copy of the config you used. If you have to come back six months later to figure out what hyperparameters produced a specific result, this is the layout that will tell you.

When runs share a directory, TensorBoard tries to stitch their event files into one run and the resulting curves are a mess. Checkpoints from one run can also be picked up by `tf.train.latest_checkpoint` on a different run, which leads to confusing behavior where the model resumes from somewhere unexpected.

### Trailing slashes and prefixes

A `tf.train.CheckpointManager` `directory` argument is a directory. A `tf.train.Checkpoint.save()` `file_prefix` argument is a prefix, not a directory. Passing `'./ckpts'` to `Checkpoint.save` creates files named `ckpts-1`, `ckpts-2`, etc. in the current working directory, which is almost never what you want. The fix is either to pass `'./ckpts/ckpt'` (so files are named `./ckpts/ckpt-1`, etc.) or to use `CheckpointManager`, which handles the prefix internally.

### Permissions in containers

In [Docker](/wiki/docker) containers the user inside the container often does not have write permissions on a mounted host directory. The error TensorFlow returns in that case is not always obvious. If your training loop fails on the first checkpoint save and you see `Permission denied`, that is usually the cause. Mount the directory with appropriate ownership or run the container as the host user.

### Avoiding stale checkpoints across experiments

When you change your model code and rerun training against an existing `model_dir`, TensorFlow will try to restore variables from the existing checkpoint. If the variable names or shapes have changed, you get cryptic shape mismatch errors. The safe move is to either delete the old directory or write the new run into a new directory. Reusing a `model_dir` only makes sense when you genuinely want to resume the same model from the same point.

## Directory structure reference

The following table summarizes which TensorFlow APIs write to a directory, what argument name they use, and what they put there.

| API | Argument | Typical contents |
| --- | --- | --- |
| [tf.estimator.Estimator](/wiki/estimator) | `model_dir` | `checkpoint`, `model.ckpt-*` files, `graph.pbtxt`, event files |
| `tf.train.CheckpointManager` | `directory` | `checkpoint`, `ckpt-N.index`, `ckpt-N.data-*` |
| `tf.train.Checkpoint.save` | `file_prefix` | Files named `<prefix>-N.index` and `<prefix>-N.data-*` |
| `tf.keras.callbacks.ModelCheckpoint` | `filepath` | `.weights.h5`, `.keras`, or TF checkpoint files |
| `tf.keras.callbacks.TensorBoard` | `log_dir` | `events.out.tfevents.*` and profiler traces |
| `tf.summary.create_file_writer` | `logdir` | `events.out.tfevents.*` |
| `tf.saved_model.save` | `export_dir` | `saved_model.pb`, `variables/`, `assets/`, `fingerprint.pb` |
| `tf.keras.Model.save` | `filepath` | `.keras` zip, or a SavedModel directory |
| `model.save_weights` | `filepath` | `.weights.h5` or sharded `.weights.json` plus shard files |

## Explain like I'm 5 (ELI5)

Imagine you are doing a big school project that takes weeks. You keep all your notes, drawings, and drafts in one folder so nothing gets lost. The root directory for a machine learning project is that folder. Inside it, you have a smaller folder for each time you sat down to work, and inside each of those, you have your saved progress (checkpoints), your sketches that show how the project is going (event files for TensorBoard), and your final clean version ready to hand in (the SavedModel). If you mix everything together in one big pile, you cannot tell which scribbles belong to which day. If you give each work session its own folder, you can always go back to the right one.

## References

- TensorFlow, "Training checkpoints," https://www.tensorflow.org/guide/checkpoint
- TensorFlow, "Using the SavedModel format," https://www.tensorflow.org/guide/saved_model
- TensorFlow, "Estimators," https://www.tensorflow.org/guide/estimator
- TensorFlow API, "tf.train.CheckpointManager," https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager
- TensorFlow API, "tf.train.latest_checkpoint," https://www.tensorflow.org/api_docs/python/tf/train/latest_checkpoint
- TensorFlow API, "tf.keras.callbacks.ModelCheckpoint," https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint
- Keras documentation, "ModelCheckpoint," https://keras.io/api/callbacks/model_checkpoint/
- Keras documentation, "Weights only saving and loading," https://keras.io/api/models/model_saving_apis/weights_saving_and_loading/
- TensorBoard, "README," https://github.com/tensorflow/tensorboard/blob/master/README.md
- TensorFlow Blog, "Train your TensorFlow model on Google Cloud using TensorFlow Cloud," https://blog.tensorflow.org/2020/08/train-your-tensorflow-model-on-google.html
- AWS Machine Learning Blog, "Implement checkpointing with TensorFlow for Amazon SageMaker Managed Spot Training," https://aws.amazon.com/blogs/machine-learning/implement-checkpointing-with-tensorflow-for-amazon-sagemaker-managed-spot-training/

