# Root directory

> Source: https://aiwiki.ai/wiki/root_directory
> Updated: 2026-06-29
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Root directory in machine learning

In machine learning, a root directory is the top level folder, on a local disk or in object storage, under which a training run writes the artifacts it produces and a serving run later reads: [checkpoints](/wiki/checkpoint), event files for [TensorBoard](/wiki/tensorboard), exported [SavedModels](/wiki/savedmodel), assets, and bookkeeping metadata. It is not a single algorithmic concept; it is the directory you pass to a framework's directory argument so every file from one experiment lands in one place. In [TensorFlow](/wiki/tensorflow) specifically, several APIs take such a directory by name, including the `model_dir` parameter of `tf.estimator.Estimator`, the `directory` parameter of `tf.train.CheckpointManager`, the `filepath` parameter of `tf.keras.callbacks.ModelCheckpoint`, and the `log_dir` parameter of `tf.keras.callbacks.TensorBoard` [1][3][4][6]. The same path is often called the model directory, the checkpoint directory, the experiment directory, or simply the logdir, depending on which tool is writing into it.

The directory itself is just a folder, but how you organize it has real consequences. A clean layout lets you resume an interrupted training run from the last checkpoint, compare two experiments in TensorBoard side by side, ship a SavedModel to [TensorFlow Serving](/wiki/tensorflow_serving) without rewiring code, and avoid the silent bug where two runs scribble over each other's events. A messy layout is one of those problems you do not notice until a week into a project, by which point the cost of cleaning it up is high.

### What is the root directory in an operating system versus in machine learning?

In operating systems, the root directory is the top level directory of a file system hierarchy. It is denoted by a forward slash (`/`) on Unix-like systems such as [Linux](/wiki/linux) and [macOS](/wiki/macos), and by a drive letter followed by a backslash (for example `C:\`) on Windows. In day to day machine learning work, however, "root directory" is rarely used in that strict OS sense. It usually means the root of a project or the root of an experiment: the top folder under which everything else for that workload lives.

A project root for a typical TensorFlow repository tends to look like this:

```
my_project/
  data/
  src/
  configs/
  notebooks/
  experiments/
    run_2026_05_10_baseline/
      checkpoints/
      logs/train/
      logs/validation/
      saved_model/
      config.json
```

The top folder is the project root. Each subdirectory under `experiments/` is an experiment root, also called a run directory. TensorFlow APIs do not care about the project root. They care about the run directory, because that is what gets passed to `model_dir`, `log_dir`, or `tf.train.CheckpointManager`.

## How does the root directory work in TensorFlow?

### model_dir in tf.estimator

`tf.estimator.Estimator` takes a `model_dir` argument in its constructor. The first time you call `train()`, TensorFlow writes an initial checkpoint into that directory, plus a `graph.pbtxt`, plus an `events.out.tfevents.*` event file for TensorBoard [3]. On subsequent calls to `train()`, `evaluate()`, or `predict()`, the Estimator rebuilds the model from the latest checkpoint in `model_dir`. If you do not pass `model_dir`, the Estimator creates a temporary directory using Python's `tempfile.mkdtemp`, which is fine for a quick test but useless if you want to resume training later.

A fresh Estimator run produces a directory roughly like this:

```
model_dir/
  checkpoint
  graph.pbtxt
  model.ckpt-0.data-00000-of-00001
  model.ckpt-0.index
  model.ckpt-0.meta
  model.ckpt-1000.data-00000-of-00001
  model.ckpt-1000.index
  model.ckpt-1000.meta
  events.out.tfevents.1715200000.hostname
```

The schedule is controlled by `tf.estimator.RunConfig`. By default, when neither `save_checkpoints_steps` nor `save_checkpoints_secs` is set, a checkpoint is written every 600 seconds, and `keep_checkpoint_max` defaults to 5, so the five most recent checkpoint files are kept and older ones are deleted as new ones appear [3]. You can override all three with `save_checkpoints_steps`, `save_checkpoints_secs`, and `keep_checkpoint_max`. One subtle trap: if you lower the save interval (for example to 60 seconds) without raising `keep_checkpoint_max`, the five-checkpoint cap means your oldest surviving checkpoint can be only about five minutes old.

### tf.train.Checkpoint and CheckpointManager

Outside of Estimator, the modern object based API is `tf.train.Checkpoint` paired with `tf.train.CheckpointManager`. As the TensorFlow guide puts it, "Checkpoints capture the exact value of all parameters (tf.Variable objects) used by a model" [1]. You build a `Checkpoint` object that tracks Python objects (a model, an optimizer, a step counter, an iterator), then hand it to a manager along with a directory:

```python
ckpt = tf.train.Checkpoint(step=tf.Variable(1), optimizer=opt, net=net)
manager = tf.train.CheckpointManager(ckpt, './tf_ckpts', max_to_keep=3)
```

Each call to `manager.save()` writes a new pair of files under `./tf_ckpts` with a numeric suffix tied to the manager's save counter:

```
tf_ckpts/
  checkpoint
  ckpt-1.data-00000-of-00001
  ckpt-1.index
  ckpt-2.data-00000-of-00001
  ckpt-2.index
  ckpt-3.data-00000-of-00001
  ckpt-3.index
```

The `checkpoint` file is a small text file that records which prefixes exist and which one is the latest; the TensorFlow guide notes that "these prefixes are grouped together in a single checkpoint file where the CheckpointManager saves its state" [1]. It is the file `tf.train.latest_checkpoint('./tf_ckpts')` reads to answer the question of where to resume from [5]. The `.index` file holds metadata about the variables stored in the checkpoint. The `.data-00000-of-00001` shard holds the actual tensor values, sharded only when training runs in a distributed setup. Per the guide, paths like `./tf_ckpts/ckpt-2` "are not files on disk. Instead they are prefixes for an index file and one or more data files which contain the variable values" [1]; passing the prefix to `ckpt.restore()` tells TensorFlow to load the index and data files that share that prefix.

`max_to_keep` is the knob that controls how many recent checkpoints survive: the manager "deletes old checkpoints," keeping only the most recent ones [1]. If you want to preserve a specific checkpoint forever (for example the one tied to a published paper or a deployed model), call `manager.checkpoints` and copy the relevant files out, because the manager will eventually rotate it out otherwise.

### tf.keras.callbacks.ModelCheckpoint

For [Keras](/wiki/keras) models, the equivalent is `tf.keras.callbacks.ModelCheckpoint`. The callback takes a `filepath` argument that can be either a directory or a templated file pattern with Python `str.format` placeholders [6]:

```python
tf.keras.callbacks.ModelCheckpoint(
    filepath='checkpoints/cp-{epoch:04d}.ckpt',
    save_weights_only=True,
    save_freq='epoch',
)
```

The extension matters. A `.h5` suffix triggers the older HDF5 single file format. A `.keras` suffix uses the modern Keras zip format. A bare path with no extension, or `.ckpt`, uses the TensorFlow checkpoint format and writes the same `.index` and `.data-*` shards described above. In recent versions of Keras, `save_weights_only=True` requires the path to end in `.weights.h5`, which can surprise users migrating from older code [8].

### TensorBoard logdir and event files

TensorBoard reads the run directory through its `--logdir` flag and walks it recursively, looking for files whose names contain `tfevents`. As the TensorBoard README states: "When TensorBoard is passed a logdir at startup, it recursively walks the directory tree rooted at logdir looking for subdirectories that contain tfevents data. Every time it encounters such a subdirectory, it loads it as a new run, and the frontend will organize the data accordingly" [9]. A typical layout for a training run with separate train and validation summaries looks like:

```
logs/
  train/
    events.out.tfevents.1715200000.hostname.0.v2
  validation/
    events.out.tfevents.1715200000.hostname.1.v2
```

The convention behind this layout is the reason `tf.keras.callbacks.TensorBoard` writes train and validation summaries into sibling subdirectories: TensorBoard treats them as two separate runs and draws them as two separate curves on the same chart. If you put both event files in the same directory, TensorBoard tries to stitch them together as one run, which is what you want when a training job crashes and resumes, but not what you want when one stream is training loss and the other is validation loss.

A common multi experiment layout is:

```
logs/
  baseline/
    train/
    validation/
  larger_lr/
    train/
    validation/
```

Launching `tensorboard --logdir logs/` shows both experiments at once, color coded per top level directory. This is the path of least friction for comparing runs.

### SavedModel directory

When you export a model for serving or for [TensorFlow Lite](/wiki/tensorflow_lite) conversion, TensorFlow writes a SavedModel into a directory. The TensorFlow documentation defines it directly: "A SavedModel is a directory containing serialized signatures and the state needed to run them, including variable values and vocabularies" [2]. The structure is fixed and is what tools like [TensorFlow Serving](/wiki/tensorflow_serving), [TensorFlow.js](/wiki/tensorflow_js), and [TensorFlow Hub](/wiki/tensorflow_hub) expect:

```
saved_model/
  saved_model.pb
  fingerprint.pb
  variables/
    variables.data-00000-of-00001
    variables.index
  assets/
  assets.extra/
```

`saved_model.pb` is the serialized graph and the named signatures: the documentation says it "stores the actual TensorFlow program, or model, and a set of named signatures, each identifying a function that accepts tensor inputs and produces tensor outputs" [2]. `variables/` holds the trained values in the standard checkpoint format [2]. `assets/` holds files the graph needs at load time, such as the vocabulary text files used by lookup tables [2]. `assets.extra/` is reserved for files that are not used by the graph but might be useful to the consumer (model cards, license notes, conversion hints); TensorFlow itself does not use this directory [2]. `fingerprint.pb` contains the fingerprint of the SavedModel, "composed of several 64-bit hashes that uniquely identify the contents of the SavedModel," and is read with `tf.saved_model.experimental.read_fingerprint` [2].

TensorFlow Serving expects a version subdirectory above this layout: `models/my_model/1/`, `models/my_model/2/`, and so on. The integer is the model version, and serving automatically picks up the highest numbered subdirectory unless you tell it otherwise. If you write a SavedModel directly into `models/my_model/` without a version folder, serving refuses to load it.

## Can the root directory live in cloud object storage?

None of these TensorFlow APIs require a local path. They all accept URIs that the underlying `tf.io.gfile` layer understands, which means you can point `model_dir`, `log_dir`, `filepath`, or the `directory` argument at object storage. The common schemes are `gs://` for [Google Cloud Storage](/wiki/google_cloud_storage), `s3://` for [Amazon S3](/wiki/amazon_s3), and `hdfs://` for HDFS [10].

```python
checkpoint_path = 'gs://my-bucket/runs/2026_05_10/save_at_{epoch}'
callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path)
```

The API surface is the same as for a local path. `tf.train.latest_checkpoint('gs://bucket/run/')` works. `tf.saved_model.save(model, 's3://bucket/model/1/')` works. What changes is the failure mode. Object storage is eventually consistent for listings on some providers, which can make `tf.train.latest_checkpoint` return a stale answer right after a save. Latency is higher, so writing a checkpoint every few seconds is much more expensive than writing one every few minutes. Authentication needs to be in place: a service account JSON pointed to by `GOOGLE_APPLICATION_CREDENTIALS` for `gs://`, IAM credentials in the environment or instance profile for `s3://`.

On [Amazon SageMaker](/wiki/amazon_sagemaker) the convention is to write checkpoints to the local path `/opt/ml/checkpoints`, which SageMaker syncs to an S3 location specified in the job configuration [11]. SageMaker configures an output channel in continuous upload mode that runs an agent on the host to watch the file system and continuously copy new files to S3 [11]. This works around the latency and consistency issues by keeping the training loop local and offloading the upload to a separate process. Managed Spot Training uses the same path to recover after a spot interruption: when training restarts, SageMaker copies the checkpoints back from S3 to `/opt/ml/checkpoints` so the script can resume [11].

## What are the common conventions and pitfalls?

A few patterns show up repeatedly in TensorFlow codebases that have been around for more than a few months.

### One run, one directory

The single most useful convention is that every training run gets its own directory, and that directory is never reused. Name it with a timestamp, a short description, or both. Inside that directory, put the checkpoints, the event files, the exported SavedModel, and a copy of the config you used. If you have to come back six months later to figure out what hyperparameters produced a specific result, this is the layout that will tell you.

When runs share a directory, TensorBoard tries to stitch their event files into one run and the resulting curves are a mess [9]. Checkpoints from one run can also be picked up by `tf.train.latest_checkpoint` on a different run, which leads to confusing behavior where the model resumes from somewhere unexpected.

### Trailing slashes and prefixes

A `tf.train.CheckpointManager` `directory` argument is a directory. A `tf.train.Checkpoint.save()` `file_prefix` argument is a prefix, not a directory. Passing `'./ckpts'` to `Checkpoint.save` creates files named `ckpts-1`, `ckpts-2`, etc. in the current working directory, which is almost never what you want. The fix is either to pass `'./ckpts/ckpt'` (so files are named `./ckpts/ckpt-1`, etc.) or to use `CheckpointManager`, which handles the prefix internally [1].

### Permissions in containers

In [Docker](/wiki/docker) containers the user inside the container often does not have write permissions on a mounted host directory. The error TensorFlow returns in that case is not always obvious. If your training loop fails on the first checkpoint save and you see `Permission denied`, that is usually the cause. Mount the directory with appropriate ownership or run the container as the host user.

### Avoiding stale checkpoints across experiments

When you change your model code and rerun training against an existing `model_dir`, TensorFlow will try to restore variables from the existing checkpoint. If the variable names or shapes have changed, you get cryptic shape mismatch errors. The safe move is to either delete the old directory or write the new run into a new directory. Reusing a `model_dir` only makes sense when you genuinely want to resume the same model from the same point.

## Directory structure reference

The following table summarizes which TensorFlow APIs write to a directory, what argument name they use, and what they put there.

| API | Argument | Typical contents |
| --- | --- | --- |
| [tf.estimator.Estimator](/wiki/estimator) | `model_dir` | `checkpoint`, `model.ckpt-*` files, `graph.pbtxt`, event files |
| `tf.train.CheckpointManager` | `directory` | `checkpoint`, `ckpt-N.index`, `ckpt-N.data-*` |
| `tf.train.Checkpoint.save` | `file_prefix` | Files named `<prefix>-N.index` and `<prefix>-N.data-*` |
| `tf.keras.callbacks.ModelCheckpoint` | `filepath` | `.weights.h5`, `.keras`, or TF checkpoint files |
| `tf.keras.callbacks.TensorBoard` | `log_dir` | `events.out.tfevents.*` and profiler traces |
| `tf.summary.create_file_writer` | `logdir` | `events.out.tfevents.*` |
| `tf.saved_model.save` | `export_dir` | `saved_model.pb`, `variables/`, `assets/`, `fingerprint.pb` |
| `tf.keras.Model.save` | `filepath` | `.keras` zip, or a SavedModel directory |
| `model.save_weights` | `filepath` | `.weights.h5` or sharded `.weights.json` plus shard files |

## Explain like I'm 5 (ELI5)

Imagine you are doing a big school project that takes weeks. You keep all your notes, drawings, and drafts in one folder so nothing gets lost. The root directory for a machine learning project is that folder. Inside it, you have a smaller folder for each time you sat down to work, and inside each of those, you have your saved progress (checkpoints), your sketches that show how the project is going (event files for TensorBoard), and your final clean version ready to hand in (the SavedModel). If you mix everything together in one big pile, you cannot tell which scribbles belong to which day. If you give each work session its own folder, you can always go back to the right one.

## References

1. TensorFlow, "Training checkpoints," https://www.tensorflow.org/guide/checkpoint
2. TensorFlow, "Using the SavedModel format," https://www.tensorflow.org/guide/saved_model
3. TensorFlow, "Estimators," https://www.tensorflow.org/guide/estimator
4. TensorFlow API, "tf.train.CheckpointManager," https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager
5. TensorFlow API, "tf.train.latest_checkpoint," https://www.tensorflow.org/api_docs/python/tf/train/latest_checkpoint
6. TensorFlow API, "tf.keras.callbacks.ModelCheckpoint," https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint
7. Keras documentation, "ModelCheckpoint," https://keras.io/api/callbacks/model_checkpoint/
8. Keras documentation, "Weights only saving and loading," https://keras.io/api/models/model_saving_apis/weights_saving_and_loading/
9. TensorBoard, "README," https://github.com/tensorflow/tensorboard/blob/master/README.md
10. TensorFlow Blog, "Train your TensorFlow model on Google Cloud using TensorFlow Cloud," https://blog.tensorflow.org/2020/08/train-your-tensorflow-model-on-google.html
11. AWS Machine Learning Blog, "Implement checkpointing with TensorFlow for Amazon SageMaker Managed Spot Training," https://aws.amazon.com/blogs/machine-learning/implement-checkpointing-with-tensorflow-for-amazon-sagemaker-managed-spot-training/