Root directory

Root directory in machine learning

In the context of machine learning, the term "root directory" does not refer to a single algorithmic concept. It refers to a directory on disk (or in object storage) that holds the artifacts a training run produces and a serving run consumes: checkpoints, event files for TensorBoard, exported SavedModels, assets, and bookkeeping metadata. In TensorFlow specifically, several APIs take a directory argument by name, including the model_dir parameter of tf.estimator.Estimator, the directory parameter of tf.train.CheckpointManager, the filepath parameter of tf.keras.callbacks.ModelCheckpoint, and the log_dir parameter of tf.keras.callbacks.TensorBoard. The same path is often called the model directory, the checkpoint directory, the experiment directory, or simply the logdir, depending on which tool is writing into it.

The directory itself is just a folder, but how you organize it has real consequences. A clean layout lets you resume an interrupted training run from the last checkpoint, compare two experiments in TensorBoard side by side, ship a SavedModel to TensorFlow Serving without rewiring code, and avoid the silent bug where two runs scribble over each other's events. A messy layout is one of those problems you do not notice until a week into a project, by which point the cost of cleaning it up is high.

Definition

In operating systems, the root directory is the top level directory of a file system hierarchy. It is denoted by a forward slash (/) on Unix-like systems such as Linux and macOS, and by a drive letter followed by a backslash (for example C:\) on Windows. In day to day machine learning work, however, "root directory" is rarely used in that strict OS sense. It usually means the root of a project or the root of an experiment: the top folder under which everything else for that workload lives.

A project root for a typical TensorFlow repository tends to look like this:

my_project/
  data/
  src/
  configs/
  notebooks/
  experiments/
    run_2026_05_10_baseline/
      checkpoints/
      logs/train/
      logs/validation/
      saved_model/
      config.json

The top folder is the project root. Each subdirectory under experiments/ is an experiment root, also called a run directory. TensorFlow APIs do not care about the project root. They care about the run directory, because that is what gets passed to model_dir, log_dir, or tf.train.CheckpointManager.

The role of the directory in TensorFlow

model_dir in tf.estimator

tf.estimator.Estimator takes a model_dir argument in its constructor. The first time you call train(), TensorFlow writes an initial checkpoint into that directory, plus a graph.pbtxt, plus an events.out.tfevents.* event file for TensorBoard. On subsequent calls to train(), evaluate(), or predict(), the Estimator rebuilds the model from the latest checkpoint in model_dir. If you do not pass model_dir, the Estimator creates a temporary directory using Python's tempfile.mkdtemp, which is fine for a quick test but useless if you want to resume training later.

A fresh Estimator run produces a directory roughly like this:

model_dir/
  checkpoint
  graph.pbtxt
  model.ckpt-0.data-00000-of-00001
  model.ckpt-0.index
  model.ckpt-0.meta
  model.ckpt-1000.data-00000-of-00001
  model.ckpt-1000.index
  model.ckpt-1000.meta
  events.out.tfevents.1715200000.hostname

The schedule is controlled by tf.estimator.RunConfig. The defaults save a checkpoint every 600 seconds and keep the last five, but you can override both with save_checkpoints_steps, save_checkpoints_secs, and keep_checkpoint_max.

tf.train.Checkpoint and CheckpointManager

Outside of Estimator, the modern object based API is tf.train.Checkpoint paired with tf.train.CheckpointManager. You build a Checkpoint object that tracks Python objects (a model, an optimizer, a step counter, an iterator), then hand it to a manager along with a directory:

ckpt = tf.train.Checkpoint(step=tf.Variable(1), optimizer=opt, net=net)
manager = tf.train.CheckpointManager(ckpt, './tf_ckpts', max_to_keep=3)

Each call to manager.save() writes a new pair of files under ./tf_ckpts with a numeric suffix tied to the manager's save counter:

tf_ckpts/
  checkpoint
  ckpt-1.data-00000-of-00001
  ckpt-1.index
  ckpt-2.data-00000-of-00001
  ckpt-2.index
  ckpt-3.data-00000-of-00001
  ckpt-3.index

The checkpoint file is a small text file that records which prefixes exist and which one is the latest. It is the file tf.train.latest_checkpoint('./tf_ckpts') reads to answer the question of where to resume from. The .index file holds metadata about the variables stored in the checkpoint. The .data-00000-of-00001 shard holds the actual tensor values, sharded only when training runs in a distributed setup. Strings like ./tf_ckpts/ckpt-2 are checkpoint prefixes, not real files; passing the prefix to ckpt.restore() tells TensorFlow to load the index and data files that share that prefix.

max_to_keep is the knob that controls how many recent checkpoints survive. Older ones are deleted automatically. If you want to preserve a specific checkpoint forever (for example the one tied to a published paper or a deployed model), call manager.checkpoints and copy the relevant files out, because the manager will eventually rotate it out otherwise.

tf.keras.callbacks.ModelCheckpoint

For Keras models, the equivalent is tf.keras.callbacks.ModelCheckpoint. The callback takes a filepath argument that can be either a directory or a templated file pattern with Python str.format placeholders:

tf.keras.callbacks.ModelCheckpoint(
    filepath='checkpoints/cp-{epoch:04d}.ckpt',
    save_weights_only=True,
    save_freq='epoch',
)

The extension matters. A .h5 suffix triggers the older HDF5 single file format. A .keras suffix uses the modern Keras zip format. A bare path with no extension, or .ckpt, uses the TensorFlow checkpoint format and writes the same .index and .data-* shards described above. In recent versions of Keras, save_weights_only=True requires the path to end in .weights.h5, which can surprise users migrating from older code.

TensorBoard logdir and event files

TensorBoard reads the run directory through its --logdir flag and walks it recursively, looking for files whose names contain tfevents. Each subdirectory that contains event files is treated as a single run and gets its own line in the TensorBoard UI. A typical layout for a training run with separate train and validation summaries looks like:

logs/
  train/
    events.out.tfevents.1715200000.hostname.0.v2
  validation/
    events.out.tfevents.1715200000.hostname.1.v2

The convention behind this layout is the reason tf.keras.callbacks.TensorBoard writes train and validation summaries into sibling subdirectories: TensorBoard treats them as two separate runs and draws them as two separate curves on the same chart. If you put both event files in the same directory, TensorBoard tries to stitch them together as one run, which is what you want when a training job crashes and resumes, but not what you want when one stream is training loss and the other is validation loss.

A common multi experiment layout is:

logs/
  baseline/
    train/
    validation/
  larger_lr/
    train/
    validation/

Launching tensorboard --logdir logs/ shows both experiments at once, color coded per top level directory. This is the path of least friction for comparing runs.

SavedModel directory

When you export a model for serving or for TensorFlow Lite conversion, TensorFlow writes a SavedModel into a directory. The structure is fixed and is what tools like TensorFlow Serving, TensorFlow.js, and TensorFlow Hub expect:

saved_model/
  saved_model.pb
  fingerprint.pb
  variables/
    variables.data-00000-of-00001
    variables.index
  assets/
  assets.extra/

saved_model.pb is the serialized graph and the named signatures (the input and output tensor specs for each callable function). variables/ holds the trained values in the standard checkpoint format. assets/ holds files the graph needs at load time, such as the vocabulary text files used by lookup tables. assets.extra/ is reserved for files that are not used by the graph but might be useful to the consumer (model cards, license notes, conversion hints). fingerprint.pb is a small set of 64 bit hashes that uniquely identify the SavedModel's contents and is read with tf.saved_model.experimental.read_fingerprint.

TensorFlow Serving expects a version subdirectory above this layout: models/my_model/1/, models/my_model/2/, and so on. The integer is the model version, and serving automatically picks up the highest numbered subdirectory unless you tell it otherwise. If you write a SavedModel directly into models/my_model/ without a version folder, serving refuses to load it.

Remote root directories

None of these TensorFlow APIs require a local path. They all accept URIs that the underlying tf.io.gfile layer understands, which means you can point model_dir, log_dir, filepath, or the directory argument at object storage. The common schemes are gs:// for Google Cloud Storage, s3:// for Amazon S3, and hdfs:// for HDFS.

checkpoint_path = 'gs://my-bucket/runs/2026_05_10/save_at_{epoch}'
callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path)

The API surface is the same as for a local path. tf.train.latest_checkpoint('gs://bucket/run/') works. tf.saved_model.save(model, 's3://bucket/model/1/') works. What changes is the failure mode. Object storage is eventually consistent for listings on some providers, which can make tf.train.latest_checkpoint return a stale answer right after a save. Latency is higher, so writing a checkpoint every few seconds is much more expensive than writing one every few minutes. Authentication needs to be in place: a service account JSON pointed to by GOOGLE_APPLICATION_CREDENTIALS for gs://, IAM credentials in the environment or instance profile for s3://.

On Amazon SageMaker the convention is to write checkpoints to the local path /opt/ml/checkpoints, which SageMaker syncs to an S3 location specified in the job configuration. This works around the latency and consistency issues by keeping the training loop local and offloading the upload to a separate process. Managed Spot Training uses the same path to recover after a spot interruption.

Conventions and pitfalls

A few patterns show up repeatedly in TensorFlow codebases that have been around for more than a few months.

One run, one directory

The single most useful convention is that every training run gets its own directory, and that directory is never reused. Name it with a timestamp, a short description, or both. Inside that directory, put the checkpoints, the event files, the exported SavedModel, and a copy of the config you used. If you have to come back six months later to figure out what hyperparameters produced a specific result, this is the layout that will tell you.

When runs share a directory, TensorBoard tries to stitch their event files into one run and the resulting curves are a mess. Checkpoints from one run can also be picked up by tf.train.latest_checkpoint on a different run, which leads to confusing behavior where the model resumes from somewhere unexpected.

Trailing slashes and prefixes

A tf.train.CheckpointManager directory argument is a directory. A tf.train.Checkpoint.save() file_prefix argument is a prefix, not a directory. Passing './ckpts' to Checkpoint.save creates files named ckpts-1, ckpts-2, etc. in the current working directory, which is almost never what you want. The fix is either to pass './ckpts/ckpt' (so files are named ./ckpts/ckpt-1, etc.) or to use CheckpointManager, which handles the prefix internally.

Permissions in containers

In Docker containers the user inside the container often does not have write permissions on a mounted host directory. The error TensorFlow returns in that case is not always obvious. If your training loop fails on the first checkpoint save and you see Permission denied, that is usually the cause. Mount the directory with appropriate ownership or run the container as the host user.

Avoiding stale checkpoints across experiments

When you change your model code and rerun training against an existing model_dir, TensorFlow will try to restore variables from the existing checkpoint. If the variable names or shapes have changed, you get cryptic shape mismatch errors. The safe move is to either delete the old directory or write the new run into a new directory. Reusing a model_dir only makes sense when you genuinely want to resume the same model from the same point.

Directory structure reference

The following table summarizes which TensorFlow APIs write to a directory, what argument name they use, and what they put there.

API	Argument	Typical contents
tf.estimator.Estimator	`model_dir`	`checkpoint`, `model.ckpt-*` files, `graph.pbtxt`, event files
`tf.train.CheckpointManager`	`directory`	`checkpoint`, `ckpt-N.index`, `ckpt-N.data-*`
`tf.train.Checkpoint.save`	`file_prefix`	Files named `<prefix>-N.index` and `<prefix>-N.data-*`
`tf.keras.callbacks.ModelCheckpoint`	`filepath`	`.weights.h5`, `.keras`, or TF checkpoint files
`tf.keras.callbacks.TensorBoard`	`log_dir`	`events.out.tfevents.*` and profiler traces
`tf.summary.create_file_writer`	`logdir`	`events.out.tfevents.*`
`tf.saved_model.save`	`export_dir`	`saved_model.pb`, `variables/`, `assets/`, `fingerprint.pb`
`tf.keras.Model.save`	`filepath`	`.keras` zip, or a SavedModel directory
`model.save_weights`	`filepath`	`.weights.h5` or sharded `.weights.json` plus shard files

Explain like I'm 5 (ELI5)

Imagine you are doing a big school project that takes weeks. You keep all your notes, drawings, and drafts in one folder so nothing gets lost. The root directory for a machine learning project is that folder. Inside it, you have a smaller folder for each time you sat down to work, and inside each of those, you have your saved progress (checkpoints), your sketches that show how the project is going (event files for TensorBoard), and your final clean version ready to hand in (the SavedModel). If you mix everything together in one big pile, you cannot tell which scribbles belong to which day. If you give each work session its own folder, you can always go back to the right one.

Root directory

Root directory in machine learning

Definition

The role of the directory in TensorFlow

model_dir in tf.estimator

tf.train.Checkpoint and CheckpointManager

tf.keras.callbacks.ModelCheckpoint

TensorBoard logdir and event files

SavedModel directory

Remote root directories

Conventions and pitfalls

One run, one directory

Trailing slashes and prefixes

Permissions in containers

Avoiding stale checkpoints across experiments

Directory structure reference

Explain like I'm 5 (ELI5)

References

Improve this article

Root directory in machine learning

Definition

The role of the directory in TensorFlow

model_dir in tf.estimator

tf.train.Checkpoint and CheckpointManager

tf.keras.callbacks.ModelCheckpoint

TensorBoard logdir and event files

SavedModel directory

Remote root directories

Conventions and pitfalls

One run, one directory

Trailing slashes and prefixes

Permissions in containers

Avoiding stale checkpoints across experiments

Directory structure reference

Explain like I'm 5 (ELI5)

References

Root directory in machine learning

Definition

The role of the directory in TensorFlow

model_dir in tf.estimator

tf.train.Checkpoint and CheckpointManager

tf.keras.callbacks.ModelCheckpoint

TensorBoard logdir and event files

SavedModel directory

Remote root directories

Conventions and pitfalls

One run, one directory

Trailing slashes and prefixes

Permissions in containers

Avoiding stale checkpoints across experiments

Directory structure reference

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering

Root directory in machine learning

Definition

The role of the directory in TensorFlow

model_dir in tf.estimator

tf.train.Checkpoint and CheckpointManager

tf.keras.callbacks.ModelCheckpoint

TensorBoard logdir and event files

SavedModel directory

Remote root directories

Conventions and pitfalls

One run, one directory

Trailing slashes and prefixes

Permissions in containers

Avoiding stale checkpoints across experiments

Directory structure reference

Explain like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering