Dataset API (tf.data)

The Dataset API (tf.data) is a high-performance input pipeline framework within TensorFlow designed for loading, transforming, and delivering data to machine learning models during training and evaluation. It provides a composable, declarative interface built around the tf.data.Dataset abstraction, which represents a potentially large sequence of elements that can be processed through a chain of functional transformations. The API follows an Extract-Transform-Load (ETL) pattern, separating user-facing application logic from underlying performance optimizations so that practitioners can focus on data semantics while the runtime handles efficient execution.

The tf.data framework was developed at Google and is described in a 2021 paper published in the Proceedings of the VLDB Endowment. An empirical analysis of millions of jobs run in Google's datacenters showed that input data processing consumed a substantial fraction of job resources and exhibited high diversity across workloads, motivating a general-purpose, automatically optimized pipeline system. The framework ships as a core module of TensorFlow and is the recommended approach for building input pipelines across all TensorFlow-based training workflows.

ELI5 (Explain like I'm 5)

Imagine you are building something out of LEGO bricks, but all your bricks are mixed up in a giant pile. Before you can start building, someone needs to find the right bricks, sort them by color and size, and hand them to you one group at a time so you never have to stop and wait. That is what tf.data does for a computer that is learning. The computer's "brain" (a GPU or TPU) learns really fast, but it needs data prepared and ready. tf.data is the helper that reads the data, cleans it up, organizes it into small batches, and keeps handing new batches to the brain before it finishes the last one. That way the brain never sits around doing nothing.

Background and motivation

Training modern deep learning models involves repeatedly feeding data through the model, adjusting weights through backpropagation, and iterating over the full dataset for many epochs. As GPUs and TPUs have grown faster, the CPU-based data loading and preprocessing step has increasingly become the bottleneck. Research has found that nearly 56% of GPU cycles can be wasted waiting for training data even when CPUs operate at 92% utilization. With datasets growing larger and accelerators growing faster, this mismatch only becomes more pronounced.

Before tf.data, TensorFlow 1.x relied on feed dictionaries and queue-based pipelines. Feed dictionaries required users to manually fetch data in Python and pass it into the computation graph through session.run(), which introduced Python-level overhead and prevented pipelining between host and device. Queue runners were more efficient but difficult to use correctly, error-prone, and hard to debug. The tf.data API was introduced to replace both approaches with a single, composable, high-performance abstraction that works naturally with both graph and eager execution modes.

Architecture and design

The ETL model

The tf.data pipeline follows a three-stage ETL pattern:

Stage	Role	Typical operations
Extract	Read raw data from storage (local disk, cloud storage, databases)	`TFRecordDataset`, `TextLineDataset`, `CsvDataset`, `from_tensor_slices`, `from_generator`, `list_files`
Transform	Parse, decode, augment, shuffle, and batch the data on CPU	`map`, `filter`, `shuffle`, `batch`, `padded_batch`, `window`, `flat_map`, `interleave`
Load	Deliver prepared batches to the accelerator for model execution	`prefetch`, distribution via `tf.distribute.Strategy`

This separation allows each stage to be independently parallelized and optimized. The Extract stage can read from multiple files concurrently, the Transform stage can parallelize map operations across CPU cores, and the Load stage can overlap data delivery with model computation.

Core abstraction: `tf.data.Dataset`

The central object in the API is tf.data.Dataset, which represents a sequence of elements. Each element can be a single tensor, a tuple of tensors, a dictionary of tensors, or nested combinations of these structures. Datasets are lazy: transformations are not executed until elements are consumed (for example, by iterating over the dataset in a training loop).

Datasets support several container types for elements:

Supported type	Description
`tf.Tensor`	Standard dense tensor
`tf.sparse.SparseTensor`	Sparse tensor representation
`tf.RaggedTensor`	Tensor with variable-length dimensions
`tf.TensorArray`	Dynamic-size tensor array
`tf.data.Dataset`	Nested dataset (used by windowing)

Element structures can use tuple, dict, NamedTuple, or OrderedDict as containers. Python lists are not supported as element containers.

The structure of a dataset's elements can be inspected using the element_spec property:

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices({
    "features": tf.random.uniform([100, 10]),
    "labels": tf.random.uniform(<sup><a href="#cite_note-100" class="cite-ref">[100]</a></sup>, maxval=2, dtype=tf.int32)
})
print(dataset.element_spec)
# {'features': TensorSpec(shape=(10,), dtype=tf.float32),
#  'labels': TensorSpec(shape=(), dtype=tf.int32)}

Creating datasets

The API provides multiple factory methods for constructing datasets from different data sources.

From in-memory data

The from_tensor_slices method creates a dataset by slicing tensors along their first dimension. This is the simplest way to create a dataset from NumPy arrays or Python lists:

import numpy as np

images = np.random.rand(1000, 28, 28).astype(np.float32)
labels = np.random.randint(0, 10, size=1000).astype(np.int32)

dataset = tf.data.Dataset.from_tensor_slices((images, labels))

The from_tensors method wraps the entire input as a single element, useful when you want to treat a batch as one element.

From Python generators

The from_generator method wraps a Python generator function, which is useful for data sources that do not fit neatly into tensor slicing or for lazy data loading:

def data_generator():
    for i in range(1000):
        yield np.random.rand(28, 28).astype(np.float32), i % 10

dataset = tf.data.Dataset.from_generator(
    data_generator,
    output_signature=(
        tf.TensorSpec(shape=(28, 28), dtype=tf.float32),
        tf.TensorSpec(shape=(), dtype=tf.int32)
    )
)

Because generators execute in Python, they cannot be serialized into a TensorFlow graph and may introduce performance overhead. They are best used for prototyping or when interfacing with external libraries.

From files

Method	File format	Use case
`tf.data.TFRecordDataset`	TFRecord (protobuf-based binary)	Large-scale training with serialized examples
`tf.data.TextLineDataset`	Plain text files	NLP tasks, log processing
`tf.data.experimental.CsvDataset`	CSV files	Tabular / structured data
`tf.data.Dataset.list_files`	File path patterns	Image directories, audio files
`tf.data.FixedLengthRecordDataset`	Fixed-length binary records	Scientific data, legacy formats

TFRecord is the recommended format for large datasets because it supports efficient sequential reads, compression, and sharding across multiple files. Sharding data across at least 10 times as many files as there are reading hosts is a common best practice for parallel I/O.

# Reading from sharded TFRecord files
filenames = tf.data.Dataset.list_files("data/train-*.tfrecord")
dataset = filenames.interleave(
    tf.data.TFRecordDataset,
    cycle_length=4,
    num_parallel_calls=tf.data.AUTOTUNE
)

Transformations

Transformations are functional operations applied to a Dataset that return a new Dataset. They are composable and lazy, meaning no computation occurs until elements are consumed.

Element-wise transformations

map(map_func, num_parallel_calls=None) applies a function to each element. The num_parallel_calls parameter enables parallel execution across multiple CPU cores:

def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0
    image = tf.image.resize(image, [224, 224])
    return image, label

dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)

filter(predicate) retains only elements for which the predicate returns True:

# Keep only samples with label > 5
dataset = dataset.filter(lambda image, label: label > 5)

Batching transformations

batch(batch_size, drop_remainder=False) groups consecutive elements into batches. Setting drop_remainder=True ensures all batches have the same size, which is required for certain static-shape operations:

dataset = dataset.batch(32, drop_remainder=True)

padded_batch(batch_size, padded_shapes, padding_values=None) batches elements that may have different shapes by padding them to a uniform shape. This is common in NLP tasks where sequences have variable lengths:

# Pad sequences to the longest in each batch
dataset = dataset.padded_batch(32, padded_shapes=([None],))

unbatch() reverses a batch, splitting each element back into individual elements.

Ordering and sampling transformations

shuffle(buffer_size, seed=None, reshuffle_each_iteration=True) maintains a buffer of buffer_size elements and randomly samples from it. For a perfectly uniform shuffle, the buffer size should equal or exceed the dataset size. Smaller buffers trade shuffle quality for lower memory usage:

dataset = dataset.shuffle(buffer_size=10000)

repeat(count=None) repeats the dataset for a specified number of epochs, or indefinitely if count is None. The ordering of shuffle, batch, and repeat matters:

shuffle then repeat: shuffles within each epoch (recommended for most cases)
repeat then shuffle: allows epoch-boundary crossover, shuffling across epoch boundaries

skip(count) and take(count) select subsets of elements by position.

Structural transformations

flat_map(map_func) maps a function over each element and flattens the resulting datasets into a single dataset.

interleave(map_func, cycle_length, num_parallel_calls=None) applies a function to each element (typically producing a sub-dataset), then interleaves elements from cycle_length sub-datasets at a time. With num_parallel_calls, reading from multiple files happens in parallel:

files = tf.data.Dataset.list_files("data/*.csv")
dataset = files.interleave(
    lambda f: tf.data.TextLineDataset(f).skip(1),
    cycle_length=4,
    num_parallel_calls=tf.data.AUTOTUNE
)

window(size, shift=1, stride=1, drop_remainder=False) creates windows of consecutive elements, returning nested datasets. This is useful for time series and sequence modeling:

range_ds = tf.data.Dataset.range(10)
windows = range_ds.window(5, shift=1, drop_remainder=True)
# Each window is a nested Dataset; flatten with flat_map + batch
windows = windows.flat_map(lambda w: w.batch(5))

zip(datasets) combines elements from multiple datasets into tuples, similar to Python's built-in zip.

concatenate(dataset) appends one dataset to another.

Summary of key transformations

Transformation	Purpose	Key parameters
`map`	Apply a function to each element	`num_parallel_calls`
`filter`	Keep elements matching a predicate	predicate function
`batch`	Group elements into fixed-size batches	`batch_size`, `drop_remainder`
`padded_batch`	Batch with padding for variable shapes	`padded_shapes`, `padding_values`
`shuffle`	Randomize element order	`buffer_size`, `seed`
`repeat`	Repeat dataset for multiple epochs	`count`
`prefetch`	Overlap data prep with model execution	`buffer_size`
`cache`	Cache elements in memory or on disk	`filename` (optional)
`interleave`	Interleave elements from sub-datasets	`cycle_length`, `num_parallel_calls`
`flat_map`	Map and flatten nested datasets	map function
`window`	Create sliding windows	`size`, `shift`, `stride`
`unbatch`	Split batched elements	none
`skip` / `take`	Select elements by position	`count`
`zip`	Combine multiple datasets	datasets
`concatenate`	Append one dataset to another	dataset
`reduce`	Aggregate all elements	`initial_state`, `reduce_func`

Performance optimization

The tf.data API includes several mechanisms for maximizing throughput and minimizing the time accelerators spend idle waiting for data.

Pipelining with `prefetch`

Without prefetching, the CPU and accelerator operate sequentially: the CPU prepares a batch, then sits idle while the accelerator trains on it, and vice versa. The prefetch transformation overlaps these two phases so that the CPU begins preparing batch N+1 while the accelerator trains on batch N. This reduces the per-step time from the sum of CPU and accelerator time to the maximum of the two:

dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

Parallel data extraction with `interleave`

When reading from multiple files (especially over network storage), the interleave transformation with num_parallel_calls reads from several files simultaneously, hiding I/O latency:

dataset = files.interleave(
    tf.data.TFRecordDataset,
    cycle_length=8,
    num_parallel_calls=tf.data.AUTOTUNE
)

Parallel data transformation with `map`

Setting num_parallel_calls in the map transformation distributes preprocessing work across multiple CPU threads. This is especially beneficial for compute-heavy operations like image decoding, augmentation, or text tokenization:

dataset = dataset.map(decode_and_augment, num_parallel_calls=tf.data.AUTOTUNE)

Caching

The cache transformation stores elements after they have been processed. On the first pass through the data, all transformations before cache are executed. On subsequent passes (later epochs), cached data is read directly, skipping the earlier transformations:

# Cache in memory (for datasets that fit)
dataset = dataset.map(expensive_preprocess).cache()

# Cache to disk (for larger datasets)
dataset = dataset.map(expensive_preprocess).cache("/tmp/cache_dir")

Placement matters: put cache after expensive, deterministic transformations but before randomized augmentations that should vary each epoch.

Vectorizing mapped functions

Applying a function element-by-element with map and then batching incurs per-element overhead. Reversing the order (batching first, then mapping) allows the function to operate on entire batches at once, leveraging vectorized operations:

# Slower: map then batch (function called per element)
dataset.map(increment).batch(256)

# Faster: batch then map (function called per batch)
dataset.batch(256).map(increment)

Benchmarks show that vectorized mapping can yield a 4 to 5x speedup over per-element mapping for lightweight operations.

AUTOTUNE

tf.data.AUTOTUNE is a special constant that delegates parameter selection to the tf.data runtime. When passed as buffer_size to prefetch or as num_parallel_calls to map and interleave, the runtime collects timing information about each pipeline stage, uses an analytical model of pipeline performance, and applies a hill-climbing algorithm to dynamically adjust parallelism, buffer sizes, and other knobs. This removes the need for manual tuning and adapts to different hardware configurations automatically.

The autotuning mechanism periodically measures throughput and latency of each transformation. If the model processes data faster than the pipeline supplies it, the runtime increases parallelism. If the pipeline is ahead of the model, it reduces resource usage. This happens continuously during training without user intervention.

Recommended pipeline pattern

Combining all optimizations, a high-performance pipeline typically looks like:

dataset = (
    tf.data.Dataset.list_files("data/train-*.tfrecord")
    .interleave(
        tf.data.TFRecordDataset,
        cycle_length=8,
        num_parallel_calls=tf.data.AUTOTUNE
    )
    .shuffle(buffer_size=10000)
    .map(parse_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(64)
    .prefetch(tf.data.AUTOTUNE)
)

Performance comparison

The following table illustrates the relative impact of each optimization technique, based on benchmarks from the TensorFlow documentation:

Pipeline configuration	Relative execution time
Naive (sequential, no optimization)	1.00x (baseline)
+ `prefetch(AUTOTUNE)`	~0.97x
+ parallel `interleave`	~0.77x
+ parallel `map`	~0.67x
+ `cache` (after first epoch)	~0.50x
All optimizations combined	~0.48x

The combined effect of all optimizations can roughly halve pipeline execution time compared to a naive sequential implementation.

Iterators and consumption

TensorFlow 1.x (graph mode)

In TensorFlow 1.x, datasets could not be iterated directly in Python. Instead, users created an explicit Iterator object and called get_next() to obtain a symbolic tensor, which was then evaluated inside a tf.Session:

# TensorFlow 1.x pattern (deprecated)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    while True:
        try:
            value = sess.run(next_element)
        except tf.errors.OutOfRangeError:
            break

TensorFlow 1.x provided four types of iterators: one-shot, initializable, reinitializable, and feedable. Each offered different levels of flexibility at the cost of complexity.

TensorFlow 2.x (eager mode)

With eager execution enabled by default in TensorFlow 2.x, tf.data.Dataset became a standard Python iterable. Elements can be consumed directly in a for loop:

# TensorFlow 2.x pattern
for images, labels in dataset:
    predictions = model(images, training=True)
    # ...

Alternatively, you can create an explicit iterator using Python's built-in iter() and next():

it = iter(dataset)
batch = next(it)

Iterator checkpointing

For long training runs, tf.data supports saving and restoring iterator state through tf.train.Checkpoint. This allows training to resume from the exact position in the dataset after an interruption, avoiding re-processing already-seen data:

iterator = iter(dataset)
ckpt = tf.train.Checkpoint(iterator=iterator)
manager = tf.train.CheckpointManager(ckpt, "/tmp/ckpt", max_to_keep=3)

# During training
for step, batch in enumerate(iterator):
    train_step(batch)
    if step % 1000 == 0:
        manager.save()

# After restart
ckpt.restore(manager.latest_checkpoint)
# Iterator resumes from saved position

Integration with Keras

tf.data.Dataset integrates directly with the Keras training API. Datasets can be passed to model.fit(), model.evaluate(), and model.predict() without manual iteration:

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(10000).batch(32).prefetch(tf.data.AUTOTUNE)

model.fit(train_ds, epochs=10)

Keras preprocessing layers can also be applied as part of the tf.data pipeline. When applied to a Dataset with .map(), preprocessing runs on the CPU asynchronously, which frees GPU cycles for model computation. Alternatively, preprocessing layers can be included inside the model itself, which causes them to run on the GPU and benefit from hardware acceleration:

# Preprocessing as part of tf.data pipeline (runs on CPU)
norm_layer = tf.keras.layers.Normalization()
norm_layer.adapt(train_ds.map(lambda x, y: x))
train_ds = train_ds.map(lambda x, y: (norm_layer(x), y))

Integration with TensorFlow Datasets (TFDS)

TensorFlow Datasets (TFDS) is a separate library that provides a catalog of ready-to-use datasets. TFDS is a high-level wrapper around tf.data: calling tfds.load() returns a tf.data.Dataset object, so all standard transformations and optimizations apply:

import tensorflow_datasets as tfds

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.shuffle(10000).batch(128).prefetch(tf.data.AUTOTUNE)

TFDS handles downloading, extracting, and preparing the data. It stores data in TFRecord format by default and supports deterministic ordering through seed-based shuffling.

Distributed training

The tf.data API integrates with tf.distribute.Strategy to handle data distribution across multiple GPUs or machines. When a dataset is used inside a distribution strategy scope, the strategy automatically shards the data across replicas.

Automatic sharding

TensorFlow provides three sharding policies through tf.data.experimental.AutoShardPolicy:

Policy	Behavior
`FILE`	Each worker reads a different subset of input files
`DATA`	Each worker reads all files but skips to its shard of elements
`AUTO`	Attempts FILE sharding first; falls back to DATA sharding
`OFF`	No automatic sharding; each worker sees the full dataset

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    model = build_model()
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Dataset is automatically sharded across GPUs
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(10000).batch(64).prefetch(tf.data.AUTOTUNE)
model.fit(train_ds, epochs=10)

tf.data service

For large-scale training jobs, the tf.data service disaggregates input data processing from model training. Instead of preprocessing data on the same machines that run accelerators, the tf.data service offloads preprocessing to dedicated CPU workers. This architecture provides three advantages:

Horizontal scaling: CPU and RAM resources for data processing can be scaled independently of accelerator hosts, eliminating the fixed hardware ratio constraint. Google reported saving 32x training time and 26x cost on average for jobs that were previously bottlenecked on data preprocessing.
Data sharing: Multiple training jobs can share preprocessed data, reducing redundant computation.
Coordinated reads: In distributed training, the service coordinates reading so that all workers receive similarly sized batches, avoiding stragglers that slow down synchronous training. This technique reduced training time by 2.2x on average.

The service is activated by adding a distribute transformation to the pipeline:

dataset = dataset.apply(
    tf.data.experimental.service.distribute(
        processing_mode="distributed_epoch",
        service="grpc://tf-data-service:5000"
    )
)

Handling imbalanced data

The API provides built-in methods for addressing class imbalance, which is common in tasks like fraud detection and medical diagnosis.

sample_from_datasets draws elements from multiple datasets according to specified weights:

neg_ds = full_ds.filter(lambda x, y: y == 0).repeat()
pos_ds = full_ds.filter(lambda x, y: y == 1).repeat()

balanced_ds = tf.data.Dataset.sample_from_datasets(
    [neg_ds, pos_ds], weights=[0.5, 0.5]
).batch(32)

rejection_resample resamples elements from a single dataset to match a target class distribution:

resampled = dataset.rejection_resample(
    class_func=lambda features, label: label,
    target_dist=[0.5, 0.5]
)

Using `tf.py_function` for custom Python logic

When a transformation requires arbitrary Python code (for example, calling a NumPy or SciPy function), tf.py_function wraps the function so it can be used inside map. However, because tf.py_function executes in Python and cannot be traced into a TensorFlow graph, it acts as a performance barrier and prevents certain graph optimizations:

import scipy.ndimage as ndimage

def random_rotate(image):
    angle = np.random.uniform(-30, 30)
    return ndimage.rotate(image, angle, reshape=False).astype(np.float32)

def augment(image, label):
    image = tf.py_function(random_rotate, [image], tf.float32)
    image.set_shape([28, 28])  # Restore static shape info
    return image, label

dataset = dataset.map(augment, num_parallel_calls=tf.data.AUTOTUNE)

For better performance, prefer using TensorFlow's own ops (such as tf.image.random_flip_left_right) instead of tf.py_function whenever possible.

Static optimizations

Beyond runtime autotuning, the tf.data framework applies static graph optimizations to the pipeline before execution. These optimizations are analogous to query optimization in relational databases:

Optimization	Description
Map and batch fusion	Combines consecutive `map` and `batch` operations into a single fused operator, reducing scheduling overhead
Map parallelization	Automatically parallelizes map operations when safe to do so
Noop elimination	Removes identity transformations that have no effect
Shuffle and repeat fusion	Fuses `shuffle` followed by `repeat` into a single operator
Map vectorization	Converts element-wise maps into batch-level vectorized operations
Filter fusion	Pushes filters earlier in the pipeline to reduce work in downstream stages
Injection of prefetch	Adds prefetch at the end of the pipeline when not already present

These optimizations are enabled by default. Users can inspect and control them through tf.data.Options:

options = tf.data.Options()
options.experimental_optimization.map_and_batch_fusion = True
options.experimental_optimization.map_parallelization = True
dataset = dataset.with_options(options)

Comparison with PyTorch DataLoader

The two dominant frameworks for building ML data pipelines are tf.data (TensorFlow) and DataLoader (PyTorch). They differ in architecture, execution model, and API design.

Feature	tf.data (TensorFlow)	DataLoader (PyTorch)
Abstraction	`tf.data.Dataset` (functional, composable transformations)	`Dataset` + `DataLoader` (class-based, imperative)
Parallelism model	Thread-based; `num_parallel_calls` on `map`/`interleave`	Process-based; `num_workers` spawns subprocesses
Shuffling	Buffer-based (`shuffle(buffer_size)`); quality depends on buffer size	Index-based; full shuffle of all indices each epoch
Graph compilation	Can compile to static graph for optimized execution	Runs in eager Python; relies on Python multiprocessing
Autotuning	Built-in `AUTOTUNE` for parallelism and buffer sizes	No built-in autotuning; manual parameter setting
Prefetching	Explicit `prefetch` transformation	`prefetch_factor` parameter on `DataLoader`
Custom data format	TFRecord recommended for large-scale I/O	No prescribed format; flexible `__getitem__` interface
Memory pinning	Handled internally	Explicit `pin_memory=True` for faster GPU transfer

In practice, both systems achieve comparable throughput when properly configured. tf.data's functional API and autotuning make it easier to build optimized pipelines without manual experimentation, while PyTorch's DataLoader offers more flexibility through its class-based interface and direct Python integration.

Deterministic vs. non-deterministic execution

By default, tf.data pipelines with parallel operations (num_parallel_calls > 1) may produce elements in non-deterministic order because different threads complete at different times. For reproducible training, determinism can be enforced:

options = tf.data.Options()
options.deterministic = True
dataset = dataset.with_options(options)

Enabling determinism may reduce throughput because faster threads must wait for slower ones to preserve ordering. The trade-off between reproducibility and performance depends on the use case.

Snapshot and save/load

The snapshot transformation (now deprecated in favor of save and load) persists the output of a pipeline to disk. This is useful when preprocessing is expensive and the same preprocessed data is needed across multiple training runs:

# Save a preprocessed dataset
tf.data.Dataset.save(dataset, "/path/to/saved_dataset")

# Load it back
loaded_ds = tf.data.Dataset.load("/path/to/saved_dataset")

Saved datasets are stored in multiple shards. The save/load API guarantees backward compatibility, so datasets saved with older versions of TensorFlow can be loaded with newer versions.

Common pipeline patterns

Image classification pipeline

def parse_image(filename, label):
    image = tf.io.read_file(filename)
    image = tf.io.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, [224, 224])
    image = tf.cast(image, tf.float32) / 255.0
    return image, label

files_ds = tf.data.Dataset.list_files("images/*/*.jpg")
labels_ds = files_ds.map(extract_label)

train_ds = (
    tf.data.Dataset.zip((files_ds, labels_ds))
    .shuffle(10000)
    .map(parse_image, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(32)
    .prefetch(tf.data.AUTOTUNE)
)

Text classification pipeline

train_ds = (
    tf.data.TextLineDataset(["train.tsv"])
    .skip(1)  # Skip header
    .map(lambda line: tf.io.decode_csv(line, [[0], [""]], field_delim="\t"))
    .shuffle(10000)
    .padded_batch(64, padded_shapes=([None], []))
    .prefetch(tf.data.AUTOTUNE)
)

Time series windowing pipeline

def make_window_dataset(series, window_size, shift, batch_size):
    ds = tf.data.Dataset.from_tensor_slices(series)
    ds = ds.window(window_size + 1, shift=shift, drop_remainder=True)
    ds = ds.flat_map(lambda w: w.batch(window_size + 1))
    ds = ds.map(lambda w: (w[:-1], w[-1]))
    ds = ds.shuffle(1000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return ds

Performance profiling and debugging

TensorFlow provides tools for analyzing tf.data pipeline performance:

TensorFlow Profiler: The tf.data section of the profiler shows a timeline of pipeline execution, revealing idle gaps and bottlenecks.
tf.data.experimental.StatsAggregator: Collects statistics about the pipeline (throughput, latency per transformation).
Plumber: An external tool developed by researchers at Carnegie Mellon and ETH Zurich that uses operational analysis to automatically diagnose and fix pipeline bottlenecks. In experiments, Plumber achieved speedups of up to 47x for misconfigured pipelines.

Common symptoms and their causes:

Symptom	Likely cause	Solution
GPU utilization low, CPU high	Data preprocessing is too slow	Add `num_parallel_calls=AUTOTUNE` to `map`; consider `cache`
GPU utilization low, CPU low	I/O bottleneck or pipeline stall	Use parallel `interleave`; increase `prefetch` buffer
First epoch slow, subsequent fast	No caching	Add `cache()` after expensive transformations
Memory growing over time	Large shuffle buffer or unbounded cache	Reduce `buffer_size`; cache to disk instead of memory

Limitations and considerations

Python generator overhead: from_generator and tf.py_function run in Python and cannot be compiled into TensorFlow graphs, limiting parallelism and throughput.
Shuffle buffer memory: A shuffle(buffer_size) with a large buffer consumes memory proportional to the buffer size times the element size. For very large datasets, a full shuffle may not be feasible in memory.
Static graph restrictions: Some transformations (such as those using tf.py_function) prevent graph optimization passes from being applied.
Debugging complexity: Because pipelines are lazy and potentially parallel, errors may surface at consumption time rather than at construction time, making debugging harder.
Deprecations: Several features in tf.data.experimental have been deprecated or moved. Users should consult the official API documentation for current recommendations.

The tf.data system has influenced and been the subject of several research projects:

cedar (Zhao et al., 2024): A Python-native framework that applies context-aware optimizations (offloading, caching, fusion, reordering) to input pipelines. Cedar achieved on average 2.74x higher performance compared to tf.data, tf.data service, Ray Data, and PyTorch DataLoader across eight benchmark pipelines.
Plumber (Kuchnik et al., 2022): Uses operational analysis modeling to automatically tune parallelism, prefetching, and caching under resource constraints.
tf.data service (Klimovic et al., 2023): Extends tf.data with disaggregated, horizontally scalable preprocessing workers.

References

Murray, D.G., Simsa, J., Klimovic, A., and Indyk, I. (2021). "tf.data: A Machine Learning Data Processing Framework." *Proceedings of the VLDB Endowment*, 14(12), 2945-2958. https://arxiv.org/abs/2101.12127
Abadi, M., Barham, P., Chen, J., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*, 265-283. https://arxiv.org/abs/1605.08695
Klimovic, A., Simsa, J., Murray, D.G., and Indyk, I. (2023). "tf.data service: A Case for Disaggregating ML Input Data Processing." *Proceedings of the 2023 ACM Symposium on Cloud Computing (SoCC)*. https://arxiv.org/abs/2210.14826
Kuchnik, M., Klimovic, A., Simsa, J., Smith, V., and Amvrosiadis, G. (2022). "Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines." *Proceedings of Machine Learning and Systems (MLSys)*. https://arxiv.org/abs/2111.04131
Zhao, M., et al. (2024). "cedar: Optimized and Unified Machine Learning Input Data Pipelines." *Proceedings of the VLDB Endowment*, 18. https://arxiv.org/abs/2401.08895
TensorFlow Authors. "tf.data: Build TensorFlow input pipelines." TensorFlow Core Guide. https://www.tensorflow.org/guide/data
TensorFlow Authors. "Better performance with the tf.data API." TensorFlow Core Guide. https://www.tensorflow.org/guide/data_performance
TensorFlow Authors. "Analyze tf.data performance with the TF Profiler." TensorFlow Core Guide. https://www.tensorflow.org/guide/data_performance_analysis
TensorFlow Authors. "tf.data.Dataset API Reference." TensorFlow v2.16.1 Documentation. https://www.tensorflow.org/api_docs/python/tf/data/Dataset
TensorFlow Authors. "Distributed training with TensorFlow." TensorFlow Core Guide. https://www.tensorflow.org/guide/distributed_training
Stanford CS230. "Building a data pipeline." CS230 Deep Learning Course Blog. https://cs230.stanford.edu/blog/datapipeline/
TensorFlow Authors. "TFRecord and tf.train.Example." TensorFlow Core Tutorial. https://www.tensorflow.org/tutorials/load_data/tfrecord

ELI5 (Explain like I'm 5)

Background and motivation

Architecture and design

The ETL model

Core abstraction: tf.data.Dataset

Creating datasets

From in-memory data

From Python generators

From files

Transformations

Element-wise transformations

Batching transformations

Ordering and sampling transformations

Structural transformations

Summary of key transformations

Performance optimization

Pipelining with prefetch

Parallel data extraction with interleave

Parallel data transformation with map

Caching

Vectorizing mapped functions

AUTOTUNE

Recommended pipeline pattern

Performance comparison

Iterators and consumption

TensorFlow 1.x (graph mode)

TensorFlow 2.x (eager mode)

Iterator checkpointing

Integration with Keras

Integration with TensorFlow Datasets (TFDS)

Distributed training

Automatic sharding

tf.data service

Handling imbalanced data

Using tf.py_function for custom Python logic

Static optimizations

Comparison with PyTorch DataLoader

Deterministic vs. non-deterministic execution

Snapshot and save/load

Common pipeline patterns

Image classification pipeline

Text classification pipeline

Time series windowing pipeline

Performance profiling and debugging

Limitations and considerations

Related research

See also

References

Improve this article

Related Articles

ARC-AGI 2

tf.keras

Estimator (tf.estimator)

GELU (Gaussian Error Linear Unit)

LeNet

TensorBoard

ELI5 (Explain like I'm 5)

Background and motivation

Architecture and design

The ETL model

Core abstraction: tf.data.Dataset

Creating datasets

From in-memory data

From Python generators

From files

Transformations

Element-wise transformations

Batching transformations

Ordering and sampling transformations

Structural transformations

Summary of key transformations

Performance optimization

Pipelining with prefetch

Parallel data extraction with interleave

Parallel data transformation with map

Caching

Vectorizing mapped functions

AUTOTUNE

Recommended pipeline pattern

Performance comparison

Core abstraction: `tf.data.Dataset`

Pipelining with `prefetch`

Parallel data extraction with `interleave`

Parallel data transformation with `map`

Using `tf.py_function` for custom Python logic

Core abstraction: `tf.data.Dataset`

Pipelining with `prefetch`

Parallel data extraction with `interleave`

Parallel data transformation with `map`

Using `tf.py_function` for custom Python logic