The Dataset API (tf.data) is a high-performance input pipeline framework within TensorFlow designed for loading, transforming, and delivering data to machine learning models during training and evaluation. It provides a composable, declarative interface built around the tf.data.Dataset abstraction, which represents a potentially large sequence of elements that can be processed through a chain of functional transformations. The API follows an Extract-Transform-Load (ETL) pattern, separating user-facing application logic from underlying performance optimizations so that practitioners can focus on data semantics while the runtime handles efficient execution.
The tf.data framework was developed at Google and is described in a 2021 paper published in the Proceedings of the VLDB Endowment. An empirical analysis of millions of jobs run in Google's datacenters showed that input data processing consumed a substantial fraction of job resources and exhibited high diversity across workloads, motivating a general-purpose, automatically optimized pipeline system. The framework ships as a core module of TensorFlow and is the recommended approach for building input pipelines across all TensorFlow-based training workflows.
Imagine you are building something out of LEGO bricks, but all your bricks are mixed up in a giant pile. Before you can start building, someone needs to find the right bricks, sort them by color and size, and hand them to you one group at a time so you never have to stop and wait. That is what tf.data does for a computer that is learning. The computer's "brain" (a GPU or TPU) learns really fast, but it needs data prepared and ready. tf.data is the helper that reads the data, cleans it up, organizes it into small batches, and keeps handing new batches to the brain before it finishes the last one. That way the brain never sits around doing nothing.
Training modern deep learning models involves repeatedly feeding data through the model, adjusting weights through backpropagation, and iterating over the full dataset for many epochs. As GPUs and TPUs have grown faster, the CPU-based data loading and preprocessing step has increasingly become the bottleneck. Research has found that nearly 56% of GPU cycles can be wasted waiting for training data even when CPUs operate at 92% utilization. With datasets growing larger and accelerators growing faster, this mismatch only becomes more pronounced.
Before tf.data, TensorFlow 1.x relied on feed dictionaries and queue-based pipelines. Feed dictionaries required users to manually fetch data in Python and pass it into the computation graph through session.run(), which introduced Python-level overhead and prevented pipelining between host and device. Queue runners were more efficient but difficult to use correctly, error-prone, and hard to debug. The tf.data API was introduced to replace both approaches with a single, composable, high-performance abstraction that works naturally with both graph and eager execution modes.
The tf.data pipeline follows a three-stage ETL pattern:
| Stage | Role | Typical operations |
|---|---|---|
| Extract | Read raw data from storage (local disk, cloud storage, databases) | TFRecordDataset, TextLineDataset, CsvDataset, from_tensor_slices, from_generator, list_files |
| Transform | Parse, decode, augment, shuffle, and batch the data on CPU | map, filter, shuffle, batch, padded_batch, window, flat_map, interleave |
| Load | Deliver prepared batches to the accelerator for model execution | prefetch, distribution via tf.distribute.Strategy |
This separation allows each stage to be independently parallelized and optimized. The Extract stage can read from multiple files concurrently, the Transform stage can parallelize map operations across CPU cores, and the Load stage can overlap data delivery with model computation.
tf.data.DatasetThe central object in the API is tf.data.Dataset, which represents a sequence of elements. Each element can be a single tensor, a tuple of tensors, a dictionary of tensors, or nested combinations of these structures. Datasets are lazy: transformations are not executed until elements are consumed (for example, by iterating over the dataset in a training loop).
Datasets support several container types for elements:
| Supported type | Description |
|---|---|
tf.Tensor | Standard dense tensor |
tf.sparse.SparseTensor | Sparse tensor representation |
tf.RaggedTensor | Tensor with variable-length dimensions |
tf.TensorArray | Dynamic-size tensor array |
tf.data.Dataset | Nested dataset (used by windowing) |
Element structures can use tuple, dict, NamedTuple, or OrderedDict as containers. Python lists are not supported as element containers.
The structure of a dataset's elements can be inspected using the element_spec property:
import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices({
"features": tf.random.uniform([100, 10]),
"labels": tf.random.uniform(<sup><a href="#cite_note-100" class="cite-ref">[100]</a></sup>, maxval=2, dtype=tf.int32)
})
print(dataset.element_spec)
# {'features': TensorSpec(shape=(10,), dtype=tf.float32),
# 'labels': TensorSpec(shape=(), dtype=tf.int32)}
The API provides multiple factory methods for constructing datasets from different data sources.
The from_tensor_slices method creates a dataset by slicing tensors along their first dimension. This is the simplest way to create a dataset from NumPy arrays or Python lists:
import numpy as np
images = np.random.rand(1000, 28, 28).astype(np.float32)
labels = np.random.randint(0, 10, size=1000).astype(np.int32)
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
The from_tensors method wraps the entire input as a single element, useful when you want to treat a batch as one element.
The from_generator method wraps a Python generator function, which is useful for data sources that do not fit neatly into tensor slicing or for lazy data loading:
def data_generator():
for i in range(1000):
yield np.random.rand(28, 28).astype(np.float32), i % 10
dataset = tf.data.Dataset.from_generator(
data_generator,
output_signature=(
tf.TensorSpec(shape=(28, 28), dtype=tf.float32),
tf.TensorSpec(shape=(), dtype=tf.int32)
)
)
Because generators execute in Python, they cannot be serialized into a TensorFlow graph and may introduce performance overhead. They are best used for prototyping or when interfacing with external libraries.
| Method | File format | Use case |
|---|---|---|
tf.data.TFRecordDataset | TFRecord (protobuf-based binary) | Large-scale training with serialized examples |
tf.data.TextLineDataset | Plain text files | NLP tasks, log processing |
tf.data.experimental.CsvDataset | CSV files | Tabular / structured data |
tf.data.Dataset.list_files | File path patterns | Image directories, audio files |
tf.data.FixedLengthRecordDataset | Fixed-length binary records | Scientific data, legacy formats |
TFRecord is the recommended format for large datasets because it supports efficient sequential reads, compression, and sharding across multiple files. Sharding data across at least 10 times as many files as there are reading hosts is a common best practice for parallel I/O.
# Reading from sharded TFRecord files
filenames = tf.data.Dataset.list_files("data/train-*.tfrecord")
dataset = filenames.interleave(
tf.data.TFRecordDataset,
cycle_length=4,
num_parallel_calls=tf.data.AUTOTUNE
)
Transformations are functional operations applied to a Dataset that return a new Dataset. They are composable and lazy, meaning no computation occurs until elements are consumed.
map(map_func, num_parallel_calls=None) applies a function to each element. The num_parallel_calls parameter enables parallel execution across multiple CPU cores:
def preprocess(image, label):
image = tf.cast(image, tf.float32) / 255.0
image = tf.image.resize(image, [224, 224])
return image, label
dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
filter(predicate) retains only elements for which the predicate returns True:
# Keep only samples with label > 5
dataset = dataset.filter(lambda image, label: label > 5)
batch(batch_size, drop_remainder=False) groups consecutive elements into batches. Setting drop_remainder=True ensures all batches have the same size, which is required for certain static-shape operations:
dataset = dataset.batch(32, drop_remainder=True)
padded_batch(batch_size, padded_shapes, padding_values=None) batches elements that may have different shapes by padding them to a uniform shape. This is common in NLP tasks where sequences have variable lengths:
# Pad sequences to the longest in each batch
dataset = dataset.padded_batch(32, padded_shapes=([None],))
unbatch() reverses a batch, splitting each element back into individual elements.
shuffle(buffer_size, seed=None, reshuffle_each_iteration=True) maintains a buffer of buffer_size elements and randomly samples from it. For a perfectly uniform shuffle, the buffer size should equal or exceed the dataset size. Smaller buffers trade shuffle quality for lower memory usage:
dataset = dataset.shuffle(buffer_size=10000)
repeat(count=None) repeats the dataset for a specified number of epochs, or indefinitely if count is None. The ordering of shuffle, batch, and repeat matters:
shuffle then repeat: shuffles within each epoch (recommended for most cases)repeat then shuffle: allows epoch-boundary crossover, shuffling across epoch boundariesskip(count) and take(count) select subsets of elements by position.
flat_map(map_func) maps a function over each element and flattens the resulting datasets into a single dataset.
interleave(map_func, cycle_length, num_parallel_calls=None) applies a function to each element (typically producing a sub-dataset), then interleaves elements from cycle_length sub-datasets at a time. With num_parallel_calls, reading from multiple files happens in parallel:
files = tf.data.Dataset.list_files("data/*.csv")
dataset = files.interleave(
lambda f: tf.data.TextLineDataset(f).skip(1),
cycle_length=4,
num_parallel_calls=tf.data.AUTOTUNE
)
window(size, shift=1, stride=1, drop_remainder=False) creates windows of consecutive elements, returning nested datasets. This is useful for time series and sequence modeling:
range_ds = tf.data.Dataset.range(10)
windows = range_ds.window(5, shift=1, drop_remainder=True)
# Each window is a nested Dataset; flatten with flat_map + batch
windows = windows.flat_map(lambda w: w.batch(5))
zip(datasets) combines elements from multiple datasets into tuples, similar to Python's built-in zip.
concatenate(dataset) appends one dataset to another.
| Transformation | Purpose | Key parameters |
|---|---|---|
map | Apply a function to each element | num_parallel_calls |
filter | Keep elements matching a predicate | predicate function |
batch | Group elements into fixed-size batches | batch_size, drop_remainder |
padded_batch | Batch with padding for variable shapes | padded_shapes, padding_values |
shuffle | Randomize element order | buffer_size, seed |
repeat | Repeat dataset for multiple epochs | count |
prefetch | Overlap data prep with model execution | buffer_size |
cache | Cache elements in memory or on disk | filename (optional) |
interleave | Interleave elements from sub-datasets | cycle_length, num_parallel_calls |
flat_map | Map and flatten nested datasets | map function |
window | Create sliding windows | size, shift, stride |
unbatch | Split batched elements | none |
skip / take | Select elements by position | count |
zip | Combine multiple datasets | datasets |
concatenate | Append one dataset to another | dataset |
reduce | Aggregate all elements | initial_state, reduce_func |
The tf.data API includes several mechanisms for maximizing throughput and minimizing the time accelerators spend idle waiting for data.
prefetchWithout prefetching, the CPU and accelerator operate sequentially: the CPU prepares a batch, then sits idle while the accelerator trains on it, and vice versa. The prefetch transformation overlaps these two phases so that the CPU begins preparing batch N+1 while the accelerator trains on batch N. This reduces the per-step time from the sum of CPU and accelerator time to the maximum of the two:
dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)
interleaveWhen reading from multiple files (especially over network storage), the interleave transformation with num_parallel_calls reads from several files simultaneously, hiding I/O latency:
dataset = files.interleave(
tf.data.TFRecordDataset,
cycle_length=8,
num_parallel_calls=tf.data.AUTOTUNE
)
mapSetting num_parallel_calls in the map transformation distributes preprocessing work across multiple CPU threads. This is especially beneficial for compute-heavy operations like image decoding, augmentation, or text tokenization:
dataset = dataset.map(decode_and_augment, num_parallel_calls=tf.data.AUTOTUNE)
The cache transformation stores elements after they have been processed. On the first pass through the data, all transformations before cache are executed. On subsequent passes (later epochs), cached data is read directly, skipping the earlier transformations:
# Cache in memory (for datasets that fit)
dataset = dataset.map(expensive_preprocess).cache()
# Cache to disk (for larger datasets)
dataset = dataset.map(expensive_preprocess).cache("/tmp/cache_dir")
Placement matters: put cache after expensive, deterministic transformations but before randomized augmentations that should vary each epoch.
Applying a function element-by-element with map and then batching incurs per-element overhead. Reversing the order (batching first, then mapping) allows the function to operate on entire batches at once, leveraging vectorized operations:
# Slower: map then batch (function called per element)
dataset.map(increment).batch(256)
# Faster: batch then map (function called per batch)
dataset.batch(256).map(increment)
Benchmarks show that vectorized mapping can yield a 4 to 5x speedup over per-element mapping for lightweight operations.
tf.data.AUTOTUNE is a special constant that delegates parameter selection to the tf.data runtime. When passed as buffer_size to prefetch or as num_parallel_calls to map and interleave, the runtime collects timing information about each pipeline stage, uses an analytical model of pipeline performance, and applies a hill-climbing algorithm to dynamically adjust parallelism, buffer sizes, and other knobs. This removes the need for manual tuning and adapts to different hardware configurations automatically.
The autotuning mechanism periodically measures throughput and latency of each transformation. If the model processes data faster than the pipeline supplies it, the runtime increases parallelism. If the pipeline is ahead of the model, it reduces resource usage. This happens continuously during training without user intervention.
Combining all optimizations, a high-performance pipeline typically looks like:
dataset = (
tf.data.Dataset.list_files("data/train-*.tfrecord")
.interleave(
tf.data.TFRecordDataset,
cycle_length=8,
num_parallel_calls=tf.data.AUTOTUNE
)
.shuffle(buffer_size=10000)
.map(parse_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
.batch(64)
.prefetch(tf.data.AUTOTUNE)
)
The following table illustrates the relative impact of each optimization technique, based on benchmarks from the TensorFlow documentation:
| Pipeline configuration | Relative execution time |
|---|---|
| Naive (sequential, no optimization) | 1.00x (baseline) |
+ prefetch(AUTOTUNE) | ~0.97x |
+ parallel interleave | ~0.77x |
+ parallel map | ~0.67x |
+ cache (after first epoch) | ~0.50x |
| All optimizations combined | ~0.48x |
The combined effect of all optimizations can roughly halve pipeline execution time compared to a naive sequential implementation.
In TensorFlow 1.x, datasets could not be iterated directly in Python. Instead, users created an explicit Iterator object and called get_next() to obtain a symbolic tensor, which was then evaluated inside a tf.Session:
# TensorFlow 1.x pattern (deprecated)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
while True:
try:
value = sess.run(next_element)
except tf.errors.OutOfRangeError:
break
TensorFlow 1.x provided four types of iterators: one-shot, initializable, reinitializable, and feedable. Each offered different levels of flexibility at the cost of complexity.
With eager execution enabled by default in TensorFlow 2.x, tf.data.Dataset became a standard Python iterable. Elements can be consumed directly in a for loop:
# TensorFlow 2.x pattern
for images, labels in dataset:
predictions = model(images, training=True)
# ...
Alternatively, you can create an explicit iterator using Python's built-in iter() and next():
it = iter(dataset)
batch = next(it)
For long training runs, tf.data supports saving and restoring iterator state through tf.train.Checkpoint. This allows training to resume from the exact position in the dataset after an interruption, avoiding re-processing already-seen data:
iterator = iter(dataset)
ckpt = tf.train.Checkpoint(iterator=iterator)
manager = tf.train.CheckpointManager(ckpt, "/tmp/ckpt", max_to_keep=3)
# During training
for step, batch in enumerate(iterator):
train_step(batch)
if step % 1000 == 0:
manager.save()
# After restart
ckpt.restore(manager.latest_checkpoint)
# Iterator resumes from saved position
tf.data.Dataset integrates directly with the Keras training API. Datasets can be passed to model.fit(), model.evaluate(), and model.predict() without manual iteration:
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(10000).batch(32).prefetch(tf.data.AUTOTUNE)
model.fit(train_ds, epochs=10)
Keras preprocessing layers can also be applied as part of the tf.data pipeline. When applied to a Dataset with .map(), preprocessing runs on the CPU asynchronously, which frees GPU cycles for model computation. Alternatively, preprocessing layers can be included inside the model itself, which causes them to run on the GPU and benefit from hardware acceleration:
# Preprocessing as part of tf.data pipeline (runs on CPU)
norm_layer = tf.keras.layers.Normalization()
norm_layer.adapt(train_ds.map(lambda x, y: x))
train_ds = train_ds.map(lambda x, y: (norm_layer(x), y))
TensorFlow Datasets (TFDS) is a separate library that provides a catalog of ready-to-use datasets. TFDS is a high-level wrapper around tf.data: calling tfds.load() returns a tf.data.Dataset object, so all standard transformations and optimizations apply:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.shuffle(10000).batch(128).prefetch(tf.data.AUTOTUNE)
TFDS handles downloading, extracting, and preparing the data. It stores data in TFRecord format by default and supports deterministic ordering through seed-based shuffling.
The tf.data API integrates with tf.distribute.Strategy to handle data distribution across multiple GPUs or machines. When a dataset is used inside a distribution strategy scope, the strategy automatically shards the data across replicas.
TensorFlow provides three sharding policies through tf.data.experimental.AutoShardPolicy:
| Policy | Behavior |
|---|---|
FILE | Each worker reads a different subset of input files |
DATA | Each worker reads all files but skips to its shard of elements |
AUTO | Attempts FILE sharding first; falls back to DATA sharding |
OFF | No automatic sharding; each worker sees the full dataset |
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = build_model()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
# Dataset is automatically sharded across GPUs
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(10000).batch(64).prefetch(tf.data.AUTOTUNE)
model.fit(train_ds, epochs=10)
For large-scale training jobs, the tf.data service disaggregates input data processing from model training. Instead of preprocessing data on the same machines that run accelerators, the tf.data service offloads preprocessing to dedicated CPU workers. This architecture provides three advantages:
The service is activated by adding a distribute transformation to the pipeline:
dataset = dataset.apply(
tf.data.experimental.service.distribute(
processing_mode="distributed_epoch",
service="grpc://tf-data-service:5000"
)
)
The API provides built-in methods for addressing class imbalance, which is common in tasks like fraud detection and medical diagnosis.
sample_from_datasets draws elements from multiple datasets according to specified weights:
neg_ds = full_ds.filter(lambda x, y: y == 0).repeat()
pos_ds = full_ds.filter(lambda x, y: y == 1).repeat()
balanced_ds = tf.data.Dataset.sample_from_datasets(
[neg_ds, pos_ds], weights=[0.5, 0.5]
).batch(32)
rejection_resample resamples elements from a single dataset to match a target class distribution:
resampled = dataset.rejection_resample(
class_func=lambda features, label: label,
target_dist=[0.5, 0.5]
)
tf.py_function for custom Python logicWhen a transformation requires arbitrary Python code (for example, calling a NumPy or SciPy function), tf.py_function wraps the function so it can be used inside map. However, because tf.py_function executes in Python and cannot be traced into a TensorFlow graph, it acts as a performance barrier and prevents certain graph optimizations:
import scipy.ndimage as ndimage
def random_rotate(image):
angle = np.random.uniform(-30, 30)
return ndimage.rotate(image, angle, reshape=False).astype(np.float32)
def augment(image, label):
image = tf.py_function(random_rotate, [image], tf.float32)
image.set_shape([28, 28]) # Restore static shape info
return image, label
dataset = dataset.map(augment, num_parallel_calls=tf.data.AUTOTUNE)
For better performance, prefer using TensorFlow's own ops (such as tf.image.random_flip_left_right) instead of tf.py_function whenever possible.
Beyond runtime autotuning, the tf.data framework applies static graph optimizations to the pipeline before execution. These optimizations are analogous to query optimization in relational databases:
| Optimization | Description |
|---|---|
| Map and batch fusion | Combines consecutive map and batch operations into a single fused operator, reducing scheduling overhead |
| Map parallelization | Automatically parallelizes map operations when safe to do so |
| Noop elimination | Removes identity transformations that have no effect |
| Shuffle and repeat fusion | Fuses shuffle followed by repeat into a single operator |
| Map vectorization | Converts element-wise maps into batch-level vectorized operations |
| Filter fusion | Pushes filters earlier in the pipeline to reduce work in downstream stages |
| Injection of prefetch | Adds prefetch at the end of the pipeline when not already present |
These optimizations are enabled by default. Users can inspect and control them through tf.data.Options:
options = tf.data.Options()
options.experimental_optimization.map_and_batch_fusion = True
options.experimental_optimization.map_parallelization = True
dataset = dataset.with_options(options)
The two dominant frameworks for building ML data pipelines are tf.data (TensorFlow) and DataLoader (PyTorch). They differ in architecture, execution model, and API design.
| Feature | tf.data (TensorFlow) | DataLoader (PyTorch) |
|---|---|---|
| Abstraction | tf.data.Dataset (functional, composable transformations) | Dataset + DataLoader (class-based, imperative) |
| Parallelism model | Thread-based; num_parallel_calls on map/interleave | Process-based; num_workers spawns subprocesses |
| Shuffling | Buffer-based (shuffle(buffer_size)); quality depends on buffer size | Index-based; full shuffle of all indices each epoch |
| Graph compilation | Can compile to static graph for optimized execution | Runs in eager Python; relies on Python multiprocessing |
| Autotuning | Built-in AUTOTUNE for parallelism and buffer sizes | No built-in autotuning; manual parameter setting |
| Prefetching | Explicit prefetch transformation | prefetch_factor parameter on DataLoader |
| Custom data format | TFRecord recommended for large-scale I/O | No prescribed format; flexible __getitem__ interface |
| Memory pinning | Handled internally | Explicit pin_memory=True for faster GPU transfer |
In practice, both systems achieve comparable throughput when properly configured. tf.data's functional API and autotuning make it easier to build optimized pipelines without manual experimentation, while PyTorch's DataLoader offers more flexibility through its class-based interface and direct Python integration.
By default, tf.data pipelines with parallel operations (num_parallel_calls > 1) may produce elements in non-deterministic order because different threads complete at different times. For reproducible training, determinism can be enforced:
options = tf.data.Options()
options.deterministic = True
dataset = dataset.with_options(options)
Enabling determinism may reduce throughput because faster threads must wait for slower ones to preserve ordering. The trade-off between reproducibility and performance depends on the use case.
The snapshot transformation (now deprecated in favor of save and load) persists the output of a pipeline to disk. This is useful when preprocessing is expensive and the same preprocessed data is needed across multiple training runs:
# Save a preprocessed dataset
tf.data.Dataset.save(dataset, "/path/to/saved_dataset")
# Load it back
loaded_ds = tf.data.Dataset.load("/path/to/saved_dataset")
Saved datasets are stored in multiple shards. The save/load API guarantees backward compatibility, so datasets saved with older versions of TensorFlow can be loaded with newer versions.
def parse_image(filename, label):
image = tf.io.read_file(filename)
image = tf.io.decode_jpeg(image, channels=3)
image = tf.image.resize(image, [224, 224])
image = tf.cast(image, tf.float32) / 255.0
return image, label
files_ds = tf.data.Dataset.list_files("images/*/*.jpg")
labels_ds = files_ds.map(extract_label)
train_ds = (
tf.data.Dataset.zip((files_ds, labels_ds))
.shuffle(10000)
.map(parse_image, num_parallel_calls=tf.data.AUTOTUNE)
.batch(32)
.prefetch(tf.data.AUTOTUNE)
)
train_ds = (
tf.data.TextLineDataset(["train.tsv"])
.skip(1) # Skip header
.map(lambda line: tf.io.decode_csv(line, [[0], [""]], field_delim="\t"))
.shuffle(10000)
.padded_batch(64, padded_shapes=([None], []))
.prefetch(tf.data.AUTOTUNE)
)
def make_window_dataset(series, window_size, shift, batch_size):
ds = tf.data.Dataset.from_tensor_slices(series)
ds = ds.window(window_size + 1, shift=shift, drop_remainder=True)
ds = ds.flat_map(lambda w: w.batch(window_size + 1))
ds = ds.map(lambda w: (w[:-1], w[-1]))
ds = ds.shuffle(1000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
return ds
TensorFlow provides tools for analyzing tf.data pipeline performance:
tf.data section of the profiler shows a timeline of pipeline execution, revealing idle gaps and bottlenecks.tf.data.experimental.StatsAggregator: Collects statistics about the pipeline (throughput, latency per transformation).Common symptoms and their causes:
| Symptom | Likely cause | Solution |
|---|---|---|
| GPU utilization low, CPU high | Data preprocessing is too slow | Add num_parallel_calls=AUTOTUNE to map; consider cache |
| GPU utilization low, CPU low | I/O bottleneck or pipeline stall | Use parallel interleave; increase prefetch buffer |
| First epoch slow, subsequent fast | No caching | Add cache() after expensive transformations |
| Memory growing over time | Large shuffle buffer or unbounded cache | Reduce buffer_size; cache to disk instead of memory |
from_generator and tf.py_function run in Python and cannot be compiled into TensorFlow graphs, limiting parallelism and throughput.shuffle(buffer_size) with a large buffer consumes memory proportional to the buffer size times the element size. For very large datasets, a full shuffle may not be feasible in memory.tf.py_function) prevent graph optimization passes from being applied.tf.data.experimental have been deprecated or moved. Users should consult the official API documentation for current recommendations.The tf.data system has influenced and been the subject of several research projects: