# Dataset API (tf.data)

> Source: https://aiwiki.ai/wiki/dataset_api_tf_data
> Updated: 2026-06-28
> Categories: Deep Learning, Developer Tools, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **Dataset API (tf.data)** is the high-performance input pipeline framework within [TensorFlow](/wiki/tensorflow) for loading, transforming, and delivering data to [machine learning](/wiki/machine_learning) models during training and evaluation. It is built around the `tf.data.Dataset` abstraction, a composable, declarative object that represents a potentially large sequence of elements processed through a chain of functional transformations such as `map`, `batch`, `shuffle`, and `prefetch`. The API follows an Extract-Transform-Load (ETL) pattern, separating user-facing application logic from underlying performance optimizations so that practitioners focus on data semantics while the runtime handles efficient, parallelized execution.[1][6]

The tf.data framework was developed at Google and is described in the 2021 paper "tf.data: A Machine Learning Data Processing Framework" published in the Proceedings of the VLDB Endowment (Vol. 14, No. 12, pp. 2945-2958) by Derek G. Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk.[1] An empirical analysis of millions of machine learning jobs run in Google's datacenter fleet over one month showed that input data processing is highly diverse and consumes a significant fraction of job resources: 20% of jobs spent more than a third of their compute time ingesting data, and across all analyzed jobs roughly 30% of total compute time went to data ingestion.[1] This motivated a general-purpose, automatically optimized pipeline system. The framework ships as a core module of TensorFlow and is the recommended approach for building input pipelines across all TensorFlow-based training workflows.[6][9]

## ELI5 (Explain like I'm 5)

Imagine you are building something out of LEGO bricks, but all your bricks are mixed up in a giant pile. Before you can start building, someone needs to find the right bricks, sort them by color and size, and hand them to you one group at a time so you never have to stop and wait. That is what tf.data does for a computer that is learning. The computer's "brain" (a [GPU](/wiki/gpu_computing) or [TPU](/wiki/tpu)) learns really fast, but it needs data prepared and ready. tf.data is the helper that reads the data, cleans it up, organizes it into small batches, and keeps handing new batches to the brain before it finishes the last one. That way the brain never sits around doing nothing.

## What is the tf.data Dataset API?

The tf.data Dataset API is TensorFlow's official input pipeline library. Its central object, `tf.data.Dataset`, represents a sequence of elements (where each element is one or more tensors) and exposes a fluent set of transformations that return new datasets. Pipelines are lazy and composable: nothing is read or computed until elements are actually consumed, for example by iterating in a training loop or passing the dataset to [Keras](/wiki/keras) `model.fit()`. This design lets the same pipeline definition run efficiently across CPUs, [GPUs](/wiki/gpu_computing), and [TPUs](/wiki/tpu), and lets the runtime apply parallelism, prefetching, caching, and static graph optimizations on the user's behalf.[1][6]

## Why was tf.data created?

Training modern [deep learning](/wiki/deep_learning) models involves repeatedly feeding data through the model, adjusting weights through [backpropagation](/wiki/backpropagation), and iterating over the full dataset for many [epochs](/wiki/epoch). As GPUs and TPUs have grown faster, the CPU-based data loading and preprocessing step has increasingly become the bottleneck. One analysis found that as much as 56% of GPU cycles can be wasted waiting for training data even when CPUs operate at 92% utilization, a phenomenon known as a data stall or I/O stall.[13] With datasets growing larger and accelerators growing faster, this mismatch only becomes more pronounced.

Before tf.data, TensorFlow 1.x relied on feed dictionaries and queue-based pipelines. Feed dictionaries required users to manually fetch data in Python and pass it into the computation graph through `session.run()`, which introduced Python-level overhead and prevented pipelining between host and device. Queue runners were more efficient but difficult to use correctly, error-prone, and hard to debug. The tf.data API was introduced to replace both approaches with a single, composable, high-performance abstraction that works naturally with both graph and [eager execution](/wiki/eager_execution) modes within the TensorFlow runtime.[1][2]

## Architecture and design

### The ETL model

The tf.data pipeline follows a three-stage ETL pattern:

| Stage | Role | Typical operations |
|-------|------|--------------------|
| **Extract** | Read raw data from storage (local disk, cloud storage, databases) | `TFRecordDataset`, `TextLineDataset`, `CsvDataset`, `from_tensor_slices`, `from_generator`, `list_files` |
| **Transform** | Parse, decode, augment, shuffle, and batch the data on CPU | `map`, `filter`, `shuffle`, `batch`, `padded_batch`, `window`, `flat_map`, `interleave` |
| **Load** | Deliver prepared batches to the accelerator for model execution | `prefetch`, distribution via `tf.distribute.Strategy` |

This separation allows each stage to be independently parallelized and optimized. The Extract stage can read from multiple files concurrently, the Transform stage can parallelize map operations across CPU cores, and the Load stage can overlap data delivery with model computation.[7]

### Core abstraction: `tf.data.Dataset`

The central object in the API is `tf.data.Dataset`, which represents a sequence of elements. Each element can be a single tensor, a tuple of tensors, a dictionary of tensors, or nested combinations of these structures. Datasets are lazy: transformations are not executed until elements are consumed (for example, by iterating over the dataset in a training loop).[9]

Datasets support several container types for elements:

| Supported type | Description |
|---------------|-------------|
| `tf.Tensor` | Standard dense tensor |
| `tf.sparse.SparseTensor` | Sparse tensor representation |
| `tf.RaggedTensor` | Tensor with variable-length dimensions |
| `tf.TensorArray` | Dynamic-size tensor array |
| `tf.data.Dataset` | Nested dataset (used by windowing) |

Element structures can use `tuple`, `dict`, `NamedTuple`, or `OrderedDict` as containers. Python lists are not supported as element containers.[9]

The structure of a dataset's elements can be inspected using the `element_spec` property:

```python
import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices({
    "features": tf.random.uniform([100, 10]),
    "labels": tf.random.uniform([100], maxval=2, dtype=tf.int32)
})
print(dataset.element_spec)
# {'features': TensorSpec(shape=(10,), dtype=tf.float32),
#  'labels': TensorSpec(shape=(), dtype=tf.int32)}
```

## How do you create a tf.data.Dataset?

The API provides multiple factory methods for constructing datasets from different data sources.

### From in-memory data

The `from_tensor_slices` method creates a dataset by slicing tensors along their first dimension. This is the simplest way to create a dataset from NumPy arrays or Python lists:

```python
import numpy as np

images = np.random.rand(1000, 28, 28).astype(np.float32)
labels = np.random.randint(0, 10, size=1000).astype(np.int32)

dataset = tf.data.Dataset.from_tensor_slices((images, labels))
```

The `from_tensors` method wraps the entire input as a single element, useful when you want to treat a batch as one element.

### From Python generators

The `from_generator` method wraps a Python generator function, which is useful for data sources that do not fit neatly into tensor slicing or for lazy data loading:

```python
def data_generator():
    for i in range(1000):
        yield np.random.rand(28, 28).astype(np.float32), i % 10

dataset = tf.data.Dataset.from_generator(
    data_generator,
    output_signature=(
        tf.TensorSpec(shape=(28, 28), dtype=tf.float32),
        tf.TensorSpec(shape=(), dtype=tf.int32)
    )
)
```

Because generators execute in Python, they cannot be serialized into a TensorFlow graph and may introduce performance overhead. They are best used for prototyping or when interfacing with external libraries.[9]

### From files

| Method | File format | Use case |
|--------|------------|----------|
| `tf.data.TFRecordDataset` | TFRecord (protobuf-based binary) | Large-scale training with serialized examples |
| `tf.data.TextLineDataset` | Plain text files | NLP tasks, log processing |
| `tf.data.experimental.CsvDataset` | CSV files | Tabular / structured data |
| `tf.data.Dataset.list_files` | File path patterns | Image directories, audio files |
| `tf.data.FixedLengthRecordDataset` | Fixed-length binary records | Scientific data, legacy formats |

TFRecord is the recommended format for large datasets because it supports efficient sequential reads, compression, and sharding across multiple files.[12] Sharding data across at least 10 times as many files as there are reading hosts is a common best practice for parallel I/O.[7]

```python
# Reading from sharded TFRecord files
filenames = tf.data.Dataset.list_files("data/train-*.tfrecord")
dataset = filenames.interleave(
    tf.data.TFRecordDataset,
    cycle_length=4,
    num_parallel_calls=tf.data.AUTOTUNE
)
```

## What is the TFRecord format?

TFRecord is TensorFlow's recommended binary serialization format for storing large datasets as a sequence of records. Each record typically holds a serialized `tf.train.Example` protocol buffer, which encodes a flexible map of named features (each a list of bytes, floats, or 64-bit integers).[12] Because the format is a simple length-prefixed stream of binary records, it supports fast sequential reads, optional GZIP or ZLIB compression, and natural sharding into many files that can be read in parallel with `interleave`. `tf.data.TFRecordDataset` reads these files and feeds the raw bytes into the pipeline, where a `map` step calls `tf.io.parse_single_example` (or `tf.io.parse_example` on a batch) to decode each record back into tensors.[12]

## What are the main tf.data transformations?

Transformations are functional operations applied to a `Dataset` that return a new `Dataset`. They are composable and lazy, meaning no computation occurs until elements are consumed.[6]

### Element-wise transformations

**`map(map_func, num_parallel_calls=None)`** applies a function to each element. The `num_parallel_calls` parameter enables parallel execution across multiple CPU cores:

```python
def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0
    image = tf.image.resize(image, [224, 224])
    return image, label

dataset = dataset.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
```

**`filter(predicate)`** retains only elements for which the predicate returns `True`:

```python
# Keep only samples with label > 5
dataset = dataset.filter(lambda image, label: label > 5)
```

### Batching transformations

**`batch(batch_size, drop_remainder=False)`** groups consecutive elements into batches. Setting `drop_remainder=True` ensures all batches have the same size, which is required for certain static-shape operations:

```python
dataset = dataset.batch(32, drop_remainder=True)
```

**`padded_batch(batch_size, padded_shapes, padding_values=None)`** batches elements that may have different shapes by padding them to a uniform shape. This is common in [NLP](/wiki/natural_language_understanding) tasks where sequences have variable lengths:

```python
# Pad sequences to the longest in each batch
dataset = dataset.padded_batch(32, padded_shapes=([None],))
```

**`unbatch()`** reverses a batch, splitting each element back into individual elements.

### Ordering and sampling transformations

**`shuffle(buffer_size, seed=None, reshuffle_each_iteration=True)`** maintains a buffer of `buffer_size` elements and randomly samples from it. For a perfectly uniform shuffle, the buffer size should equal or exceed the dataset size. Smaller buffers trade shuffle quality for lower memory usage:

```python
dataset = dataset.shuffle(buffer_size=10000)
```

**`repeat(count=None)`** repeats the dataset for a specified number of epochs, or indefinitely if `count` is `None`. The ordering of `shuffle`, `batch`, and `repeat` matters:

- `shuffle` then `repeat`: shuffles within each epoch (recommended for most cases)
- `repeat` then `shuffle`: allows epoch-boundary crossover, shuffling across epoch boundaries

**`skip(count)`** and **`take(count)`** select subsets of elements by position.

### Structural transformations

**`flat_map(map_func)`** maps a function over each element and flattens the resulting datasets into a single dataset.

**`interleave(map_func, cycle_length, num_parallel_calls=None)`** applies a function to each element (typically producing a sub-dataset), then interleaves elements from `cycle_length` sub-datasets at a time. With `num_parallel_calls`, reading from multiple files happens in parallel:

```python
files = tf.data.Dataset.list_files("data/*.csv")
dataset = files.interleave(
    lambda f: tf.data.TextLineDataset(f).skip(1),
    cycle_length=4,
    num_parallel_calls=tf.data.AUTOTUNE
)
```

**`window(size, shift=1, stride=1, drop_remainder=False)`** creates windows of consecutive elements, returning nested datasets. This is useful for [time series](/wiki/time_series_analysis) and sequence modeling:

```python
range_ds = tf.data.Dataset.range(10)
windows = range_ds.window(5, shift=1, drop_remainder=True)
# Each window is a nested Dataset; flatten with flat_map + batch
windows = windows.flat_map(lambda w: w.batch(5))
```

**`zip(datasets)`** combines elements from multiple datasets into tuples, similar to Python's built-in `zip`.

**`concatenate(dataset)`** appends one dataset to another.

### Summary of key transformations

| Transformation | Purpose | Key parameters |
|---------------|---------|----------------|
| `map` | Apply a function to each element | `num_parallel_calls` |
| `filter` | Keep elements matching a predicate | predicate function |
| `batch` | Group elements into fixed-size batches | `batch_size`, `drop_remainder` |
| `padded_batch` | Batch with padding for variable shapes | `padded_shapes`, `padding_values` |
| `shuffle` | Randomize element order | `buffer_size`, `seed` |
| `repeat` | Repeat dataset for multiple epochs | `count` |
| `prefetch` | Overlap data prep with model execution | `buffer_size` |
| `cache` | Cache elements in memory or on disk | `filename` (optional) |
| `interleave` | Interleave elements from sub-datasets | `cycle_length`, `num_parallel_calls` |
| `flat_map` | Map and flatten nested datasets | map function |
| `window` | Create sliding windows | `size`, `shift`, `stride` |
| `unbatch` | Split batched elements | none |
| `skip` / `take` | Select elements by position | `count` |
| `zip` | Combine multiple datasets | datasets |
| `concatenate` | Append one dataset to another | dataset |
| `reduce` | Aggregate all elements | `initial_state`, `reduce_func` |

## How does tf.data improve performance?

The tf.data API includes several mechanisms for maximizing throughput and minimizing the time accelerators spend idle waiting for data. The official guidance is summarized as: use `prefetch` to overlap producer and consumer work, parallelize reading with `interleave`, parallelize `map` with `num_parallel_calls`, `cache` data when it fits, vectorize user-defined functions, and order transformations to reduce memory usage.[7]

### Pipelining with `prefetch`

Without prefetching, the CPU and accelerator operate sequentially: the CPU prepares a batch, then sits idle while the accelerator trains on it, and vice versa. The `prefetch` transformation overlaps these two phases so that the CPU begins preparing batch N+1 while the accelerator trains on batch N. As the TensorFlow documentation puts it, prefetching "reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data":[7]

```python
dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)
```

### Parallel data extraction with `interleave`

When reading from multiple files (especially over network storage), the `interleave` transformation with `num_parallel_calls` reads from several files simultaneously, hiding I/O latency:[7]

```python
dataset = files.interleave(
    tf.data.TFRecordDataset,
    cycle_length=8,
    num_parallel_calls=tf.data.AUTOTUNE
)
```

### Parallel data transformation with `map`

Setting `num_parallel_calls` in the `map` transformation distributes preprocessing work across multiple CPU threads. This is especially beneficial for compute-heavy operations like image decoding, augmentation, or text tokenization:[7]

```python
dataset = dataset.map(decode_and_augment, num_parallel_calls=tf.data.AUTOTUNE)
```

### Caching

The `cache` transformation stores elements after they have been processed. On the first pass through the data, all transformations before `cache` are executed. On subsequent passes (later epochs), cached data is read directly, skipping the earlier transformations:[7]

```python
# Cache in memory (for datasets that fit)
dataset = dataset.map(expensive_preprocess).cache()

# Cache to disk (for larger datasets)
dataset = dataset.map(expensive_preprocess).cache("/tmp/cache_dir")
```

Placement matters: put `cache` after expensive, deterministic transformations but before randomized augmentations that should vary each epoch.

### Vectorizing mapped functions

Applying a function element-by-element with `map` and then batching incurs per-element overhead. Reversing the order (batching first, then mapping) allows the function to operate on entire batches at once, leveraging vectorized operations:[7]

```python
# Slower: map then batch (function called per element)
dataset.map(increment).batch(256)

# Faster: batch then map (function called per batch)
dataset.batch(256).map(increment)
```

Benchmarks in the TensorFlow performance guide show that vectorized mapping can yield a roughly 4 to 5x speedup over per-element mapping for lightweight operations.[7]

## What is tf.data.AUTOTUNE?

`tf.data.AUTOTUNE` is a special constant that delegates parameter selection to the tf.data runtime. When passed as `buffer_size` to `prefetch` or as `num_parallel_calls` to `map` and `interleave`, the runtime tunes the value dynamically at runtime rather than requiring the user to pick a fixed number.[7] Under the hood, the tf.data autotuner builds an analytical model of the pipeline. The VLDB paper states that "the auto-tuning implementation uses the processing time and the input pipeline structure to build an analytical model that is used to estimate how input pipeline parameters affect end-to-end latency," modeling asynchronous transformations such as `prefetch` with an M/M/1/k queue and choosing parallelism and buffer sizes to minimize expected output latency subject to CPU and RAM budget constraints.[1] The optimizer adjusts these knobs using a gradient descent procedure over the analytical model, removing the need for manual tuning and adapting to different hardware configurations automatically.[1]

The autotuning mechanism periodically measures throughput and latency of each transformation. If the model processes data faster than the pipeline supplies it, the runtime increases parallelism. If the pipeline is ahead of the model, it reduces resource usage. This happens continuously during training without user intervention.[1][7]

### Recommended pipeline pattern

Combining all optimizations, a high-performance pipeline typically looks like:

```python
dataset = (
    tf.data.Dataset.list_files("data/train-*.tfrecord")
    .interleave(
        tf.data.TFRecordDataset,
        cycle_length=8,
        num_parallel_calls=tf.data.AUTOTUNE
    )
    .shuffle(buffer_size=10000)
    .map(parse_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(64)
    .prefetch(tf.data.AUTOTUNE)
)
```

### Performance comparison

The following table illustrates the relative impact of each optimization technique, based on benchmarks from the TensorFlow documentation:[7]

| Pipeline configuration | Relative execution time |
|----------------------|------------------------|
| Naive (sequential, no optimization) | 1.00x (baseline) |
| + `prefetch(AUTOTUNE)` | ~0.97x |
| + parallel `interleave` | ~0.77x |
| + parallel `map` | ~0.67x |
| + `cache` (after first epoch) | ~0.50x |
| All optimizations combined | ~0.48x |

The combined effect of all optimizations can roughly halve pipeline execution time compared to a naive sequential implementation.[7]

## How do you consume a dataset (iterators)?

### TensorFlow 1.x (graph mode)

In TensorFlow 1.x, datasets could not be iterated directly in Python. Instead, users created an explicit `Iterator` object and called `get_next()` to obtain a symbolic tensor, which was then evaluated inside a `tf.Session`:

```python
# TensorFlow 1.x pattern (deprecated)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    while True:
        try:
            value = sess.run(next_element)
        except tf.errors.OutOfRangeError:
            break
```

TensorFlow 1.x provided four types of iterators: one-shot, initializable, reinitializable, and feedable. Each offered different levels of flexibility at the cost of complexity.

### TensorFlow 2.x (eager mode)

With [eager execution](/wiki/eager_execution) enabled by default in TensorFlow 2.x, `tf.data.Dataset` became a standard Python iterable. Elements can be consumed directly in a `for` loop:[9]

```python
# TensorFlow 2.x pattern
for images, labels in dataset:
    predictions = model(images, training=True)
    # ...
```

Alternatively, you can create an explicit iterator using Python's built-in `iter()` and `next()`:

```python
it = iter(dataset)
batch = next(it)
```

### Iterator checkpointing

For long training runs, tf.data supports saving and restoring iterator state through `tf.train.Checkpoint`. This allows training to resume from the exact position in the dataset after an interruption, avoiding re-processing already-seen data:[9]

```python
iterator = iter(dataset)
ckpt = tf.train.Checkpoint(iterator=iterator)
manager = tf.train.CheckpointManager(ckpt, "/tmp/ckpt", max_to_keep=3)

# During training
for step, batch in enumerate(iterator):
    train_step(batch)
    if step % 1000 == 0:
        manager.save()

# After restart
ckpt.restore(manager.latest_checkpoint)
# Iterator resumes from saved position
```

## How does tf.data integrate with Keras?

`tf.data.Dataset` integrates directly with the [Keras](/wiki/keras) training API. Datasets can be passed to `model.fit()`, `model.evaluate()`, and `model.predict()` without manual iteration:[9]

```python
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(10000).batch(32).prefetch(tf.data.AUTOTUNE)

model.fit(train_ds, epochs=10)
```

Keras [preprocessing layers](/wiki/preprocessing) can also be applied as part of the tf.data pipeline. When applied to a `Dataset` with `.map()`, preprocessing runs on the CPU asynchronously, which frees GPU cycles for model computation. Alternatively, preprocessing layers can be included inside the model itself, which causes them to run on the GPU and benefit from hardware acceleration:

```python
# Preprocessing as part of tf.data pipeline (runs on CPU)
norm_layer = tf.keras.layers.Normalization()
norm_layer.adapt(train_ds.map(lambda x, y: x))
train_ds = train_ds.map(lambda x, y: (norm_layer(x), y))
```

## Integration with TensorFlow Datasets (TFDS)

[TensorFlow Datasets](https://www.tensorflow.org/datasets) (TFDS) is a separate library that provides a catalog of ready-to-use datasets. TFDS is a high-level wrapper around tf.data: calling `tfds.load()` returns a `tf.data.Dataset` object, so all standard transformations and optimizations apply:

```python
import tensorflow_datasets as tfds

ds = tfds.load('mnist', split='train', as_supervised=True)
ds = ds.shuffle(10000).batch(128).prefetch(tf.data.AUTOTUNE)
```

TFDS handles downloading, extracting, and preparing the data. It stores data in TFRecord format by default and supports deterministic ordering through seed-based shuffling.

## How does tf.data handle distributed training?

The tf.data API integrates with `tf.distribute.Strategy` to handle data distribution across multiple GPUs or machines. When a dataset is used inside a distribution strategy scope, the strategy automatically shards the data across replicas.[10]

### Automatic sharding

TensorFlow provides three sharding policies through `tf.data.experimental.AutoShardPolicy`:

| Policy | Behavior |
|--------|----------|
| `FILE` | Each worker reads a different subset of input files |
| `DATA` | Each worker reads all files but skips to its shard of elements |
| `AUTO` | Attempts FILE sharding first; falls back to DATA sharding |
| `OFF` | No automatic sharding; each worker sees the full dataset |

```python
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    model = build_model()
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Dataset is automatically sharded across GPUs
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.shuffle(10000).batch(64).prefetch(tf.data.AUTOTUNE)
model.fit(train_ds, epochs=10)
```

### What is the tf.data service?

For large-scale training jobs, the **tf.data service** disaggregates input data processing from model training. Instead of preprocessing data on the same machines that run accelerators, the tf.data service offloads preprocessing to dedicated CPU workers that can be scaled independently. This architecture, described in the 2023 SoCC paper "tf.data service: A Case for Disaggregating ML Input Data Processing," provides three advantages:[3]

1. **Horizontal scaling**: CPU and RAM resources for data processing can be scaled independently of accelerator hosts, eliminating the fixed hardware ratio constraint. By right-sizing host resources per job, the service saved 32x training time and 26x cost on average for jobs that were previously bottlenecked on data preprocessing.[3]
2. **Data sharing**: Multiple training jobs can share ephemeral preprocessed data results, reducing redundant computation and optimizing CPU usage.[3]
3. **Coordinated reads**: In distributed training, the service coordinates reading so that all workers receive similarly sized batches, avoiding stragglers that slow down synchronous training. This technique reduced training time by 2.2x on average.[3]

The service is activated by adding a `distribute` transformation to the pipeline:

```python
dataset = dataset.apply(
    tf.data.experimental.service.distribute(
        processing_mode="distributed_epoch",
        service="grpc://tf-data-service:5000"
    )
)
```

## How do you handle imbalanced data with tf.data?

The API provides built-in methods for addressing class imbalance, which is common in tasks like fraud detection and medical diagnosis.

**`sample_from_datasets`** draws elements from multiple datasets according to specified weights:

```python
neg_ds = full_ds.filter(lambda x, y: y == 0).repeat()
pos_ds = full_ds.filter(lambda x, y: y == 1).repeat()

balanced_ds = tf.data.Dataset.sample_from_datasets(
    [neg_ds, pos_ds], weights=[0.5, 0.5]
).batch(32)
```

**`rejection_resample`** resamples elements from a single dataset to match a target class distribution:

```python
resampled = dataset.rejection_resample(
    class_func=lambda features, label: label,
    target_dist=[0.5, 0.5]
)
```

## Using `tf.py_function` for custom Python logic

When a transformation requires arbitrary Python code (for example, calling a NumPy or SciPy function), `tf.py_function` wraps the function so it can be used inside `map`. However, because `tf.py_function` executes in Python and cannot be traced into a TensorFlow graph, it acts as a performance barrier and prevents certain graph optimizations:[9]

```python
import scipy.ndimage as ndimage

def random_rotate(image):
    angle = np.random.uniform(-30, 30)
    return ndimage.rotate(image, angle, reshape=False).astype(np.float32)

def augment(image, label):
    image = tf.py_function(random_rotate, [image], tf.float32)
    image.set_shape([28, 28])  # Restore static shape info
    return image, label

dataset = dataset.map(augment, num_parallel_calls=tf.data.AUTOTUNE)
```

For better performance, prefer using TensorFlow's own ops (such as `tf.image.random_flip_left_right`) instead of `tf.py_function` whenever possible.

## Static optimizations

Beyond runtime autotuning, the tf.data framework applies static graph optimizations to the pipeline before execution. These optimizations are analogous to query optimization in relational databases:[1]

| Optimization | Description |
|-------------|-------------|
| **Map and batch fusion** | Combines consecutive `map` and `batch` operations into a single fused operator, reducing scheduling overhead |
| **Map parallelization** | Automatically parallelizes map operations when safe to do so |
| **Noop elimination** | Removes identity transformations that have no effect |
| **Shuffle and repeat fusion** | Fuses `shuffle` followed by `repeat` into a single operator |
| **Map vectorization** | Converts element-wise maps into batch-level vectorized operations |
| **Filter fusion** | Pushes filters earlier in the pipeline to reduce work in downstream stages |
| **Injection of prefetch** | Adds prefetch at the end of the pipeline when not already present |

These optimizations are enabled by default. Users can inspect and control them through `tf.data.Options`:

```python
options = tf.data.Options()
options.experimental_optimization.map_and_batch_fusion = True
options.experimental_optimization.map_parallelization = True
dataset = dataset.with_options(options)
```

## How does tf.data compare to PyTorch DataLoader?

The two dominant frameworks for building ML data pipelines are tf.data (TensorFlow) and DataLoader ([PyTorch](/wiki/pytorch)). They differ in architecture, execution model, and API design.

| Feature | tf.data (TensorFlow) | DataLoader (PyTorch) |
|---------|---------------------|---------------------|
| **Abstraction** | `tf.data.Dataset` (functional, composable transformations) | `Dataset` + `DataLoader` (class-based, imperative) |
| **Parallelism model** | Thread-based; `num_parallel_calls` on `map`/`interleave` | Process-based; `num_workers` spawns subprocesses |
| **Shuffling** | Buffer-based (`shuffle(buffer_size)`); quality depends on buffer size | Index-based; full shuffle of all indices each epoch |
| **Graph compilation** | Can compile to static graph for optimized execution | Runs in eager Python; relies on Python multiprocessing |
| **Autotuning** | Built-in `AUTOTUNE` for parallelism and buffer sizes | No built-in autotuning; manual parameter setting |
| **Prefetching** | Explicit `prefetch` transformation | `prefetch_factor` parameter on `DataLoader` |
| **Custom data format** | TFRecord recommended for large-scale I/O | No prescribed format; flexible `__getitem__` interface |
| **Memory pinning** | Handled internally | Explicit `pin_memory=True` for faster GPU transfer |

In practice, both systems achieve comparable throughput when properly configured. tf.data's functional API and autotuning make it easier to build optimized pipelines without manual experimentation, while PyTorch's DataLoader offers more flexibility through its class-based interface and direct Python integration.

## Deterministic vs. non-deterministic execution

By default, tf.data pipelines with parallel operations (`num_parallel_calls > 1`) may produce elements in non-deterministic order because different threads complete at different times. For reproducible training, determinism can be enforced:[9]

```python
options = tf.data.Options()
options.deterministic = True
dataset = dataset.with_options(options)
```

Enabling determinism may reduce throughput because faster threads must wait for slower ones to preserve ordering. The trade-off between reproducibility and performance depends on the use case.

## Snapshot and save/load

The `snapshot` transformation (now deprecated in favor of `save` and `load`) persists the output of a pipeline to disk. This is useful when preprocessing is expensive and the same preprocessed data is needed across multiple training runs:[9]

```python
# Save a preprocessed dataset
tf.data.Dataset.save(dataset, "/path/to/saved_dataset")

# Load it back
loaded_ds = tf.data.Dataset.load("/path/to/saved_dataset")
```

Saved datasets are stored in multiple shards. The `save`/`load` API guarantees backward compatibility, so datasets saved with older versions of TensorFlow can be loaded with newer versions.

## Common pipeline patterns

### Image classification pipeline

A standard image pipeline reads filenames, decodes and resizes each image, shuffles, batches, and prefetches, a structure also taught in academic course material on building data pipelines.[11]

```python
def parse_image(filename, label):
    image = tf.io.read_file(filename)
    image = tf.io.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, [224, 224])
    image = tf.cast(image, tf.float32) / 255.0
    return image, label

files_ds = tf.data.Dataset.list_files("images/*/*.jpg")
labels_ds = files_ds.map(extract_label)

train_ds = (
    tf.data.Dataset.zip((files_ds, labels_ds))
    .shuffle(10000)
    .map(parse_image, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(32)
    .prefetch(tf.data.AUTOTUNE)
)
```

### Text classification pipeline

```python
train_ds = (
    tf.data.TextLineDataset(["train.tsv"])
    .skip(1)  # Skip header
    .map(lambda line: tf.io.decode_csv(line, [[0], [""]], field_delim="\t"))
    .shuffle(10000)
    .padded_batch(64, padded_shapes=([None], []))
    .prefetch(tf.data.AUTOTUNE)
)
```

### Time series windowing pipeline

```python
def make_window_dataset(series, window_size, shift, batch_size):
    ds = tf.data.Dataset.from_tensor_slices(series)
    ds = ds.window(window_size + 1, shift=shift, drop_remainder=True)
    ds = ds.flat_map(lambda w: w.batch(window_size + 1))
    ds = ds.map(lambda w: (w[:-1], w[-1]))
    ds = ds.shuffle(1000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return ds
```

## How do you debug and profile a tf.data pipeline?

TensorFlow provides tools for analyzing tf.data pipeline performance:[8]

- **TensorFlow Profiler**: The `tf.data` section of the profiler shows a timeline of pipeline execution, revealing idle gaps and bottlenecks.
- **`tf.data.experimental.StatsAggregator`**: Collects statistics about the pipeline (throughput, latency per transformation).
- **Plumber**: An external tool developed by researchers at Carnegie Mellon and ETH Zurich that uses operational analysis to automatically diagnose and fix pipeline bottlenecks. In experiments, Plumber achieved speedups of up to 47x for misconfigured pipelines.[4]

Common symptoms and their causes:

| Symptom | Likely cause | Solution |
|---------|-------------|----------|
| GPU utilization low, CPU high | Data preprocessing is too slow | Add `num_parallel_calls=AUTOTUNE` to `map`; consider `cache` |
| GPU utilization low, CPU low | I/O bottleneck or pipeline stall | Use parallel `interleave`; increase `prefetch` buffer |
| First epoch slow, subsequent fast | No caching | Add `cache()` after expensive transformations |
| Memory growing over time | Large shuffle buffer or unbounded cache | Reduce `buffer_size`; cache to disk instead of memory |

## Limitations and considerations

- **Python generator overhead**: `from_generator` and `tf.py_function` run in Python and cannot be compiled into TensorFlow graphs, limiting parallelism and throughput.
- **Shuffle buffer memory**: A `shuffle(buffer_size)` with a large buffer consumes memory proportional to the buffer size times the element size. For very large datasets, a full shuffle may not be feasible in memory.
- **Static graph restrictions**: Some transformations (such as those using `tf.py_function`) prevent graph optimization passes from being applied.
- **Debugging complexity**: Because pipelines are lazy and potentially parallel, errors may surface at consumption time rather than at construction time, making debugging harder.
- **Deprecations**: Several features in `tf.data.experimental` have been deprecated or moved. Users should consult the official API documentation for current recommendations.[9]

## Related research

The tf.data system has influenced and been the subject of several research projects:

- **cedar** (Zhao et al., 2024): A Python-native framework that applies context-aware optimizations (offloading, caching, fusion, reordering) to input pipelines. Cedar achieved on average 2.74x higher performance compared to tf.data, tf.data service, Ray Data, and PyTorch DataLoader across eight benchmark pipelines.[5]
- **Plumber** (Kuchnik et al., 2022): Uses operational analysis modeling to automatically tune parallelism, prefetching, and caching under resource constraints.[4]
- **tf.data service** (Audibert, Chen, Graur, Klimovic, Simsa, and Thekkath, 2023): Extends tf.data with disaggregated, horizontally scalable preprocessing workers.[3]

## Frequently asked questions

### Is tf.data part of TensorFlow or a separate package?

tf.data is a core module of TensorFlow, available as `tf.data` after `import tensorflow as tf`. It does not require a separate installation. The related TensorFlow Datasets (TFDS) catalog is a distinct package (`tensorflow_datasets`) that returns `tf.data.Dataset` objects.[6][9]

### What is the difference between tf.data and TensorFlow Datasets (TFDS)?

tf.data is the low-level input pipeline API for building and optimizing datasets from your own data. TFDS is a higher-level library that downloads and prepares a catalog of ready-to-use public datasets and hands them back as `tf.data.Dataset` objects, so all tf.data transformations and optimizations still apply.

### Where should prefetch go in a tf.data pipeline?

`prefetch` is almost always the last transformation in the pipeline, typically with `buffer_size=tf.data.AUTOTUNE`, so that the CPU can prepare the next batch while the accelerator trains on the current one. The TensorFlow guide describes this as reducing step time to the maximum, rather than the sum, of the data and training times.[7]

### Does tf.data work with PyTorch or JAX?

tf.data is a TensorFlow library, but because it produces standard NumPy-compatible tensors when iterated eagerly, the pipeline output can be fed into other frameworks. Projects in the [JAX](/wiki/jax) ecosystem, for example, commonly use tf.data (or TFDS) for input pipelines even when the model itself runs in JAX.[6]

## See also

- [TensorFlow](/wiki/tensorflow)
- [Keras](/wiki/keras)
- [PyTorch](/wiki/pytorch)
- [Data augmentation](/wiki/data_augmentation)
- [Batch size](/wiki/batch_size)
- [GPU computing](/wiki/gpu_computing)
- [TPU](/wiki/tpu)
- [Data parallelism](/wiki/data_parallelism)
- [Preprocessing](/wiki/preprocessing)
- [Eager execution](/wiki/eager_execution)

## References

1. Murray, D.G., Simsa, J., Klimovic, A., and Indyk, I. (2021). "tf.data: A Machine Learning Data Processing Framework." *Proceedings of the VLDB Endowment*, 14(12), 2945-2958. https://arxiv.org/abs/2101.12127

2. Abadi, M., Barham, P., Chen, J., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*, 265-283. https://arxiv.org/abs/1605.08695

3. Audibert, A., Chen, Y., Graur, D., Klimovic, A., Simsa, J., and Thekkath, C.A. (2023). "tf.data service: A Case for Disaggregating ML Input Data Processing." *Proceedings of the 2023 ACM Symposium on Cloud Computing (SoCC)*. https://arxiv.org/abs/2210.14826

4. Kuchnik, M., Klimovic, A., Simsa, J., Smith, V., and Amvrosiadis, G. (2022). "Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines." *Proceedings of Machine Learning and Systems (MLSys)*. https://arxiv.org/abs/2111.04131

5. Zhao, M., et al. (2024). "cedar: Optimized and Unified Machine Learning Input Data Pipelines." *Proceedings of the VLDB Endowment*, 18. https://arxiv.org/abs/2401.08895

6. TensorFlow Authors. "tf.data: Build TensorFlow input pipelines." TensorFlow Core Guide. https://www.tensorflow.org/guide/data

7. TensorFlow Authors. "Better performance with the tf.data API." TensorFlow Core Guide. https://www.tensorflow.org/guide/data_performance

8. TensorFlow Authors. "Analyze tf.data performance with the TF Profiler." TensorFlow Core Guide. https://www.tensorflow.org/guide/data_performance_analysis

9. TensorFlow Authors. "tf.data.Dataset API Reference." TensorFlow v2.16.1 Documentation. https://www.tensorflow.org/api_docs/python/tf/data/Dataset

10. TensorFlow Authors. "Distributed training with TensorFlow." TensorFlow Core Guide. https://www.tensorflow.org/guide/distributed_training

11. Stanford CS230. "Building a data pipeline." CS230 Deep Learning Course Blog. https://cs230.stanford.edu/blog/datapipeline/

12. TensorFlow Authors. "TFRecord and tf.train.Example." TensorFlow Core Tutorial. https://www.tensorflow.org/tutorials/load_data/tfrecord

13. Pinto, C., et al. (2024). "Efficient Tabular Data Preprocessing of ML Pipelines." arXiv preprint. https://arxiv.org/abs/2409.14912