See also: Machine learning terms
tf.train.Example (commonly written as tf.Example) is a Protocol Buffers message type that TensorFlow uses as its standard on-disk representation for a single training or inference example. The message is a thin wrapper around a flexible map from string feature names to typed value lists, and it is the canonical record body inside a TFRecord file. Because the format is defined by a published .proto schema, every TensorFlow language binding (Python, C++, Java, Go) can read and write the same files without any framework-specific glue.
The schema lives at tensorflow/core/example/example.proto and tensorflow/core/example/feature.proto in the open source TensorFlow repository. The same protos are reused by higher level systems such as TFX, TensorFlow Data Validation, TensorFlow Transform, and Vertex AI training jobs, which is why tf.Example shows up well beyond the core training loop.
tf.Example was introduced alongside TensorFlow itself in 2015 as the standard payload for the TFRecord container. The design goal was a single, compact, language neutral record format that could carry mixed numeric and binary fields, support batching from disk to GPU at high throughput, and survive across versions of TensorFlow without breaking older datasets. Choosing Protobuf for the wire format gave the format a stable schema and forward and backward compatible parsing, both of which matter when datasets sit on disk for years and outlive several model architectures.
The format is intentionally narrow. It does not try to be a relational database, a columnar analytics format, or a transport protocol. Its job is to describe one training example as a bag of typed feature lists, and let the rest of the tf.data pipeline handle shuffling, batching, and prefetching.
The schema is short enough to read end to end. It defines three primitive list types, a oneof Feature wrapper, a Features map, and the top level Example and SequenceExample messages.
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true, jstype = JS_STRING]; }
message Feature {
oneof kind {
BytesList bytes_list = 1;
FloatList float_list = 2;
Int64List int64_list = 3;
}
}
message Features {
map<string, Feature> feature = 1;
}
message Example {
Features features = 1;
}
message SequenceExample {
Features context = 1;
FeatureLists feature_lists = 2;
}
message FeatureList { repeated Feature feature = 1; }
message FeatureLists { map<string, FeatureList> feature_list = 1; }
A few details in this schema matter in practice. Float and Int64 lists use packed = true, which means repeated numeric values are written as a single length-prefixed block instead of one tag per value. This is what makes long numeric features (image pixels, embeddings, audio frames) cheap to encode. The jstype = JS_STRING annotation on Int64List is a JavaScript hint: it tells protobuf-js to surface 64-bit ints as strings so they survive the JS Number type, which only handles 53-bit integers safely.
Every field inside a tf.Example is a Feature, and every Feature is exactly one of three list kinds. The choice of list determines what Python and TensorFlow types are accepted at write time.
| Feature kind | Proto field | Element type | Common uses |
|---|---|---|---|
| BytesList | bytes_list | bytes (raw byte string) | Encoded JPEG or PNG image bytes, UTF-8 strings, serialized tensors via tf.io.serialize_tensor, tokenized sequences as serialized arrays |
| FloatList | float_list | float32 | Continuous features, embedding vectors, audio samples, regression targets, normalized image tensors written as floats |
| Int64List | int64_list | int64 | Class labels, token IDs, timestamps, booleans (encoded as 0 or 1), categorical IDs, image dimensions |
There is no native float64, int32, or boolean type. TensorFlow simply coerces those into the closest list at serialization time: bool becomes int64, int32 widens to int64, and float64 narrows to float32. If precision matters, a value can be stored as a serialized tensor inside a BytesList instead.
A Feature can also be empty (a list of length zero), which is how missing values are represented. Parsers can either fail loudly or fill with a default value depending on how the feature spec is configured.
The sibling message tf.train.SequenceExample exists because some training data has a clean split between non-time-varying context and a time-ordered sequence. Speech recognition is the canonical case: the speaker ID and sample rate are context, the per-frame audio features are a sequence.
| Aspect | tf.train.Example | tf.train.SequenceExample |
|---|---|---|
| Top-level fields | features only | context plus feature_lists |
| Best for | Independent records with a flat feature schema | Records with both static fields and one or more variable-length sequences |
| Sequence support | Variable-length lists are allowed but every feature is treated as flat | Each FeatureList is a list of Feature values, one per timestep |
| Parsing function | tf.io.parse_single_example, tf.io.parse_example | tf.io.parse_single_sequence_example, tf.io.parse_sequence_example |
| Typical workloads | Image classification, tabular data, single-token labels, ranking features | Audio frames, video frames, time series, token-level labels for sequence tagging |
In practice most teams pick Example, even for sequence data, by serializing the sequence into a single BytesList field and reshaping after parsing. SequenceExample is more correct but heavier to construct.
tf.Example messages are almost always stored inside TFRecord files. TFRecord is a simple binary container of length-prefixed records with CRC checksums, and a TFRecord file is just one tf.Example after another, optionally compressed with gzip or zlib at the file level.
The on-disk layout of a single record is fixed:
| Field | Size | Purpose |
|---|---|---|
| length | uint64 (little endian) | Number of bytes in the data payload |
| masked_crc32_of_length | uint32 | CRC-32C of the length field, masked |
| data | length bytes | The serialized tf.Example payload |
| masked_crc32_of_data | uint32 | CRC-32C of the data, masked |
The CRC-32C variant uses the Castagnoli polynomial, the same one that ext4, iSCSI, and SCTP use. The masking step is masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8, which avoids accidental alignment with all-zero or all-one byte patterns inside the payload itself. Records are concatenated end to end with no global header, no index, and no offset table, which is why TFRecord files are streamed sequentially rather than indexed by record number.
There is no requirement that the payload be a tf.Example. Any byte string is a legal record body, including raw text lines, JSON blobs, or serialized tensors. tf.Example is just the convention that comes with batteries included parsing.
The Python API splits the work across three layers: per-value helper functions that wrap a Python value in a Feature, a dictionary that gathers Features under string keys, and the top-level Example wrapper that gets serialized.
import tensorflow as tf
def _bytes_feature(value):
if isinstance(value, type(tf.constant(0))):
value = value.numpy()
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def serialize_image_example(image_bytes, label, height, width):
feature = {
'image/encoded': _bytes_feature(image_bytes),
'image/format': _bytes_feature(b'jpeg'),
'image/height': _int64_feature(height),
'image/width': _int64_feature(width),
'label': _int64_feature(label),
}
example = tf.train.Example(features=tf.train.Features(feature=feature))
return example.SerializeToString()
with tf.io.TFRecordWriter('train.tfrecord') as writer:
for image_path, label in dataset:
image_bytes = tf.io.read_file(image_path).numpy()
h, w = get_dims(image_path)
writer.write(serialize_image_example(image_bytes, label, h, w))
This is the standard pattern for image classification datasets. The image is kept as raw JPEG bytes inside a BytesList, which is far smaller than storing decoded pixels and lets the GPU side of the pipeline call tf.io.decode_jpeg after parsing. Field names use a category/subfield convention borrowed from the TF-Slim and TensorFlow Object Detection codebases.
Reading TFRecord files goes through tf.data.TFRecordDataset, and parsing goes through tf.io.parse_single_example (one record at a time) or tf.io.parse_example (a batched tensor of serialized records). Both take a feature description dictionary that names every field the parser should extract, along with its dtype and shape.
feature_description = {
'image/encoded': tf.io.FixedLenFeature([], tf.string),
'image/format': tf.io.FixedLenFeature([], tf.string, default_value='jpeg'),
'image/height': tf.io.FixedLenFeature([], tf.int64),
'image/width': tf.io.FixedLenFeature([], tf.int64),
'label': tf.io.FixedLenFeature([], tf.int64),
}
def _parse(example_proto):
parsed = tf.io.parse_single_example(example_proto, feature_description)
image = tf.io.decode_jpeg(parsed['image/encoded'], channels=3)
image = tf.image.resize(image, [224, 224]) / 255.0
return image, parsed['label']
ds = (tf.data.TFRecordDataset(filenames, num_parallel_reads=tf.data.AUTOTUNE)
.map(_parse, num_parallel_calls=tf.data.AUTOTUNE)
.shuffle(10_000)
.batch(64)
.prefetch(tf.data.AUTOTUNE))
The two main feature spec classes do different jobs. FixedLenFeature(shape, dtype) returns dense tensors and is used when every record has the same shape. VarLenFeature(dtype) returns a tf.SparseTensor and is used when records have ragged or variable-length fields, after which tf.sparse.to_dense produces a dense view. A third helper, FixedLenSequenceFeature, handles variable length sequences inside SequenceExample records.
For inspection without writing parsing code, the legacy tf.io.tf_record_iterator (now tf.data.TFRecordDataset(...).take(n) plus tf.train.Example.FromString) lets you walk a file record by record and dump features as Python dictionaries.
The combination of Protobuf as the record schema and TFRecord as the container gives the format several properties that matter for large-scale training.
| Property | What it buys |
|---|---|
| Compact binary encoding | Numeric features are packed; bytes are stored verbatim; total file size is typically smaller than the equivalent CSV or JSONL by a wide margin |
| Streaming friendly | Sequential record layout means readers can pull records as fast as the disk or network can serve them, with no need to seek |
| Parallel I/O | TFRecordDataset(filenames, num_parallel_reads=N) splits work across files automatically; tf.data.experimental.parallel_interleave was the older API |
| Schema flexibility | Adding a new feature to a dataset does not break old readers, because they simply ignore unknown keys |
| TPU friendly | The combination of TFRecord on Google Cloud Storage and the tf.data input pipeline is the recommended path for Cloud TPU training |
| Language neutral | The same files can be parsed from C++, Java, Go, JavaScript, or any other language with a Protobuf binding |
Google uses TFRecord and tf.Example as the default training data format inside many of its production pipelines, including the public TensorFlow Datasets catalogue, which ships every dataset as sharded TFRecord files of tf.Example records.
tf.Example is also a product of its era and ecosystem, and the tradeoffs show.
The format is row-oriented. Reading just one column out of a 200-feature record means the parser still touches every byte of every record, because there is no column index. For analytic workloads ("give me the mean of feature X across the dataset") this is much slower than a columnar format like Parquet or Apache Arrow.
It is not human readable. There is no equivalent of head -n 5 train.tfrecord that produces something useful. Quick inspection requires either a small Python snippet or a tool like tfrecord-viewer.
There is no enforced schema. The Features map is map<string, Feature>, which means any record can carry any keys with any types. Two records inside the same file can disagree on which keys exist, and the parser only finds out at read time. Tools like TensorFlow Data Validation exist precisely to add a schema layer on top.
It is TensorFlow-shaped. The format is technically open, but the only mature reader and writer ecosystem lives inside the TensorFlow project. PyTorch has community packages such as tfrecord and webdataset.TFRecord, but they are second-class citizens compared to native PyTorch formats.
Finally, for very small datasets the per-file overhead and shard tuning rules (Google recommends at least 100 MB per shard, with a total shard count of roughly 10 times the number of input hosts) can be more friction than benefit. For datasets that fit in memory, a Numpy file or a Pandas DataFrame is usually simpler.
tf.Example sits in a crowded design space. Each peer format makes different tradeoffs between row vs column orientation, streaming vs random access, and framework neutrality.
| Format | Layout | Schema | Primary ecosystem | Notes |
|---|---|---|---|---|
| TFRecord with tf.Example | Row-oriented binary records, length-prefixed | Implicit, per-record | TensorFlow, TFX | Native to tf.data and TPU pipelines; weak random access |
| Parquet | Columnar with row groups | Explicit, file level | Spark, Pandas, Polars, DuckDB, Hugging Face datasets | Best in class for analytic queries and partial-column reads |
| Apache Arrow and Feather | Columnar, in-memory and on-disk | Explicit | Polars, modern Spark, Hugging Face datasets | Designed for zero-copy IPC between processes |
| JSONL | One JSON object per line, text | Implicit | Hugging Face Hub uploads, LLM fine-tuning datasets | Human readable; slow to parse; no efficient binary fields |
| CSV | Comma separated text | None | Universal | Easy to inspect; no nested data; no native types beyond string |
| HDF5 | Hierarchical binary | Explicit | Scientific computing, older Keras model.save() | Great for multi-dimensional arrays; complex API |
| WebDataset | Tar files of per-sample shards | Implicit, file extensions act as keys | PyTorch, JAX, TensorFlow | Pure Python tooling; bit-identical to source files; popular in CV training |
| safetensors | Flat tensor blob with header | Explicit per file | PyTorch, JAX, Hugging Face | Stores model weights only, not training data |
The practical pattern in 2026 looks roughly like this. TensorFlow training stacks (especially anything that runs on TPUs or originated from Google) keep using tf.Example. Modern PyTorch and JAX training stacks tend to use WebDataset for image and video, JSONL or Parquet for text and LLM fine-tuning, and Arrow as the in-memory exchange format. The Hugging Face datasets library defaults to Arrow on disk and Parquet on the Hub, which is why most public LLM datasets you download are Parquet rather than TFRecord.
A few utilities outside the core TensorFlow API are worth knowing about when working with tf.Example records.
| Tool | Purpose |
|---|---|
tf.io.TFRecordWriter and tf.data.TFRecordDataset | The canonical write and read APIs inside TensorFlow |
tf.io.tf_record_iterator | Legacy iterator for one-off inspection (deprecated; use TFRecordDataset plus Example.FromString) |
| Apache Beam and Dataflow | Large scale TFRecord generation; the tfx_bsl package ships an WriteToTFRecord PTransform |
tfrecord-viewer | A small Flask app that opens a TFRecord file in the browser and renders image features as thumbnails |
| TensorFlow Data Validation | Generates a schema, statistics, and anomaly reports from a TFRecord dataset of tf.Example |
| TensorFlow Datasets (TFDS) | The official catalogue of public datasets; every dataset is downloaded as sharded TFRecords of tf.Example |
PyTorch tfrecord package | Third-party library that lets PyTorch DataLoaders consume TFRecord files without TensorFlow installed |
For large pipelines the typical setup is Apache Beam on Dataflow generating sharded TFRecord output to Google Cloud Storage, which is then consumed by a tf.data input pipeline on a TPU pod for training.
As of 2026, tf.Example remains the default record format for TensorFlow training, particularly inside Google and inside teams that target Cloud TPUs. TensorFlow Datasets, TensorFlow Hub example pipelines, and TFX templates all still produce tf.Example records by default, and the schema has not changed in any breaking way since the original release.
Outside of TensorFlow, the format has lost ground. Hugging Face datasets, the dominant catalogue for LLM training data, uses Arrow and Parquet. PyTorch image and video pipelines have largely standardized on WebDataset for very large datasets and on plain image folders for small ones. JSONL is the format of choice for instruction tuning and chat fine-tuning. The result is that most new public datasets aimed at the post-2022 generative AI ecosystem are not distributed as tf.Example.
This is less a story about the format being bad and more a story about the tf.data plus TPU pipeline being a smaller share of new ML work than it was in 2018. tf.Example is still the right call when the workload is TensorFlow on TPU; for almost everything else, a peer format is now the more common choice.
Imagine each training example is a lunchbox with little compartments. Every compartment has a label written on it, like "sandwich" or "apple" or "juice box", and inside the compartment there is one of three kinds of things: a list of words, a list of numbers, or a list of something binary like a photo. tf.Example is the lunchbox. Features is the set of compartments inside. A Feature is one compartment. The TFRecord file is a long line of identical lunchboxes packed nose to tail in a delivery truck, and the truck drives them straight to the model so it can eat them in order.
tensorflow/core/example/example.proto.tensorflow/core/example/feature.proto.