Tf.Example

See also: Machine learning terms

tf.train.Example (commonly written as tf.Example) is a Protocol Buffers message type that TensorFlow uses as its standard on-disk representation for a single training or inference example. The message is a thin wrapper around a flexible map from string feature names to typed value lists, and it is the canonical record body inside a TFRecord file. Because the format is defined by a published .proto schema, every TensorFlow language binding (Python, C++, Java, Go) can read and write the same files without any framework-specific glue.

The schema lives at tensorflow/core/example/example.proto and tensorflow/core/example/feature.proto in the open source TensorFlow repository. The same protos are reused by higher level systems such as TFX, TensorFlow Data Validation, TensorFlow Transform, and Vertex AI training jobs, which is why tf.Example shows up well beyond the core training loop.

history and purpose

tf.Example was introduced alongside TensorFlow itself in 2015 as the standard payload for the TFRecord container. The design goal was a single, compact, language neutral record format that could carry mixed numeric and binary fields, support batching from disk to GPU at high throughput, and survive across versions of TensorFlow without breaking older datasets. Choosing Protobuf for the wire format gave the format a stable schema and forward and backward compatible parsing, both of which matter when datasets sit on disk for years and outlive several model architectures.

The format is intentionally narrow. It does not try to be a relational database, a columnar analytics format, or a transport protocol. Its job is to describe one training example as a bag of typed feature lists, and let the rest of the tf.data pipeline handle shuffling, batching, and prefetching.

protocol buffer schema

The schema is short enough to read end to end. It defines three primitive list types, a oneof Feature wrapper, a Features map, and the top level Example and SequenceExample messages.

message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true, jstype = JS_STRING]; }

message Feature {
  oneof kind {
    BytesList bytes_list = 1;
    FloatList float_list = 2;
    Int64List int64_list = 3;
  }
}

message Features {
  map<string, Feature> feature = 1;
}

message Example {
  Features features = 1;
}

message SequenceExample {
  Features context = 1;
  FeatureLists feature_lists = 2;
}

message FeatureList { repeated Feature feature = 1; }
message FeatureLists { map<string, FeatureList> feature_list = 1; }

A few details in this schema matter in practice. Float and Int64 lists use packed = true, which means repeated numeric values are written as a single length-prefixed block instead of one tag per value. This is what makes long numeric features (image pixels, embeddings, audio frames) cheap to encode. The jstype = JS_STRING annotation on Int64List is a JavaScript hint: it tells protobuf-js to surface 64-bit ints as strings so they survive the JS Number type, which only handles 53-bit integers safely.

feature types

Every field inside a tf.Example is a Feature, and every Feature is exactly one of three list kinds. The choice of list determines what Python and TensorFlow types are accepted at write time.

Feature kind	Proto field	Element type	Common uses
BytesList	`bytes_list`	bytes (raw byte string)	Encoded JPEG or PNG image bytes, UTF-8 strings, serialized tensors via `tf.io.serialize_tensor`, tokenized sequences as serialized arrays
FloatList	`float_list`	float32	Continuous features, embedding vectors, audio samples, regression targets, normalized image tensors written as floats
Int64List	`int64_list`	int64	Class labels, token IDs, timestamps, booleans (encoded as 0 or 1), categorical IDs, image dimensions

There is no native float64, int32, or boolean type. TensorFlow simply coerces those into the closest list at serialization time: bool becomes int64, int32 widens to int64, and float64 narrows to float32. If precision matters, a value can be stored as a serialized tensor inside a BytesList instead.

A Feature can also be empty (a list of length zero), which is how missing values are represented. Parsers can either fail loudly or fill with a default value depending on how the feature spec is configured.

Example vs SequenceExample

The sibling message tf.train.SequenceExample exists because some training data has a clean split between non-time-varying context and a time-ordered sequence. Speech recognition is the canonical case: the speaker ID and sample rate are context, the per-frame audio features are a sequence.

Aspect	tf.train.Example	tf.train.SequenceExample
Top-level fields	`features` only	`context` plus `feature_lists`
Best for	Independent records with a flat feature schema	Records with both static fields and one or more variable-length sequences
Sequence support	Variable-length lists are allowed but every feature is treated as flat	Each `FeatureList` is a list of `Feature` values, one per timestep
Parsing function	`tf.io.parse_single_example`, `tf.io.parse_example`	`tf.io.parse_single_sequence_example`, `tf.io.parse_sequence_example`
Typical workloads	Image classification, tabular data, single-token labels, ranking features	Audio frames, video frames, time series, token-level labels for sequence tagging

In practice most teams pick Example, even for sequence data, by serializing the sequence into a single BytesList field and reshaping after parsing. SequenceExample is more correct but heavier to construct.

TFRecord container format

tf.Example messages are almost always stored inside TFRecord files. TFRecord is a simple binary container of length-prefixed records with CRC checksums, and a TFRecord file is just one tf.Example after another, optionally compressed with gzip or zlib at the file level.

The on-disk layout of a single record is fixed:

Field	Size	Purpose
length	uint64 (little endian)	Number of bytes in the data payload
masked_crc32_of_length	uint32	CRC-32C of the length field, masked
data	length bytes	The serialized tf.Example payload
masked_crc32_of_data	uint32	CRC-32C of the data, masked

The CRC-32C variant uses the Castagnoli polynomial, the same one that ext4, iSCSI, and SCTP use. The masking step is masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8, which avoids accidental alignment with all-zero or all-one byte patterns inside the payload itself. Records are concatenated end to end with no global header, no index, and no offset table, which is why TFRecord files are streamed sequentially rather than indexed by record number.

There is no requirement that the payload be a tf.Example. Any byte string is a legal record body, including raw text lines, JSON blobs, or serialized tensors. tf.Example is just the convention that comes with batteries included parsing.

writing tf.Example records

The Python API splits the work across three layers: per-value helper functions that wrap a Python value in a Feature, a dictionary that gathers Features under string keys, and the top-level Example wrapper that gets serialized.

import tensorflow as tf

def _bytes_feature(value):
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def serialize_image_example(image_bytes, label, height, width):
    feature = {
        'image/encoded': _bytes_feature(image_bytes),
        'image/format':  _bytes_feature(b'jpeg'),
        'image/height':  _int64_feature(height),
        'image/width':   _int64_feature(width),
        'label':         _int64_feature(label),
    }
    example = tf.train.Example(features=tf.train.Features(feature=feature))
    return example.SerializeToString()

with tf.io.TFRecordWriter('train.tfrecord') as writer:
    for image_path, label in dataset:
        image_bytes = tf.io.read_file(image_path).numpy()
        h, w = get_dims(image_path)
        writer.write(serialize_image_example(image_bytes, label, h, w))

This is the standard pattern for image classification datasets. The image is kept as raw JPEG bytes inside a BytesList, which is far smaller than storing decoded pixels and lets the GPU side of the pipeline call tf.io.decode_jpeg after parsing. Field names use a category/subfield convention borrowed from the TF-Slim and TensorFlow Object Detection codebases.

reading and parsing

Reading TFRecord files goes through tf.data.TFRecordDataset, and parsing goes through tf.io.parse_single_example (one record at a time) or tf.io.parse_example (a batched tensor of serialized records). Both take a feature description dictionary that names every field the parser should extract, along with its dtype and shape.

feature_description = {
    'image/encoded': tf.io.FixedLenFeature([], tf.string),
    'image/format':  tf.io.FixedLenFeature([], tf.string, default_value='jpeg'),
    'image/height':  tf.io.FixedLenFeature([], tf.int64),
    'image/width':   tf.io.FixedLenFeature([], tf.int64),
    'label':         tf.io.FixedLenFeature([], tf.int64),
}

def _parse(example_proto):
    parsed = tf.io.parse_single_example(example_proto, feature_description)
    image = tf.io.decode_jpeg(parsed['image/encoded'], channels=3)
    image = tf.image.resize(image, [224, 224]) / 255.0
    return image, parsed['label']

ds = (tf.data.TFRecordDataset(filenames, num_parallel_reads=tf.data.AUTOTUNE)
        .map(_parse, num_parallel_calls=tf.data.AUTOTUNE)
        .shuffle(10_000)
        .batch(64)
        .prefetch(tf.data.AUTOTUNE))

The two main feature spec classes do different jobs. FixedLenFeature(shape, dtype) returns dense tensors and is used when every record has the same shape. VarLenFeature(dtype) returns a tf.SparseTensor and is used when records have ragged or variable-length fields, after which tf.sparse.to_dense produces a dense view. A third helper, FixedLenSequenceFeature, handles variable length sequences inside SequenceExample records.

For inspection without writing parsing code, the legacy tf.io.tf_record_iterator (now tf.data.TFRecordDataset(...).take(n) plus tf.train.Example.FromString) lets you walk a file record by record and dump features as Python dictionaries.

why use tf.Example

The combination of Protobuf as the record schema and TFRecord as the container gives the format several properties that matter for large-scale training.

Property	What it buys
Compact binary encoding	Numeric features are packed; bytes are stored verbatim; total file size is typically smaller than the equivalent CSV or JSONL by a wide margin
Streaming friendly	Sequential record layout means readers can pull records as fast as the disk or network can serve them, with no need to seek
Parallel I/O	`TFRecordDataset(filenames, num_parallel_reads=N)` splits work across files automatically; `tf.data.experimental.parallel_interleave` was the older API
Schema flexibility	Adding a new feature to a dataset does not break old readers, because they simply ignore unknown keys
TPU friendly	The combination of TFRecord on Google Cloud Storage and the tf.data input pipeline is the recommended path for Cloud TPU training
Language neutral	The same files can be parsed from C++, Java, Go, JavaScript, or any other language with a Protobuf binding

Google uses TFRecord and tf.Example as the default training data format inside many of its production pipelines, including the public TensorFlow Datasets catalogue, which ships every dataset as sharded TFRecord files of tf.Example records.

limitations

tf.Example is also a product of its era and ecosystem, and the tradeoffs show.

The format is row-oriented. Reading just one column out of a 200-feature record means the parser still touches every byte of every record, because there is no column index. For analytic workloads ("give me the mean of feature X across the dataset") this is much slower than a columnar format like Parquet or Apache Arrow.

It is not human readable. There is no equivalent of head -n 5 train.tfrecord that produces something useful. Quick inspection requires either a small Python snippet or a tool like tfrecord-viewer.

There is no enforced schema. The Features map is map<string, Feature>, which means any record can carry any keys with any types. Two records inside the same file can disagree on which keys exist, and the parser only finds out at read time. Tools like TensorFlow Data Validation exist precisely to add a schema layer on top.

It is TensorFlow-shaped. The format is technically open, but the only mature reader and writer ecosystem lives inside the TensorFlow project. PyTorch has community packages such as tfrecord and webdataset.TFRecord, but they are second-class citizens compared to native PyTorch formats.

Finally, for very small datasets the per-file overhead and shard tuning rules (Google recommends at least 100 MB per shard, with a total shard count of roughly 10 times the number of input hosts) can be more friction than benefit. For datasets that fit in memory, a Numpy file or a Pandas DataFrame is usually simpler.

comparison with peer data formats

tf.Example sits in a crowded design space. Each peer format makes different tradeoffs between row vs column orientation, streaming vs random access, and framework neutrality.

Format	Layout	Schema	Primary ecosystem	Notes
TFRecord with tf.Example	Row-oriented binary records, length-prefixed	Implicit, per-record	TensorFlow, TFX	Native to tf.data and TPU pipelines; weak random access
Parquet	Columnar with row groups	Explicit, file level	Spark, Pandas, Polars, DuckDB, Hugging Face datasets	Best in class for analytic queries and partial-column reads
Apache Arrow and Feather	Columnar, in-memory and on-disk	Explicit	Polars, modern Spark, Hugging Face datasets	Designed for zero-copy IPC between processes
JSONL	One JSON object per line, text	Implicit	Hugging Face Hub uploads, LLM fine-tuning datasets	Human readable; slow to parse; no efficient binary fields
CSV	Comma separated text	None	Universal	Easy to inspect; no nested data; no native types beyond string
HDF5	Hierarchical binary	Explicit	Scientific computing, older Keras `model.save()`	Great for multi-dimensional arrays; complex API
WebDataset	Tar files of per-sample shards	Implicit, file extensions act as keys	PyTorch, JAX, TensorFlow	Pure Python tooling; bit-identical to source files; popular in CV training
safetensors	Flat tensor blob with header	Explicit per file	PyTorch, JAX, Hugging Face	Stores model weights only, not training data

The practical pattern in 2026 looks roughly like this. TensorFlow training stacks (especially anything that runs on TPUs or originated from Google) keep using tf.Example. Modern PyTorch and JAX training stacks tend to use WebDataset for image and video, JSONL or Parquet for text and LLM fine-tuning, and Arrow as the in-memory exchange format. The Hugging Face datasets library defaults to Arrow on disk and Parquet on the Hub, which is why most public LLM datasets you download are Parquet rather than TFRecord.

tooling

A few utilities outside the core TensorFlow API are worth knowing about when working with tf.Example records.

Tool	Purpose
`tf.io.TFRecordWriter` and `tf.data.TFRecordDataset`	The canonical write and read APIs inside TensorFlow
`tf.io.tf_record_iterator`	Legacy iterator for one-off inspection (deprecated; use `TFRecordDataset` plus `Example.FromString`)
Apache Beam and Dataflow	Large scale TFRecord generation; the `tfx_bsl` package ships an `WriteToTFRecord` PTransform
`tfrecord-viewer`	A small Flask app that opens a TFRecord file in the browser and renders image features as thumbnails
TensorFlow Data Validation	Generates a schema, statistics, and anomaly reports from a TFRecord dataset of tf.Example
TensorFlow Datasets (TFDS)	The official catalogue of public datasets; every dataset is downloaded as sharded TFRecords of tf.Example
PyTorch `tfrecord` package	Third-party library that lets PyTorch DataLoaders consume TFRecord files without TensorFlow installed

For large pipelines the typical setup is Apache Beam on Dataflow generating sharded TFRecord output to Google Cloud Storage, which is then consumed by a tf.data input pipeline on a TPU pod for training.

modern relevance

As of 2026, tf.Example remains the default record format for TensorFlow training, particularly inside Google and inside teams that target Cloud TPUs. TensorFlow Datasets, TensorFlow Hub example pipelines, and TFX templates all still produce tf.Example records by default, and the schema has not changed in any breaking way since the original release.

Outside of TensorFlow, the format has lost ground. Hugging Face datasets, the dominant catalogue for LLM training data, uses Arrow and Parquet. PyTorch image and video pipelines have largely standardized on WebDataset for very large datasets and on plain image folders for small ones. JSONL is the format of choice for instruction tuning and chat fine-tuning. The result is that most new public datasets aimed at the post-2022 generative AI ecosystem are not distributed as tf.Example.

This is less a story about the format being bad and more a story about the tf.data plus TPU pipeline being a smaller share of new ML work than it was in 2018. tf.Example is still the right call when the workload is TensorFlow on TPU; for almost everything else, a peer format is now the more common choice.

explain like I'm 5

Imagine each training example is a lunchbox with little compartments. Every compartment has a label written on it, like "sandwich" or "apple" or "juice box", and inside the compartment there is one of three kinds of things: a list of words, a list of numbers, or a list of something binary like a photo. tf.Example is the lunchbox. Features is the set of compartments inside. A Feature is one compartment. The TFRecord file is a long line of identical lunchboxes packed nose to tail in a delivery truck, and the truck drives them straight to the model so it can eat them in order.

references

TensorFlow team. TFRecord and tf.train.Example. TensorFlow Core tutorials.
TensorFlow source code. example.proto, tensorflow/core/example/example.proto.
TensorFlow source code. feature.proto, tensorflow/core/example/feature.proto.
TensorFlow API reference. tf.io.parse_single_example, tf.io.parse_example, tf.io.FixedLenFeature, tf.io.VarLenFeature.
Kim, Jong Wook. Anatomy of TFRecord. Personal blog. Detailed walkthrough of the TFRecord on-disk layout, CRC masking, and protobuf encoding.
Rand, Chaim (2022). Data Formats for Training in TensorFlow: Parquet, Petastorm, Feather, and More. Towards Data Science.
Hopsworks. Guide to File Formats for Machine Learning.
LinkedIn Engineering (2020). Spark-TFRecord: Toward full support of TFRecord in Spark.
TensorFlow Datasets project. TFDS overview.
Aizman, A. et al. (2020). High Performance I/O For Large Scale Deep Learning. NVIDIA. arXiv:2001.01858.

history and purpose

protocol buffer schema

feature types

Example vs SequenceExample

TFRecord container format

writing tf.Example records

reading and parsing

why use tf.Example

limitations

comparison with peer data formats

tooling

modern relevance

explain like I'm 5

references

Improve this article

Related Articles

Dataset API (tf.data)

Layers API (tf.layers)

Node (TensorFlow graph)

SavedModel

TensorBoard

TensorFlow Playground

history and purpose

protocol buffer schema

feature types

Example vs SequenceExample

TFRecord container format

writing tf.Example records

reading and parsing

why use tf.Example

limitations

comparison with peer data formats

tooling

modern relevance

explain like I'm 5

references

Related Articles

Dataset API (tf.data)

Layers API (tf.layers)

Node (TensorFlow graph)

SavedModel

TensorBoard

TensorFlow Playground