Feature spec
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,295 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,295 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
A feature spec (short for feature specification) is a declarative description of the input features used by a machine learning model: the name of each feature, its data type, its shape, whether it is required or has a default value, and how it should be encoded before reaching the model. The term has several overlapping meanings in practice. In TensorFlow, a feature spec is a Python dictionary that tells tf.io.parse_example how to decode serialized tf.train.Example records. In TFX (TensorFlow Extended) and TensorFlow Data Validation, a feature spec is derived from a Schema proto and drives parsing, validation, and transformation. In a broader sense, the phrase also covers the schema files used by feature stores such as Feast, Tecton, and Vertex AI Feature Store, and the legacy tf.feature_column API that defined feature schemas for Estimator models.
Across all of these meanings the unifying idea is the same. A feature spec sits between raw bytes on disk and the rectangular tensors a model consumes. It records the contract that data producers and model owners agree on, so that a value written today as a 32 bit float scalar is still a 32 bit float scalar when an inference job in a different process, on a different machine, weeks later, asks for it by name. When that contract holds, the rest of the pipeline can stay simple. When it does not, the failure mode is silent training and serving skew, which is one of the most common and most painful classes of bug in applied machine learning.
The narrowest definition of a feature spec is the dictionary passed to tf.io.parse_example or tf.io.parse_single_example. Each key is a feature name, and each value is a parsing helper that describes the type and shape of that feature in a serialized tf.train.Example record.
A tf.train.Example is itself a small Protocol Buffers message. The top level message contains a Features map, where each key is a string and each value is a Feature message holding one of three list types: BytesList, FloatList, or Int64List. The proto definition is roughly:
message Example { Features features = 1; }
message Features { map<string, Feature> feature = 1; }
message Feature {
oneof kind {
BytesList bytes_list = 1;
FloatList float_list = 2;
Int64List int64_list = 3;
}
}
Because every value bottoms out in one of three repeated lists, the Example format is type-erased: the bytes on disk do not carry shape information beyond "a sequence of strings or floats or ints". The feature spec restores that lost information at read time by telling the parser what shape and dtype to materialize from each list. This is why a feature spec is required to read a TFRecord file in any useful way, even though no spec is strictly required to write one.
The four main parsing helpers are:
| Helper | Output tensor | Typical use |
|---|---|---|
tf.io.FixedLenFeature(shape, dtype, default_value=None) | Dense Tensor | Scalars and fixed size vectors such as labels, ages, or 128 dimensional embeddings |
tf.io.VarLenFeature(dtype) | SparseTensor | Variable length lists such as tags, token IDs, or click sequences |
tf.io.SparseFeature(index_key, value_key, dtype, size, already_sorted=False) | SparseTensor | Pre indexed sparse data with explicit position keys |
tf.io.RaggedFeature(dtype, value_key=None, partitions=(), row_splits_dtype=tf.int32, validate=False) | RaggedTensor | Variable length lists where ragged dimensions are preferred over sparse ones |
A typical parsing feature spec looks like this:
import tensorflow as tf
feature_spec = {
"image": tf.io.FixedLenFeature((), tf.string),
"label": tf.io.FixedLenFeature((), tf.int64),
"embedding": tf.io.FixedLenFeature((128,), tf.float32),
"tags": tf.io.VarLenFeature(tf.string),
"clicks": tf.io.RaggedFeature(tf.int64),
"optional_age": tf.io.FixedLenFeature((), tf.float32, default_value=0.0),
}
parsed = tf.io.parse_example(serialized_examples, feature_spec)
A FixedLenFeature without a default_value is treated as required, and parsing fails if any record is missing that field. VarLenFeature produces a SparseTensor with indices of shape [N, 2] containing [batch_row, position] pairs, values of shape [N], and a dense_shape of [batch_size, max_length_in_batch], which makes it the standard way to encode lists of unknown length. The TensorFlow documentation recommends VarLenFeature over SparseFeature in most cases because its semantics are easier to reason about. SparseFeature is only useful when the indices of the sparse positions are themselves stored as a separate feature in the same Example, which is rare outside of legacy data.
The parser maps the three proto list types to TensorFlow dtypes through a fixed correspondence. Understanding this mapping is critical when authoring code that produces Example records.
| Proto list | Compatible writer types | Parser dtype options |
|---|---|---|
Int64List | bool, enum, int32, uint32, int64, uint64 | tf.int64 only |
FloatList | float (float32), double (float64) | tf.float32 only |
BytesList | string, bytes | tf.string only |
A value written as a 32 bit integer is upcast to int64 on write, and the parser always returns tf.int64. A value written as a double is downcast to float32 on write, and the parser always returns tf.float32. There is no way to round trip an int32 or float64 tensor through tf.train.Example without an explicit cast on the consumer side. This is a deliberate simplification of the format, but it has caught countless practitioners by surprise.
Bytes features have a similar wrinkle. Images, serialized tensors, JSON blobs, and encoded protos all travel as BytesList. The feature spec marks them as tf.string, but the consumer is responsible for further decoding, typically with tf.io.decode_jpeg, tf.io.decode_png, tf.io.parse_tensor, or a custom proto parser inside a tf.py_function.
The simplest end to end demonstration of a feature spec is a round trip through a single TFRecord file.
import tensorflow as tf
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
with tf.io.TFRecordWriter("toy.tfrecord") as writer:
for i in range(3):
example = tf.train.Example(features=tf.train.Features(feature={
"id": _int64_feature(i),
"label": _int64_feature(i % 2),
"embedding": _float_feature([0.1 * i] * 8),
"text": _bytes_feature(f"item {i}".encode("utf-8")),
}))
writer.write(example.SerializeToString())
feature_spec = {
"id": tf.io.FixedLenFeature((), tf.int64),
"label": tf.io.FixedLenFeature((), tf.int64),
"embedding": tf.io.FixedLenFeature((8,), tf.float32),
"text": tf.io.FixedLenFeature((), tf.string),
}
ds = tf.data.TFRecordDataset("toy.tfrecord")
ds = ds.map(lambda x: tf.io.parse_single_example(x, feature_spec))
for row in ds.take(1):
print({k: v.numpy() for k, v in row.items()})
The writer code knows nothing about feature specs. It just emits valid Example protos. The reader code is where the spec earns its keep: it converts the type erased lists back into typed, shaped tensors that downstream tf.data operations and Keras layers can consume.
For batched parsing, the recommended pattern is to call dataset.batch(batch_size).map(parse_fn) rather than map(parse_single_example).batch(batch_size). The batched form uses tf.io.parse_example, which is implemented as a single fused op and can be up to an order of magnitude faster than per record parsing followed by batching.
With FixedLenFeature, a missing default_value argument means the feature is required. Any input record that lacks the key triggers a parsing error at runtime. Setting a default_value makes the feature optional and supplies a fill value for absent records. The shape of the default must match the declared shape, including the rank.
spec = {
"age": tf.io.FixedLenFeature((), tf.float32, default_value=0.0),
"embedding": tf.io.FixedLenFeature((4,), tf.float32, default_value=[0.0]*4),
}
VarLenFeature cannot take a default value because the natural "missing" state of a variable length list is the empty list, which a sparse tensor already represents. RaggedFeature is similar: a record with no values for a ragged feature simply contributes an empty row to the batch.
For data with explicit ordered sequences, such as a stream of click events per user, TensorFlow also provides tf.train.SequenceExample and the matching helpers tf.io.FixedLenSequenceFeature and tf.io.parse_sequence_example. A SequenceExample has a context field (a normal Features map) plus a feature_lists field (a map of FeatureList, each a repeated Feature). The parser takes two dictionaries instead of one and returns (context, sequence) outputs.
context_features = {
"user_id": tf.io.FixedLenFeature((), tf.int64),
}
sequence_features = {
"click_ids": tf.io.FixedLenSequenceFeature((), tf.int64),
"dwell_ms": tf.io.FixedLenSequenceFeature((), tf.float32, allow_missing=True),
}
context, sequences = tf.io.parse_sequence_example(
serialized, context_features=context_features,
sequence_features=sequence_features,
)
FixedLenSequenceFeature declares the shape of each step (not the whole sequence). Setting allow_missing=True lets the parser pad missing steps with default_value and is commonly used when each record can have a different sequence length but each step has the same fixed shape.
tf.io.RaggedFeature returns a RaggedTensor, which is TensorFlow's native representation for jagged sequences of unequal length. With zero partitions, RaggedFeature behaves like a one dimensional flat list. To express higher rank ragged data, callers pass a tuple of partition objects that describe how the flat values map into rows.
| Partition class | Meaning |
|---|---|
tf.io.RaggedFeature.RowSplits(key) | Cumulative split offsets between rows |
tf.io.RaggedFeature.RowLengths(key) | Length of each row |
tf.io.RaggedFeature.RowStarts(key) | Start index of each row inside the flat values array |
tf.io.RaggedFeature.RowLimits(key) | End index of each row |
tf.io.RaggedFeature.ValueRowIds(key) | Row index of each individual value |
tf.io.RaggedFeature.UniformRowLength(length) | A fixed inner row length, identical for every outer row |
This is how a single RaggedFeature can produce a two or three dimensional RaggedTensor directly from a flat Int64List plus a few side channel features that describe the partition. In practice, most teams use zero or one partition and prefer VarLenFeature when more complex shapes are needed, because dense tooling around SparseTensor is still slightly more mature.
In TFX pipelines the canonical description of the data is a Schema proto, defined in the tensorflow_metadata library. The schema records the type of each feature, its presence (required, optional), its valence (single value, fixed list, variable length), allowed value ranges, vocabulary domains, and other constraints. Schemas can be authored by hand or generated automatically by the SchemaGen component, which infers a starting schema from statistics produced by StatisticsGen.
The Schema proto is converted into a parsing feature spec through tensorflow_transform.tf_metadata.schema_utils.schema_as_feature_spec. The inverse function schema_from_feature_spec builds a schema from a dictionary of FixedLenFeature, VarLenFeature, and SparseFeature objects. This conversion is the bridge between the high level schema, used for validation and documentation, and the low level parsing dictionary, used at runtime to materialize tensors from tf.train.Example records.
A second proto, TensorRepresentation, was added to the schema in later versions of tensorflow_metadata. It captures more fine grained intent for how a feature should be materialized as a tensor: dense, sparse, ragged, or variable length sparse. Where a plain Feature entry leaves some ambiguity (a one valued feature could be a scalar or a length 1 vector), an explicit TensorRepresentation removes it. Modern TFX components consult TensorRepresentation first and fall back to the legacy mapping when it is absent.
A full TFX flow usually looks like this:
ExampleGen ingests raw data and writes tf.train.Example files.StatisticsGen computes column level statistics.SchemaGen produces an initial schema, which engineers then curate and check into source control.ExampleValidator and TensorFlow Data Validation use the schema to flag anomalies, missing values, and out of vocabulary categories.Transform uses the schema to build a parsing feature spec and to define tf.Transform preprocessing logic with preprocessing_fn.Trainer and Pusher consume the transformed features along with the same schema, which makes training and serving see the same view of the data.TFDV uses the schema for three skew checks: schema skew, where training and serving records do not conform to the same schema; feature skew, where feature values differ between training and serving for the same entity; and distribution skew, where the joint distribution of features shifts over time. Environments can be attached to features so that the same schema covers both training (where labels are present) and serving (where they are not).
A minimal anomaly check looks like:
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_tfrecord("data/train-*")
schema = tfdv.infer_schema(stats)
anomalies = tfdv.validate_statistics(stats, schema)
tfdv.display_anomalies(anomalies)
Once the schema is locked in, the same call against the next day's statistics surfaces drift: features whose value count, mean, or category distribution has shifted beyond a tolerable threshold. Teams typically wire this into a CI job that blocks the pipeline from advancing to training when anomalies fire.
TFDV's anomaly taxonomy is broader than schema mismatches and worth knowing in detail.
| Anomaly | Trigger |
|---|---|
| Schema mismatch | A feature absent from the schema appears in the data, or vice versa |
| Type mismatch | An integer feature suddenly contains string values |
| Domain violation | A categorical feature contains a token not in the declared vocabulary |
| Out of range | A numeric feature exceeds the declared [min, max] range |
| Missing feature | A required feature is absent from a fraction of examples above the allowed threshold |
| Skew | The distribution of a feature in serving traffic diverges from training |
| Drift | The distribution at time t diverges from the distribution at t minus one window |
Each anomaly has a severity (WARNING or ERROR) and a reason string. The schema can be updated to widen a domain or relax a presence requirement, which is how engineers respond to a planned change in upstream data.
TensorFlow Transform is the bridge between a feature spec and the features a model actually consumes. Its core abstraction is the preprocessing_fn, a user defined function that takes a dictionary of input tensors keyed by raw feature name and returns a dictionary of output tensors keyed by transformed feature name. The function is written in pure TensorFlow but can call special analyzers from the tft namespace that compute global statistics with Apache Beam under the hood.
import tensorflow as tf
import tensorflow_transform as tft
def preprocessing_fn(inputs):
return {
"age_norm": tft.scale_to_z_score(inputs["age"]),
"income_log": tf.math.log1p(inputs["income"]),
"city_idx": tft.compute_and_apply_vocabulary(inputs["city"], top_k=10_000),
"clicks_bow": tft.bag_of_words(inputs["clicks"], ngram_range=(1, 1), separator=" "),
}
During the analyze phase, Beam scans every record and computes the mean, variance, vocabulary, and any other analyzer results. During the transform phase, those constants are baked into a TensorFlow graph that maps raw Example records to ready to train tensors. The graph is saved as a SavedModel and can be applied at serving time, which is how tf.Transform eliminates training and serving skew for the preprocessing layer. Without tf.Transform, a normalization that used the training mean would be recomputed at serving time, which produces a slightly different value and silently degrades the model.
The input feature spec to preprocessing_fn comes from the schema. The output feature spec is implied by the keys and dtypes of the returned dictionary. Both are written to a metadata directory next to the SavedModel so that downstream consumers (training, serving, or further pipelines) know exactly what tensors to expect.
For several years the dominant way to describe input features for a Keras or Estimator model in TensorFlow was the tf.feature_column API. A feature_column object combined two ideas: it described the schema of a column (numeric, categorical with a vocabulary, hashed, bucketized) and it produced a dense tensor that could be fed into an Estimator head or a DenseFeatures layer.
Common column constructors included:
| Constructor | Purpose |
|---|---|
tf.feature_column.numeric_column | Treat a column as a real valued feature, with an optional normalizer function |
tf.feature_column.categorical_column_with_vocabulary_list | Map strings to integer IDs using a fixed vocabulary |
tf.feature_column.categorical_column_with_hash_bucket | Hash large categorical spaces into a fixed number of buckets |
tf.feature_column.bucketized_column | Discretize a numeric column into bins |
tf.feature_column.embedding_column | Look up a learned embedding for a categorical column |
tf.feature_column.indicator_column | One hot or multi hot encode a categorical column |
tf.feature_column.crossed_column | Cross two or more categorical features via hashing |
Starting in TensorFlow 2.13 the API was marked as not recommended for new code, and TensorFlow 2.16 made the deprecation more visible by emitting warnings on every column constructor and pointing users toward Keras preprocessing layers. The recommended path is to replace each column with the equivalent layer.
Old tf.feature_column.* | New Keras layer |
|---|---|
numeric_column | tf.keras.layers.Normalization |
categorical_column_with_identity | tf.keras.layers.CategoryEncoding |
categorical_column_with_vocabulary_list | tf.keras.layers.StringLookup or IntegerLookup |
categorical_column_with_hash_bucket | tf.keras.layers.Hashing |
bucketized_column | tf.keras.layers.Discretization |
embedding_column | tf.keras.layers.Embedding after a lookup layer |
indicator_column | output_mode='one_hot' or 'multi_hot' on a lookup or encoding layer |
crossed_column | tf.keras.layers.experimental.preprocessing.HashedCrossing |
The migration is not just cosmetic. Keras preprocessing layers can run inside a tf.data pipeline for asynchronous CPU preprocessing, can be saved with the model so the same logic runs at serving time, and produce sparse outputs natively when sparse=True is set. Feature columns, by contrast, were tightly coupled to Estimator and tended to materialize dense tensors even for very wide categorical inputs.
Keras 3 ships a higher level utility called tf.keras.utils.FeatureSpace (also exposed as keras.utils.FeatureSpace). It plays the same role that tf.feature_column used to play, but on top of Keras preprocessing layers. Each feature is declared with a short string that names its type, and FeatureSpace builds the corresponding preprocessing pipeline.
import keras
feature_space = keras.utils.FeatureSpace(
features={
"age": "float_normalized",
"thal": "string_categorical",
"sex": "integer_categorical",
},
crosses=[("sex", "thal")],
output_mode="concat",
)
feature_space.adapt(train_dataset)
encoded = feature_space(raw_inputs)
Supported feature types include float, float_normalized, float_rescaled, float_discretized, integer_categorical, string_categorical, integer_hashed, and string_hashed. Cross features are declared by passing tuples of feature names to the crosses argument, and output_mode controls whether FeatureSpace returns a single concatenated vector or a dictionary of encoded tensors. Calling .adapt() on a representative dataset fits the underlying Normalization, StringLookup, IntegerLookup, and Discretization layers, after which the FeatureSpace can be saved alongside the model and replayed during inference.
FeatureSpace also exposes the underlying layers through the preprocessors and crossers properties for inspection or partial reuse, and the get_inputs() and get_encoded_features() methods make it trivial to wire the preprocessing layer into a Keras Functional model. The hashing dimension (hashing_dim), crossing dimension (crossing_dim), and number of discretization bins (num_discretization_bins) can all be tuned through constructor arguments.
Feature stores extend the idea of a feature spec from a single training job to a shared catalog of features that many teams and models can reuse. Each store has its own way to declare feature schemas, but the building blocks are similar: entities, features, types, and views. The schema sitting at the core of a feature store is precisely a feature spec for the organization, not just one job.
Feast defines features with Field objects inside a FeatureView. Each Field carries a name and a Feast type, drawn from feast.types, which includes primitives such as Int64, Float32, String, and Bytes, along with complex types such as Array, Map, Json, and Struct. A typical declaration looks like this:
from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64
driver = Entity(name="driver", join_keys=["driver_id"])
driver_stats = FeatureView(
name="driver_hourly_stats",
entities=[driver],
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="acc_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
],
source=hourly_stats_source,
)
If the schema is omitted, Feast infers it from the data source when feast apply runs. The resulting feature views are versioned, registered in the Feast registry, and used to assemble offline training data and online serving requests with the same field names and types.
Tecton takes a similar declarative approach but combines schema definition with the transformation that produces the feature. Tecton features are written in Python, SQL, or Spark, with explicit input and output types, and the platform materializes them to both an offline store (for training) and an online store (for serving) under the same name. Tecton further distinguishes between Batch, Stream, and Realtime feature views, each of which has a different latency contract but shares the same schema surface.
Vertex AI Feature Store on Google Cloud uses three registry resources: FeatureGroup, which corresponds to a BigQuery table or view; Feature, which points to a column inside that table; and FeatureView, which is a logical collection materialized to an online store instance. The schema is anchored in the underlying BigQuery columns, with the registry providing metadata and access control on top. Other major feature stores, including Amazon SageMaker Feature Store and Databricks Feature Store, follow comparable patterns.
The different incarnations of a feature spec optimize for different things. The table below highlights the common axes of variation.
| Surface | Type system | Required vs optional | Default value | Built in domains | Multi tenant catalog |
|---|---|---|---|---|---|
tf.io.* parsing dict | tf.int64, tf.float32, tf.string | Implicit from FixedLenFeature | default_value argument | No | No |
TFMD Schema proto | Primitive types plus domains | Explicit via presence | No | Yes | No |
Feast Field | Feast primitives, Array, Struct | Always required at type level | No | No | Yes, via registry |
| Tecton feature view | Python typed, validated at apply | Always required | No | No | Yes, via repo |
| Vertex AI registry | Inherited from BigQuery types | Inherited from column nullability | No | No | Yes, via Vertex |
Keras FeatureSpace | Type aliases such as float_normalized | Optional via inputs | Implicit via default_value on lookup | Vocabulary from adapt | No |
No single surface dominates. Most production stacks combine several. A typical setup uses a TFMD schema in TFX, a Feast registry for shared features, and a Keras FeatureSpace inside the model for in graph preprocessing.
Teams that adopt other data formats face the same need to declare feature schemas, but the mechanics differ.
| Format | Schema mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Parquet plus Polars or DuckDB | Column dtypes are stored inside the Parquet file footer | No external spec required, fast random access, language agnostic | Less natural for ragged or sparse tensors |
| Pandas with explicit dtypes | A Python dtype dict applied on read | Quick iteration on small data | Brittle when data scale grows, no built in validation |
Apache Arrow with Schema | Arrow IPC carries a typed schema in metadata | Excellent interop and zero copy reads | Sparse and ragged still require side channels |
| JSON or JSON Lines | Implicit, every consumer guesses | Easy to write by hand | Type drift is almost guaranteed, slow to parse |
| Petastorm Unischema | Python class describing fields and shapes | Works directly with PyTorch and Spark | Smaller ecosystem than TFRecord |
The choice often comes down to the rest of the stack. Pipelines built around TensorFlow and TFX lean on TFRecord and the parsing feature spec because the tooling is already there. Pipelines built around PyTorch and Spark tend to prefer Parquet plus a separate schema file, often written in YAML, JSON, or Avro. The conceptual role of the feature spec is identical in both worlds; only the syntax changes.
Feature specs become non negotiable as pipelines move from a single notebook to a production system serving live traffic.
Large scale recommendation systems for video, news, ads, and e commerce read billions of tf.train.Example records per training run. A typical record has tens to hundreds of features: scalar user metadata, embeddings of user history, multi valent categorical IDs for content tags, sparse interaction signals, and dense numerical signals such as recency or popularity. The feature spec encodes all of this in one Python dictionary. When a data engineer adds a new candidate feature, the spec gains an entry; when a feature is retired, the spec loses one; and tf.io.parse_example continues to produce the right tensors for the model.
Two tower retrieval models, common in recommender systems at YouTube, Pinterest, Twitter, Facebook, and Instagram, depend on this contract. The query tower and the candidate tower each parse a different subset of fields from the same record. A change to the spec on one side without a matching change on the other breaks training silently because parsed dictionaries simply lose a key. Teams therefore version the spec alongside the model code and treat schema changes as code reviews.
Search ranking systems and ad auction models use TFRecord and feature specs in much the same way. A single Example can carry the query, the document, the user context, the auction state, and a label. Sparse and ragged features dominate, because user histories and document features rarely have fixed shapes. The combination of VarLenFeature and RaggedFeature is what makes parsing tractable at scale, and tf.Transform is what keeps preprocessing identical between offline training and online serving.
Vision and audio pipelines lean on TFRecord because the underlying files (encoded JPEG, PNG, WAV, FLAC, or serialized arrays) compress well, batch easily, and stream over network filesystems with minimal overhead. The feature spec is small in this domain: typically a BytesList for the encoded image, an Int64List for the label, and maybe a FloatList for bounding boxes. The interesting code is in tf.io.decode_jpeg and the augmentation pipeline that follows. Still, the spec is what allows the same dataset to be consumed by a classification model, a detection model, and a segmentation model without rewriting the loader.
When the unit of work is a sequence rather than a snapshot, SequenceExample and FixedLenSequenceFeature carry the load. A user session might be encoded as a SequenceExample whose context features describe the user and whose feature lists hold the per step click IDs, dwell times, and item embeddings. Models such as transformer based session recommenders read these records directly and reshape them into batches of sequences. The feature spec records the per step shape and the optional / required status of every channel.
Most large scale TFRecord production happens inside Apache Beam pipelines, often running on Google Cloud Dataflow, Apache Flink, or Spark. Beam's tfrecordio module exposes ReadFromTFRecord and WriteToTFRecord transforms that work natively against gzipped or uncompressed shards.
import apache_beam as beam
from apache_beam.io.tfrecordio import ReadFromTFRecord, WriteToTFRecord
with beam.Pipeline() as p:
(p
| "Read" >> ReadFromTFRecord("gs://bucket/raw-*.tfrecord.gz")
| "Parse" >> beam.Map(parse_record)
| "Transform" >> beam.Map(add_features)
| "Serialize" >> beam.Map(serialize_example)
| "Write" >> WriteToTFRecord(
"gs://bucket/derived/out",
file_name_suffix=".tfrecord.gz",
num_shards=50,
))
Where tf.Transform is involved, the Beam pipeline is built by tft_beam.AnalyzeAndTransformDataset, which takes a preprocessing_fn and a metadata schema and returns both transformed data and a transform_fn SavedModel. The schema travels with the data through the pipeline as a TFMD Schema proto and is converted into a parsing feature spec at each stage that needs to materialize tensors.
Beam can also write directly to BigQuery, Pub/Sub, or Avro, and many teams mix these formats with TFRecord. In that case the feature spec is one of several schemas the pipeline juggles: a BigQuery schema for offline analytics, an Avro schema for streaming, and a TF parsing spec for training. Tools such as the tensorflow_io library and the tfx_bsl records helpers translate between these representations so that a single source of truth (often the TFMD schema) drives them all.
A stable feature spec is one of the simplest defenses against training and serving skew. Once the spec is checked into source control, every component of the pipeline (data ingestion, validation, feature engineering, training, and serving) reads features through the same names, types, and shapes. If a producer changes a column from int64 to string, the validator catches it before the model retrains on the broken data. If a serving job sends an integer where a float is expected, parsing fails loudly instead of silently coercing the value.
Feature specs also document the contract between data producers and model owners. A new engineer joining a project can read the schema file and learn which features the model expects, whether they are required, what their valid ranges are, and which vocabulary lists drive the encoders. In larger organizations the feature spec stored in a feature store becomes the discovery surface for the whole MLOps platform, listing which teams own which features and which models depend on them.
Versioning matters. A common pattern is to keep the schema in a git repository, tag each release, and bind training jobs to the tag. Backfilling features then becomes a question of running the same preprocessing_fn against historical raw data and verifying that the output schema matches the expected version. Without that discipline, a year old model that retrains weekly slowly drifts from its original spec, and the team eventually loses the ability to reproduce a result from the previous quarter.
A few traps catch teams new to feature specs.
Example collapses dtypes. Writing a int32 and parsing as tf.int64 is fine; writing a float64 and expecting float64 back is not.parse_single_example followed by batch is much slower than batch followed by parse_example.VarLenFeature with SparseFeature. The first takes the dtype only and produces a SparseTensor whose indices the parser fills in; the second requires the producer to write separate index features and is rarely worth the complexity.value_count or presence in the schema have no effect on a downstream pipeline that hard codes its parsing dict.SchemaGen infer the schema and never reviewing it. Inferred schemas are conservative and tend to mark every feature as optional, which silences real anomalies.Feature engineering is the broader practice of constructing features from raw data; feature specification is the narrower act of declaring what those features look like once they are constructed. Most production pipelines interleave the two. A tf.Transform preprocessing_fn reads raw fields described by an input feature spec, computes derived features, and emits a transformed feature spec that downstream training jobs consume. A Feast feature view declares both the source of a feature and the schema fields it produces. A Keras FeatureSpace describes how each raw column should be normalized, encoded, or hashed before the model sees it.
The deeper point is that a feature is not just a Python variable. It is a typed, shaped, named slot in a contract that spans multiple processes, languages, and time horizons. The feature spec is the written form of that contract. Treating it as a first class artifact, with versioning, code review, and automated validation, is the boundary between a research prototype and a production machine learning system.