# Feature spec

> Source: https://aiwiki.ai/wiki/feature_spec
> Updated: 2026-07-16
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A **feature spec** (short for feature specification) is a declarative description of the input features used by a [machine learning](/wiki/machine_learning) model: the name of each feature, its data type, its shape, whether it is required or has a default value, and how it should be encoded before reaching the model. The term has several overlapping meanings in practice. In [TensorFlow](/wiki/tensorflow), a feature spec is a Python dictionary that tells `tf.io.parse_example` how to decode serialized `tf.train.Example` records.[3] In [TFX](/wiki/tfx) (TensorFlow Extended) and [TensorFlow Data Validation](/wiki/tfdv), a feature spec is derived from a `Schema` proto and drives parsing, validation, and transformation.[19][21] In a broader sense, the phrase also covers the schema files used by [feature stores](/wiki/feature_store) such as [Feast](/wiki/feast), [Tecton](/wiki/tecton), and [Vertex AI Feature Store](/wiki/vertex_ai_feature_store), and the legacy `tf.feature_column` API that defined feature schemas for Estimator models.[25][26][27]

Across all of these meanings the unifying idea is the same. A feature spec sits between raw bytes on disk and the rectangular tensors a model consumes. It records the contract that data producers and model owners agree on, so that a value written today as a 32 bit float scalar is still a 32 bit float scalar when an inference job in a different process, on a different machine, weeks later, asks for it by name. When that contract holds, the rest of the pipeline can stay simple. When it does not, the failure mode is silent training and serving skew, which is one of the most common and most painful classes of bug in applied [machine learning](/wiki/machine_learning).

## Feature spec as a parsing dictionary

The narrowest definition of a feature spec is the dictionary passed to `tf.io.parse_example` or `tf.io.parse_single_example`.[3][4] Each key is a feature name, and each value is a parsing helper that describes the type and shape of that feature in a serialized `tf.train.Example` record.

A `tf.train.Example` is itself a small [Protocol Buffers](/wiki/protocol_buffers) message.[11] The top level message contains a `Features` map, where each key is a string and each value is a `Feature` message holding one of three list types: `BytesList`, `FloatList`, or `Int64List`.[11][13] The proto definition is roughly:

```proto
message Example { Features features = 1; }
message Features { map<string, Feature> feature = 1; }
message Feature {
  oneof kind {
    BytesList bytes_list = 1;
    FloatList float_list = 2;
    Int64List int64_list = 3;
  }
}
```

Because every value bottoms out in one of three repeated lists, the `Example` format is type-erased: the bytes on disk do not carry shape information beyond "a sequence of strings or floats or ints". The feature spec restores that lost information at read time by telling the parser what shape and dtype to materialize from each list. This is why a feature spec is required to read a [TFRecord](/wiki/tfrecord) file in any useful way, even though no spec is strictly required to write one.[13]

The four main parsing helpers are:

| Helper | Output tensor | Typical use |
|---|---|---|
| `tf.io.FixedLenFeature(shape, dtype, default_value=None)` | Dense `Tensor` | Scalars and fixed size vectors such as labels, ages, or 128 dimensional embeddings[6] |
| `tf.io.VarLenFeature(dtype)` | `SparseTensor` | Variable length lists such as tags, token IDs, or click sequences[7] |
| `tf.io.SparseFeature(index_key, value_key, dtype, size, already_sorted=False)` | `SparseTensor` | Pre indexed sparse data with explicit position keys[9] |
| `tf.io.RaggedFeature(dtype, value_key=None, partitions=(), row_splits_dtype=tf.int32, validate=False)` | `RaggedTensor` | Variable length lists where ragged dimensions are preferred over sparse ones[8] |

A typical parsing feature spec looks like this:

```python
import tensorflow as tf

feature_spec = {
    "image": tf.io.FixedLenFeature((), tf.string),
    "label": tf.io.FixedLenFeature((), tf.int64),
    "embedding": tf.io.FixedLenFeature((128,), tf.float32),
    "tags": tf.io.VarLenFeature(tf.string),
    "clicks": tf.io.RaggedFeature(tf.int64),
    "optional_age": tf.io.FixedLenFeature((), tf.float32, default_value=0.0),
}
parsed = tf.io.parse_example(serialized_examples, feature_spec)
```

A `FixedLenFeature` without a `default_value` is treated as required, and parsing fails if any record is missing that field.[6] `VarLenFeature` produces a `SparseTensor` with indices of shape `[N, 2]` containing `[batch_row, position]` pairs, values of shape `[N]`, and a `dense_shape` of `[batch_size, max_length_in_batch]`, which makes it the standard way to encode lists of unknown length.[7] The TensorFlow documentation recommends `VarLenFeature` over `SparseFeature` in most cases because its semantics are easier to reason about.[9] `SparseFeature` is only useful when the indices of the sparse positions are themselves stored as a separate feature in the same `Example`, which is rare outside of legacy data.[9]

### How list values are mapped to tensors

The parser maps the three proto list types to TensorFlow dtypes through a fixed correspondence.[13] Understanding this mapping is critical when authoring code that produces `Example` records.

| Proto list | Compatible writer types | Parser dtype options |
|---|---|---|
| `Int64List` | `bool`, `enum`, `int32`, `uint32`, `int64`, `uint64` | `tf.int64` only |
| `FloatList` | `float` (float32), `double` (float64) | `tf.float32` only |
| `BytesList` | `string`, `bytes` | `tf.string` only |

A value written as a 32 bit integer is upcast to `int64` on write, and the parser always returns `tf.int64`.[13] A value written as a double is downcast to `float32` on write, and the parser always returns `tf.float32`. There is no way to round trip an `int32` or `float64` tensor through `tf.train.Example` without an explicit cast on the consumer side. This is a deliberate simplification of the format, but it has caught countless practitioners by surprise.

Bytes features have a similar wrinkle. Images, serialized tensors, JSON blobs, and encoded protos all travel as `BytesList`. The feature spec marks them as `tf.string`, but the consumer is responsible for further decoding, typically with `tf.io.decode_jpeg`, `tf.io.decode_png`, `tf.io.parse_tensor`, or a custom proto parser inside a `tf.py_function`.[13]

### Worked example: writing and reading a TFRecord file

The simplest end to end demonstration of a feature spec is a round trip through a single TFRecord file.

```python
import tensorflow as tf

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

with tf.io.TFRecordWriter("toy.tfrecord") as writer:
    for i in range(3):
        example = tf.train.Example(features=tf.train.Features(feature={
            "id": _int64_feature(i),
            "label": _int64_feature(i % 2),
            "embedding": _float_feature([0.1 * i] * 8),
            "text": _bytes_feature(f"item {i}".encode("utf-8")),
        }))
        writer.write(example.SerializeToString())

feature_spec = {
    "id": tf.io.FixedLenFeature((), tf.int64),
    "label": tf.io.FixedLenFeature((), tf.int64),
    "embedding": tf.io.FixedLenFeature((8,), tf.float32),
    "text": tf.io.FixedLenFeature((), tf.string),
}

ds = tf.data.TFRecordDataset("toy.tfrecord")
ds = ds.map(lambda x: tf.io.parse_single_example(x, feature_spec))
for row in ds.take(1):
    print({k: v.numpy() for k, v in row.items()})
```

The writer code knows nothing about feature specs. It just emits valid `Example` protos. The reader code is where the spec earns its keep: it converts the type erased lists back into typed, shaped tensors that downstream `tf.data` operations and Keras layers can consume.

For batched parsing, the recommended pattern is to call `dataset.batch(batch_size).map(parse_fn)` rather than `map(parse_single_example).batch(batch_size)`.[4] The batched form uses `tf.io.parse_example`, which is implemented as a single fused op and can be up to an order of magnitude faster than per record parsing followed by batching.

### Required and optional features

With `FixedLenFeature`, a missing `default_value` argument means the feature is required.[6] Any input record that lacks the key triggers a parsing error at runtime. Setting a `default_value` makes the feature optional and supplies a fill value for absent records. The shape of the default must match the declared shape, including the rank.[6]

```python
spec = {
    "age": tf.io.FixedLenFeature((), tf.float32, default_value=0.0),
    "embedding": tf.io.FixedLenFeature((4,), tf.float32, default_value=[0.0]*4),
}
```

`VarLenFeature` cannot take a default value because the natural "missing" state of a variable length list is the empty list, which a sparse tensor already represents.[7] `RaggedFeature` is similar: a record with no values for a ragged feature simply contributes an empty row to the batch.[8]

### Sequence features and FixedLenSequenceFeature

For data with explicit ordered sequences, such as a stream of click events per user, TensorFlow also provides `tf.train.SequenceExample` and the matching helpers `tf.io.FixedLenSequenceFeature` and `tf.io.parse_sequence_example`.[5][10][12] A `SequenceExample` has a `context` field (a normal `Features` map) plus a `feature_lists` field (a map of `FeatureList`, each a repeated `Feature`).[12] The parser takes two dictionaries instead of one and returns `(context, sequence)` outputs.[5]

```python
context_features = {
    "user_id": tf.io.FixedLenFeature((), tf.int64),
}
sequence_features = {
    "click_ids": tf.io.FixedLenSequenceFeature((), tf.int64),
    "dwell_ms": tf.io.FixedLenSequenceFeature((), tf.float32, allow_missing=True),
}
context, sequences = tf.io.parse_sequence_example(
    serialized, context_features=context_features,
    sequence_features=sequence_features,
)
```

`FixedLenSequenceFeature` declares the shape of each step (not the whole sequence).[10] Setting `allow_missing=True` lets the parser pad missing steps with `default_value` and is commonly used when each record can have a different sequence length but each step has the same fixed shape.[10]

### RaggedFeature partitions

`tf.io.RaggedFeature` returns a `RaggedTensor`, which is TensorFlow's native representation for jagged sequences of unequal length.[8][14] With zero partitions, `RaggedFeature` behaves like a one dimensional flat list. To express higher rank ragged data, callers pass a tuple of partition objects that describe how the flat values map into rows.[8]

| Partition class | Meaning |
|---|---|
| `tf.io.RaggedFeature.RowSplits(key)` | Cumulative split offsets between rows |
| `tf.io.RaggedFeature.RowLengths(key)` | Length of each row |
| `tf.io.RaggedFeature.RowStarts(key)` | Start index of each row inside the flat values array |
| `tf.io.RaggedFeature.RowLimits(key)` | End index of each row |
| `tf.io.RaggedFeature.ValueRowIds(key)` | Row index of each individual value |
| `tf.io.RaggedFeature.UniformRowLength(length)` | A fixed inner row length, identical for every outer row |

This is how a single `RaggedFeature` can produce a two or three dimensional `RaggedTensor` directly from a flat `Int64List` plus a few side channel features that describe the partition.[8] In practice, most teams use zero or one partition and prefer `VarLenFeature` when more complex shapes are needed, because dense tooling around `SparseTensor` is still slightly more mature.

## Feature spec in TFX and TensorFlow Data Validation

In TFX pipelines the canonical description of the data is a `Schema` proto, defined in the `tensorflow_metadata` library.[19] The schema records the type of each feature, its presence (required, optional), its valence (single value, fixed list, variable length), allowed value ranges, vocabulary domains, and other constraints.[19] Schemas can be authored by hand or generated automatically by the `SchemaGen` component, which infers a starting schema from statistics produced by `StatisticsGen`.[18]

The `Schema` proto is converted into a parsing feature spec through `tensorflow_transform.tf_metadata.schema_utils.schema_as_feature_spec`.[21] The inverse function `schema_from_feature_spec` builds a schema from a dictionary of `FixedLenFeature`, `VarLenFeature`, and `SparseFeature` objects.[21] This conversion is the bridge between the high level schema, used for validation and documentation, and the low level parsing dictionary, used at runtime to materialize tensors from `tf.train.Example` records.

A second proto, `TensorRepresentation`, was added to the schema in later versions of `tensorflow_metadata`.[19] It captures more fine grained intent for how a feature should be materialized as a tensor: dense, sparse, ragged, or variable length sparse.[19][20] Where a plain `Feature` entry leaves some ambiguity (a one valued feature could be a scalar or a length 1 vector), an explicit `TensorRepresentation` removes it. Modern TFX components consult `TensorRepresentation` first and fall back to the legacy mapping when it is absent.[20]

### Standard TFX flow

A full TFX flow usually looks like this:

1. `ExampleGen` ingests raw data and writes `tf.train.Example` files.
2. `StatisticsGen` computes column level statistics.
3. `SchemaGen` produces an initial schema, which engineers then curate and check into source control.[18]
4. `ExampleValidator` and `TensorFlow Data Validation` use the schema to flag anomalies, missing values, and out of vocabulary categories.[17]
5. `Transform` uses the schema to build a parsing feature spec and to define `tf.Transform` preprocessing logic with `preprocessing_fn`.[16]
6. `Trainer` and `Pusher` consume the transformed features along with the same schema, which makes training and serving see the same view of the data.[16]

TFDV uses the schema for three skew checks: schema skew, where training and serving records do not conform to the same schema; feature skew, where feature values differ between training and serving for the same entity; and distribution skew, where the joint distribution of features shifts over time.[17] Environments can be attached to features so that the same schema covers both training (where labels are present) and serving (where they are not).[17]

A minimal anomaly check looks like:

```python
import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_tfrecord("data/train-*")
schema = tfdv.infer_schema(stats)
anomalies = tfdv.validate_statistics(stats, schema)
tfdv.display_anomalies(anomalies)
```

Once the schema is locked in, the same call against the next day's statistics surfaces drift: features whose value count, mean, or category distribution has shifted beyond a tolerable threshold.[17] Teams typically wire this into a CI job that blocks the pipeline from advancing to training when anomalies fire.

### Anomaly categories

TFDV's anomaly taxonomy is broader than schema mismatches and worth knowing in detail.[17]

| Anomaly | Trigger |
|---|---|
| Schema mismatch | A feature absent from the schema appears in the data, or vice versa |
| Type mismatch | An integer feature suddenly contains string values |
| Domain violation | A categorical feature contains a token not in the declared vocabulary |
| Out of range | A numeric feature exceeds the declared `[min, max]` range |
| Missing feature | A required feature is absent from a fraction of examples above the allowed threshold |
| Skew | The distribution of a feature in serving traffic diverges from training |
| Drift | The distribution at time t diverges from the distribution at t minus one window |

Each anomaly has a severity (`WARNING` or `ERROR`) and a reason string.[17] The schema can be updated to widen a domain or relax a presence requirement, which is how engineers respond to a planned change in upstream data.

## tf.Transform and preprocessing_fn

TensorFlow Transform is the bridge between a feature spec and the features a model actually consumes. Its core abstraction is the `preprocessing_fn`, a user defined function that takes a dictionary of input tensors keyed by raw feature name and returns a dictionary of output tensors keyed by transformed feature name.[15] The function is written in pure TensorFlow but can call special analyzers from the `tft` namespace that compute global statistics with [Apache Beam](/wiki/apache_beam) under the hood.[15]

```python
import tensorflow as tf
import tensorflow_transform as tft

def preprocessing_fn(inputs):
    return {
        "age_norm": tft.scale_to_z_score(inputs["age"]),
        "income_log": tf.math.log1p(inputs["income"]),
        "city_idx": tft.compute_and_apply_vocabulary(inputs["city"], top_k=10_000),
        "clicks_bow": tft.bag_of_words(inputs["clicks"], ngram_range=(1, 1), separator=" "),
    }
```

During the analyze phase, Beam scans every record and computes the mean, variance, vocabulary, and any other analyzer results.[15] During the transform phase, those constants are baked into a TensorFlow graph that maps raw `Example` records to ready to train tensors. The graph is saved as a `SavedModel` and can be applied at serving time, which is how `tf.Transform` eliminates training and serving skew for the preprocessing layer.[16] Without `tf.Transform`, a normalization that used the training mean would be recomputed at serving time, which produces a slightly different value and silently degrades the model.

The input feature spec to `preprocessing_fn` comes from the schema. The output feature spec is implied by the keys and dtypes of the returned dictionary. Both are written to a metadata directory next to the `SavedModel` so that downstream consumers (training, serving, or further pipelines) know exactly what tensors to expect.[15]

## tf.feature_column and the legacy schema layer

For several years the dominant way to describe input features for a Keras or Estimator model in TensorFlow was the `tf.feature_column` API.[1] A `feature_column` object combined two ideas: it described the schema of a column (numeric, categorical with a vocabulary, hashed, bucketized) and it produced a dense tensor that could be fed into an `Estimator` head or a `DenseFeatures` layer.[1]

Common column constructors included:

| Constructor | Purpose |
|---|---|
| `tf.feature_column.numeric_column` | Treat a column as a real valued feature, with an optional normalizer function |
| `tf.feature_column.categorical_column_with_vocabulary_list` | Map strings to integer IDs using a fixed vocabulary |
| `tf.feature_column.categorical_column_with_hash_bucket` | Hash large categorical spaces into a fixed number of buckets |
| `tf.feature_column.bucketized_column` | Discretize a numeric column into bins |
| `tf.feature_column.embedding_column` | Look up a learned embedding for a categorical column |
| `tf.feature_column.indicator_column` | One hot or multi hot encode a categorical column |
| `tf.feature_column.crossed_column` | Cross two or more categorical features via hashing |

Starting in TensorFlow 2.13 the API was marked as not recommended for new code, and TensorFlow 2.16 made the deprecation more visible by emitting warnings on every column constructor and pointing users toward Keras preprocessing layers.[1] The recommended path is to replace each column with the equivalent layer.[1]

| Old `tf.feature_column.*` | New Keras layer |
|---|---|
| `numeric_column` | `tf.keras.layers.Normalization` |
| `categorical_column_with_identity` | `tf.keras.layers.CategoryEncoding` |
| `categorical_column_with_vocabulary_list` | `tf.keras.layers.StringLookup` or `IntegerLookup` |
| `categorical_column_with_hash_bucket` | `tf.keras.layers.Hashing` |
| `bucketized_column` | `tf.keras.layers.Discretization` |
| `embedding_column` | `tf.keras.layers.Embedding` after a lookup layer |
| `indicator_column` | `output_mode='one_hot'` or `'multi_hot'` on a lookup or encoding layer |
| `crossed_column` | `tf.keras.layers.experimental.preprocessing.HashedCrossing` |

The migration is not just cosmetic. Keras preprocessing layers can run inside a `tf.data` pipeline for asynchronous CPU preprocessing, can be saved with the model so the same logic runs at serving time, and produce sparse outputs natively when `sparse=True` is set.[1] Feature columns, by contrast, were tightly coupled to `Estimator` and tended to materialize dense tensors even for very wide categorical inputs.

## tf.keras.utils.FeatureSpace

Keras 3 ships a higher level utility called `tf.keras.utils.FeatureSpace` (also exposed as `keras.utils.FeatureSpace`).[2][23] It plays the same role that `tf.feature_column` used to play, but on top of Keras preprocessing layers. Each feature is declared with a short string that names its type, and `FeatureSpace` builds the corresponding preprocessing pipeline.[2]

```python
import keras

feature_space = keras.utils.FeatureSpace(
    features={
        "age": "float_normalized",
        "thal": "string_categorical",
        "sex": "integer_categorical",
    },
    crosses=[("sex", "thal")],
    output_mode="concat",
)
feature_space.adapt(train_dataset)
encoded = feature_space(raw_inputs)
```

Supported feature types include `float`, `float_normalized`, `float_rescaled`, `float_discretized`, `integer_categorical`, `string_categorical`, `integer_hashed`, and `string_hashed`.[2][23] Cross features are declared by passing tuples of feature names to the `crosses` argument, and `output_mode` controls whether `FeatureSpace` returns a single concatenated vector or a dictionary of encoded tensors.[2] Calling `.adapt()` on a representative dataset fits the underlying `Normalization`, `StringLookup`, `IntegerLookup`, and `Discretization` layers, after which the `FeatureSpace` can be saved alongside the model and replayed during inference.[22]

`FeatureSpace` also exposes the underlying layers through the `preprocessors` and `crossers` properties for inspection or partial reuse, and the `get_inputs()` and `get_encoded_features()` methods make it trivial to wire the preprocessing layer into a Keras Functional model.[2][22] The hashing dimension (`hashing_dim`), crossing dimension (`crossing_dim`), and number of discretization bins (`num_discretization_bins`) can all be tuned through constructor arguments.[2]

## Feature specs in feature stores

Feature stores extend the idea of a feature spec from a single training job to a shared catalog of features that many teams and models can reuse. Each store has its own way to declare feature schemas, but the building blocks are similar: entities, features, types, and views.[28] The schema sitting at the core of a [feature store](/wiki/feature_store) is precisely a feature spec for the organization, not just one job.

[Feast](/wiki/feast) defines features with `Field` objects inside a `FeatureView`.[25] Each `Field` carries a name and a Feast type, drawn from `feast.types`, which includes primitives such as `Int64`, `Float32`, `String`, and `Bytes`, along with complex types such as `Array`, `Map`, `Json`, and `Struct`.[25] A typical declaration looks like this:

```python
from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64

driver = Entity(name="driver", join_keys=["driver_id"])

driver_stats = FeatureView(
    name="driver_hourly_stats",
    entities=[driver],
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    source=hourly_stats_source,
)
```

If the schema is omitted, Feast infers it from the data source when `feast apply` runs.[25] The resulting feature views are versioned, registered in the Feast registry, and used to assemble offline training data and online serving requests with the same field names and types.[25]

[Tecton](/wiki/tecton) takes a similar declarative approach but combines schema definition with the transformation that produces the feature.[26] Tecton features are written in Python, SQL, or Spark, with explicit input and output types, and the platform materializes them to both an offline store (for training) and an online store (for serving) under the same name.[26] Tecton further distinguishes between Batch, Stream, and Realtime feature views, each of which has a different latency contract but shares the same schema surface.[26]

[Vertex AI Feature Store](/wiki/vertex_ai_feature_store) on Google Cloud uses three registry resources: `FeatureGroup`, which corresponds to a BigQuery table or view; `Feature`, which points to a column inside that table; and `FeatureView`, which is a logical collection materialized to an online store instance.[27] The schema is anchored in the underlying BigQuery columns, with the registry providing metadata and access control on top.[27] Other major feature stores, including [Amazon SageMaker Feature Store](/wiki/sagemaker_feature_store) and [Databricks](/wiki/databricks) Feature Store, follow comparable patterns.

### Comparison of feature spec styles

The different incarnations of a feature spec optimize for different things. The table below highlights the common axes of variation.

| Surface | Type system | Required vs optional | Default value | Built in domains | Multi tenant catalog |
|---|---|---|---|---|---|
| `tf.io.*` parsing dict[3] | `tf.int64`, `tf.float32`, `tf.string` | Implicit from `FixedLenFeature` | `default_value` argument | No | No |
| TFMD `Schema` proto[19] | Primitive types plus domains | Explicit via `presence` | No | Yes | No |
| Feast `Field`[25] | Feast primitives, `Array`, `Struct` | Always required at type level | No | No | Yes, via registry |
| Tecton feature view[26] | Python typed, validated at apply | Always required | No | No | Yes, via repo |
| Vertex AI registry[27] | Inherited from BigQuery types | Inherited from column nullability | No | No | Yes, via Vertex |
| Keras `FeatureSpace`[2] | Type aliases such as `float_normalized` | Optional via inputs | Implicit via `default_value` on lookup | Vocabulary from `adapt` | No |

No single surface dominates. Most production stacks combine several. A typical setup uses a TFMD schema in TFX, a Feast registry for shared features, and a Keras `FeatureSpace` inside the model for in graph preprocessing.

## Alternatives to TFRecord plus feature spec

Teams that adopt other data formats face the same need to declare feature schemas, but the mechanics differ.

| Format | Schema mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Parquet plus Polars or DuckDB | Column dtypes are stored inside the Parquet file footer | No external spec required, fast random access, language agnostic | Less natural for ragged or sparse tensors |
| Pandas with explicit dtypes | A Python `dtype` dict applied on read | Quick iteration on small data | Brittle when data scale grows, no built in validation |
| Apache Arrow with `Schema` | Arrow IPC carries a typed schema in metadata | Excellent interop and zero copy reads | Sparse and ragged still require side channels |
| JSON or JSON Lines | Implicit, every consumer guesses | Easy to write by hand | Type drift is almost guaranteed, slow to parse |
| Petastorm Unischema | Python class describing fields and shapes | Works directly with PyTorch and Spark | Smaller ecosystem than TFRecord |

The choice often comes down to the rest of the stack. Pipelines built around [TensorFlow](/wiki/tensorflow) and [TFX](/wiki/tfx) lean on TFRecord and the parsing feature spec because the tooling is already there. Pipelines built around [PyTorch](/wiki/pytorch) and Spark tend to prefer Parquet plus a separate schema file, often written in YAML, JSON, or Avro. The conceptual role of the feature spec is identical in both worlds; only the syntax changes.

## Production use cases

Feature specs become non negotiable as pipelines move from a single notebook to a production system serving live traffic.

### Recommendation and ranking

Large scale recommendation systems for video, news, ads, and e commerce read billions of `tf.train.Example` records per training run. A typical record has tens to hundreds of features: scalar user metadata, embeddings of user history, multi valent categorical IDs for content tags, sparse interaction signals, and dense numerical signals such as recency or popularity. The feature spec encodes all of this in one Python dictionary. When a data engineer adds a new candidate feature, the spec gains an entry; when a feature is retired, the spec loses one; and `tf.io.parse_example` continues to produce the right tensors for the model.

Two tower retrieval models, common in [recommender systems](/wiki/recommender_systems) at YouTube, Pinterest, Twitter, Facebook, and Instagram, depend on this contract. The query tower and the candidate tower each parse a different subset of fields from the same record. A change to the spec on one side without a matching change on the other breaks training silently because parsed dictionaries simply lose a key. Teams therefore version the spec alongside the model code and treat schema changes as code reviews.

### Search and ads

Search ranking systems and ad auction models use TFRecord and feature specs in much the same way. A single `Example` can carry the query, the document, the user context, the auction state, and a label. Sparse and ragged features dominate, because user histories and document features rarely have fixed shapes. The combination of `VarLenFeature` and `RaggedFeature` is what makes parsing tractable at scale, and `tf.Transform` is what keeps preprocessing identical between offline training and online serving.

### Computer vision and audio

Vision and audio pipelines lean on TFRecord because the underlying files (encoded JPEG, PNG, WAV, FLAC, or serialized arrays) compress well, batch easily, and stream over network filesystems with minimal overhead. The feature spec is small in this domain: typically a `BytesList` for the encoded image, an `Int64List` for the label, and maybe a `FloatList` for bounding boxes. The interesting code is in `tf.io.decode_jpeg` and the augmentation pipeline that follows. Still, the spec is what allows the same dataset to be consumed by a classification model, a detection model, and a segmentation model without rewriting the loader.

### Time series and clickstreams

When the unit of work is a sequence rather than a snapshot, `SequenceExample` and `FixedLenSequenceFeature` carry the load.[12] A user session might be encoded as a `SequenceExample` whose context features describe the user and whose feature lists hold the per step click IDs, dwell times, and item embeddings. Models such as transformer based session recommenders read these records directly and reshape them into batches of sequences. The feature spec records the per step shape and the optional / required status of every channel.

## Apache Beam integration

Most large scale TFRecord production happens inside [Apache Beam](/wiki/apache_beam) pipelines, often running on Google Cloud Dataflow, Apache Flink, or Spark. Beam's `tfrecordio` module exposes `ReadFromTFRecord` and `WriteToTFRecord` transforms that work natively against gzipped or uncompressed shards.[24]

```python
import apache_beam as beam
from apache_beam.io.tfrecordio import ReadFromTFRecord, WriteToTFRecord

with beam.Pipeline() as p:
    (p
        | "Read" >> ReadFromTFRecord("gs://bucket/raw-*.tfrecord.gz")
        | "Parse" >> beam.Map(parse_record)
        | "Transform" >> beam.Map(add_features)
        | "Serialize" >> beam.Map(serialize_example)
        | "Write" >> WriteToTFRecord(
            "gs://bucket/derived/out",
            file_name_suffix=".tfrecord.gz",
            num_shards=50,
        ))
```

Where `tf.Transform` is involved, the Beam pipeline is built by `tft_beam.AnalyzeAndTransformDataset`, which takes a `preprocessing_fn` and a metadata schema and returns both transformed data and a `transform_fn` `SavedModel`.[15] The schema travels with the data through the pipeline as a TFMD `Schema` proto and is converted into a parsing feature spec at each stage that needs to materialize tensors.[21]

Beam can also write directly to BigQuery, Pub/Sub, or Avro, and many teams mix these formats with TFRecord. In that case the feature spec is one of several schemas the pipeline juggles: a BigQuery schema for offline analytics, an Avro schema for streaming, and a TF parsing spec for training. Tools such as the `tensorflow_io` library and the `tfx_bsl` records helpers translate between these representations so that a single source of truth (often the TFMD schema) drives them all.[20]

## Why feature specs matter in production

A stable feature spec is one of the simplest defenses against training and serving skew. Once the spec is checked into source control, every component of the pipeline ([data ingestion](/wiki/data_ingestion), validation, [feature engineering](/wiki/feature_engineering), training, and serving) reads features through the same names, types, and shapes. If a producer changes a column from `int64` to `string`, the validator catches it before the model retrains on the broken data. If a serving job sends an integer where a float is expected, parsing fails loudly instead of silently coercing the value.

Feature specs also document the contract between data producers and model owners. A new engineer joining a project can read the schema file and learn which features the model expects, whether they are required, what their valid ranges are, and which vocabulary lists drive the encoders. In larger organizations the feature spec stored in a feature store becomes the discovery surface for the whole [MLOps](/wiki/mlops) platform, listing which teams own which features and which models depend on them.

Versioning matters. A common pattern is to keep the schema in a git repository, tag each release, and bind training jobs to the tag. Backfilling features then becomes a question of running the same `preprocessing_fn` against historical raw data and verifying that the output schema matches the expected version. Without that discipline, a year old model that retrains weekly slowly drifts from its original spec, and the team eventually loses the ability to reproduce a result from the previous quarter.

## Common pitfalls

A few traps catch teams new to feature specs.

- **Forgetting that `Example` collapses dtypes.** Writing a `int32` and parsing as `tf.int64` is fine; writing a `float64` and expecting `float64` back is not.[13]
- **Mixing per record and batched parsing.** `parse_single_example` followed by `batch` is much slower than `batch` followed by `parse_example`.[4]
- **Confusing `VarLenFeature` with `SparseFeature`.** The first takes the dtype only and produces a `SparseTensor` whose indices the parser fills in; the second requires the producer to write separate index features and is rarely worth the complexity.[7][9]
- **Hand editing the schema proto without regenerating the parsing spec.** Changes to `value_count` or `presence` in the schema have no effect on a downstream pipeline that hard codes its parsing dict.
- **Letting `SchemaGen` infer the schema and never reviewing it.** Inferred schemas are conservative and tend to mark every feature as optional, which silences real anomalies.[18]
- **Embedding the schema in the model code.** When several services consume the same data, the schema belongs in a shared package or a generated artifact, not in any one model's repository.

## Relationship to feature engineering

Feature engineering is the broader practice of constructing features from raw data; feature specification is the narrower act of declaring what those features look like once they are constructed. Most production pipelines interleave the two. A `tf.Transform` `preprocessing_fn` reads raw fields described by an input feature spec, computes derived features, and emits a transformed feature spec that downstream training jobs consume.[15] A Feast feature view declares both the source of a feature and the schema fields it produces.[25] A Keras `FeatureSpace` describes how each raw column should be normalized, encoded, or hashed before the model sees it.[2]

The deeper point is that a feature is not just a Python variable. It is a typed, shaped, named slot in a contract that spans multiple processes, languages, and time horizons. The feature spec is the written form of that contract. Treating it as a first class artifact, with versioning, code review, and automated validation, is the boundary between a research prototype and a production [machine learning](/wiki/machine_learning) system.

## See also

- [TFRecord](/wiki/tfrecord)
- [TensorFlow](/wiki/tensorflow)
- [TFX](/wiki/tfx)
- [TensorFlow Data Validation](/wiki/tfdv)
- [Protocol Buffers](/wiki/protocol_buffers)
- [Feature store](/wiki/feature_store)
- [Apache Beam](/wiki/apache_beam)
- [Feature engineering](/wiki/feature_engineering)
- [MLOps](/wiki/mlops)
- [Recommender systems](/wiki/recommender_systems)

## References

[1] TensorFlow, "Migrate tf.feature_columns to Keras preprocessing layers", https://www.tensorflow.org/guide/migrate/migrating_feature_columns
[2] TensorFlow, "tf.keras.utils.FeatureSpace", https://www.tensorflow.org/api_docs/python/tf/keras/utils/FeatureSpace
[3] TensorFlow, "tf.io.parse_example", https://www.tensorflow.org/api_docs/python/tf/io/parse_example
[4] TensorFlow, "tf.io.parse_single_example", https://www.tensorflow.org/api_docs/python/tf/io/parse_single_example
[5] TensorFlow, "tf.io.parse_sequence_example", https://www.tensorflow.org/api_docs/python/tf/io/parse_sequence_example
[6] TensorFlow, "tf.io.FixedLenFeature", https://www.tensorflow.org/api_docs/python/tf/io/FixedLenFeature
[7] TensorFlow, "tf.io.VarLenFeature", https://www.tensorflow.org/api_docs/python/tf/io/VarLenFeature
[8] TensorFlow, "tf.io.RaggedFeature", https://www.tensorflow.org/api_docs/python/tf/io/RaggedFeature
[9] TensorFlow, "tf.io.SparseFeature", https://www.tensorflow.org/api_docs/python/tf/io/SparseFeature
[10] TensorFlow, "tf.io.FixedLenSequenceFeature", https://www.tensorflow.org/api_docs/python/tf/io/FixedLenSequenceFeature
[11] TensorFlow, "tf.train.Example", https://www.tensorflow.org/api_docs/python/tf/train/Example
[12] TensorFlow, "tf.train.SequenceExample", https://www.tensorflow.org/api_docs/python/tf/train/SequenceExample
[13] TensorFlow, "TFRecord and tf.train.Example", https://www.tensorflow.org/tutorials/load_data/tfrecord
[14] TensorFlow, "Ragged tensors", https://www.tensorflow.org/guide/ragged_tensor
[15] TensorFlow, "Get started with TensorFlow Transform", https://www.tensorflow.org/tfx/transform/get_started
[16] TensorFlow, "The Transform TFX Pipeline Component", https://www.tensorflow.org/tfx/guide/transform
[17] TensorFlow, "Get started with TensorFlow Data Validation", https://www.tensorflow.org/tfx/data_validation/get_started
[18] TensorFlow, "The SchemaGen TFX Pipeline Component", https://www.tensorflow.org/tfx/guide/schemagen
[19] TensorFlow Metadata, "schema.proto", https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto
[20] TFX BSL, "Schema interpretation", https://github.com/tensorflow/tfx-bsl/blob/master/tfx_bsl/docs/schema_interpretation.md
[21] TensorFlow Transform, "schema_utils.py source", https://github.com/tensorflow/transform/blob/master/tensorflow_transform/tf_metadata/schema_utils.py
[22] Keras, "Structured data classification with FeatureSpace", https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/
[23] Keras, "Structured data preprocessing utilities", https://keras.io/api/utils/feature_space/
[24] Apache Beam, "apache_beam.io.tfrecordio module", https://beam.apache.org/releases/pydoc/current/apache_beam.io.tfrecordio.html
[25] Feast, "Feature view concepts", https://docs.feast.dev/getting-started/concepts/feature-view
[26] Tecton, "Defining features", https://docs.tecton.ai/docs/defining-features
[27] Google Cloud, "About Vertex AI Feature Store", https://docs.cloud.google.com/vertex-ai/docs/featurestore/latest/overview
[28] Tecton, "Choosing the Right Feature Store: Feast or Tecton?", https://resources.tecton.ai/hubfs/Choosing-Feature-Solution-Feast-or-Tecton.pdf