Feature store
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,629 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,629 words
Add missing citations, update stale details, or suggest a clearer explanation.
A feature store is a centralised data system for storing, serving, discovering, sharing, monitoring and reusing machine-learning features. It separates feature computation from model training and inference so that the same feature values can be reliably reused across models and across the training-versus-serving boundary. The term entered public usage in September 2017, when Uber published its Michelangelo blog post and described the feature store as the platform's most important component for scaling machine learning across the company.[^uber2017]
A feature, in this context, is any signal derived from raw data that a model consumes as input. Examples include a user's seven-day rolling order count, the average tip percentage in a city last hour, or the embedding of a product description. A feature store does not invent these signals. It standardises how they are defined, computed, materialised, served and audited so that the feature engineering pipeline becomes a piece of shared infrastructure rather than a per-model script.
Most large machine-learning teams arrived at the same set of complaints between 2016 and 2020. Features computed in batch for training behaved differently in production. Two teams shipped slightly different versions of the same "days since last purchase" feature. Backfilling a year of historical features for a new label took weeks. Nobody could tell whether a feature had drifted because nobody could tell who owned it. A feature store is the consolidation of those complaints into a single piece of infrastructure.
The table below lists the recurring industry pain points that feature stores were built to solve.
| Problem | What goes wrong without a feature store | How a feature store helps |
|---|---|---|
| Training-serving skew | Features computed in a batch SQL job for training behave differently when reimplemented in a Java or Python service for online inference, which causes silent prediction failures. | The same feature definition produces both the offline training set and the online lookup, so the values seen by the model are identical in both environments.[^uber2017][^databricks2021] |
| Feature reuse | The same feature is recomputed by N teams for N models, wasting engineering time and producing slightly inconsistent values. | A registry lets practitioners search for an existing feature and consume it as is, rather than rebuilding it from raw tables.[^uber2017][^feast] |
| Backfilling and point-in-time correctness | Historical training labels need feature values as they existed at the label timestamp, not the latest values, otherwise the training set leaks future information. | The offline store supports time-travel joins (often called AS OF joins) so that each label row is paired with feature values that were valid at that label's time.[^huyen2023][^chronon2024] |
| Discovery | Data scientists rebuild features that already exist somewhere because nobody knows the inventory. | A feature catalog lists definitions, owners, freshness and downstream consumers.[^databricks2021][^aws2020] |
| Monitoring and drift detection | Distribution drift in input features is hard to spot when each model logs features in its own way. | A central store can attach drift, freshness and quality metrics to every feature group.[^aerospike2024][^applyingml] |
| Governance and compliance | Auditors cannot tell what data trained which model, and personally identifiable inputs are scattered. | Lineage from raw source through feature view to model registry gives a single audit trail.[^databricks2021] |
The idea of pre-computing reusable model inputs and serving them from a fast store predates the term "feature store." Internal versions existed at search and ads companies long before 2017. The label, and the broader notion of a shared platform component for features, came out of Uber.
2017: Uber Michelangelo. On 5 September 2017, Jeremy Hermann (engineering manager) and Mike Del Balso (product manager) published "Meet Michelangelo: Uber's Machine Learning Platform." The post described an internal feature store with roughly 10,000 features in production at the time, supporting both daily batch features in HDFS and near-real-time features written to Cassandra by Samza streaming jobs. The team explicitly named feature reuse and consistent training-versus-serving values as the two central goals.[^uber2017]
2017 to 2018: Other tech companies built their own. Airbnb began work on a feature engineering framework originally called Zipline in 2017, which was later renamed Chronon and open-sourced in April 2024.[^chronon2024] Twitter, LinkedIn, Spotify and Netflix built internal feature platforms in roughly the same window. LinkedIn later open-sourced Feathr; DoorDash described an internal store called Fabricator; Stripe became an early Chronon adopter and co-maintainer.[^huyen2023][^chronon2024]
2018: Hopsworks. Logical Clocks released the first version of the Hopsworks Feature Store at the end of 2018, with an API based on FeatureGroup data frames.[^logicalclocks2020] Hopsworks remains one of the few feature stores explicitly designed around an open-format lakehouse, today using Apache Hudi and Iceberg.[^hopsworks]
2019: Feast goes public. Gojek, the Indonesian ride-hailing and payments company, partnered with Google Cloud to build Feast and open-sourced it in early 2019. Gojek had more than ten teams independently shipping models for pricing, matching, fraud and recommendations, and they kept rebuilding the same data plumbing. Feast was the consolidation of that work and became the first widely adopted open-source feature store.[^feastgcp2019][^gojek] It is licensed under Apache 2.0 and is now hosted by the Linux Foundation AI & Data project.[^feast]
2020: Tecton launches commercially. Mike Del Balso, Jeremy Hermann and Kevin Stumpf, the team behind Uber's Michelangelo feature store, founded Tecton in 2019 and emerged from stealth on 28 April 2020 with $25 million from Andreessen Horowitz and Sequoia.[^tectonlaunch] Tecton released its cloud-native enterprise feature store on 7 December 2020 alongside a $35 million Series B.[^tectonseriesb] In July 2022 it raised a $100 million Series C led by Kleiner Perkins, with strategic participation from Databricks and Snowflake Ventures, bringing total funding to $160 million.[^tectonseriesc] In August 2025 Databricks announced it was acquiring Tecton.[^databrickstecton]
December 2020: AWS SageMaker Feature Store. Amazon announced SageMaker Feature Store at re:Invent on 1 December 2020, with general availability on 8 December 2020. It is a managed dual-store service: an online store for low-latency lookups and an offline store backed by S3 for training and exploration with Athena, Spark or EMR.[^aws2020][^awsblog2020]
May 2021: Databricks Feature Store. Databricks announced its feature store on 27 May 2021, positioning it as the first feature store co-designed with Delta Lake and MLflow. Lineage and discovery flow through the same metadata that already tracks tables and notebooks; deployed MLflow models can call the store directly at inference time. The feature store has since migrated under Unity Catalog as Databricks Feature Engineering, with the legacy databricks-feature-store package deprecated in favour of databricks-feature-engineering 0.2 and later.[^databricks2021][^databricksrelease]
May 2021: Vertex AI Feature Store. Google launched Vertex AI in May 2021, with a feature store among its core components.[^huyen2023] In October 2023 Google announced a redesigned BigQuery-powered Vertex AI Feature Store that treats BigQuery itself as the offline store and adds a managed online serving layer with sub-2 millisecond latency at the 99th percentile, plus native vector embedding support.[^vertex2023]
2023 to 2024: Azure ML managed feature store. Microsoft announced the Azure Machine Learning managed feature store at Build in May 2023; the SDK reached version 1.0 in November 2023, marking general availability. Subsequent 2024 releases added custom feature sources, sovereign cloud regions and improved offline backfill materialisation.[^azure2023]
April 2024: Chronon (formerly Zipline) open source. Airbnb open-sourced Chronon, its production feature platform, in April 2024 alongside Stripe as an early adopter and co-maintainer. Chronon focuses on feature definitions that target both batch warehouses and streaming pipelines from a single declaration.[^chronon2024]
A standard feature store has five layers. They can be drawn as a diagram, but in prose the picture is:
get_historical_features) and from production services (get_online_features).[^feast][^aws2020]Materialisation is the glue. Features computed in the offline pipeline get pushed ("materialised") into the online store on a schedule so that production lookups return the same values that training saw.[^azure2023]
The vocabulary is mostly stable across vendors, with small differences in spelling.
| Term | Meaning |
|---|---|
| Entity | The unit of analysis a feature is keyed by, such as user_id, restaurant_id, session_id or a composite key. Entities are how features are joined to training labels and to inference requests.[^feast][^aws2020] |
| Feature view (Feast, Tecton) / feature group (Hopsworks, Databricks) | A logical grouping of features that share an entity, a source and a refresh schedule. The unit of definition that practitioners write and version in Git.[^feast][^hopsworks] |
| Feature service | A bundle of feature views consumed by a particular model. Lets the model declare its full input contract in one place.[^feast] |
| Point-in-time correctness | The guarantee that, for every training row, feature values are taken as they were at the row's label timestamp. Implemented via AS OF joins or time-travel joins. Without this, training sets leak future information into the past and inflate offline metrics.[^huyen2023][^chronon2024] |
| Materialisation | The process of computing features from the offline pipeline and writing them into the online store, on a schedule or on demand.[^azure2023] |
| Freshness and time-to-live (TTL) | How recently a feature was updated and how long it remains valid for online lookup. Strict freshness budgets matter for fraud and ads; relaxed TTLs are fine for slow-moving demographics.[^chronon2024][^applyingml] |
| On-demand transformation (also called request-time feature) | A feature computed at inference using inputs that only exist at request time, such as the user's query string or current geolocation. Tecton, Feast and Chronon all support these.[^feast][^chronon2024] |
| Backfill | Computing historical feature values for a new feature so it can be used to train a model on past labels. This is the operation that makes point-in-time correctness expensive.[^huyen2023] |
The table below compares the systems most teams actually evaluate.
| System | Open source / commercial | Deployment model | Online store options | Streaming support | First released | Notable users |
|---|---|---|---|---|---|---|
| Feast | Open source (Apache 2.0) | Self-hosted; runs on any cloud, Kubernetes or local | Redis, DynamoDB, Bigtable, Cassandra, Postgres, Snowflake, SQLite, Dragonfly, ScyllaDB, Milvus, Qdrant and others | Push API for Kafka and Kinesis sources | 2019 (Gojek and Google) | Robinhood, NVIDIA, Discord, Cloudflare, Walmart[^feast][^feastgcp2019] |
| Tecton | Commercial (now part of Databricks) | Managed SaaS on AWS, GCP, Azure | Tecton-managed, with options for DynamoDB or Redis on the customer side | Native Spark Structured Streaming and Kafka pipelines, sub-100 ms freshness | Launched April 2020 | Atlassian, Plaid, Cash App, HelloFresh, Coinbase[^tectonlaunch][^tecton2025] |
| Hopsworks | Open core (AGPL with commercial editions) | Self-hosted, Kubernetes, or Hopsworks managed | RonDB (default), plus Redis and others | Native Flink and Spark Structured Streaming | Late 2018 | Logical Clocks customers in finance, telco and gaming[^logicalclocks2020][^hopsworks] |
| Databricks Feature Engineering / Feature Store | Commercial (part of Databricks) | Managed inside Databricks workspaces | Cosmos DB, DynamoDB, MySQL, plus Databricks Online Tables | Streaming via Spark Structured Streaming on the lakehouse | May 2021 | Databricks customers including ABN AMRO, Block, Comcast, Walgreens[^databricks2021][^databricksrelease] |
| AWS SageMaker Feature Store | Commercial (part of AWS) | Managed in AWS | Online store backed by managed key-value layer; offline in S3 | Ingestion via Kinesis, Kafka MSK and PutRecord API | 8 December 2020 | Customers including 3M, JPMorgan Chase, Vanguard[^aws2020][^awsblog2020] |
| Vertex AI Feature Store | Commercial (part of Google Cloud) | Managed in GCP | Optimised online serving and Bigtable serving; BigQuery as the offline source | Streaming through Dataflow into BigQuery sources | May 2021; BigQuery-powered redesign October 2023 | Wayfair, Cash App, Spotify, Shopify[^vertex2023] |
| Azure ML managed feature store | Commercial (part of Azure ML) | Managed in Azure ML workspaces | Online store via Azure Cache for Redis | Streaming sources via Spark Structured Streaming on Azure | Preview May 2023; SDK 1.0 in November 2023 | Microsoft enterprise customers[^azure2023] |
| Featureform | Open source (MPL 2.0); now part of Redis | Virtual store layered on existing data infrastructure | Whatever the customer brings: Redis, DynamoDB, Postgres, vector DBs | Yes, on the underlying engine | 2021 | Open-source community; acquired by Redis[^featureform] |
| Chronon (Airbnb) | Open source (Apache 2.0) | Self-hosted on Spark and Flink, with a serving service | Customer-provided key-value store | First-class batch and streaming with consistency-measurement pipelines | Open-sourced April 2024 (internal since 2017 as Zipline) | Airbnb, Stripe[^chronon2024] |
Feature stores are one of those technologies where the second-order failures show up only after a year of use. The recurring traps:
The shape of feature platforms is still moving. Five trends are visible across vendor blogs, conference talks and the MLOps Community feature-store summits.
Real-time features become standard. Fraud, ads bidding, recommendations and dynamic pricing all need features that reflect what happened seconds ago. Tecton, Chronon and Hopsworks all advertise sub-second feature freshness as a first-class capability rather than an advanced add-on.[^tecton2025][^chronon2024]
On-demand features for LLM and agent applications. When a feature depends on the user's prompt, their current cart contents, or the output of an upstream model, it cannot be precomputed. Feast added on-demand transformations as a beta in 2024; Tecton positions on-demand pipelines as a core part of its "AI feature platform."[^feast][^tecton2025]
Tighter integration with the lakehouse. Databricks Feature Engineering lives inside Unity Catalog and reuses Delta Lake for storage and lineage. Vertex AI Feature Store stores offline features directly in BigQuery rather than copying them out. Hopsworks builds its offline store on Hudi and Iceberg.[^databricksrelease][^vertex2023][^hopsworks] The boundary between "the data platform" and "the feature store" is thinning.
Feature stores and vector databases converge. Feast added Milvus and Qdrant as supported online stores; Vertex AI Feature Store added native vector embedding storage and similarity search; Featureform was "built from the ground up with embeddings in mind" and is now part of Redis. As machine learning systems mix structured features with embeddings for retrieval-augmented generation, the two categories of infrastructure are merging.[^feast][^vertex2023][^featureform]
"Feature platform" terminology. Tecton, Chip Huyen and the Chronon team prefer "feature platform" over "feature store" because the storage layer is only one of four components (the others being feature definitions, computation engines and serving APIs). The term "feature store" is sticky enough that both labels are now used interchangeably in practice.[^huyen2023][^tecton2025]
Not every team needs a feature store, and the category has its skeptics. Three lines of critique recur:
Three publicly documented designs are useful starting points for teams building or evaluating their own platform.
The principal feature stores in production today are:
Feature stores are part of the broader MLOps tool stack and overlap with experiment tracking (MLflow, Weights & Biases), model registries, online serving infrastructure and observability platforms. Many teams adopt a feature store at the same time they consolidate on a model registry and a serving framework, treating all three as components of a single ML platform.