TensorFlow Decision Forests (TF-DF)

Developer Tools Google Machine Learning

18 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v2 · 3,530 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

TensorFlow Decision Forests (often abbreviated TF-DF) is an open-source Google library for training, serving, and interpreting decision-forest models such as Random Forest and Gradient Boosted Decision Trees using the TensorFlow and Keras API. Google's TensorFlow team open-sourced it on May 27, 2021, presenting it at Google I/O 2021, and it is built on top of the Yggdrasil Decision Forests (YDF) C++ library that has run in production at Google since 2018 ^[1]^[3]^[4]. The official documentation describes TF-DF as "a library to train, run and interpret decision forest models (e.g., Random Forests, Gradient Boosted Trees) in TensorFlow," supporting "classification, regression, ranking and uplifting" ^[2]^[6].

TF-DF exposes Random Forest, Gradient Boosted Decision Trees, CART, Distributed Gradient Boosted Trees, and Isolation Forest models through a Keras-compatible API, so that classification, regression, ranking, and uplifting tasks can be trained, evaluated, and served using the same workflow as a deep neural network ^[1]^[2]. The launch post defines it as "a collection of production-ready state-of-the-art algorithms for training, serving and interpreting decision forest models (including random forests and gradient boosted trees)" ^[1]. Under the hood, TF-DF wraps Yggdrasil Decision Forests (YDF), a C++ decision forest library that has been in production at Google since 2018 ^[3]^[4].

TF-DF was open-sourced by the TensorFlow team on May 27, 2021, in a launch post written by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, and Jan Pfeifer, and shipped for TensorFlow 2.5.0 ^[1]. It was the first time Google made the internal Yggdrasil engine available outside the company. Before TF-DF, TensorFlow users who wanted competitive tree models on tabular data typically had to leave the framework and use XGBoost, LightGBM, CatBoost, or scikit-learn. TF-DF closed that gap and also made it straightforward to combine tree models and neural networks inside a single TensorFlow pipeline ^[1]^[5].

What is TensorFlow Decision Forests?

TensorFlow Decision Forests is Google's open-source library for building decision-forest models with the Keras API. It gives TensorFlow users first-class, production-ready implementations of tree ensembles (Random Forests and Gradient Boosted Decision Trees, plus CART and Isolation Forest) that train, evaluate, save, and serve exactly like a Keras neural network ^[1]^[2]. The library is most useful on tabular data, the dominant format in industry, where tree ensembles remain competitive with or better than deep neural networks. As the documentation puts it, "Decision forests are a family of machine learning algorithms with quality and speed competitive with (and often favorable to) neural networks, especially when you're working with tabular data" ^[2]. TF-DF is released under the Apache License 2.0 and distributed on PyPI as tensorflow-decision-forests ^[6]^[9].

When was TF-DF released, and by whom?

Google's TensorFlow team open-sourced TF-DF on May 27, 2021, releasing it for TensorFlow 2.5.0 and presenting it at Google I/O 2021 ^[1]. The launch post was written by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, and Jan Pfeifer ^[1]. It was the first public release of Google's internal Yggdrasil engine.

The Yggdrasil project began inside Google in 2017 as an experimental research library. By 2018 it had moved into production and was being used to serve predictions tens of millions of times per second across various Google products ^[4]^[6]. The team chose the name Yggdrasil after the world tree of Norse mythology, which is fitting for a library whose central data structure is the decision tree.

For several years Yggdrasil remained an internal C++ library with command-line and Google Sheets bindings. Around 2019 to 2020, the team began wrapping the C++ core with a Python interface that integrated with TensorFlow and Keras. That wrapper became TensorFlow Decision Forests, and it was open-sourced at Google I/O on May 27, 2021 ^[1]^[6]. The launch was accompanied by Colab tutorials, Keras-style example code, and a Google I/O talk by Mathieu Guillame-Bert.

Key milestones in the library's history are listed below.

Year	Event
2017	Yggdrasil started as an experimental C++ research library at Google
2018	Yggdrasil moved into production at Google; serves tens of millions of predictions per second
2021 (May 27)	TF-DF 0.1.x released as open source by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, and Jan Pfeifer ^[1]
2022 (December)	arXiv preprint of the Yggdrasil paper posted ^[3]
2022 (December)	Simple ML for Sheets launched, putting Yggdrasil inside Google Sheets ^[7]
2023 (February)	TF-DF 1.0 declared production ready in a TensorFlow blog post ^[2]
2023 (August)	Yggdrasil paper published at KDD 2023 in Long Beach ^[3]
2023	Standalone YDF Python API released as a leaner alternative to TF-DF ^[4]
2024	Releases 1.8 through 1.11 add LambdaMART NDCG with configurable truncation, faster model loading, NA support in the fast inference engine, and TensorFlow 2.16 to 2.18 compatibility ^[8]
2025 (March 13)	TF-DF 1.12 ships with Python 3.12 support, sparse oblique split tuning, and TensorFlow 2.19 compatibility ^[8]^[9]

The library is maintained at Google by a core team that includes Mathieu Guillame-Bert (lead), Jan Pfeifer, Richard Stotz, Sebastian Bruch, and Arvind Srinivasan ^[4]^[6].

Why does TF-DF matter?

Tabular data is still the dominant format in industry. Click prediction, fraud detection, churn modeling, credit scoring, ranking, and many recommendation problems all run over tables of numbers and categorical labels rather than over images, audio, or natural language. On those problems tree ensembles continue to outperform deep neural networks in most cases, which is why XGBoost, LightGBM, and CatBoost are the default tools on Kaggle and in many production stacks ^[10]^[11].

Before TF-DF, TensorFlow had no first-class support for those models. A team that wanted to combine a decision tree ensemble with a TensorFlow neural network was forced to maintain two separate pipelines, two model formats, and two serving stacks. TF-DF removed that friction in three concrete ways.

It put a Random Forest, a Gradient Boosted Trees, or a CART model behind the same model.fit(), model.predict(), model.evaluate(), and model.save() calls that Keras users already knew.
It produced ordinary SavedModel artifacts that work with TensorFlow Serving, TensorFlow Lite, TensorFlow.js, and Vertex AI without any extra conversion step.
It allowed hybrid pipelines where a neural network produces dense feature vectors that feed a tree, or where tree predictions become inputs to a downstream neural model.

The production lineage of the underlying engine matters too. Yggdrasil already powered ranking, classification, and regression tasks across multiple Google products before the open-source release, so the TF-DF launch did not ship experimental code ^[1]^[3].

What models does TF-DF support?

TF-DF exposes five training algorithms, each as a Keras Model subclass.

Algorithm	Keras class	Original reference	Typical use
Random Forest	`tfdf.keras.RandomForestModel`	Breiman, 2001	Strong baseline, low tuning effort, robust to noise
Gradient Boosted Trees	`tfdf.keras.GradientBoostedTreesModel`	Friedman, 2001	Highest accuracy on most tabular benchmarks, the workhorse
CART	`tfdf.keras.CartModel`	Breiman, Friedman, Olshen, and Stone, 1984	Single decision tree baseline, interpretable models
Distributed Gradient Boosted Trees	`tfdf.keras.DistributedGradientBoostedTreesModel`	Yggdrasil paper, 2023	GBT training over hundreds of millions to a few billion rows across many machines ^[3]^[12]
Isolation Forest	exposed via Yggdrasil; reachable from TF-DF as of recent releases	Liu, Ting, and Zhou, 2008	Unsupervised anomaly detection ^[4]

In addition to these top-level model types, the library implements LambdaMART (under both LAMBDA_MART_NDCG and the older LAMBDA_MART_NDCG5 names), DART (Dropout Additive Regression Trees), Extra Trees, oblique splits, sparse oblique splits, greedy global growth, one-side sampling, categorical-set learning, random categorical learning, out-of-bag evaluation, and several variable importance measures ^[1]^[8]. Across all of these models TF-DF supports four task types: classification, regression, ranking, and uplifting ^[2].

What are TF-DF's key features?

TF-DF inherits the full Yggdrasil feature set and adds Keras conveniences on top.

Feature	Notes
Keras model API	`tfdf.keras.RandomForestModel`, `GradientBoostedTreesModel`, `CartModel`, `DistributedGradientBoostedTreesModel` all behave like Keras models, with `fit`, `evaluate`, `predict`, and `save`
Native categorical support	Both `CATEGORICAL_INTEGER` and `CATEGORICAL_STRING` features train without one-hot encoding, hashing, or embeddings ^[5]
Native missing-value handling	NaNs are split on directly without imputation ^[5]
Numerical feature handling	Splits are threshold-based, so no scaling, normalization, or whitening is required ^[5]
Multi-task learning	A single GBT can predict more than one label, including mixed regression and classification targets
Sparse features	Sparse representations are supported, including sparse oblique splits added in 1.12 ^[9]
Ranking	LambdaMART-style ranking with configurable NDCG truncation as of 1.11 ^[8]
Uplifting	Uplift modeling is supported through dedicated tasks and an uplift inspector added in 1.8 ^[8]
Hyperparameter templates	`use_predefined_hps=True` enables curated parameter sets discovered across hundreds of datasets ^[2]
Distributed training	Up to a few billion examples across tens to hundreds of machines via TensorFlow Distributed ^[12]
Deterministic training	The same dataset and hyperparameters reproduce the exact same model ^[5]
Single-epoch training	The full dataset is loaded once; no batch-size or epoch tuning ^[5]
SavedModel export	Models save as standard TensorFlow SavedModel, ready for TensorFlow Serving, Vertex AI, TF Lite, and TF.js ^[2]
YDF interoperability	TF-DF and YDF models are mutually loadable, so a model trained in Python can be served from C++, Go, JavaScript, or the command line ^[4]
License	Apache License 2.0 ^[6]

A bare-bones training example shows how minimal the API is.

import tensorflow_decision_forests as tfdf
import pandas as pd

train_df = pd.read_csv("train.csv")
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="income")

model = tfdf.keras.GradientBoostedTreesModel()
model.fit(train_ds)
model.compile(metrics=["accuracy"])
model.save("my_gbt_model")

No feature columns, no normalization, no batch size, no epoch count. The model figures out feature semantics from the dataset spec and trains in a single pass. As the announcement notes, with TF-DF beginners can "more easily develop and explain decision forest models," with "no need to explicitly list or pre-process input features" and "no need to specify an architecture" ^[1].

How does TF-DF relate to Yggdrasil Decision Forests?

TF-DF is a thin Python and TensorFlow wrapper. The training loops, splitter logic, statistics tables, model serialization format, and inference engines all live in Yggdrasil's C++ core ^[3]^[4]. In the words of the project, TF-DF is "powered by Yggdrasil Decision Forest (YDF)," and "TF-DF models are compatible with YDF models, and vice versa" ^[4]^[6]. The C++ core is portable across Linux, macOS, and Windows, and it does not depend on TensorFlow at all, which is why Yggdrasil can also be called from a CLI, from Go, from JavaScript via WebAssembly, and from Google Sheets ^[3]^[6]^[7].

The Yggdrasil paper, published at KDD 2023, describes four design principles: simplicity of use, safety of use, modularity with high-level abstractions, and integration with other ML libraries ^[3]. The same paper benchmarks Yggdrasil against XGBoost and LightGBM and reports competitive accuracy with strong inference performance, often in the sub-microsecond range per example per CPU core ^[3].

Key internal components include:

A unified dataset specification (DataSpec) that records feature names, types, vocabulary, and statistics so that the same model can be trained, served, and inspected without re-deriving the schema.
A tree-builder library that supports both axis-aligned and oblique splits, multiple impurity criteria, and several growth strategies (depth-wise, best-first, and global).
A serving library with multiple inference engines, including a generic engine, optimized engines for axis-aligned conditions, and quantized engines.
A hyperparameter tuner that wraps the training loops and exposes both random and exhaustive search.

Because the C++ core is open source under Apache 2.0, the same engine can be embedded in non-Python and non-TensorFlow stacks, which is one reason Google built the standalone YDF Python package in 2023 ^[4].

How does TF-DF compare with XGBoost, LightGBM, and CatBoost?

TF-DF is one of several mature tree libraries. The table below compares the headline differences.

Library	Maintainer	First release	Core language	Framework integration	Distinctive feature
TensorFlow Decision Forests	Google	May 2021	C++ (via Yggdrasil), Python wrapper	Native Keras and TensorFlow	Tree models that share a SavedModel format with neural networks
XGBoost	Distributed (Tianqi Chen et al.)	2014	C++	Python, R, Java, Scala, Spark	Industry default for gradient boosting; mature scikit-learn API
LightGBM	Microsoft	2016 (paper 2017)	C++	Python, R, C, command line	Histogram-based GBT with leaf-wise growth, fast on large data
CatBoost	Yandex	2017	C++	Python, R, command line	Symmetric trees with native ordered target statistics for categoricals
scikit-learn ensembles	Community	2007 onward	Cython, Python	Native to scikit-learn	The textbook reference implementations of Random Forest and GBM
H2O.ai	H2O.ai	2014	Java	Python, R, REST	Distributed JVM-based GBM and Random Forest with AutoML

A few recurring observations from independent benchmarks and from the Yggdrasil paper itself:

On accuracy, TF-DF and Yggdrasil are competitive with XGBoost on most tabular tasks, and the gap to LightGBM and CatBoost is small in either direction depending on dataset and tuning ^[3]^[13].
On training speed, LightGBM is often the fastest of the four for medium-sized datasets, with CatBoost close behind. Yggdrasil and XGBoost trade places depending on hyperparameters ^[13].
On inference speed, Yggdrasil is strong because the C++ engines were designed to serve Google production traffic ^[3].
On categorical handling, TF-DF, CatBoost, and LightGBM (since 2.0) all handle categoricals natively, while XGBoost still typically requires preprocessing such as one-hot or ordinal encoding ^[10]^[11].
On framework integration, TF-DF is the only library in the group that produces a TensorFlow SavedModel out of the box.

There is one important caveat. TF-DF training does not run on GPUs or TPUs. All training and inference happens on CPU, sometimes with SIMD acceleration. For very large datasets, GPU-accelerated XGBoost or NVIDIA's Forest Inference Library can be faster on appropriate hardware, although the Yggdrasil team argues that for typical inference workloads the GPU does not help because batches are small ^[13]^[14].

What is TF-DF used for?

The practical applications of TF-DF cluster around problems where tabular data dominates and where the rest of the system already runs on TensorFlow.

Tabular classification and regression on structured data such as customer records, sensor logs, financial transactions, and survey responses.
Ranking systems for search and recommendations, where the LambdaMART loss function provides directly optimized NDCG.
Anomaly detection through Isolation Forest for fraud, intrusion detection, or quality monitoring.
Production ML pipelines that already use TensorFlow Serving, Vertex AI, or TFX and need a tree model that fits the same deployment story.
Hybrid models that mix neural networks and trees, for example using image embeddings or text embeddings as inputs to a Gradient Boosted Trees model.
Tabular AutoML systems that wrap TF-DF inside a hyperparameter search loop.
Edge and on-device inference, with caveats: TF-DF can export to TF Lite, but for tight embedded targets the standalone Yggdrasil C++ runtime is usually a better fit because it has no TensorFlow dependency ^[4].

How do hybrid neural network plus tree pipelines work?

One of the points the launch announcement emphasized was that TF-DF makes neural network plus tree pipelines straightforward ^[1]. A few common patterns are listed below.

Train a neural network to produce dense embeddings, freeze it, and then train a Gradient Boosted Trees model on the concatenation of those embeddings and the original tabular features.
Train a TF-DF model first, treat its prediction as a single dense feature, and feed it into a small neural network that combines it with embeddings of categorical features such as user IDs.
In multi-task learning, route tabular features through a tree model and unstructured modalities such as text or images through a neural network, then combine the two prediction streams.
For ranking, use a neural-network-based feature extractor on raw text or images and pass its outputs to a TF-DF LambdaMART model that scores documents.

Because every TF-DF model is a Keras model, these compositions can be expressed inside a tf.keras.Model and serialized as a single SavedModel ^[1]^[5].

What are TF-DF's strengths?

The Yggdrasil C++ core is battle-tested at Google scale, where it serves many millions of inferences per second ^[3]^[6].
Native handling of categorical and missing values means much less preprocessing than XGBoost typically requires ^[5].
The Keras API is familiar to anyone who already uses TensorFlow, so adoption inside an existing TensorFlow shop is low-friction ^[1]^[5].
Trained models export to standard SavedModel and run on TensorFlow Serving, Vertex AI, TensorFlow Lite, and TensorFlow.js without conversion ^[2].
Training does not require a GPU, which is convenient when the rest of the cluster is busy training neural networks ^[5].
Distributed training scales to billions of examples through TensorFlow Distributed when needed ^[12].

What are TF-DF's limitations?

The community is much smaller than the XGBoost or LightGBM communities. There are fewer Stack Overflow answers, blog posts, and Kaggle notebooks.
A few XGBoost-specific knobs and integrations are not available, and some users find the TF-DF documentation thinner than XGBoost's reference manual.
Training does not run on GPU or TPU, although Yggdrasil's CPU performance is competitive on most workloads ^[13].
TF-DF inherits the size of TensorFlow as a runtime dependency. The standalone YDF package was created in part to address this, with a roughly 12 MB footprint compared to the much larger TF-DF plus TensorFlow stack ^[4].
For Windows users, TF-DF runs only via WSL; native Windows wheels are not provided ^[9].
The Google team now recommends new projects start with the standalone YDF Python API rather than TF-DF, while keeping TF-DF as the right choice for projects that need tight TensorFlow integration ^[4].

What changed in the 2024 to 2025 releases?

The 1.8 through 1.12 release cycle added several practical improvements ^[8]^[9].

TF-DF 1.8 (November 2023) added support for inspecting uplift models and was the first release to fix a regression bug in MSE/MAE gradients.
TF-DF 1.9 (March 2024) brought TensorFlow 2.16 compatibility, faster loading for models with many features, and NA condition support in the fast inference engine.
TF-DF 1.10 (August 2024) tracked TensorFlow 2.17 and shipped macOS build fixes.
TF-DF 1.11 (October 2024) renamed LAMBDA_MART_NDCG5 to LAMBDA_MART_NDCG with configurable truncation, switched from UnknownError to InvalidArgumentError for clearer debugging, and aligned with TensorFlow 2.18.
TF-DF 1.12 (March 2025) added Python 3.12 support, exposed five new hyperparameters for sparse oblique splits, fixed handling of non-Unicode categorical values, and aligned with TensorFlow 2.19. It also updated compatibility with the latest YDF model format.

The Google team has also shifted some documentation and roadmap energy toward the standalone YDF Python package, which advertises training that is up to five times faster and inference that is up to a thousand times faster than TF-DF for the same models, primarily because YDF avoids TensorFlow's overhead on small datasets ^[4]. TF-DF remains under active maintenance for users who need the TensorFlow integration.

How is TF-DF used in production at Google?

Google has not published a comprehensive list of products that use Yggdrasil and TF-DF, but the launch and KDD 2023 papers note that the underlying engine has served Google production traffic for years and is used across multiple internal products ^[1]^[3]. The Simple ML for Sheets add-on, which lets Google Sheets users build classification and regression models with no code, is itself powered by Yggdrasil ^[7].

Is TF-DF open source, and how do I install it?

Both TF-DF and Yggdrasil are released under the Apache License 2.0 ^[4]^[6]. TF-DF is distributed on PyPI as tensorflow-decision-forests and supports Python 3.9 through 3.12, on Linux x86-64 and macOS 12+ on ARM64; Windows is supported via WSL ^[9]. The source repository lives at github.com/tensorflow/decision-forests, with the Yggdrasil C++ core at github.com/google/yggdrasil-decision-forests ^[4]^[6].

References

Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, Jan Pfeifer. "Introducing TensorFlow Decision Forests." The TensorFlow Blog, May 27, 2021. https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html ↩
"Updates: TensorFlow Decision Forests is production ready." The TensorFlow Blog, February 2023. https://blog.tensorflow.org/2023/02/updates-tensorflow-decision-forests-is-production-ready.html ↩
Mathieu Guillame-Bert, Sebastian Bruch, Richard Stotz, Jan Pfeifer. "Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library." Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2023), pp. 4068-4077. https://dl.acm.org/doi/10.1145/3580305.3599933 (preprint: https://arxiv.org/abs/2212.02934) ↩
Yggdrasil Decision Forests, GitHub repository. https://github.com/google/yggdrasil-decision-forests ↩
"Migrating from Neural Networks." TensorFlow Decision Forests documentation. https://www.tensorflow.org/decision_forests/migration ↩
TensorFlow Decision Forests, GitHub repository. https://github.com/tensorflow/decision-forests ↩
Simple ML for Sheets, official site. https://simplemlforsheets.com/ ↩
TensorFlow Decision Forests CHANGELOG. https://github.com/tensorflow/decision-forests/blob/main/CHANGELOG.md ↩
tensorflow-decision-forests on PyPI. https://pypi.org/project/tensorflow-decision-forests/ ↩
"When to Choose CatBoost Over XGBoost or LightGBM." Neptune.ai blog. https://neptune.ai/blog/when-to-choose-catboost-over-xgboost-or-lightgbm ↩
"The Kaggle Grandmasters Playbook: 7 Battle-Tested Modeling Techniques for Tabular Data." NVIDIA Technical Blog. https://developer.nvidia.com/blog/the-kaggle-grandmasters-playbook-7-battle-tested-modeling-techniques-for-tabular-data/ ↩
"Distributed Training." TensorFlow Decision Forests documentation. https://www.tensorflow.org/decision_forests/distributed_training ↩
Hong Yi et al. "A Comparison of Decision Forest Inference Platforms from A Database Perspective." arXiv:2302.04430. https://arxiv.org/abs/2302.04430 ↩
"Supercharge Tree-Based Model Inference with Forest Inference Library in NVIDIA cuML." NVIDIA Technical Blog. https://developer.nvidia.com/blog/supercharge-tree-based-model-inference-with-forest-inference-library-in-nvidia-cuml/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Estimator (tf.estimator)In-set condition Machine learning terms/Decision Forests Oblique condition

What is TensorFlow Decision Forests?

When was TF-DF released, and by whom?

Why does TF-DF matter?

What models does TF-DF support?

What are TF-DF's key features?

How does TF-DF relate to Yggdrasil Decision Forests?

How does TF-DF compare with XGBoost, LightGBM, and CatBoost?

What is TF-DF used for?

How do hybrid neural network plus tree pipelines work?

What are TF-DF's strengths?

What are TF-DF's limitations?

What changed in the 2024 to 2025 releases?

How is TF-DF used in production at Google?

Is TF-DF open source, and how do I install it?

References

Improve this article

Related Articles

Firebase

MediaPipe

TensorFlow Lite (LiteRT)

TensorFlow.js

Google AI Studio

Jules (Google)

What links here

Related Articles

Firebase

MediaPipe

TensorFlow Lite (LiteRT)

TensorFlow.js

Google AI Studio

Jules (Google)

What links here