TensorFlow Decision Forests (TF-DF)
Last reviewed
May 1, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 3,114 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 3,114 words
Add missing citations, update stale details, or suggest a clearer explanation.
TensorFlow Decision Forests (often abbreviated TF-DF) is an open-source Google library that brings state-of-the-art tree-based machine-learning algorithms into the TensorFlow and Keras ecosystem. It exposes Random Forest, Gradient Boosted Decision Trees, CART, Distributed Gradient Boosted Trees, and Isolation Forest models through a Keras-compatible API, so that classification, regression, ranking, and uplifting tasks can be trained, evaluated, and served using the same workflow as a deep neural network [1][2]. Under the hood, TF-DF wraps Yggdrasil Decision Forests (YDF), a C++ decision forest library that has been in production at Google since 2018 [3][4].
TF-DF was open-sourced by the TensorFlow team on May 27, 2021, in a launch post written by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, and Jan Pfeifer [1]. It was the first time Google made the internal Yggdrasil engine available outside the company. Before TF-DF, TensorFlow users who wanted competitive tree models on tabular data typically had to leave the framework and use XGBoost, LightGBM, CatBoost, or scikit-learn. TF-DF closed that gap and also made it straightforward to combine tree models and neural networks inside a single TensorFlow pipeline [1][5].
The Yggdrasil project began inside Google in 2017 as an experimental research library. By 2018 it had moved into production and was being used to serve predictions tens of millions of times per second across various Google products [4][6]. The team chose the name Yggdrasil after the world tree of Norse mythology, which is fitting for a library whose central data structure is the decision tree.
For several years Yggdrasil remained an internal C++ library with command-line and Google Sheets bindings. Around 2019 to 2020, the team began wrapping the C++ core with a Python interface that integrated with TensorFlow and Keras. That wrapper became TensorFlow Decision Forests, and it was open-sourced at Google I/O on May 27, 2021 [1][6]. The launch was accompanied by Colab tutorials, Keras-style example code, and a Google I/O talk by Mathieu Guillame-Bert.
Key milestones in the library's history are listed below.
| Year | Event |
|---|---|
| 2017 | Yggdrasil started as an experimental C++ research library at Google |
| 2018 | Yggdrasil moved into production at Google; serves tens of millions of predictions per second |
| 2021 (May 27) | TF-DF 0.1.x released as open source by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, and Jan Pfeifer [1] |
| 2022 (December) | arXiv preprint of the Yggdrasil paper posted [3] |
| 2022 (December) | Simple ML for Sheets launched, putting Yggdrasil inside Google Sheets [7] |
| 2023 (February) | TF-DF 1.0 declared production ready in a TensorFlow blog post [2] |
| 2023 (August) | Yggdrasil paper published at KDD 2023 in Long Beach [3] |
| 2023 | Standalone YDF Python API released as a leaner alternative to TF-DF [4] |
| 2024 | Releases 1.8 through 1.11 add LambdaMART NDCG with configurable truncation, faster model loading, NA support in the fast inference engine, and TensorFlow 2.16 to 2.18 compatibility [8] |
| 2025 (March 13) | TF-DF 1.12 ships with Python 3.12 support, sparse oblique split tuning, and TensorFlow 2.19 compatibility [8][9] |
The library is maintained at Google by a core team that includes Mathieu Guillame-Bert (lead), Jan Pfeifer, Richard Stotz, Sebastian Bruch, and Arvind Srinivasan [4][6].
Tabular data is still the dominant format in industry. Click prediction, fraud detection, churn modeling, credit scoring, ranking, and many recommendation problems all run over tables of numbers and categorical labels rather than over images, audio, or natural language. On those problems tree ensembles continue to outperform deep neural networks in most cases, which is why XGBoost, LightGBM, and CatBoost are the default tools on Kaggle and in many production stacks [10][11].
Before TF-DF, TensorFlow had no first-class support for those models. A team that wanted to combine a decision tree ensemble with a TensorFlow neural network was forced to maintain two separate pipelines, two model formats, and two serving stacks. TF-DF removed that friction in three concrete ways.
model.fit(), model.predict(), model.evaluate(), and model.save() calls that Keras users already knew.SavedModel artifacts that work with TensorFlow Serving, TensorFlow Lite, TensorFlow.js, and Vertex AI without any extra conversion step.The production lineage of the underlying engine matters too. Yggdrasil already powered ranking, classification, and regression tasks across multiple Google products before the open-source release, so the TF-DF launch did not ship experimental code [1][3].
TF-DF exposes five training algorithms, each as a Keras Model subclass.
| Algorithm | Keras class | Original reference | Typical use |
|---|---|---|---|
| Random Forest | tfdf.keras.RandomForestModel | Breiman, 2001 | Strong baseline, low tuning effort, robust to noise |
| Gradient Boosted Trees | tfdf.keras.GradientBoostedTreesModel | Friedman, 2001 | Highest accuracy on most tabular benchmarks, the workhorse |
| CART | tfdf.keras.CartModel | Breiman, Friedman, Olshen, and Stone, 1984 | Single decision tree baseline, interpretable models |
| Distributed Gradient Boosted Trees | tfdf.keras.DistributedGradientBoostedTreesModel | Yggdrasil paper, 2023 | GBT training over hundreds of millions to a few billion rows across many machines [3][12] |
| Isolation Forest | exposed via Yggdrasil; reachable from TF-DF as of recent releases | Liu, Ting, and Zhou, 2008 | Unsupervised anomaly detection [4] |
In addition to these top-level model types, the library implements LambdaMART (under both LAMBDA_MART_NDCG and the older LAMBDA_MART_NDCG5 names), DART (Dropout Additive Regression Trees), Extra Trees, oblique splits, sparse oblique splits, greedy global growth, one-side sampling, categorical-set learning, random categorical learning, out-of-bag evaluation, and several variable importance measures [1][8].
TF-DF inherits the full Yggdrasil feature set and adds Keras conveniences on top.
| Feature | Notes |
|---|---|
| Keras model API | tfdf.keras.RandomForestModel, GradientBoostedTreesModel, CartModel, DistributedGradientBoostedTreesModel all behave like Keras models, with fit, evaluate, predict, and save |
| Native categorical support | Both CATEGORICAL_INTEGER and CATEGORICAL_STRING features train without one-hot encoding, hashing, or embeddings [5] |
| Native missing-value handling | NaNs are split on directly without imputation [5] |
| Numerical feature handling | Splits are threshold-based, so no scaling, normalization, or whitening is required [5] |
| Multi-task learning | A single GBT can predict more than one label, including mixed regression and classification targets |
| Sparse features | Sparse representations are supported, including sparse oblique splits added in 1.12 [9] |
| Ranking | LambdaMART-style ranking with configurable NDCG truncation as of 1.11 [8] |
| Uplifting | Uplift modeling is supported through dedicated tasks and an uplift inspector added in 1.8 [8] |
| Hyperparameter templates | use_predefined_hps=True enables curated parameter sets discovered across hundreds of datasets [2] |
| Distributed training | Up to a few billion examples across tens to hundreds of machines via TensorFlow Distributed [12] |
| Deterministic training | The same dataset and hyperparameters reproduce the exact same model [5] |
| Single-epoch training | The full dataset is loaded once; no batch-size or epoch tuning [5] |
| SavedModel export | Models save as standard TensorFlow SavedModel, ready for TensorFlow Serving, Vertex AI, TF Lite, and TF.js [2] |
| YDF interoperability | TF-DF and YDF models are mutually loadable, so a model trained in Python can be served from C++, Go, JavaScript, or the command line [4] |
| License | Apache License 2.0 [6] |
A bare-bones training example shows how minimal the API is.
import tensorflow_decision_forests as tfdf
import pandas as pd
train_df = pd.read_csv("train.csv")
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="income")
model = tfdf.keras.GradientBoostedTreesModel()
model.fit(train_ds)
model.compile(metrics=["accuracy"])
model.save("my_gbt_model")
No feature columns, no normalization, no batch size, no epoch count. The model figures out feature semantics from the dataset spec and trains in a single pass.
TF-DF is a thin Python and TensorFlow wrapper. The training loops, splitter logic, statistics tables, model serialization format, and inference engines all live in Yggdrasil's C++ core [3][4]. The C++ core is portable across Linux, macOS, and Windows, and it does not depend on TensorFlow at all, which is why Yggdrasil can also be called from a CLI, from Go, from JavaScript via WebAssembly, and from Google Sheets [3][6][7].
The Yggdrasil paper, published at KDD 2023, describes four design principles: simplicity of use, safety of use, modularity with high-level abstractions, and integration with other ML libraries [3]. The same paper benchmarks Yggdrasil against XGBoost and LightGBM and reports competitive accuracy with strong inference performance, often in the sub-microsecond range per example per CPU core [3].
Key internal components include:
DataSpec) that records feature names, types, vocabulary, and statistics so that the same model can be trained, served, and inspected without re-deriving the schema.Because the C++ core is open source under Apache 2.0, the same engine can be embedded in non-Python and non-TensorFlow stacks, which is one reason Google built the standalone YDF Python package in 2023 [4].
TF-DF is one of several mature tree libraries. The table below compares the headline differences.
| Library | Maintainer | First release | Core language | Framework integration | Distinctive feature |
|---|---|---|---|---|---|
| TensorFlow Decision Forests | May 2021 | C++ (via Yggdrasil), Python wrapper | Native Keras and TensorFlow | Tree models that share a SavedModel format with neural networks | |
| XGBoost | Distributed (Tianqi Chen et al.) | 2014 | C++ | Python, R, Java, Scala, Spark | Industry default for gradient boosting; mature scikit-learn API |
| LightGBM | Microsoft | 2016 (paper 2017) | C++ | Python, R, C, command line | Histogram-based GBT with leaf-wise growth, fast on large data |
| CatBoost | Yandex | 2017 | C++ | Python, R, command line | Symmetric trees with native ordered target statistics for categoricals |
| scikit-learn ensembles | Community | 2007 onward | Cython, Python | Native to scikit-learn | The textbook reference implementations of Random Forest and GBM |
| H2O.ai | H2O.ai | 2014 | Java | Python, R, REST | Distributed JVM-based GBM and Random Forest with AutoML |
A few recurring observations from independent benchmarks and from the Yggdrasil paper itself:
There is one important caveat. TF-DF training does not run on GPUs or TPUs. All training and inference happens on CPU, sometimes with SIMD acceleration. For very large datasets, GPU-accelerated XGBoost or NVIDIA's Forest Inference Library can be faster on appropriate hardware, although the Yggdrasil team argues that for typical inference workloads the GPU does not help because batches are small [13][14].
The practical applications of TF-DF cluster around problems where tabular data dominates and where the rest of the system already runs on TensorFlow.
One of the points the launch announcement emphasized was that TF-DF makes neural network plus tree pipelines straightforward [1]. A few common patterns are listed below.
Because every TF-DF model is a Keras model, these compositions can be expressed inside a tf.keras.Model and serialized as a single SavedModel [1][5].
The 1.8 through 1.12 release cycle added several practical improvements [8][9].
LAMBDA_MART_NDCG5 to LAMBDA_MART_NDCG with configurable truncation, switched from UnknownError to InvalidArgumentError for clearer debugging, and aligned with TensorFlow 2.18.The Google team has also shifted some documentation and roadmap energy toward the standalone YDF Python package, which advertises training that is up to five times faster and inference that is up to a thousand times faster than TF-DF for the same models, primarily because YDF avoids TensorFlow's overhead on small datasets [4]. TF-DF remains under active maintenance for users who need the TensorFlow integration.
Google has not published a comprehensive list of products that use Yggdrasil and TF-DF, but the launch and KDD 2023 papers note that the underlying engine has served Google production traffic for years and is used across multiple internal products [1][3]. The Simple ML for Sheets add-on, which lets Google Sheets users build classification and regression models with no code, is itself powered by Yggdrasil [7].
Both TF-DF and Yggdrasil are released under the Apache License 2.0 [4][6]. TF-DF is distributed on PyPI as tensorflow-decision-forests and supports Python 3.9 through 3.12, on Linux x86-64 and macOS 12+ on ARM64; Windows is supported via WSL [9]. The source repository lives at github.com/tensorflow/decision-forests, with the Yggdrasil C++ core at github.com/google/yggdrasil-decision-forests [4][6].