# TensorFlow Decision Forests (TF-DF)

> Source: https://aiwiki.ai/wiki/tensorflow_decision_forests
> Updated: 2026-05-01
> Categories: Developer Tools, Google, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**TensorFlow Decision Forests** (often abbreviated **TF-DF**) is an open-source Google library that brings state-of-the-art tree-based machine-learning algorithms into the [TensorFlow](tensorflow) and [Keras](keras) ecosystem. It exposes [Random Forest](random_forest), [Gradient Boosted Decision Trees](gradient_boosted_decision_trees_gbt), CART, Distributed Gradient Boosted Trees, and Isolation Forest models through a Keras-compatible API, so that classification, regression, ranking, and uplifting tasks can be trained, evaluated, and served using the same workflow as a deep neural network [1][2]. Under the hood, TF-DF wraps Yggdrasil Decision Forests (YDF), a C++ [decision forest](decision_forest) library that has been in production at Google since 2018 [3][4].

TF-DF was open-sourced by the TensorFlow team on May 27, 2021, in a launch post written by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, and Jan Pfeifer [1]. It was the first time Google made the internal Yggdrasil engine available outside the company. Before TF-DF, TensorFlow users who wanted competitive tree models on tabular data typically had to leave the framework and use [XGBoost](xgboost), [LightGBM](lightgbm), CatBoost, or scikit-learn. TF-DF closed that gap and also made it straightforward to combine tree models and neural networks inside a single TensorFlow pipeline [1][5].

## History

The Yggdrasil project began inside Google in 2017 as an experimental research library. By 2018 it had moved into production and was being used to serve predictions tens of millions of times per second across various Google products [4][6]. The team chose the name Yggdrasil after the world tree of Norse mythology, which is fitting for a library whose central data structure is the decision tree.

For several years Yggdrasil remained an internal C++ library with command-line and Google Sheets bindings. Around 2019 to 2020, the team began wrapping the C++ core with a Python interface that integrated with TensorFlow and Keras. That wrapper became TensorFlow Decision Forests, and it was open-sourced at Google I/O on May 27, 2021 [1][6]. The launch was accompanied by Colab tutorials, Keras-style example code, and a Google I/O talk by Mathieu Guillame-Bert.

Key milestones in the library's history are listed below.

| Year | Event |
|------|-------|
| 2017 | Yggdrasil started as an experimental C++ research library at Google |
| 2018 | Yggdrasil moved into production at Google; serves tens of millions of predictions per second |
| 2021 (May 27) | TF-DF 0.1.x released as open source by Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, and Jan Pfeifer [1] |
| 2022 (December) | arXiv preprint of the Yggdrasil paper posted [3] |
| 2022 (December) | Simple ML for Sheets launched, putting Yggdrasil inside Google Sheets [7] |
| 2023 (February) | TF-DF 1.0 declared production ready in a TensorFlow blog post [2] |
| 2023 (August) | Yggdrasil paper published at KDD 2023 in Long Beach [3] |
| 2023 | Standalone YDF Python API released as a leaner alternative to TF-DF [4] |
| 2024 | Releases 1.8 through 1.11 add LambdaMART NDCG with configurable truncation, faster model loading, NA support in the fast inference engine, and TensorFlow 2.16 to 2.18 compatibility [8] |
| 2025 (March 13) | TF-DF 1.12 ships with Python 3.12 support, sparse oblique split tuning, and TensorFlow 2.19 compatibility [8][9] |

The library is maintained at Google by a core team that includes Mathieu Guillame-Bert (lead), Jan Pfeifer, Richard Stotz, Sebastian Bruch, and Arvind Srinivasan [4][6].

## Why TF-DF matters

Tabular data is still the dominant format in industry. Click prediction, fraud detection, churn modeling, credit scoring, ranking, and many recommendation problems all run over tables of numbers and categorical labels rather than over images, audio, or natural language. On those problems tree ensembles continue to outperform deep neural networks in most cases, which is why XGBoost, LightGBM, and CatBoost are the default tools on Kaggle and in many production stacks [10][11].

Before TF-DF, TensorFlow had no first-class support for those models. A team that wanted to combine a [decision tree](decision_tree) ensemble with a TensorFlow neural network was forced to maintain two separate pipelines, two model formats, and two serving stacks. TF-DF removed that friction in three concrete ways.

1. It put a Random Forest, a Gradient Boosted Trees, or a CART model behind the same `model.fit()`, `model.predict()`, `model.evaluate()`, and `model.save()` calls that Keras users already knew.
2. It produced ordinary `SavedModel` artifacts that work with TensorFlow Serving, TensorFlow Lite, TensorFlow.js, and Vertex AI without any extra conversion step.
3. It allowed hybrid pipelines where a neural network produces dense feature vectors that feed a tree, or where tree predictions become inputs to a downstream neural model.

The production lineage of the underlying engine matters too. Yggdrasil already powered ranking, classification, and regression tasks across multiple Google products before the open-source release, so the TF-DF launch did not ship experimental code [1][3].

## Supported algorithms

TF-DF exposes five training algorithms, each as a Keras `Model` subclass.

| Algorithm | Keras class | Original reference | Typical use |
|-----------|-------------|--------------------|-------------|
| Random Forest | `tfdf.keras.RandomForestModel` | Breiman, 2001 | Strong baseline, low tuning effort, robust to noise |
| Gradient Boosted Trees | `tfdf.keras.GradientBoostedTreesModel` | Friedman, 2001 | Highest accuracy on most tabular benchmarks, the workhorse |
| CART | `tfdf.keras.CartModel` | Breiman, Friedman, Olshen, and Stone, 1984 | Single decision tree baseline, interpretable models |
| Distributed Gradient Boosted Trees | `tfdf.keras.DistributedGradientBoostedTreesModel` | Yggdrasil paper, 2023 | GBT training over hundreds of millions to a few billion rows across many machines [3][12] |
| Isolation Forest | exposed via Yggdrasil; reachable from TF-DF as of recent releases | Liu, Ting, and Zhou, 2008 | Unsupervised anomaly detection [4] |

In addition to these top-level model types, the library implements LambdaMART (under both LAMBDA_MART_NDCG and the older LAMBDA_MART_NDCG5 names), DART (Dropout Additive Regression Trees), Extra Trees, oblique splits, sparse oblique splits, greedy global growth, one-side sampling, categorical-set learning, random categorical learning, out-of-bag evaluation, and several variable importance measures [1][8].

## Key features

TF-DF inherits the full Yggdrasil feature set and adds Keras conveniences on top.

| Feature | Notes |
|---------|-------|
| Keras model API | `tfdf.keras.RandomForestModel`, `GradientBoostedTreesModel`, `CartModel`, `DistributedGradientBoostedTreesModel` all behave like Keras models, with `fit`, `evaluate`, `predict`, and `save` |
| Native categorical support | Both `CATEGORICAL_INTEGER` and `CATEGORICAL_STRING` features train without one-hot encoding, hashing, or embeddings [5] |
| Native missing-value handling | NaNs are split on directly without imputation [5] |
| Numerical feature handling | Splits are threshold-based, so no scaling, normalization, or whitening is required [5] |
| Multi-task learning | A single GBT can predict more than one label, including mixed regression and classification targets |
| Sparse features | Sparse representations are supported, including sparse oblique splits added in 1.12 [9] |
| Ranking | LambdaMART-style ranking with configurable NDCG truncation as of 1.11 [8] |
| Uplifting | Uplift modeling is supported through dedicated tasks and an uplift inspector added in 1.8 [8] |
| Hyperparameter templates | `use_predefined_hps=True` enables curated parameter sets discovered across hundreds of datasets [2] |
| Distributed training | Up to a few billion examples across tens to hundreds of machines via TensorFlow Distributed [12] |
| Deterministic training | The same dataset and hyperparameters reproduce the exact same model [5] |
| Single-epoch training | The full dataset is loaded once; no batch-size or epoch tuning [5] |
| SavedModel export | Models save as standard TensorFlow SavedModel, ready for TensorFlow Serving, Vertex AI, TF Lite, and TF.js [2] |
| YDF interoperability | TF-DF and YDF models are mutually loadable, so a model trained in Python can be served from C++, Go, JavaScript, or the command line [4] |
| License | Apache License 2.0 [6] |

A bare-bones training example shows how minimal the API is.

```python
import tensorflow_decision_forests as tfdf
import pandas as pd

train_df = pd.read_csv("train.csv")
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="income")

model = tfdf.keras.GradientBoostedTreesModel()
model.fit(train_ds)
model.compile(metrics=["accuracy"])
model.save("my_gbt_model")
```

No feature columns, no normalization, no batch size, no epoch count. The model figures out feature semantics from the dataset spec and trains in a single pass.

## Yggdrasil internals

TF-DF is a thin Python and TensorFlow wrapper. The training loops, splitter logic, statistics tables, model serialization format, and inference engines all live in Yggdrasil's C++ core [3][4]. The C++ core is portable across Linux, macOS, and Windows, and it does not depend on TensorFlow at all, which is why Yggdrasil can also be called from a CLI, from Go, from JavaScript via WebAssembly, and from Google Sheets [3][6][7].

The Yggdrasil paper, published at KDD 2023, describes four design principles: simplicity of use, safety of use, modularity with high-level abstractions, and integration with other ML libraries [3]. The same paper benchmarks Yggdrasil against XGBoost and LightGBM and reports competitive accuracy with strong inference performance, often in the sub-microsecond range per example per CPU core [3].

Key internal components include:

- A unified dataset specification (`DataSpec`) that records feature names, types, vocabulary, and statistics so that the same model can be trained, served, and inspected without re-deriving the schema.
- A tree-builder library that supports both axis-aligned and oblique splits, multiple impurity criteria, and several growth strategies (depth-wise, best-first, and global).
- A serving library with multiple inference engines, including a generic engine, optimized engines for axis-aligned conditions, and quantized engines.
- A hyperparameter tuner that wraps the training loops and exposes both random and exhaustive search.

Because the C++ core is open source under Apache 2.0, the same engine can be embedded in non-Python and non-TensorFlow stacks, which is one reason Google built the standalone YDF Python package in 2023 [4].

## Comparison with other tree libraries

TF-DF is one of several mature tree libraries. The table below compares the headline differences.

| Library | Maintainer | First release | Core language | Framework integration | Distinctive feature |
|---------|------------|---------------|---------------|-----------------------|---------------------|
| [TensorFlow Decision Forests](tensorflow_decision_forests) | Google | May 2021 | C++ (via Yggdrasil), Python wrapper | Native [Keras](keras) and [TensorFlow](tensorflow) | Tree models that share a SavedModel format with neural networks |
| [XGBoost](xgboost) | Distributed (Tianqi Chen et al.) | 2014 | C++ | Python, R, Java, Scala, Spark | Industry default for gradient boosting; mature scikit-learn API |
| [LightGBM](lightgbm) | Microsoft | 2016 (paper 2017) | C++ | Python, R, C, command line | Histogram-based GBT with leaf-wise growth, fast on large data |
| CatBoost | Yandex | 2017 | C++ | Python, R, command line | Symmetric trees with native ordered target statistics for categoricals |
| scikit-learn ensembles | Community | 2007 onward | Cython, Python | Native to scikit-learn | The textbook reference implementations of Random Forest and GBM |
| H2O.ai | H2O.ai | 2014 | Java | Python, R, REST | Distributed JVM-based GBM and Random Forest with AutoML |

A few recurring observations from independent benchmarks and from the Yggdrasil paper itself:

- On accuracy, TF-DF and Yggdrasil are competitive with XGBoost on most tabular tasks, and the gap to LightGBM and CatBoost is small in either direction depending on dataset and tuning [3][13].
- On training speed, LightGBM is often the fastest of the four for medium-sized datasets, with CatBoost close behind. Yggdrasil and XGBoost trade places depending on hyperparameters [13].
- On inference speed, Yggdrasil is strong because the C++ engines were designed to serve Google production traffic [3].
- On categorical handling, TF-DF, CatBoost, and LightGBM (since 2.0) all handle categoricals natively, while XGBoost still typically requires preprocessing such as one-hot or ordinal encoding [10][11].
- On framework integration, TF-DF is the only library in the group that produces a TensorFlow SavedModel out of the box.

There is one important caveat. TF-DF training does not run on GPUs or TPUs. All training and inference happens on CPU, sometimes with SIMD acceleration. For very large datasets, GPU-accelerated XGBoost or NVIDIA's Forest Inference Library can be faster on appropriate hardware, although the Yggdrasil team argues that for typical inference workloads the GPU does not help because batches are small [13][14].

## Use cases

The practical applications of TF-DF cluster around problems where tabular data dominates and where the rest of the system already runs on TensorFlow.

- Tabular classification and regression on structured data such as customer records, sensor logs, financial transactions, and survey responses.
- Ranking systems for search and recommendations, where the LambdaMART loss function provides directly optimized NDCG.
- Anomaly detection through Isolation Forest for fraud, intrusion detection, or quality monitoring.
- Production ML pipelines that already use TensorFlow Serving, Vertex AI, or TFX and need a tree model that fits the same deployment story.
- Hybrid models that mix neural networks and trees, for example using image embeddings or text embeddings as inputs to a Gradient Boosted Trees model.
- Tabular AutoML systems that wrap TF-DF inside a hyperparameter search loop.
- Edge and on-device inference, with caveats: TF-DF can export to TF Lite, but for tight embedded targets the standalone Yggdrasil C++ runtime is usually a better fit because it has no TensorFlow dependency [4].

## Hybrid neural network plus tree pipelines

One of the points the launch announcement emphasized was that TF-DF makes neural network plus tree pipelines straightforward [1]. A few common patterns are listed below.

- Train a neural network to produce dense embeddings, freeze it, and then train a Gradient Boosted Trees model on the concatenation of those embeddings and the original tabular features.
- Train a TF-DF model first, treat its prediction as a single dense feature, and feed it into a small neural network that combines it with embeddings of categorical features such as user IDs.
- In multi-task learning, route tabular features through a tree model and unstructured modalities such as text or images through a neural network, then combine the two prediction streams.
- For ranking, use a neural-network-based feature extractor on raw text or images and pass its outputs to a TF-DF LambdaMART model that scores documents.

Because every TF-DF model is a Keras model, these compositions can be expressed inside a `tf.keras.Model` and serialized as a single SavedModel [1][5].

## Strengths

- The Yggdrasil C++ core is battle-tested at Google scale, where it serves many millions of inferences per second [3][6].
- Native handling of categorical and missing values means much less preprocessing than [XGBoost](xgboost) typically requires [5].
- The Keras API is familiar to anyone who already uses TensorFlow, so adoption inside an existing TensorFlow shop is low-friction [1][5].
- Trained models export to standard SavedModel and run on TensorFlow Serving, Vertex AI, TensorFlow Lite, and TensorFlow.js without conversion [2].
- Training does not require a GPU, which is convenient when the rest of the cluster is busy training neural networks [5].
- Distributed training scales to billions of examples through TensorFlow Distributed when needed [12].

## Limitations

- The community is much smaller than the XGBoost or LightGBM communities. There are fewer Stack Overflow answers, blog posts, and Kaggle notebooks.
- A few XGBoost-specific knobs and integrations are not available, and some users find the TF-DF documentation thinner than XGBoost's reference manual.
- Training does not run on GPU or TPU, although Yggdrasil's CPU performance is competitive on most workloads [13].
- TF-DF inherits the size of TensorFlow as a runtime dependency. The standalone YDF package was created in part to address this, with a roughly 12 MB footprint compared to the much larger TF-DF plus TensorFlow stack [4].
- For Windows users, TF-DF runs only via WSL; native Windows wheels are not provided [9].
- The Google team now recommends new projects start with the standalone YDF Python API rather than TF-DF, while keeping TF-DF as the right choice for projects that need tight TensorFlow integration [4].

## Recent updates (2024 to 2025)

The 1.8 through 1.12 release cycle added several practical improvements [8][9].

- TF-DF 1.8 (November 2023) added support for inspecting uplift models and was the first release to fix a regression bug in MSE/MAE gradients.
- TF-DF 1.9 (March 2024) brought TensorFlow 2.16 compatibility, faster loading for models with many features, and NA condition support in the fast inference engine.
- TF-DF 1.10 (August 2024) tracked TensorFlow 2.17 and shipped macOS build fixes.
- TF-DF 1.11 (October 2024) renamed `LAMBDA_MART_NDCG5` to `LAMBDA_MART_NDCG` with configurable truncation, switched from `UnknownError` to `InvalidArgumentError` for clearer debugging, and aligned with TensorFlow 2.18.
- TF-DF 1.12 (March 2025) added Python 3.12 support, exposed five new hyperparameters for sparse oblique splits, fixed handling of non-Unicode categorical values, and aligned with TensorFlow 2.19. It also updated compatibility with the latest YDF model format.

The Google team has also shifted some documentation and roadmap energy toward the standalone YDF Python package, which advertises training that is up to five times faster and inference that is up to a thousand times faster than TF-DF for the same models, primarily because YDF avoids TensorFlow's overhead on small datasets [4]. TF-DF remains under active maintenance for users who need the TensorFlow integration.

## TF-DF in production at Google

Google has not published a comprehensive list of products that use Yggdrasil and TF-DF, but the launch and KDD 2023 papers note that the underlying engine has served Google production traffic for years and is used across multiple internal products [1][3]. The Simple ML for Sheets add-on, which lets Google Sheets users build classification and regression models with no code, is itself powered by Yggdrasil [7].

## Licensing and availability

Both TF-DF and Yggdrasil are released under the Apache License 2.0 [4][6]. TF-DF is distributed on PyPI as `tensorflow-decision-forests` and supports Python 3.9 through 3.12, on Linux x86-64 and macOS 12+ on ARM64; Windows is supported via WSL [9]. The source repository lives at github.com/tensorflow/decision-forests, with the Yggdrasil C++ core at github.com/google/yggdrasil-decision-forests [4][6].

## References

1. Mathieu Guillame-Bert, Sebastian Bruch, Josh Gordon, Jan Pfeifer. "Introducing TensorFlow Decision Forests." The TensorFlow Blog, May 27, 2021. https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html
2. "Updates: TensorFlow Decision Forests is production ready." The TensorFlow Blog, February 2023. https://blog.tensorflow.org/2023/02/updates-tensorflow-decision-forests-is-production-ready.html
3. Mathieu Guillame-Bert, Sebastian Bruch, Richard Stotz, Jan Pfeifer. "Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library." Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2023), pp. 4068-4077. https://dl.acm.org/doi/10.1145/3580305.3599933 (preprint: https://arxiv.org/abs/2212.02934)
4. Yggdrasil Decision Forests, GitHub repository. https://github.com/google/yggdrasil-decision-forests
5. "Migrating from Neural Networks." TensorFlow Decision Forests documentation. https://www.tensorflow.org/decision_forests/migration
6. TensorFlow Decision Forests, GitHub repository. https://github.com/tensorflow/decision-forests
7. Simple ML for Sheets, official site. https://simplemlforsheets.com/
8. TensorFlow Decision Forests CHANGELOG. https://github.com/tensorflow/decision-forests/blob/main/CHANGELOG.md
9. tensorflow-decision-forests on PyPI. https://pypi.org/project/tensorflow-decision-forests/
10. "When to Choose CatBoost Over XGBoost or LightGBM." Neptune.ai blog. https://neptune.ai/blog/when-to-choose-catboost-over-xgboost-or-lightgbm
11. "The Kaggle Grandmasters Playbook: 7 Battle-Tested Modeling Techniques for Tabular Data." NVIDIA Technical Blog. https://developer.nvidia.com/blog/the-kaggle-grandmasters-playbook-7-battle-tested-modeling-techniques-for-tabular-data/
12. "Distributed Training." TensorFlow Decision Forests documentation. https://www.tensorflow.org/decision_forests/distributed_training
13. Hong Yi et al. "A Comparison of Decision Forest Inference Platforms from A Database Perspective." arXiv:2302.04430. https://arxiv.org/abs/2302.04430
14. "Supercharge Tree-Based Model Inference with Forest Inference Library in NVIDIA cuML." NVIDIA Technical Blog. https://developer.nvidia.com/blog/supercharge-tree-based-model-inference-with-forest-inference-library-in-nvidia-cuml/

