Tabular models
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 1,578 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 1,578 words
Add missing citations, update stale details, or suggest a clearer explanation.
Tabular models are machine learning systems that learn from data arranged in tables, where each row is a sample and each column is a feature. The features are usually heterogeneous, mixing continuous numbers (age, income, a sensor reading), categorical codes (country, product identifier), ordinal levels, and binary flags. Unlike images, text, or audio, the columns of a table have no canonical spatial or temporal ordering, they differ in scale and meaning, and missing values are common. This setting covers a large share of applied machine learning, including credit scoring, fraud detection, churn prediction, demand forecasting, medical risk modeling, and click-through rate prediction.
This article is an overview of the model families used for tabular data and how they relate. The two main supervised tasks have their own detailed articles: tabular classification models predict a discrete class label, and tabular regression models predict a continuous numeric target. Most of the methods below apply to both tasks with only a change of loss function and output head.
For most other data types, deep neural networks have become the default. Tabular data is the main exception. The columns carry no fixed topology, so a model cannot rely on the locality assumptions that make convolutional and sequence architectures effective. Useful interactions between columns tend to be low-order rather than deeply compositional, datasets are often small to medium (hundreds to a few million rows), class or target distributions are frequently skewed, and interpretability and calibration usually matter because the predictions drive consequential decisions. These properties favor models that are sample-efficient, robust to noise and uninformative features, fast to retrain, and easy to inspect. They are also the reason that decision-tree ensembles, rather than neural networks, have dominated the field for most of the last two decades.
The earliest tabular models were linear: ordinary least squares and ridge regression for numeric targets, and logistic regression for classification. These remain the workhorses when interpretability or well-calibrated probabilities are paramount. Penalized variants such as the lasso (L1) and elastic net (L1 plus L2) add variable selection. Other classical baselines include k-nearest neighbors, naive Bayes, support vector machines, and single decision trees (the CART framework of Breiman, Friedman, Olshen, and Stone, 1984). A single tree is high-variance, which is why ensembles of trees, rather than individual trees, became the standard.
Bagging (Breiman, 1996) trains many trees on bootstrap samples and averages them. Random forest (Breiman, 2001) adds feature subsampling at each split, giving a robust general-purpose model with few hyperparameters and a built-in out-of-bag error estimate. These methods reduce the variance of single trees while keeping their ability to handle mixed feature types, missing values, and feature interactions.
Gradient boosting, formalized by Jerome Friedman (2001), builds an additive ensemble in which each new shallow tree is fit to the negative gradient of a differentiable loss with respect to the current predictions. Three open-source libraries account for most production use and most Kaggle wins on tabular data. XGBoost (Chen and Guestrin, KDD 2016) was the first widely adopted production-grade implementation, adding second-order Newton steps, an L1 and L2 regularized objective, and a sparsity-aware split finder for missing values. LightGBM (Ke et al., NeurIPS 2017) introduced gradient-based one-side sampling and exclusive feature bundling for speed and memory efficiency on large, sparse datasets, and grows trees leaf-wise. CatBoost (Prokhorenkova et al., NeurIPS 2018) introduced ordered target statistics to encode categorical features without target leakage, and uses symmetric oblivious trees for fast inference. Gradient-boosted trees remain the default baseline for tabular machine learning.
Neural networks for tabular data try to recover the inductive biases that trees obtain almost for free, namely the ability to handle features at different scales and to ignore uninformative columns. A practical line of work in recommendation and online advertising, including Wide and Deep, DeepFM, and Deep and Cross Networks, models low-order feature crosses explicitly. For general tabular tasks, research accelerated after 2017. TabNet (Arik and Pfister, AAAI 2021) uses sequential attention to select features at each decision step. FT-Transformer (Gorishniy et al., NeurIPS 2021) tokenizes both numerical and categorical features and passes them through a standard transformer encoder; the same paper, Revisiting Deep Learning Models for Tabular Data, also showed that a well-tuned residual MLP is a strong baseline and that no neural model dominated XGBoost across all datasets. Other notable architectures include NODE (differentiable oblivious tree ensembles), TabTransformer, and SAINT. A recurring empirical finding is that careful tuning of an MLP recovers most of the gap to more elaborate deep learning designs.
A newer direction trains one large model on a vast variety of tabular tasks, then deploys it as a frozen in-context predictor that takes a labeled support set as input and outputs predictions without per-dataset training. TabPFN (Hollmann et al., ICLR 2023), a transformer trained on millions of synthetic tasks from a structural causal model prior, was the first widely cited example; it ran in under a second on small numerical classification problems. TabPFN v2 (Hollmann et al., Nature, 2025) added native categorical and missing-value handling, scaled to roughly 10,000 rows and 500 features, and extended the approach to regression, matching or surpassing tuned gradient-boosting ensembles on small to medium datasets. Related efforts include TabDPT, trained on real OpenML datasets rather than synthetic data.
The table below compares the main families on the properties that matter for tabular work.
| Family | Representative methods | Native categorical and missing handling | Typical strength | Main limitation |
|---|---|---|---|---|
| Linear and classical | Logistic regression, ridge, lasso, SVM, k-NN | No (needs encoding and imputation) | Interpretable, calibrated, fast | Only linear or low-flexibility boundaries |
| Random forests | Random forest, Extra Trees | Partial (handles mixed types, splits on codes) | Robust default, few hyperparameters | Weaker on smooth or linear signals |
| Gradient-boosted trees | XGBoost, LightGBM, CatBoost | Yes (LightGBM, CatBoost; XGBoost since 1.5) | State of the art on most tabular tasks | Needs tuning; less suited to very wide data |
| Deep learning | TabNet, FT-Transformer, NODE, SAINT, MLP | Yes (learned embeddings) | Flexible, integrates with other modalities | Rarely beats boosting at equal tuning budget |
| Foundation models | TabPFN, TabPFN v2, TabDPT | Yes (v2 onward) | Strong on small data, no per-dataset tuning | Limited row and feature counts so far |
Two influential 2022 studies crystallized the case that gradient-boosted trees still beat deep learning on typical tabular data when the comparison is fair. Grinsztajn, Oyallon, and Varoquaux (NeurIPS 2022) built a benchmark of 45 medium-sized datasets and, with equal tuning budgets, found that random forests and gradient boosting consistently outperformed FT-Transformer, ResNet, and MLPs on both classification and regression. They attributed the gap to three structural advantages of trees: robustness to uninformative features, invariance to monotone transformations of individual features, and the ability to fit non-smooth target functions. Shwartz-Ziv and Armon (Information Fusion, 2022) reproduced several published deep models and found that they did not generalize beyond their authors' chosen datasets, while a tuned XGBoost ensemble won on most tasks; an ensemble of XGBoost with a neural model was often the strongest. The picture as of 2025 is roughly that tuned gradient boosting remains the default winner on medium to large tabular data, the small-data regime is increasingly contested by TabPFN-style foundation models, and ensembles of boosting with at least one neural component often win competitions.