Tabular Classification Models
Last reviewed
May 13, 2026
Sources
53 citations
Review status
Source-backed
Revision
v2 · 6,595 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
53 citations
Review status
Source-backed
Revision
v2 · 6,595 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Tabular Models and Tasks
Tabular classification models are machine learning systems that predict a discrete class label for each row of a tabular dataset, where rows are samples and columns are heterogeneous features (numeric, ordinal, categorical, or binary). This task setting covers a large share of practical applications of machine learning, including credit scoring, fraud detection, churn prediction, medical diagnosis, click-through rate prediction, and operational risk modeling. Unlike image, text, audio, or sequence data, tabular features have no canonical spatial or temporal ordering, the columns often differ in semantics and scale, missing values are common, and informative interactions between columns are usually low-order rather than deeply compositional.
For most of the last two decades the dominant family of tabular classifiers has been gradient boosted decision trees (GBDT), implemented by XGBoost, LightGBM, and CatBoost. These libraries have won a majority of Kaggle competitions on tabular data and remain the default baseline in industry. Since 2017 there has been an active research program to design neural network architectures that match or surpass GBDT on tabular data, producing models such as TabNet, FT-Transformer, SAINT, NODE, and TabTransformer. From 2022 onward, a parallel direction has emerged: pre-trained foundation models for tabular data, most notably TabPFN, which performs in-context classification without per-dataset gradient training, and the larger TabPFN v2, published in Nature in January 2025.
Tabular classification problems are characterized by a few recurring properties that shape model design. Datasets are usually small to medium in size, from a few hundred to a few million rows, with anywhere from a handful to a few thousand columns. Features are heterogeneous, mixing continuous variables with categorical codes, binary flags, and ordinal levels. Class distributions are often imbalanced, and feature distributions can shift between training and deployment. Interpretability and calibration matter, because tabular models often drive consequential decisions in finance, healthcare, and policy. These constraints favor models that are sample-efficient, robust to noise and missing values, fast to retrain, and easy to inspect.
Classical statistical methods such as logistic regression, naive Bayes, and linear discriminant analysis provided the first baselines and remain widely used when interpretability or calibration is paramount. Tree-based ensembles such as random forest and gradient boosted decision trees offer strong predictive accuracy with minimal tuning. Deep neural networks for tabular data attempt to recover the inductive biases that trees obtain almost for free, especially the ability to handle features at different scales and to ignore uninformative columns. Foundation models for tabular data try to amortize the entire training procedure into a single pre-training run.
In the supervised tabular classification setting, the training set is a collection of pairs $(x_i, y_i)$ where $x_i \in \mathbb{R}^d$ is a feature vector with mixed types and $y_i$ is a discrete label drawn from a finite set of classes. The goal is to learn a function that maps a new feature vector to a probability distribution over classes, or to a single predicted class. Binary classification has two classes; multiclass classification has three or more mutually exclusive classes; multilabel classification allows each sample to belong to several classes simultaneously and is typically reduced to a set of binary problems.
Features may be continuous (age, income, sensor reading), categorical with low cardinality (sex, country code), categorical with high cardinality (zip code, product ID), ordinal (education level, satisfaction score), boolean, or text fragments that have been hashed or embedded. The data is usually stored as a single table or as a small set of joined tables, hence the name. The defining feature is that no a priori topology or sequence is imposed on the columns. Permuting the columns leaves the prediction problem unchanged after appropriate renaming.
Training a tabular classifier typically involves splitting the data into training, validation, and test partitions; encoding categorical and missing values; choosing a model class and a loss function (most often cross-entropy or log-loss for classification); optimizing hyperparameters by cross-validation or a held-out validation set; and reporting performance on a test set using metrics such as accuracy, F1 score, precision and recall, area under the ROC curve (AUROC), log-loss, and Brier score.
The earliest practical tabular classifiers were linear models. Logistic regression, formalized by David Cox in 1958, fits a sigmoid of a linear combination of features to estimate the probability of a binary outcome and remains the workhorse for credit scoring and medical risk modeling. Multinomial logistic regression extends the same idea to multiclass problems through a softmax. Linear and quadratic discriminant analysis, developed by Ronald Fisher in 1936 and later generalized, assume class-conditional Gaussian distributions and yield closed-form decision boundaries.
Naive Bayes classifiers, popularized by Maron and Kuhns in the 1960s for document classification, assume conditional independence of features given the class. Despite the strong independence assumption these classifiers often perform surprisingly well, especially with high-dimensional sparse inputs.
Decision trees, introduced by Breiman, Friedman, Olshen, and Stone in their 1984 book Classification and Regression Trees (the CART algorithm), recursively partition the feature space using axis-aligned splits chosen to maximize an impurity reduction (Gini impurity or entropy). C4.5, developed by Ross Quinlan in 1993, refined the same idea with a different splitting criterion and handling of continuous features. Trees natively handle mixed feature types, missing values, and feature interactions, but a single tree is high-variance.
Ensembles of trees were the next leap. Bagging, proposed by Leo Breiman in 1996, trains many trees on bootstrap samples and averages their predictions. Random forest, also by Breiman in 2001, adds a feature subsampling rule at each split, producing a strong all-purpose classifier with only two important hyperparameters and a built-in out-of-bag error estimate. Extra Trees, by Pierre Geurts and colleagues in 2006, randomize the split thresholds as well.
Support vector machines, formalized by Vapnik and colleagues in the 1990s, maximize the margin between classes in a feature space induced by a kernel function. k-nearest neighbors classification, dating back to Cover and Hart in 1967, predicts the majority class among the k closest training points under a chosen metric. Both methods remain useful baselines, particularly on small datasets.
The table below summarizes the most common classical classifiers used on tabular data.
| Algorithm | Year | Key reference | Strengths | Typical weaknesses |
|---|---|---|---|---|
| Logistic regression | 1958 | Cox | Calibrated probabilities, interpretable coefficients | Linear decision boundary |
| Linear discriminant analysis | 1936 | Fisher | Closed-form solution, fast | Assumes Gaussian class conditionals |
| Naive Bayes | 1960s | Maron and Kuhns | Very fast, works with sparse features | Independence assumption rarely holds |
| k-nearest neighbors | 1967 | Cover and Hart | No training phase, simple | Slow at inference, sensitive to scaling |
| CART | 1984 | Breiman et al. | Handles mixed types, interpretable | High variance |
| C4.5 | 1993 | Quinlan | Information gain ratio, prunes | High variance |
| Support vector machine | 1995 | Cortes and Vapnik | Strong margins, kernels | Scales poorly past tens of thousands of rows |
| Bagging | 1996 | Breiman | Lower variance than single tree | Lower interpretability |
| Random forest | 2001 | Breiman | Robust default, OOB error | Larger memory, weaker on linear signals |
| Extra Trees | 2006 | Geurts et al. | Faster fits than random forest | Slightly weaker on noisy data |
Gradient boosting builds an additive ensemble of weak learners (usually shallow regression trees), where each new tree is fit to the negative gradient of a differentiable loss with respect to the current ensemble's predictions. The technique was formalized by Jerome Friedman in his 2001 Annals of Statistics paper Greedy Function Approximation: A Gradient Boosting Machine and a 2002 follow-up that introduced stochastic subsampling. Variants of this idea now dominate tabular classification benchmarks.
XGBoost, introduced by Tianqi Chen and Carlos Guestrin at KDD 2016, was the first widely adopted implementation engineered for speed and scale. It added second-order gradient information (Newton steps), a regularized objective with explicit L1 and L2 penalties on leaf weights, a sparsity-aware split finder for missing values, weighted quantile sketching for approximate split candidates, cache-aware block storage, and parallel histogram construction across CPU cores. By 2015 XGBoost had won most of the top Kaggle tabular competitions and the algorithm remains the most cited baseline in tabular machine learning research.
LightGBM, released by Microsoft Research and announced by Guolin Ke and colleagues at NeurIPS 2017, introduced two ideas that made GBDT both faster and more memory-efficient on large datasets. Gradient-based One-Side Sampling (GOSS) keeps all samples with large gradients and randomly subsamples the rest, focusing computation on samples that are still hard to fit. Exclusive Feature Bundling (EFB) merges mutually exclusive sparse features into a single bundle, reducing the effective dimensionality of categorical inputs after one-hot encoding. LightGBM also uses histogram-based binning of continuous features and grows trees leaf-wise (best-first) rather than level-wise, which often yields lower training loss for the same tree count.
CatBoost, developed by Yandex and presented by Liudmila Prokhorenkova and colleagues at NeurIPS 2018, focused on principled handling of categorical features and on preventing target leakage during the encoding of those features. It introduced ordered target statistics, an encoding scheme that computes the running mean of the target for each category using only earlier rows in a random permutation, removing the bias that plain target encoding can introduce. CatBoost also uses oblivious decision trees (the same split is applied at every node of a given depth), which gives compact models and fast inference, and a technique called ordered boosting that further reduces target leakage at the cost of additional bookkeeping.
HistGradientBoostingClassifier, added to scikit-learn in version 0.21 (2019), is a pure-Python and Cython implementation modeled on LightGBM. It is slower than the dedicated libraries but is shipped with sklearn and is therefore the most accessible GBDT for many practitioners.
The table below compares the three production GBDT libraries on the features most relevant to tabular classification.
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Initial release | 2014 | 2016 | 2017 |
| Paper venue | KDD 2016 | NeurIPS 2017 | NeurIPS 2018 |
| Lead authors | Chen, Guestrin | Ke et al. (Microsoft) | Prokhorenkova et al. (Yandex) |
| Tree growth | Level-wise (default), also leaf-wise | Leaf-wise | Oblivious symmetric |
| Histogram binning | Yes (since 1.0) | Yes (default) | Yes |
| Native categorical handling | Limited (since 1.5) | Yes | Yes (ordered target statistics) |
| Missing value handling | Sparsity-aware split | Default direction | Default direction |
| GPU support | Yes | Yes | Yes |
| Distributed training | Yes (Dask, Spark, Ray) | Yes (MPI, Dask) | Yes |
| Default loss for classification | Logistic, softmax | Logistic, softmax | Logistic, multiclass |
In industry practice, XGBoost is the most common default, LightGBM is preferred on very large datasets with sparse features, and CatBoost tends to be the most ergonomic on datasets with many high-cardinality categorical columns and minimal manual feature engineering.
Neural network classifiers for tabular data go back at least to the early 1990s, when single hidden layer MLPs were already competing with logistic regression on credit datasets. Modern neural tabular models try to do three things at once: handle mixed input types without manual encoding, learn feature interactions automatically, and remain competitive with GBDT on small to medium datasets.
One practical line of work, often called factorization machines, originated with Steffen Rendle's Factorization Machines paper in 2010. The method models pairwise feature interactions through low-rank embeddings. Field-aware Factorization Machines, by Yuchin Juan and colleagues in 2016, generalized the idea by giving each feature a different embedding for each interaction field, which improved click-through rate prediction noticeably.
Wide and Deep Learning, introduced by Heng-Tze Cheng and colleagues at Google in 2016, combined a wide linear model that memorizes sparse cross-features with a deep MLP that generalizes via embeddings. DeepFM, by Huifeng Guo and colleagues in 2017, replaced the wide part with a factorization machine, giving a more principled model of low-order interactions. Deep and Cross Networks (DCN) by Ruoxi Wang and colleagues in 2017 and DCN-V2 in 2020 used explicit cross layers that compute feature crosses of bounded degree. AutoInt, by Weiping Song and colleagues in 2019, applied self-attention to feature embeddings to model arbitrary higher-order interactions. These models grew out of recommendation and online advertising contexts and are still the dominant neural architecture for click-through rate prediction.
Research directed at general tabular classification, rather than click-through rate prediction, took off after 2017.
TabNet, proposed by Sercan Arik and Tomas Pfister of Google Cloud at AAAI 2021, uses sequential attention to select features at each decision step, mimicking the way a tree splits on one feature at a time while remaining differentiable. TabNet supports both supervised training and an unsupervised pre-training task in which masked features are reconstructed, which can help on small labeled datasets.
TabTransformer, by Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin at Amazon in 2020, applied transformer encoder layers to embeddings of categorical features while leaving continuous features in a separate path. The model was particularly useful when the dataset had many high-cardinality categorical columns.
FT-Transformer, by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko at NeurIPS 2021, simplified the picture by treating both categorical and numerical features uniformly as tokens, passing them through a stack of standard transformer encoder blocks, and reading out a classification head from a special [CLS] token. The same paper, Revisiting Deep Learning Models for Tabular Data, also introduced a strong residual MLP baseline (ResNet for tabular) and reported careful benchmark comparisons with GBDT, finding that no neural model dominated XGBoost across all datasets.
SAINT (Self-Attention and Intersample Attention Transformer), by Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C. Bayan Bruss, and Tom Goldstein in 2021, added a second attention dimension across samples within a minibatch, allowing the model to compare a query row to other rows from the same batch. The method also used contrastive pre-training similar to that of SimCLR.
NODE (Neural Oblivious Decision Ensembles), by Sergei Popov, Stanislav Morozov, and Artem Babenko at Yandex in 2019, designed a differentiable analog of oblivious decision trees that could be trained end-to-end with gradient descent. Mixture-of-experts variants and tree-structured architectures such as DeepGBM and GrowNet have explored related ideas.
The table below summarizes the most cited neural tabular classifiers.
| Model | Year | Authors | Core idea |
|---|---|---|---|
| Wide and Deep | 2016 | Cheng et al. (Google) | Joint wide linear + deep MLP |
| DeepFM | 2017 | Guo et al. | Factorization machine + deep network |
| DCN | 2017 | Wang et al. | Explicit cross layers of bounded degree |
| NODE | 2019 | Popov, Morozov, Babenko | Differentiable oblivious tree ensemble |
| AutoInt | 2019 | Song et al. | Self-attention over feature embeddings |
| TabNet | 2021 | Arik and Pfister (Google) | Sequential attention with sparse feature selection |
| TabTransformer | 2020 | Huang et al. (Amazon) | Transformer encoder over categorical embeddings |
| FT-Transformer | 2021 | Gorishniy et al. (Yandex) | Feature tokenizer + standard transformer encoder |
| ResNet for tabular | 2021 | Gorishniy et al. | Residual MLP baseline |
| SAINT | 2021 | Somepalli et al. | Attention across columns and across samples |
| DCN-V2 | 2020 | Wang et al. | Improved cross network for production CTR |
Despite the proliferation of architectures, empirical studies have consistently found that careful tuning of an MLP or a residual MLP recovers most of the gap to the more elaborate models. Hyperparameter tuning budgets are a confounder: when GBDT and neural models receive the same amount of tuning, GBDT typically wins on small and medium tabular datasets.
A newer line of work treats tabular classification the way GPT treats text: train one large model once on a huge variety of synthetic or real tabular tasks, then deploy it as a frozen in-context predictor that takes a labeled support set as input and outputs predictions for a query set without any per-dataset gradient updates.
TabPFN (Tabular Prior-data Fitted Network), introduced by Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter at ICLR 2023, was the first widely cited model of this type. TabPFN is a transformer that has been trained on millions of synthetic classification tasks generated from a structural causal model prior. At inference time the entire training set of the target task is concatenated with the query points and fed through the network in a single forward pass; the model performs Bayesian inference in the function space defined by its synthetic prior. The original TabPFN was limited to roughly 1,000 training rows, 100 features, and 10 classes, and to purely numerical inputs, but within those constraints it matched or beat tuned GBDT and AutoML systems while running in under a second on a single GPU. The paper received an Outstanding Paper award at ICLR 2023.
TabPFN v2, by Noah Hollmann and colleagues at the University of Freiburg and Prior Labs, was published in Nature in January 2025 under the title Accurate predictions on small data with a tabular foundation model. The second version increased the parameter count, extended the model to handle categorical features and missing values natively, and scaled to roughly 10,000 training rows and 500 features. The Nature paper reported that TabPFN v2 matched or surpassed tuned ensembles of XGBoost, CatBoost, and LightGBM on a broad benchmark of small to medium classification datasets, often after only a few seconds of inference and with no per-dataset hyperparameter tuning. Prior Labs released the weights and a Python package under a non-commercial license.
TabDPT (Tabular Discriminative Pre-trained Transformer), released by Layer 6 AI in September 2024, follows the same in-context learning recipe but is trained on a curated mix of real OpenML datasets rather than purely synthetic data. It targets the same regime of small to medium tabular tasks.
TabLLM, by Stefan Hegselmann and colleagues at AISTATS 2023, took a different path: it serializes a row of a table into a natural language sentence and asks a pre-trained large language model to classify the row in zero-shot or few-shot mode. TabLLM is most useful when the column names are semantically meaningful, because the language model can transfer prior knowledge from text. Performance saturates quickly compared to GBDT on datasets with thousands of rows but can be competitive in very-low-shot regimes.
GReaT (Generation of Realistic Tabular data), by Vadim Borisov and colleagues in 2023, used pre-trained language models for tabular generation rather than classification, but its serialization scheme has been adapted for downstream classification work. Earlier related efforts include TabBERT and TabularBERT for transactional and event data.
The table below summarizes the main tabular foundation model approaches.
| Model | Year | Authors | Approach |
|---|---|---|---|
| TabPFN | 2023 | Hollmann, Müller, Eggensperger, Hutter | In-context Bayesian classifier, synthetic SCM prior |
| TabPFN v2 | 2025 | Hollmann et al. (Prior Labs) | Larger in-context model, categorical and missing support, Nature 2025 |
| TabDPT | 2024 | Layer 6 AI | In-context transformer, real OpenML pre-training |
| TabLLM | 2023 | Hegselmann et al. | Serialize rows as text, classify with LLM |
| GReaT | 2023 | Borisov et al. | Pre-trained LM for tabular generation |
| TabBERT | 2021 | Padhi et al. (IBM) | BERT-style pre-training on transactional rows |
A recurring theme in the foundation model literature is that pre-trained in-context predictors are well suited to the small-data regime where GBDT has historically had a large hyperparameter tuning advantage. As of 2025, none of these models has displaced GBDT on wide tables with many thousands of features or on very large tables with millions of rows.
Two influential 2022 papers crystallized the case that gradient boosted trees still beat deep learning on tabular data when the comparison is fair.
Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux published Why do tree-based models still outperform deep learning on tabular data? at NeurIPS 2022. The paper built a benchmark of 45 medium-sized tabular datasets curated to control for issues that plague earlier comparisons (uninformative features, regression versus classification mix, dataset size). With identical hyperparameter tuning budgets, the authors found that random forest, XGBoost, and LightGBM consistently outperformed FT-Transformer, ResNet, and MLPs. They attributed the gap to three structural advantages of trees: robustness to uninformative features through embedded feature selection, robustness to rotation of the feature axes (trees are invariant to monotone transformations of individual features, neural networks are not), and the ability to handle non-smooth target functions without bias toward smooth approximations.
Ravid Shwartz-Ziv and Amitai Armon published Tabular Data: Deep Learning Is Not All You Need in Information Fusion in 2022 (originally an arXiv preprint in 2021). They reproduced the published results of TabNet, NODE, DNF-Net, and 1D-CNN on the datasets from their original papers and on a held-out set of additional benchmarks. The headline finding was that the deep models did not generalize beyond their authors' chosen datasets, while an XGBoost ensemble tuned with the same compute won on most of them. The paper also showed that an ensemble combining XGBoost with one or more neural models was usually the strongest, suggesting complementary inductive biases.
Subsequent work has refined the picture rather than overturning it. Borisov and colleagues' 2022 survey Deep Neural Networks and Tabular Data: A Survey in IEEE TNNLS reached similar conclusions. McElfresh and colleagues' TabZilla benchmark at NeurIPS 2023 (When Do Neural Nets Outperform Boosted Trees on Tabular Data?) examined 196 datasets and found that the answer depends on dataset characteristics, with neural networks closing or reversing the gap on a minority of tasks, especially those with many high-cardinality categorical features and moderate sample sizes. The Hollmann 2025 Nature paper on TabPFN v2 then reopened the question for the small-data regime specifically, where in-context learning offers a different kind of advantage than gradient boosting.
A fair summary of the literature in 2025: tuned gradient boosting remains the default winner on medium to large tabular classification tasks; small-data regimes are increasingly contested by TabPFN v2; ensembles of GBDT with at least one neural component often beat either alone in competitions.
Benchmarks for tabular classification draw on three main sources. The first is the UCI Machine Learning Repository, founded by David Aha at UC Irvine in 1987, which hosts many of the oldest reference datasets including Iris (150 rows, 4 features, 3 classes), Adult (Census income, 48,842 rows, 14 features), Covertype (581,012 rows, 54 features, 7 classes), Bank Marketing, Wine Quality, and Higgs. UCI remains the canonical source even though it has been criticized for the small sample sizes and the cumulative effect of leaderboard chasing.
The second source is OpenML, a community platform founded by Joaquin Vanschoren in 2014 that hosts datasets, task definitions, and machine learning runs in a reproducible format. The OpenML CC-18 benchmark, defined by Bischl, Casalicchio, Feurer, and colleagues in 2017, is a curated suite of 72 classification datasets chosen to be diverse, well-behaved, and free from obvious leakage. The OpenML-CTR23 suite is a regression analog. OpenML's larger AutoML benchmarks include hundreds of datasets.
The third source is Kaggle, which hosts both academic-style competitions on cleaned datasets and industry challenges with messy production data. Kaggle Tabular Playground, run roughly monthly since 2021, has provided a steady stream of tabular classification problems. The cumulative Kaggle leaderboard for tabular competitions remains dominated by XGBoost, LightGBM, and CatBoost.
More recent benchmarks include the following.
| Benchmark | Year | Curators | Scope |
|---|---|---|---|
| UCI ML Repository | 1987 | Aha, Bay, others (UC Irvine) | Hundreds of small to medium classical datasets |
| OpenML CC-18 | 2017 | Bischl et al. | 72 curated classification tasks |
| OpenML AutoML Benchmark | 2019 | Gijsbers et al. | 39 classification and regression tasks for AutoML |
| AMLB v2 | 2024 | Gijsbers et al. | Expanded AutoML benchmark, 104 datasets |
| Grinsztajn benchmark | 2022 | Grinsztajn, Oyallon, Varoquaux | 45 medium tabular datasets, tree vs deep comparison |
| TabZilla | 2023 | McElfresh et al. | 196 datasets, model-by-dataset performance map |
| TabBench / TALENT | 2024 | Various | Larger collections for tabular foundation models |
Metrics most often reported are accuracy and log-loss for balanced multiclass problems, AUROC and average precision for imbalanced binary tasks, and macro F1 for multiclass problems with class imbalance. Calibration is reported with Brier score or expected calibration error.
AutoML systems automate the model selection, hyperparameter tuning, and ensembling steps that drive tabular classification performance in competitions and in industry. Most major systems target tabular problems first because the search space is well understood and the building blocks are mature.
Auto-sklearn, by Matthias Feurer, Aaron Klein, and colleagues at the University of Freiburg, won the first AutoML Challenge in 2014. It uses Bayesian optimization (SMAC) over the scikit-learn pipeline space, meta-learning from previous datasets to warm-start the search, and post-hoc ensembling of the best models found.
H2O AutoML, part of the open-source H2O platform from H2O.ai, trains XGBoost, GBM, deep learning, GLM, and random forest models and stacks them with a meta-learner. It is widely used in industry, particularly in finance and insurance.
AutoGluon, released by Amazon Web Services in 2020 (Erickson et al., AutoGluon-Tabular), takes a different design stance: rather than searching the space of pipelines, it trains a hand-picked set of strong base models (LightGBM, CatBoost, XGBoost, MLP, random forest, k-NN) with sensible defaults and stacks them in multiple layers. The library has consistently ranked at or near the top of the OpenML AutoML benchmark since 2020 and has been particularly popular in Kaggle tabular competitions.
FLAML, by Chi Wang and colleagues at Microsoft Research in 2021, uses an efficient cost-aware Bayesian search to find good configurations quickly under a fixed budget. TPOT, by Randy Olson and Jason Moore in 2016, uses genetic programming to evolve pipelines. MLJAR AutoML is a commercial-friendly open-source package that emphasizes interpretable ensembles and explanatory reports.
| AutoML system | Year | Lead authors | Distinctive idea |
|---|---|---|---|
| Auto-WEKA | 2013 | Thornton et al. | SMAC over WEKA pipelines |
| Auto-sklearn | 2015 | Feurer et al. (Freiburg) | Meta-learning + SMAC + ensemble |
| TPOT | 2016 | Olson and Moore | Genetic programming over pipelines |
| H2O AutoML | 2017 | H2O.ai | Stacked ensembles of GBM, GLM, DL |
| AutoGluon | 2020 | Erickson et al. (AWS) | Hand-picked models, multi-layer stacking |
| FLAML | 2021 | Wang et al. (Microsoft) | Cost-aware blendsearch |
| MLJAR | 2018 | Pląskowski et al. | Interpretable AutoML, mode-driven |
In most published comparisons, AutoGluon and H2O AutoML are the top performers on the OpenML AutoML Benchmark, with FLAML close behind under tight time budgets. AutoML systems typically outperform a single well-tuned GBDT by one to three percentage points of accuracy and by reducing variance across datasets, although the gap shrinks when the user is an experienced practitioner.
Even with modern libraries that accept categorical columns directly, the encoding of categorical features remains the single most important feature engineering choice for tabular classification. The most common schemes are listed below.
One-hot encoding expands a categorical column with k levels into k binary columns, one per level. This is the default in scikit-learn pipelines for linear models, naive Bayes, and neural networks. It is wasteful for high-cardinality columns.
Ordinal encoding assigns each level an integer code. The codes are arbitrary unless the variable is genuinely ordered, but tree-based learners can still find useful splits on ordinal codes regardless of the encoding order.
Target encoding (also called mean encoding) replaces each level with a smoothed average of the target on the training rows in that level. James-Stein and m-probability smoothing are the most common variants. Target encoding is powerful for high-cardinality columns but leaks information from the target into features unless the encoding is computed with care, typically inside a cross-validation fold. CatBoost's ordered target statistics, introduced in the original paper, are an in-built solution to this leakage.
Embedding layers are learned during neural network training and map each categorical level to a low-dimensional vector. They were popularized by entity embeddings by Cheng Guo and Felix Berkhahn in 2016, who showed that a neural net with entity embeddings beat XGBoost on the Rossmann store sales Kaggle competition.
Hashing encoders map each level to one of a fixed number of buckets using a hash function, sacrificing some collisions for a bounded representation size.
Missing values are handled by tree-based learners through dedicated default split directions (LightGBM and CatBoost) or by treating missing as a separate category (XGBoost's sparsity-aware split). Linear models and neural networks typically require explicit imputation, for which scikit-learn's IterativeImputer, KNNImputer, and SimpleImputer are common choices. The MissForest method by Stekhoven and Bühlmann in 2012 is a strong baseline that uses random forests to impute missing values.
Class imbalance is addressed through resampling and through reweighting. The Synthetic Minority Over-sampling Technique (SMOTE), proposed by Nitesh Chawla, Kevin Bowyer, Lawrence Hall, and W. Philip Kegelmeyer in JAIR in 2002, interpolates new minority-class samples between existing ones in feature space. ADASYN, by Haibo He and colleagues in 2008, focuses oversampling on minority points near the decision boundary. The imbalanced-learn library by Lemaitre, Nogueira, and Aridas (2017) implements these and many variants. Class weighting in the loss function is a simpler alternative that is supported natively by XGBoost, LightGBM, CatBoost, and scikit-learn.
Feature engineering for tabular classification can also include numerical transforms (log, Box-Cox, Yeo-Johnson, quantile binning), pairwise interactions, polynomial features, and domain-specific aggregations such as time-since-last-event for transactional data. Modern libraries such as Featuretools by Max Kanter and Kalyan Veeramachaneni (MIT, 2015) automate the construction of aggregation features over relational tables.
Most tabular classification work today is done in Python. The mature stack includes scikit-learn for classical models and pipelines, XGBoost, LightGBM, and CatBoost for boosting, PyTorch and Keras for neural models, and AutoGluon, H2O, and FLAML for AutoML. The pytorch-tabular library by Manu Joseph (2021) provides a unified PyTorch implementation of TabNet, FT-Transformer, NODE, and other neural tabular models. The pytorch-frame library by Stanford and Kumo.AI in 2024 provides a similar capability with a focus on relational tabular data.
For categorical encoding, the category_encoders library by Will McGinnis offers more than a dozen schemes including James-Stein, m-estimator, CatBoost-style ordered target statistics, and helmert/sum/backward-difference contrasts. For class imbalance, imbalanced-learn provides SMOTE, ADASYN, and several Tomek-link-based cleaning algorithms.
Production deployment increasingly relies on ONNX or Treelite (Hyunsu Cho et al., 2018) to convert tree ensembles to a portable runtime, and on tools such as Hummingbird (Microsoft) to compile tree models into tensor operations that run on GPUs.
In R, the workhorses are the gbm package (Greg Ridgeway), xgboost, lightgbm, and ranger (a fast random forest implementation by Marvin Wright). H2O and tidymodels also support tabular workflows. Julia's MLJ framework offers a similar tabular-first design.
Despite the maturity of the field, several open problems remain.
Tabular foundation models such as TabPFN v2 are limited to roughly 10,000 rows and 500 features as of 2025. They cannot yet handle wide tables (more columns than rows) or very large datasets without falling back to subsampling.
Transfer learning across tabular datasets is largely unsolved. Pre-trained tabular models do not consistently transfer their learned representations to new datasets the way pre-trained language and vision models do. The closest existing analog is in-context learning with TabPFN-style models.
Distribution shift is endemic in tabular deployment but rarely modeled explicitly. Most benchmarks assume IID splits, while production data often suffers from temporal drift, covariate shift, and concept drift. Methods such as adversarial validation and TrAdaBoost provide partial solutions.
Calibration of tree ensembles is acceptable for AUROC-driven applications but deteriorates for probability-sensitive ones. Isotonic regression and Platt scaling are the common post-hoc corrections. Calibration of deep tabular models is generally worse than that of GBDT, especially under class imbalance.
Interpretability remains in tension with accuracy. SHAP (Lundberg and Lee, NeurIPS 2017) and TreeSHAP provide a near-canonical attribution method for tree ensembles, but the same techniques applied to neural tabular models give attributions that are noisier and less reproducible across seeds.
Categorical encoding is still a manual choice with large effects on accuracy, even with libraries that handle the encoding internally. Best practices for high-cardinality columns are still evolving.
Fairness and disparate impact are first-order concerns in many tabular applications (credit, hiring, criminal justice). The field has produced post-hoc and in-processing methods such as reweighing, adversarial debiasing, and equalized odds calibration, but no consensus on which to apply when.