Tabular Classification Models

AI Models Machine Learning

37 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

65 citations

Revision

v5 · 7,355 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Part of the Tabular Models hub. See also the sibling survey Tabular Regression Models.

Tabular classification models are machine learning systems that predict a discrete class label for each row of a tabular dataset, where rows are samples and columns are heterogeneous features (numeric, ordinal, categorical, or binary). This task setting covers a large share of practical applications of machine learning, including credit scoring, fraud detection, churn prediction, medical diagnosis, click-through rate prediction, and operational risk modeling. Unlike image, text, audio, or sequence data, tabular features have no canonical spatial or temporal ordering, the columns often differ in semantics and scale, missing values are common, and informative interactions between columns are usually low-order rather than deeply compositional.

For most of the last two decades the dominant family of tabular classifiers has been gradient boosted decision trees (GBDT), implemented by XGBoost, LightGBM, and CatBoost.^[17]^[22]^[27] These libraries have won a majority of Kaggle competitions on tabular data and remain the default baseline in industry. Since 2017 there has been an active research program to design neural network architectures that match or surpass GBDT on tabular data, producing models such as TabNet, FT-Transformer, SAINT, NODE, TabTransformer, and the parameter-efficient MLP ensemble TabM. From 2022 onward, a parallel direction has emerged: pre-trained foundation models for tabular data, most notably TabPFN, which performs in-context classification without per-dataset gradient training, and the larger TabPFN v2, published in Nature in January 2025.^[43]^[48] Newer in-context models such as TabPFN-2.5 and TabICL have extended the approach to tens or hundreds of thousands of rows.^[50]^[51]

Overview

Tabular classification problems are characterized by a few recurring properties that shape model design. Datasets are usually small to medium in size, from a few hundred to a few million rows, with anywhere from a handful to a few thousand columns. Features are heterogeneous, mixing continuous variables with categorical codes, binary flags, and ordinal levels. Class distributions are often imbalanced, and feature distributions can shift between training and deployment. Interpretability and calibration matter, because tabular models often drive consequential decisions in finance, healthcare, and policy. These constraints favor models that are sample-efficient, robust to noise and missing values, fast to retrain, and easy to inspect.

Classical statistical methods such as logistic regression, naive Bayes, and linear discriminant analysis provided the first baselines and remain widely used when interpretability or calibration is paramount. Tree-based ensembles such as random forest and gradient boosted decision trees offer strong predictive accuracy with minimal tuning. Deep neural networks for tabular data attempt to recover the inductive biases that trees obtain almost for free, especially the ability to handle features at different scales and to ignore uninformative columns. Foundation models for tabular data try to amortize the entire training procedure into a single pre-training run.

Definition and problem setup

In the supervised tabular classification setting, the training set is a collection of pairs $(x_i, y_i)$ where $x_i \in \mathbb{R}^d$ is a feature vector with mixed types and $y_i$ is a discrete label drawn from a finite set of classes. The goal is to learn a function that maps a new feature vector to a probability distribution over classes, or to a single predicted class. Binary classification has two classes; multiclass classification has three or more mutually exclusive classes; multilabel classification allows each sample to belong to several classes simultaneously and is typically reduced to a set of binary problems.

Features may be continuous (age, income, sensor reading), categorical with low cardinality (sex, country code), categorical with high cardinality (zip code, product ID), ordinal (education level, satisfaction score), boolean, or text fragments that have been hashed or embedded. The data is usually stored as a single table or as a small set of joined tables, hence the name. The defining feature is that no a priori topology or sequence is imposed on the columns. Permuting the columns leaves the prediction problem unchanged after appropriate renaming.

Training a tabular classifier typically involves splitting the data into training, validation, and test partitions; encoding categorical and missing values; choosing a model class and a loss function (most often cross-entropy or log-loss for classification); optimizing hyperparameters by cross-validation or a held-out validation set; and reporting performance on a test set using metrics such as accuracy, F1 score, precision and recall, area under the ROC curve (AUROC), log-loss, and Brier score.

Classical algorithms

The earliest practical tabular classifiers were linear models. Logistic regression, formalized by David Cox in 1958, fits a sigmoid of a linear combination of features to estimate the probability of a binary outcome and remains the workhorse for credit scoring and medical risk modeling.^[1] Multinomial logistic regression extends the same idea to multiclass problems through a softmax. Linear and quadratic discriminant analysis, developed by Ronald Fisher in 1936 and later generalized, assume class-conditional Gaussian distributions and yield closed-form decision boundaries.^[2]

Naive Bayes classifiers, popularized by Maron and Kuhns in the 1960s for document classification, assume conditional independence of features given the class. Despite the strong independence assumption these classifiers often perform surprisingly well, especially with high-dimensional sparse inputs.

Decision trees, introduced by Breiman, Friedman, Olshen, and Stone in their 1984 book Classification and Regression Trees (the CART algorithm), recursively partition the feature space using axis-aligned splits chosen to maximize an impurity reduction (Gini impurity or entropy).^[4] C4.5, developed by Ross Quinlan in 1993, refined the same idea with a different splitting criterion and handling of continuous features.^[5] Trees natively handle mixed feature types, missing values, and feature interactions, but a single tree is high-variance.

Ensembles of trees were the next leap. Bagging, proposed by Leo Breiman in 1996, trains many trees on bootstrap samples and averages their predictions.^[7] Random forest, also by Breiman in 2001, adds a feature subsampling rule at each split, producing a strong all-purpose classifier with only two important hyperparameters and a built-in out-of-bag error estimate.^[8] Extra Trees, by Pierre Geurts and colleagues in 2006, randomize the split thresholds as well.^[11]

Support vector machines, formalized by Vapnik and colleagues in the 1990s, maximize the margin between classes in a feature space induced by a kernel function.^[6] k-nearest neighbors classification, dating back to Cover and Hart in 1967, predicts the majority class among the k closest training points under a chosen metric.^[3] Both methods remain useful baselines, particularly on small datasets.

The table below summarizes the most common classical classifiers used on tabular data.

Algorithm	Year	Key reference	Strengths	Typical weaknesses
Logistic regression	1958	Cox^[1]	Calibrated probabilities, interpretable coefficients	Linear decision boundary
Linear discriminant analysis	1936	Fisher^[2]	Closed-form solution, fast	Assumes Gaussian class conditionals
Naive Bayes	1960s	Maron and Kuhns	Very fast, works with sparse features	Independence assumption rarely holds
k-nearest neighbors	1967	Cover and Hart^[3]	No training phase, simple	Slow at inference, sensitive to scaling
CART	1984	Breiman et al.^[4]	Handles mixed types, interpretable	High variance
C4.5	1993	Quinlan^[5]	Information gain ratio, prunes	High variance
Support vector machine	1995	Cortes and Vapnik^[6]	Strong margins, kernels	Scales poorly past tens of thousands of rows
Bagging	1996	Breiman^[7]	Lower variance than single tree	Lower interpretability
Random forest	2001	Breiman^[8]	Robust default, OOB error	Larger memory, weaker on linear signals
Extra Trees	2006	Geurts et al.^[11]	Faster fits than random forest	Slightly weaker on noisy data

Gradient boosting frameworks

Gradient boosting builds an additive ensemble of weak learners (usually shallow regression trees), where each new tree is fit to the negative gradient of a differentiable loss with respect to the current ensemble's predictions. The technique was formalized by Jerome Friedman in his 2001 Annals of Statistics paper Greedy Function Approximation: A Gradient Boosting Machine and a 2002 follow-up that introduced stochastic subsampling.^[9] Variants of this idea now dominate tabular classification benchmarks.

XGBoost, introduced by Tianqi Chen and Carlos Guestrin at KDD 2016, was the first widely adopted implementation engineered for speed and scale.^[17] It added second-order gradient information (Newton steps), a regularized objective with explicit L1 and L2 penalties on leaf weights, a sparsity-aware split finder for missing values, weighted quantile sketching for approximate split candidates, cache-aware block storage, and parallel histogram construction across CPU cores.^[17] By 2015 XGBoost had won most of the top Kaggle tabular competitions and the algorithm remains the most cited baseline in tabular machine learning research.^[17]

LightGBM, released by Microsoft Research and announced by Guolin Ke and colleagues at NeurIPS 2017, introduced two ideas that made GBDT both faster and more memory-efficient on large datasets.^[22] Gradient-based One-Side Sampling (GOSS) keeps all samples with large gradients and randomly subsamples the rest, focusing computation on samples that are still hard to fit.^[22] Exclusive Feature Bundling (EFB) merges mutually exclusive sparse features into a single bundle, reducing the effective dimensionality of categorical inputs after one-hot encoding.^[22] LightGBM also uses histogram-based binning of continuous features and grows trees leaf-wise (best-first) rather than level-wise, which often yields lower training loss for the same tree count.^[53]

CatBoost, developed by Yandex and presented by Liudmila Prokhorenkova and colleagues at NeurIPS 2018, focused on principled handling of categorical features and on preventing target leakage during the encoding of those features.^[27] It introduced ordered target statistics, an encoding scheme that computes the running mean of the target for each category using only earlier rows in a random permutation, removing the bias that plain target encoding can introduce.^[27] CatBoost also uses oblivious decision trees (the same split is applied at every node of a given depth), which gives compact models and fast inference, and a technique called ordered boosting that further reduces target leakage at the cost of additional bookkeeping.^[27]

HistGradientBoostingClassifier, added to scikit-learn in version 0.21 (2019), is a pure-Python and Cython implementation modeled on LightGBM.^[56] It is slower than the dedicated libraries but is shipped with sklearn and is therefore the most accessible GBDT for many practitioners.

The table below compares the three production GBDT libraries on the features most relevant to tabular classification.

Feature	XGBoost	LightGBM	CatBoost
Initial release	2014	2016	2017
Paper venue	KDD 2016^[17]	NeurIPS 2017^[22]	NeurIPS 2018^[27]
Lead authors	Chen, Guestrin	Ke et al. (Microsoft)	Prokhorenkova et al. (Yandex)
Tree growth	Level-wise (default), also leaf-wise	Leaf-wise	Oblivious symmetric
Histogram binning	Yes (since 1.0)	Yes (default)	Yes
Native categorical handling	Limited (since 1.5)^[52]	Yes^[53]	Yes (ordered target statistics)^[54]
Missing value handling	Sparsity-aware split	Default direction	Default direction
GPU support	Yes	Yes	Yes
Distributed training	Yes (Dask, Spark, Ray)	Yes (MPI, Dask)	Yes
Default loss for classification	Logistic, softmax	Logistic, softmax	Logistic, multiclass

In industry practice, XGBoost is the most common default, LightGBM is preferred on very large datasets with sparse features, and CatBoost tends to be the most ergonomic on datasets with many high-cardinality categorical columns and minimal manual feature engineering.

Neural approaches

Neural network classifiers for tabular data go back at least to the early 1990s, when single hidden layer MLPs were already competing with logistic regression on credit datasets. Modern neural tabular models try to do three things at once: handle mixed input types without manual encoding, learn feature interactions automatically, and remain competitive with GBDT on small to medium datasets.

One practical line of work, often called factorization machines, originated with Steffen Rendle's Factorization Machines paper in 2010.^[13] The method models pairwise feature interactions through low-rank embeddings. Field-aware Factorization Machines, by Yuchin Juan and colleagues in 2016, generalized the idea by giving each feature a different embedding for each interaction field, which improved click-through rate prediction noticeably.^[19]

Wide and Deep Learning, introduced by Heng-Tze Cheng and colleagues at Google in 2016, combined a wide linear model that memorizes sparse cross-features with a deep MLP that generalizes via embeddings.^[16] DeepFM, by Huifeng Guo and colleagues in 2017, replaced the wide part with a factorization machine, giving a more principled model of low-order interactions.^[28] Deep and Cross Networks (DCN) by Ruoxi Wang and colleagues in 2017 and DCN-V2 in 2020 used explicit cross layers that compute feature crosses of bounded degree.^[25]^[33] AutoInt, by Weiping Song and colleagues in 2019, applied self-attention to feature embeddings to model arbitrary higher-order interactions.^[29] These models grew out of recommendation and online advertising contexts and are still the dominant neural architecture for click-through rate prediction.

Research directed at general tabular classification, rather than click-through rate prediction, took off after 2017.

TabNet, proposed by Sercan Arik and Tomas Pfister of Google Cloud at AAAI 2021, uses sequential attention to select features at each decision step, mimicking the way a tree splits on one feature at a time while remaining differentiable.^[34] TabNet supports both supervised training and an unsupervised pre-training task in which masked features are reconstructed, which can help on small labeled datasets.^[34]

TabTransformer, by Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin at Amazon in 2020, applied transformer encoder layers to embeddings of categorical features while leaving continuous features in a separate path.^[32] The model was particularly useful when the dataset had many high-cardinality categorical columns.

FT-Transformer, by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko at NeurIPS 2021, simplified the picture by treating both categorical and numerical features uniformly as tokens, passing them through a stack of standard transformer encoder blocks, and reading out a classification head from a special [CLS] token.^[35] The same paper, Revisiting Deep Learning Models for Tabular Data, also introduced a strong residual MLP baseline (ResNet for tabular) and reported careful benchmark comparisons with GBDT, finding that no neural model dominated XGBoost across all datasets.^[35]

SAINT (Self-Attention and Intersample Attention Transformer), by Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C. Bayan Bruss, and Tom Goldstein in 2021, added a second attention dimension across samples within a minibatch, allowing the model to compare a query row to other rows from the same batch.^[36] The method also used contrastive pre-training similar to that of SimCLR.

NODE (Neural Oblivious Decision Ensembles), by Sergei Popov, Stanislav Morozov, and Artem Babenko at Yandex in 2019, designed a differentiable analog of oblivious decision trees that could be trained end-to-end with gradient descent.^[30] Mixture-of-experts variants and tree-structured architectures such as DeepGBM and GrowNet have explored related ideas.

TabM (Tabular Multiple-prediction model), by Yury Gorishniy, Akim Kotelnikov, and Artem Babenko at Yandex Research and presented at ICLR 2025, returned to the simple multilayer perceptron and made it stronger through parameter-efficient ensembling based on the BatchEnsemble technique.^[49] A single TabM efficiently imitates an ensemble of MLPs by training several implicit submodels at once that share most of their weights, producing multiple predictions per input that are weak individually but accurate when averaged. In a large-scale evaluation on public benchmarks, the authors reported that TabM achieved the best accuracy among tabular deep learning models while keeping a favorable performance-to-efficiency trade-off, and they concluded that MLP-based models form a stronger and more practical line than attention- and retrieval-based architectures.^[49] The largest dataset in the study contained roughly 13 million objects, showing that the approach scales beyond the small-data regime that limits in-context models.^[49]

The table below summarizes the most cited neural tabular classifiers.

Model	Year	Authors	Core idea
Wide and Deep	2016	Cheng et al. (Google)^[16]	Joint wide linear + deep MLP
DeepFM	2017	Guo et al.^[28]	Factorization machine + deep network
DCN	2017	Wang et al.^[25]	Explicit cross layers of bounded degree
NODE	2019	Popov, Morozov, Babenko^[30]	Differentiable oblivious tree ensemble
AutoInt	2019	Song et al.^[29]	Self-attention over feature embeddings
TabNet	2021	Arik and Pfister (Google)^[34]	Sequential attention with sparse feature selection
TabTransformer	2020	Huang et al. (Amazon)^[32]	Transformer encoder over categorical embeddings
FT-Transformer	2021	Gorishniy et al. (Yandex)^[35]	Feature tokenizer + standard transformer encoder
ResNet for tabular	2021	Gorishniy et al.^[35]	Residual MLP baseline
SAINT	2021	Somepalli et al.^[36]	Attention across columns and across samples
DCN-V2	2020	Wang et al.^[33]	Improved cross network for production CTR
TabM	2025	Gorishniy, Kotelnikov, Babenko (Yandex)^[49]	Parameter-efficient MLP ensemble (BatchEnsemble)

Despite the proliferation of architectures, empirical studies have consistently found that careful tuning of an MLP or a residual MLP recovers most of the gap to the more elaborate models.^[35] Hyperparameter tuning budgets are a confounder: when GBDT and neural models receive the same amount of tuning, GBDT typically wins on small and medium tabular datasets.^[42]

Foundation models for tabular

A newer line of work treats tabular classification the way GPT treats text: train one large model once on a huge variety of synthetic or real tabular tasks, then deploy it as a frozen in-context predictor that takes a labeled support set as input and outputs predictions for a query set without any per-dataset gradient updates.

TabPFN (Tabular Prior-data Fitted Network), introduced by Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter at ICLR 2023, was the first widely cited model of this type.^[43] TabPFN is a transformer that has been trained on millions of synthetic classification tasks generated from a structural causal model prior. At inference time the entire training set of the target task is concatenated with the query points and fed through the network in a single forward pass; the model performs Bayesian inference in the function space defined by its synthetic prior. The original TabPFN was limited to roughly 1,000 training rows, 100 features, and 10 classes, and to purely numerical inputs, but within those constraints it matched or beat tuned GBDT and AutoML systems while running in under a second on a single GPU.^[43] The paper received an Outstanding Paper award at ICLR 2023.^[43]

TabPFN v2, by Noah Hollmann and colleagues at the University of Freiburg and Prior Labs, was published in Nature in January 2025 under the title Accurate predictions on small data with a tabular foundation model.^[48] The second version increased the parameter count, extended the model to handle categorical features and missing values natively, and scaled to roughly 10,000 training rows and 500 features.^[48] The Nature paper reported that TabPFN v2 matched or surpassed tuned ensembles of XGBoost, CatBoost, and LightGBM on a broad benchmark of small to medium classification datasets, often after only a few seconds of inference and with no per-dataset hyperparameter tuning.^[48] Prior Labs released the weights and a Python package under a non-commercial license.

TabPFN-2.5, released by Prior Labs in November 2025, extended the same in-context recipe to roughly 50,000 training rows and 2,000 features while keeping the single-forward-pass, no-tuning workflow, about a twentyfold increase in supported data cells over TabPFN v2.^[51] The accompanying technical report stated that default TabPFN-2.5 achieved a 100 percent win rate against default XGBoost on small to medium classification datasets (up to 10,000 rows and 500 features) and an 87 percent win rate on larger datasets up to 100,000 rows and 2,000 features, and that it ranked first on the TabArena benchmark for classification and regression.^[51] A fine-tuned variant trained on real datasets, Real-TabPFN-2.5, was reported to be stronger still.^[51]

TabICL (Tabular In-Context Learning), by Jingang Qu, David Holzmüller, Gael Varoquaux, and Marine Le Morvan and presented at ICML 2025, addressed the scaling limitation of TabPFN v2, whose alternating column-wise and row-wise attention makes large training sets computationally expensive.^[50] TabICL uses a two-stage architecture, a column-then-row attention mechanism that builds fixed-dimensional row embeddings followed by a transformer for in-context learning.^[50] Pre-trained on synthetic datasets of up to 60,000 rows, it can handle up to 500,000 rows at inference on affordable hardware.^[50] Across 200 classification datasets from the TALENT benchmark it matched TabPFN v2 while running up to ten times faster, and on the 53 datasets with more than 10,000 rows it surpassed both TabPFN v2 and CatBoost.^[50]

TabDPT (Tabular Discriminative Pre-trained Transformer), released by Layer 6 AI in September 2024, follows the same in-context learning recipe but is trained on a curated mix of real OpenML datasets rather than purely synthetic data. It targets the same regime of small to medium tabular tasks.

TabLLM, by Stefan Hegselmann and colleagues at AISTATS 2023, took a different path: it serializes a row of a table into a natural language sentence and asks a pre-trained large language model to classify the row in zero-shot or few-shot mode.^[44] TabLLM is most useful when the column names are semantically meaningful, because the language model can transfer prior knowledge from text. Performance saturates quickly compared to GBDT on datasets with thousands of rows but can be competitive in very-low-shot regimes.^[44]

GReaT (Generation of Realistic Tabular data), by Vadim Borisov and colleagues in 2023, used pre-trained language models for tabular generation rather than classification, but its serialization scheme has been adapted for downstream classification work.^[45] Earlier related efforts include TabBERT and TabularBERT for transactional and event data.^[38]

The table below summarizes the main tabular foundation model approaches.

Model	Year	Authors	Approach
TabPFN	2023	Hollmann, Müller, Eggensperger, Hutter^[43]	In-context Bayesian classifier, synthetic SCM prior
TabPFN v2	2025	Hollmann et al. (Prior Labs)^[48]	Larger in-context model, categorical and missing support, Nature 2025
TabPFN-2.5	2025	Prior Labs^[51]	In-context model scaled to ~50,000 rows and 2,000 features
TabICL	2025	Qu, Holzmüller, Varoquaux, Le Morvan^[50]	Column-then-row attention, in-context learning up to 500,000 rows
TabDPT	2024	Layer 6 AI	In-context transformer, real OpenML pre-training
TabLLM	2023	Hegselmann et al.^[44]	Serialize rows as text, classify with LLM
GReaT	2023	Borisov et al.^[45]	Pre-trained LM for tabular generation
TabBERT	2021	Padhi et al. (IBM)^[38]	BERT-style pre-training on transactional rows

A recurring theme in the foundation model literature is that pre-trained in-context predictors are well suited to the small-data regime where GBDT has historically had a large hyperparameter tuning advantage. The 2025 generation of models pushed the practical ceiling outward: TabPFN-2.5 reaches roughly 50,000 rows and 2,000 features, and TabICL handles up to 500,000 rows.^[50]^[51] As of 2026, GBDT still has the edge on the widest tables (more features than rows) and on very large tables with many millions of rows, where in-context inference cost and context length become limiting.

Do neural networks beat trees on tabular data

Two influential 2022 papers crystallized the case that gradient boosted trees still beat deep learning on tabular data when the comparison is fair.

Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux published Why do tree-based models still outperform deep learning on tabular data? at NeurIPS 2022.^[42] The paper built a benchmark of 45 medium-sized tabular datasets curated to control for issues that plague earlier comparisons (uninformative features, regression versus classification mix, dataset size).^[42] With identical hyperparameter tuning budgets, the authors found that random forest, XGBoost, and LightGBM consistently outperformed FT-Transformer, ResNet, and MLPs.^[42] They attributed the gap to three structural advantages of trees: robustness to uninformative features through embedded feature selection, robustness to rotation of the feature axes (trees are invariant to monotone transformations of individual features, neural networks are not), and the ability to handle non-smooth target functions without bias toward smooth approximations.^[42]

Ravid Shwartz-Ziv and Amitai Armon published Tabular Data: Deep Learning Is Not All You Need in Information Fusion in 2022 (originally an arXiv preprint in 2021).^[41] They reproduced the published results of TabNet, NODE, DNF-Net, and 1D-CNN on the datasets from their original papers and on a held-out set of additional benchmarks. The headline finding was that the deep models did not generalize beyond their authors' chosen datasets, while an XGBoost ensemble tuned with the same compute won on most of them.^[41] The paper also showed that an ensemble combining XGBoost with one or more neural models was usually the strongest, suggesting complementary inductive biases.^[41]

Subsequent work has refined the picture rather than overturning it. Borisov and colleagues' 2022 survey Deep Neural Networks and Tabular Data: A Survey in IEEE TNNLS reached similar conclusions.^[40] McElfresh and colleagues' TabZilla benchmark at NeurIPS 2023 (When Do Neural Nets Outperform Boosted Trees on Tabular Data?) examined 196 datasets and found that the answer depends on dataset characteristics, with neural networks closing or reversing the gap on a minority of tasks, especially those with many high-cardinality categorical features and moderate sample sizes.^[46] The Hollmann 2025 Nature paper on TabPFN v2 then reopened the question for the small-data regime specifically, where in-context learning offers a different kind of advantage than gradient boosting.^[48] On the neural side, TabM (Gorishniy et al., ICLR 2025) reported that a parameter-efficient ensemble of MLPs forms a stronger and more practical line of models than attention- and retrieval-based tabular networks, narrowing the gap to GBDT while remaining simple to train.^[49] On the in-context side, TabICL (Qu et al., ICML 2025) and TabPFN-2.5 (Prior Labs, 2025) pushed pre-trained predictors past the small-data ceiling, with both reporting wins over tuned CatBoost and XGBoost on datasets of tens of thousands of rows.^[50]^[51]

A fair summary of the literature as of 2026: tuned gradient boosting remains the default on medium to large tabular classification tasks, and stays competitive everywhere; the small-data and now medium-data regimes are increasingly contested by pre-trained in-context models such as TabPFN v2, TabPFN-2.5, and TabICL; a well-tuned MLP ensemble such as TabM closes much of the remaining gap among purely neural methods; and ensembles of GBDT with at least one neural or in-context component often beat either alone in competitions.

Benchmarks and datasets

Benchmarks for tabular classification draw on three main sources. The first is the UCI Machine Learning Repository, founded by David Aha at UC Irvine in 1987, which hosts many of the oldest reference datasets including Iris (150 rows, 4 features, 3 classes), Adult (Census income, 48,842 rows, 14 features), Covertype (581,012 rows, 54 features, 7 classes), Bank Marketing, Wine Quality, and Higgs. UCI remains the canonical source even though it has been criticized for the small sample sizes and the cumulative effect of leaderboard chasing.

The second source is OpenML, a community platform founded by Joaquin Vanschoren in 2014 that hosts datasets, task definitions, and machine learning runs in a reproducible format. The OpenML CC-18 benchmark, defined by Bischl, Casalicchio, Feurer, and colleagues in 2017, is a curated suite of 72 classification datasets chosen to be diverse, well-behaved, and free from obvious leakage.^[24] The OpenML-CTR23 suite is a regression analog, covered in the sibling article on tabular regression models. OpenML's larger AutoML benchmarks include hundreds of datasets.

The third source is Kaggle, which hosts both academic-style competitions on cleaned datasets and industry challenges with messy production data. Kaggle Tabular Playground, run roughly monthly since 2021, has provided a steady stream of tabular classification problems. The cumulative Kaggle leaderboard for tabular competitions remains dominated by XGBoost, LightGBM, and CatBoost.

More recent benchmarks include the following.

Benchmark	Year	Curators	Scope
UCI ML Repository	1987	Aha, Bay, others (UC Irvine)	Hundreds of small to medium classical datasets
OpenML CC-18	2017	Bischl et al.^[24]	72 curated classification tasks
OpenML AutoML Benchmark	2019	Gijsbers et al.	39 classification and regression tasks for AutoML
AMLB v2	2024	Gijsbers et al.^[47]	Expanded AutoML benchmark, 104 datasets
Grinsztajn benchmark	2022	Grinsztajn, Oyallon, Varoquaux^[42]	45 medium tabular datasets, tree vs deep comparison
TabZilla	2023	McElfresh et al.^[46]	196 datasets, model-by-dataset performance map
TabBench / TALENT	2024	Various	Larger collections for tabular foundation models

Metrics most often reported are accuracy and log-loss for balanced multiclass problems, AUROC and average precision for imbalanced binary tasks, and macro F1 for multiclass problems with class imbalance. Calibration is reported with Brier score or expected calibration error.

AutoML for tabular

AutoML systems automate the model selection, hyperparameter tuning, and ensembling steps that drive tabular classification performance in competitions and in industry. Most major systems target tabular problems first because the search space is well understood and the building blocks are mature.

Auto-sklearn, by Matthias Feurer, Aaron Klein, and colleagues at the University of Freiburg, won the first AutoML Challenge in 2014.^[21] It uses Bayesian optimization (SMAC) over the scikit-learn pipeline space, meta-learning from previous datasets to warm-start the search, and post-hoc ensembling of the best models found.^[21]

H2O AutoML, part of the open-source H2O platform from H2O.ai, trains XGBoost, GBM, deep learning, GLM, and random forest models and stacks them with a meta-learner. It is widely used in industry, particularly in finance and insurance.

AutoGluon, released by Amazon Web Services in 2020 (Erickson et al., AutoGluon-Tabular), takes a different design stance: rather than searching the space of pipelines, it trains a hand-picked set of strong base models (LightGBM, CatBoost, XGBoost, MLP, random forest, k-NN) with sensible defaults and stacks them in multiple layers.^[31]^[55] The library has consistently ranked at or near the top of the OpenML AutoML benchmark since 2020 and has been particularly popular in Kaggle tabular competitions.^[47]

FLAML, by Chi Wang and colleagues at Microsoft Research in 2021, uses an efficient cost-aware Bayesian search to find good configurations quickly under a fixed budget.^[37] TPOT, by Randy Olson and Jason Moore in 2016, uses genetic programming to evolve pipelines.^[20] MLJAR AutoML is a commercial-friendly open-source package that emphasizes interpretable ensembles and explanatory reports.

AutoML system	Year	Lead authors	Distinctive idea
Auto-WEKA	2013	Thornton et al.^[15]	SMAC over WEKA pipelines
Auto-sklearn	2015	Feurer et al. (Freiburg)^[21]	Meta-learning + SMAC + ensemble
TPOT	2016	Olson and Moore^[20]	Genetic programming over pipelines
H2O AutoML	2017	H2O.ai	Stacked ensembles of GBM, GLM, DL
AutoGluon	2020	Erickson et al. (AWS)^[31]	Hand-picked models, multi-layer stacking
FLAML	2021	Wang et al. (Microsoft)^[37]	Cost-aware blendsearch
MLJAR	2018	Pląskowski et al.	Interpretable AutoML, mode-driven

In most published comparisons, AutoGluon and H2O AutoML are the top performers on the OpenML AutoML Benchmark, with FLAML close behind under tight time budgets.^[47] AutoML systems typically outperform a single well-tuned GBDT by one to three percentage points of accuracy and by reducing variance across datasets, although the gap shrinks when the user is an experienced practitioner.

Categorical encoding and feature engineering

Even with modern libraries that accept categorical columns directly, the encoding of categorical features remains the single most important feature engineering choice for tabular classification. The most common schemes are listed below.

One-hot encoding expands a categorical column with k levels into k binary columns, one per level. This is the default in scikit-learn pipelines for linear models, naive Bayes, and neural networks. It is wasteful for high-cardinality columns.

Ordinal encoding assigns each level an integer code. The codes are arbitrary unless the variable is genuinely ordered, but tree-based learners can still find useful splits on ordinal codes regardless of the encoding order.

Target encoding (also called mean encoding) replaces each level with a smoothed average of the target on the training rows in that level. James-Stein and m-probability smoothing are the most common variants. Target encoding is powerful for high-cardinality columns but leaks information from the target into features unless the encoding is computed with care, typically inside a cross-validation fold. CatBoost's ordered target statistics, introduced in the original paper, are an in-built solution to this leakage.^[27]

Embedding layers are learned during neural network training and map each categorical level to a low-dimensional vector. They were popularized by entity embeddings by Cheng Guo and Felix Berkhahn in 2016, who showed that a neural net with entity embeddings beat XGBoost on the Rossmann store sales Kaggle competition.^[18]

Hashing encoders map each level to one of a fixed number of buckets using a hash function, sacrificing some collisions for a bounded representation size.

Missing values are handled by tree-based learners through dedicated default split directions (LightGBM and CatBoost) or by treating missing as a separate category (XGBoost's sparsity-aware split). Linear models and neural networks typically require explicit imputation, for which scikit-learn's IterativeImputer, KNNImputer, and SimpleImputer are common choices.^[56] The MissForest method by Stekhoven and Bühlmann in 2012 is a strong baseline that uses random forests to impute missing values.^[14]

Class imbalance is addressed through resampling and through reweighting. The Synthetic Minority Over-sampling Technique (SMOTE), proposed by Nitesh Chawla, Kevin Bowyer, Lawrence Hall, and W. Philip Kegelmeyer in JAIR in 2002, interpolates new minority-class samples between existing ones in feature space.^[10] ADASYN, by Haibo He and colleagues in 2008, focuses oversampling on minority points near the decision boundary.^[12] The imbalanced-learn library by Lemaitre, Nogueira, and Aridas (2017) implements these and many variants.^[26] Class weighting in the loss function is a simpler alternative that is supported natively by XGBoost, LightGBM, CatBoost, and scikit-learn.

Feature engineering for tabular classification can also include numerical transforms (log, Box-Cox, Yeo-Johnson, quantile binning), pairwise interactions, polynomial features, and domain-specific aggregations such as time-since-last-event for transactional data. Modern libraries such as Featuretools by Max Kanter and Kalyan Veeramachaneni (MIT, 2015) automate the construction of aggregation features over relational tables.

Open-source ecosystem

Most tabular classification work today is done in Python. The mature stack includes scikit-learn for classical models and pipelines, XGBoost, LightGBM, and CatBoost for boosting, PyTorch and Keras for neural models, and AutoGluon, H2O, and FLAML for AutoML. The pytorch-tabular library by Manu Joseph (2021) provides a unified PyTorch implementation of TabNet, FT-Transformer, NODE, and other neural tabular models.^[39] The pytorch-frame library by Stanford and Kumo.AI in 2024 provides a similar capability with a focus on relational tabular data.

For categorical encoding, the category_encoders library by Will McGinnis offers more than a dozen schemes including James-Stein, m-estimator, CatBoost-style ordered target statistics, and helmert/sum/backward-difference contrasts. For class imbalance, imbalanced-learn provides SMOTE, ADASYN, and several Tomek-link-based cleaning algorithms.^[26]

Production deployment increasingly relies on ONNX or Treelite (Hyunsu Cho et al., 2018) to convert tree ensembles to a portable runtime, and on tools such as Hummingbird (Microsoft) to compile tree models into tensor operations that run on GPUs.

In R, the workhorses are the gbm package (Greg Ridgeway), xgboost, lightgbm, and ranger (a fast random forest implementation by Marvin Wright). H2O and tidymodels also support tabular workflows. Julia's MLJ framework offers a similar tabular-first design.

Limitations

Despite the maturity of the field, several open problems remain.

Tabular foundation models still face scale limits, although the ceiling has risen quickly. TabPFN v2 handles roughly 10,000 rows and 500 features, while the 2025 generation extended this to about 50,000 rows for TabPFN-2.5 and up to 500,000 rows for TabICL.^[48]^[50]^[51] As of 2026 these models still struggle with wide tables (more columns than rows) and with very large datasets in the tens of millions of rows, where in-context inference cost grows with the size of the support set and subsampling becomes necessary.

Transfer learning across tabular datasets is largely unsolved. Pre-trained tabular models do not consistently transfer their learned representations to new datasets the way pre-trained language and vision models do. The closest existing analog is in-context learning with TabPFN-style models.

Distribution shift is endemic in tabular deployment but rarely modeled explicitly. Most benchmarks assume IID splits, while production data often suffers from temporal drift, covariate shift, and concept drift. Methods such as adversarial validation and TrAdaBoost provide partial solutions.

Calibration of tree ensembles is acceptable for AUROC-driven applications but deteriorates for probability-sensitive ones. Isotonic regression and Platt scaling are the common post-hoc corrections. Calibration of deep tabular models is generally worse than that of GBDT, especially under class imbalance.

Interpretability remains in tension with accuracy. SHAP (Lundberg and Lee, NeurIPS 2017) and TreeSHAP provide a near-canonical attribution method for tree ensembles, but the same techniques applied to neural tabular models give attributions that are noisier and less reproducible across seeds.^[23]

Categorical encoding is still a manual choice with large effects on accuracy, even with libraries that handle the encoding internally. Best practices for high-cardinality columns are still evolving.

Fairness and disparate impact are first-order concerns in many tabular applications (credit, hiring, criminal justice). The field has produced post-hoc and in-processing methods such as reweighing, adversarial debiasing, and equalized odds calibration, but no consensus on which to apply when.

References

Cox, D. R. (1958). The regression analysis of binary sequences. *Journal of the Royal Statistical Society Series B*, 20(2), 215-242. ↩
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. *Annals of Eugenics*, 7(2), 179-188. ↩
Cover, T. M., and Hart, P. E. (1967). Nearest neighbor pattern classification. *IEEE Transactions on Information Theory*, 13(1), 21-27. ↩
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth. ↩
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann. ↩
Cortes, C., and Vapnik, V. (1995). Support-vector networks. *Machine Learning*, 20(3), 273-297. ↩
Breiman, L. (1996). Bagging predictors. *Machine Learning*, 24(2), 123-140. ↩
Breiman, L. (2001). Random forests. *Machine Learning*, 45(1), 5-32. ↩
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. *Annals of Statistics*, 29(5), 1189-1232. ↩
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. *Journal of Artificial Intelligence Research*, 16, 321-357. ↩
Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. *Machine Learning*, 63(1), 3-42. ↩
He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. *IJCNN 2008*. ↩
Rendle, S. (2010). Factorization machines. *ICDM 2010*. ↩
Stekhoven, D. J., and Bühlmann, P. (2012). MissForest: Non-parametric missing value imputation for mixed-type data. *Bioinformatics*, 28(1), 112-118. ↩
Thornton, C., Hutter, F., Hoos, H., and Leyton-Brown, K. (2013). Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. *KDD 2013*. ↩
Cheng, H.-T., et al. (2016). Wide and deep learning for recommender systems. *DLRS 2016*. ↩
Chen, T., and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. *KDD 2016*. ↩
Guo, C., and Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv:1604.06737. ↩
Juan, Y., Zhuang, Y., Chin, W.-S., and Lin, C.-J. (2016). Field-aware factorization machines for CTR prediction. *RecSys 2016*. ↩
Olson, R. S., and Moore, J. H. (2016). TPOT: A tree-based pipeline optimization tool for automating machine learning. *ICML AutoML Workshop 2016*. ↩
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. (2015). Efficient and robust automated machine learning. *NeurIPS 2015*. ↩
Ke, G., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. *NeurIPS 2017*. ↩
Lundberg, S. M., and Lee, S.-I. (2017). A unified approach to interpreting model predictions. *NeurIPS 2017*. ↩
Bischl, B., Casalicchio, G., Feurer, M., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N., and Vanschoren, J. (2017). OpenML benchmarking suites. arXiv:1708.03731. ↩
Wang, R., Fu, B., Fu, G., and Wang, M. (2017). Deep and cross network for ad click predictions. *AdKDD 2017*. ↩
Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets. *JMLR*, 18(17), 1-5. ↩
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. *NeurIPS 2018*. ↩
Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. (2017). DeepFM: A factorization machine based neural network for CTR prediction. *IJCAI 2017*. ↩
Song, W., Shi, C., Xiao, Z., Duan, Z., Xu, Y., Zhang, M., and Tang, J. (2019). AutoInt: Automatic feature interaction learning via self-attentive neural networks. *CIKM 2019*. ↩
Popov, S., Morozov, S., and Babenko, A. (2019). Neural oblivious decision ensembles for deep learning on tabular data. *ICLR 2020*. ↩
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., and Smola, A. (2020). AutoGluon-Tabular: Robust and accurate AutoML for structured data. arXiv:2003.06505. ↩
Huang, X., Khetan, A., Cvitkovic, M., and Karnin, Z. (2020). TabTransformer: Tabular data modeling using contextual embeddings. arXiv:2012.06678. ↩
Wang, R., Shivanna, R., Cheng, D., Jain, S., Lin, D., Hong, L., and Chi, E. (2020). DCN-V2: Improved deep and cross network and practical lessons for web-scale learning to rank systems. *WWW 2021*. ↩
Arik, S. O., and Pfister, T. (2021). TabNet: Attentive interpretable tabular learning. *AAAI 2021*. ↩
Gorishniy, Y., Rubachev, I., Khrulkov, V., and Babenko, A. (2021). Revisiting deep learning models for tabular data. *NeurIPS 2021*. ↩
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C. B., and Goldstein, T. (2021). SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv:2106.01342. ↩
Wang, C., Wu, Q., Weimer, M., and Zhu, E. (2021). FLAML: A fast and lightweight AutoML library. *MLSys 2021*. ↩
Padhi, I., Schiff, Y., Melnyk, I., Rigotti, M., Mroueh, Y., Dognin, P., Ross, J., Nair, R., and Altman, E. (2021). Tabular transformers for modeling multivariate time series. *ICASSP 2021*. ↩
Joseph, M. (2021). PyTorch Tabular: A framework for deep learning with tabular data. arXiv:2104.13638. ↩
Borisov, V., Leemann, T., Sessler, K., Haug, J., Pawelczyk, M., and Kasneci, G. (2022). Deep neural networks and tabular data: A survey. *IEEE TNNLS*. ↩
Shwartz-Ziv, R., and Armon, A. (2022). Tabular data: Deep learning is not all you need. *Information Fusion*, 81, 84-90. ↩
Grinsztajn, L., Oyallon, E., and Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? *NeurIPS 2022 Datasets and Benchmarks*. ↩
Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F. (2023). TabPFN: A transformer that solves small tabular classification problems in a second. *ICLR 2023* (Outstanding Paper). ↩
Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D. (2023). TabLLM: Few-shot classification of tabular data with large language models. *AISTATS 2023*. ↩
Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., and Kasneci, G. (2023). Language models are realistic tabular data generators. *ICLR 2023*. ↩
McElfresh, D., Khandagale, S., Valverde, J., Prasad C, V., Feuer, B., Hegde, C., Ramakrishnan, G., Goldblum, M., and White, C. (2023). When do neural nets outperform boosted trees on tabular data? *NeurIPS 2023 Datasets and Benchmarks*. ↩
Gijsbers, P., Bueno, M. L. P., Coors, S., LeDell, E., Poirier, S., Thomas, J., Bischl, B., and Vanschoren, J. (2024). AMLB: An AutoML benchmark. *Journal of Machine Learning Research*, 25, 1-65. ↩
Hollmann, N., Müller, S., Purucker, L., Krishnakumar, A., Körfer, M., Hoo, S. B., Schirrmeister, R. T., and Hutter, F. (2025). Accurate predictions on small data with a tabular foundation model. *Nature*, 637, 319-326.^[1] ↩
Gorishniy, Y., Kotelnikov, A., and Babenko, A. (2025). TabM: Advancing tabular deep learning with parameter-efficient ensembling. *ICLR 2025*. arXiv:2410.24210.^[2] ↩
Qu, J., Holzmüller, D., Varoquaux, G., and Le Morvan, M. (2025). TabICL: A tabular foundation model for in-context learning on large data. *ICML 2025*. arXiv:2502.05564.^[3] ↩
Prior Labs (2025). TabPFN-2.5: Advancing the state of the art in tabular foundation models. Technical report. arXiv:2511.08667.^[4] ↩
XGBoost documentation.^[5] ↩
LightGBM documentation.^[6] ↩
CatBoost documentation.^[7] ↩
AutoGluon documentation.^[8] ↩
scikit-learn documentation.^[9] ↩
https://www.nature.com/articles/s41586-024-08328-6 (Accessed 2026-05-31) ↩
https://arxiv.org/abs/2410.24210 (Accessed 2026-05-31) ↩
https://arxiv.org/abs/2502.05564 (Accessed 2026-05-31) ↩
https://arxiv.org/abs/2511.08667 (Accessed 2026-05-31) ↩
https://xgboost.readthedocs.io/ (Accessed 2026-05-31) ↩
https://lightgbm.readthedocs.io/ (Accessed 2026-05-31) ↩
https://catboost.ai/docs/ (Accessed 2026-05-31) ↩
https://auto.gluon.ai/ (Accessed 2026-05-31) ↩
https://scikit-learn.org/stable/ (Accessed 2026-05-31) ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

AdaBoost Tabular Regression Models Tabular models

Overview

Definition and problem setup

Classical algorithms

Gradient boosting frameworks

Neural approaches

Foundation models for tabular

Do neural networks beat trees on tabular data

Benchmarks and datasets

AutoML for tabular

Categorical encoding and feature engineering

Open-source ecosystem

Limitations

See also

References

Improve this article

Related Articles

Graph Machine Learning Models

Tabular Regression Models

V-JEPA

LLaMA/Model Card

Translation Models

Audio-to-Audio Models

What links here

Related Articles

Graph Machine Learning Models

Tabular Regression Models

V-JEPA

LLaMA/Model Card

Translation Models

Audio-to-Audio Models

What links here