Tabular Regression Models

See also: Tabular Models and Tasks

Tabular regression models are machine learning systems that predict a continuous numeric target from a vector of tabular features, where rows are samples and columns are heterogeneous attributes (numeric, ordinal, categorical, or binary). The task setting covers a large share of practical machine learning applications, including house price prediction, demand forecasting, time-to-event modeling, sensor readings, energy load estimation, drug response curves, and risk scoring. Unlike image, text, or audio data, tabular features have no canonical spatial or temporal ordering, the columns differ in scale and semantics, and missing values are common. Predicting a continuous value also raises questions that do not arise in classification, such as how to handle heavy-tailed target distributions, how to model heteroscedastic noise, and how to produce calibrated prediction intervals rather than point estimates.

For most of the last two decades, gradient boosted decision trees implemented by XGBoost, LightGBM, and CatBoost have been the dominant family of tabular regressors. These libraries have won most Kaggle regression competitions and remain the default baseline in industry. From 2017 onward, an active research program has tried to design neural network architectures that match gradient boosting on tabular regression, producing models such as TabNet, FT-Transformer, SAINT, and NODE. Since 2022, foundation models for tabular data have emerged, most notably TabPFN v2, which was extended to regression and published in Nature in January 2025.

Overview

A tabular regression dataset consists of pairs $(x_i, y_i)$ where $x_i \in \mathbb{R}^d$ is a feature vector with mixed types and $y_i \in \mathbb{R}$ is a continuous target. The goal is to learn a function $f$ that maps a new feature vector to a real-valued prediction. The function can be a single point estimate, a probability density over the target, or a set of conditional quantiles.

Tabular regression problems share a few recurring properties that shape model choice. Datasets are usually small to medium, from a few hundred to a few million rows, with anywhere from a handful to a few thousand columns. Features are heterogeneous, mixing continuous variables with categorical codes, ordinal levels, and binary flags. Target distributions are often skewed, with a long right tail in house prices, energy demand, claims amounts, and many other applied settings. Outliers are common, so loss functions and preprocessing decisions matter as much as the model class. Interpretability and calibration are often as important as predictive accuracy because regression outputs drive consequential decisions in finance, insurance, healthcare, and operations.

The most common loss functions for regression include squared error (mean squared error, MSE), absolute error (mean absolute error, MAE), Huber loss (quadratic near zero and linear in the tails), log-cosh, and quantile (pinball) loss. Squared error is the maximum-likelihood objective under Gaussian noise and gives the conditional mean. Absolute error is the maximum-likelihood objective under Laplace noise and gives the conditional median. Huber loss combines the robustness of MAE in the tails with the smoothness of MSE near zero. Quantile loss with parameter $\tau \in (0, 1)$ gives the conditional $\tau$-quantile and is the basis of quantile regression and prediction intervals.

Reporting metrics typically include root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), the coefficient of determination $R^2$, and for probabilistic predictions the negative log-likelihood, the continuous ranked probability score (CRPS), and the pinball loss at chosen quantiles.

Definition and problem setup

Let the training set be ${(x_i, y_i)}_{i=1}^n$ with $x_i \in \mathcal{X}$ and $y_i \in \mathbb{R}$. A regression model is a function $f: \mathcal{X} \to \mathbb{R}$ chosen to minimize the empirical risk

$$\hat f = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i)) + \Omega(f),$$

where $L$ is a loss function and $\Omega$ is a regularizer that penalizes model complexity. The choice of $\mathcal{F}$ (linear, kernel, tree ensemble, neural network, in-context predictor) and the choice of $L$ together define a regression algorithm.

Features may be continuous (age, income, sensor reading), categorical with low cardinality (sex, country code), categorical with high cardinality (zip code, product identifier), ordinal (education level, satisfaction score), boolean, or text or date fragments that have been preprocessed. The defining feature of a tabular setting is that no a priori topology is imposed on the columns: permuting the columns leaves the prediction problem unchanged after appropriate renaming.

Training a tabular regressor typically involves splitting the data into training, validation, and test partitions; encoding categorical and missing values; choosing a model class and loss function; tuning hyperparameters by cross-validation or a held-out validation set; and reporting performance on the test set using one or more of RMSE, MAE, MAPE, and $R^2$. For applications that require uncertainty, the model also outputs prediction intervals, conditional quantiles, or full predictive distributions, which are evaluated by interval coverage, pinball loss, or CRPS.

Classical algorithms

The earliest tabular regressors were linear models. Ordinary least squares regression, formalized by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809, fits a linear combination of features to minimize the sum of squared residuals and has a closed-form solution through the normal equations. Ridge regression, proposed by Arthur Hoerl and Robert Kennard in 1970 (also known as Tikhonov regularization in the inverse problems literature), adds an L2 penalty on the coefficient vector. Ridge is the standard remedy when the design matrix is poorly conditioned or when there are more features than samples.

Lasso regression, introduced by Robert Tibshirani in 1996 in his Journal of the Royal Statistical Society paper Regression Shrinkage and Selection via the Lasso, adds an L1 penalty. The L1 penalty drives some coefficients to exactly zero, producing a sparse model that performs implicit variable selection. Elastic net, introduced by Hui Zou and Trevor Hastie in 2005, blends the L1 and L2 penalties and tends to be preferable to lasso when groups of predictors are correlated, because lasso would otherwise pick one and drop the others. Elastic net is also better behaved when the number of predictors exceeds the number of observations.

Nonparametric regressors include k-nearest neighbors regression, which predicts a weighted average of the targets of the $k$ closest training points under a chosen metric, and locally weighted regression (LOESS, William Cleveland, 1979), which fits a low-degree polynomial on a kernel-weighted neighborhood around each query point. Support vector regression, introduced by Vladimir Vapnik and colleagues in the mid 1990s and reviewed in Smola and Schölkopf's 2004 tutorial, minimizes an $\epsilon$-insensitive loss that ignores residuals smaller than $\epsilon$ and is robust to outliers outside that band. SVR can use polynomial, RBF, or sigmoid kernels and was popular before tree ensembles took over in the 2000s.

Regression trees, the regression branch of the CART framework introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in their 1984 book Classification and Regression Trees, recursively partition the feature space using axis-aligned splits chosen to maximize the reduction in mean squared error. A single tree is high-variance, but ensembles of trees are very competitive. Bagging (Breiman, 1996) trains many trees on bootstrap samples and averages their predictions. Random forest regression, introduced by Breiman in 2001, adds a feature subsampling rule at each split and averages the predictions of the resulting trees. Extra Trees, by Pierre Geurts and colleagues in 2006, randomize the split thresholds as well.

Gradient boosting regression, formalized by Jerome Friedman in his 2001 Annals of Statistics paper Greedy Function Approximation: A Gradient Boosting Machine, builds an additive ensemble of weak learners (usually shallow regression trees) where each new tree is fit to the negative gradient of a differentiable loss with respect to the current ensemble's predictions. A 2002 follow-up paper introduced stochastic subsampling. Friedman's framework is the basis of every modern boosting implementation.

The table below summarizes the most common classical regressors used on tabular data.

Algorithm	Year	Key reference	Loss	Notable property
Ordinary least squares	1805/1809	Legendre, Gauss	MSE	Closed form
Ridge regression	1970	Hoerl and Kennard	MSE + L2	Handles collinearity
Lasso	1996	Tibshirani	MSE + L1	Sparse coefficients
Elastic net	2005	Zou and Hastie	MSE + L1 + L2	Groups correlated features
k-nearest neighbors	1967	Cover and Hart	Any	No training phase
LOESS	1979	Cleveland	MSE (local)	Smooth nonparametric fits
Support vector regression	1995	Vapnik et al.	$\epsilon$-insensitive	Sparse support vectors
CART regression tree	1984	Breiman et al.	MSE	Interpretable
Bagging	1996	Breiman	Any	Variance reduction
Random forest regressor	2001	Breiman	MSE	Robust default
Extra Trees regressor	2006	Geurts et al.	MSE	Faster than RF
Gradient boosting machine	2001	Friedman	Any differentiable	Forward stagewise additive model

Gradient boosting frameworks

Gradient boosting now dominates tabular regression in industry and competitions. Three open-source libraries account for the bulk of usage.

XGBoost, introduced by Tianqi Chen and Carlos Guestrin at KDD 2016, was the first widely adopted production-grade gradient boosting library. For regression it supports several objectives. The default reg:squarederror minimizes mean squared error and is the maximum likelihood estimator under Gaussian noise. The reg:absoluteerror objective minimizes mean absolute error and gives the conditional median. The reg:pseudohubererror objective minimizes a smooth approximation of the Huber loss, which is quadratic near zero and linear in the tails, giving robustness to outliers without sacrificing differentiability. XGBoost also offers reg:gamma and reg:tweedie for strictly positive targets with skewed distributions, and reg:quantileerror (added in 2.0) for quantile regression with a user-supplied quantile parameter. XGBoost adds second-order gradient information through Newton steps, a regularized objective with explicit L1 and L2 penalties on leaf weights, a sparsity-aware split finder that learns a default direction for missing values, weighted quantile sketching for approximate split candidates, cache-aware block storage, and parallel histogram construction.

LightGBM, announced by Guolin Ke and colleagues at Microsoft Research at NeurIPS 2017, introduced two techniques that made gradient boosting faster and more memory-efficient on large datasets. Gradient-based One-Side Sampling (GOSS) keeps all samples with large gradients and randomly subsamples the rest, focusing computation on samples that are still hard to fit. Exclusive Feature Bundling (EFB) merges mutually exclusive sparse features into a single bundle, reducing the effective dimensionality of categorical inputs after one-hot encoding. LightGBM also uses histogram-based binning of continuous features and grows trees leaf-wise (best-first) rather than level-wise. For regression LightGBM supports MSE (regression), MAE (regression_l1), Huber (huber), Fair, quantile (quantile), gamma, Tweedie, and log-cosh (mape) objectives, among others.

CatBoost, developed by Yandex and presented by Liudmila Prokhorenkova and colleagues at NeurIPS 2018, focused on principled handling of categorical features and on preventing target leakage during their encoding. It introduced ordered target statistics, an encoding scheme that computes the running mean of the target for each category using only earlier rows in a random permutation. CatBoost also uses oblivious decision trees, where the same split is applied at every node of a given depth, giving compact models and fast inference. For regression it provides RMSE, MAE, MAPE, Quantile, Expectile, Huber, Tweedie, LogCosh, and Poisson objectives, as well as multi-target regression through the MultiRMSE loss.

HistGradientBoostingRegressor, added to scikit-learn in version 0.21 (2019), is a pure-Python and Cython implementation modeled on LightGBM. It is slower than the dedicated libraries but is shipped with sklearn and is therefore the most accessible gradient boosting regressor for many practitioners. It supports MSE, MAE, Poisson, gamma, and quantile losses.

The table below compares the three production gradient boosting libraries on regression-specific features.

Feature	XGBoost	LightGBM	CatBoost
Initial release	2014	2016	2017
Paper venue	KDD 2016	NeurIPS 2017	NeurIPS 2018
Lead authors	Chen, Guestrin	Ke et al. (Microsoft)	Prokhorenkova et al. (Yandex)
MSE objective	reg:squarederror	regression	RMSE
MAE objective	reg:absoluteerror	regression_l1	MAE
Huber objective	reg:pseudohubererror	huber	Huber
Quantile objective	reg:quantileerror	quantile	Quantile
Poisson, Gamma, Tweedie	Yes	Yes	Yes (Tweedie, Poisson)
Native categorical handling	Limited (since 1.5)	Yes	Yes (ordered target statistics)
Missing value handling	Sparsity-aware split	Default direction	Default direction
Multi-target regression	Yes (since 2.0)	No (workaround)	Yes (MultiRMSE)
GPU support	Yes	Yes	Yes

In industry practice XGBoost is the most common default for regression, LightGBM is preferred for very large datasets with sparse features, and CatBoost tends to be the most ergonomic on datasets with many high-cardinality categorical columns and minimal manual feature engineering.

Neural approaches

Neural network regressors for tabular data go back to the early 1990s, when single hidden layer multilayer perceptrons were already competing with linear regression on housing and energy datasets. Modern neural tabular models try to do three things at once: handle mixed input types without manual encoding, learn feature interactions automatically, and remain competitive with gradient boosting on small and medium datasets.

TabNet, proposed by Sercan Arik and Tomas Pfister of Google Cloud at AAAI 2021, uses sequential attention to select features at each decision step, mimicking the way a tree splits on one feature at a time while remaining differentiable. The regressor variant replaces the classification head with a linear output and is trained with squared error or any other differentiable loss. TabNet supports an unsupervised pre-training task in which masked features are reconstructed, which can help on small labeled datasets.

FT-Transformer, by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko at NeurIPS 2021, treats both categorical and numerical features uniformly as tokens, passes them through a stack of standard transformer encoder blocks, and reads out a prediction head from a special token. The same paper, Revisiting Deep Learning Models for Tabular Data, also introduced a residual MLP baseline (ResNet for tabular) and reported careful benchmark comparisons with gradient boosting on both classification and regression datasets, finding that no neural model dominated XGBoost across all datasets.

SAINT (Self-Attention and Intersample Attention Transformer), by Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C. Bayan Bruss, and Tom Goldstein in 2021, added a second attention dimension across samples within a minibatch, allowing the model to compare a query row to other rows from the same batch. The method also used contrastive pre-training similar to that of SimCLR. SAINT was reported to outperform XGBoost, CatBoost, and LightGBM on several benchmark regression tasks in the original paper.

NODE (Neural Oblivious Decision Ensembles), by Sergei Popov, Stanislav Morozov, and Artem Babenko at Yandex in 2019, designed a differentiable analog of oblivious decision trees that can be trained end-to-end with gradient descent. A NODE layer is a concatenation of independent trees with their own learnable branching decisions and leaf values, and stacked layers allow each set of ensembles to take input from the previous layer. NODE was reported to outperform gradient boosting on several tabular regression tasks at the time of publication.

Deep and Cross Network (DCN) by Ruoxi Wang and colleagues in 2017 and DCN-V2 by Wang, Shivanna, Cheng, Jain, Lin, Hong, and Chi at WWW 2021 introduced an explicit cross layer that computes feature crosses of bounded degree. DCN-V2 replaced the cross weight vector with a cross weight matrix, which significantly increased the expressiveness of the cross network, and introduced a low-rank decomposition for efficiency. DCN-V2 has been used for both classification and regression in Google production systems and is open-source through TensorFlow Recommenders.

DeepGBM, by Guolin Ke and colleagues at KDD 2019, integrated gradient boosting decision trees and neural networks through two components: CatNN focuses on sparse categorical features, and GBDT2NN distills knowledge from a trained gradient boosting model into a neural network operating on dense numerical features. The framework was designed for online prediction with continuous data updates and supports both regression and classification.

The table below summarizes the most cited neural tabular regressors.

Model	Year	Authors	Core idea
DCN	2017	Wang et al.	Explicit cross layers of bounded degree
NODE	2019	Popov, Morozov, Babenko	Differentiable oblivious tree ensemble
DeepGBM	2019	Ke et al. (Microsoft)	GBDT distillation + neural categorical model
TabNet	2021	Arik and Pfister (Google)	Sequential attention with sparse feature selection
FT-Transformer	2021	Gorishniy et al. (Yandex)	Feature tokenizer + standard transformer encoder
ResNet for tabular	2021	Gorishniy et al.	Residual MLP baseline
SAINT	2021	Somepalli et al.	Attention across columns and across samples
DCN-V2	2021	Wang et al. (Google)	Improved cross network with weight matrix

Empirical studies have consistently found that careful tuning of an MLP or a residual MLP recovers most of the gap to the more elaborate models. When gradient boosting and neural models receive the same tuning budget, gradient boosting typically wins on small to medium tabular regression datasets.

Foundation models for tabular regression

A newer line of work treats tabular regression the way GPT treats text: train one large model once on a huge variety of synthetic or real tabular tasks, then deploy it as a frozen in-context predictor that takes a labeled support set as input and outputs predictions for a query set without per-dataset gradient updates.

TabPFN (Tabular Prior-data Fitted Network), introduced by Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter at ICLR 2023, was the first widely cited model of this type. The original TabPFN was a transformer trained on millions of synthetic classification tasks generated from a structural causal model prior, with no support for regression. At inference time the entire training set of the target task is concatenated with the query points and fed through the network in a single forward pass; the model performs approximate Bayesian inference in the function space defined by its synthetic prior. The original release was limited to roughly 1,000 training rows, 100 features, and 10 classes, with purely numerical inputs. Within those constraints it matched or beat tuned gradient boosting and AutoML systems while running in under a second on a single GPU. The paper received an Outstanding Paper award at ICLR 2023.

TabPFN v2, by Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Schirrmeister, and Frank Hutter at the University of Freiburg and Prior Labs, was published in Nature in January 2025 under the title Accurate predictions on small data with a tabular foundation model. The second version increased the parameter count, extended the model to handle categorical features and missing values natively, scaled to roughly 10,000 training rows and 500 features, and crucially added regression as a first-class task. For regression TabPFN v2 outputs a piecewise predictive distribution over the target, from which point estimates (mean or median), quantiles, and prediction intervals can be derived. The Nature paper reported that TabPFN v2 matched or surpassed tuned ensembles of XGBoost, CatBoost, and LightGBM on a broad benchmark of small to medium regression datasets, often after only a few seconds of inference and with no per-dataset hyperparameter tuning. Prior Labs released the model weights and a Python package under a non-commercial license through the Prior-Labs/TabPFN-v2-reg repository on Hugging Face.

TabPFN-2.5, released in late 2025, scaled the same approach to 50,000 training samples and 2,000 features and reported an 85 percent win rate against tuned gradient boosting on regression tasks. TabPFN-3, released as a technical report in 2026, scaled further.

TabDPT (Tabular Discriminative Pre-trained Transformer), released by Layer 6 AI in 2024, follows the same in-context learning recipe but is trained on a curated mix of real OpenML datasets rather than purely synthetic data. It targets the same regime of small to medium tabular tasks for both classification and regression.

GReaT (Generation of Realistic Tabular Data), by Vadim Borisov and colleagues at ICLR 2023, used pre-trained language models to generate tabular data rather than to predict targets, but the serialization scheme has been adapted for downstream regression in subsequent work. Related efforts include TabBERT (Padhi et al., IBM, ICASSP 2021) for transactional and event data.

Model	Year	Authors	Regression support
TabPFN	2023	Hollmann et al. (Freiburg)	No (classification only)
TabPFN v2	2025	Hollmann et al. (Prior Labs)	Yes (Nature 2025)
TabPFN-2.5	2025	Prior Labs	Yes
TabDPT	2024	Layer 6 AI	Yes
GReaT	2023	Borisov et al.	Generative, adapted
TabBERT	2021	Padhi et al. (IBM)	Yes (sequence regression)

Quantile and probabilistic regression

Many applications require more than a point estimate. The output of a regression model can be a full probability distribution, a set of conditional quantiles, or a calibrated prediction interval.

Quantile regression, introduced by Roger Koenker and Gilbert Bassett in their 1978 Econometrica paper Regression Quantiles, estimates the conditional $\tau$-quantile $q_\tau(x)$ of the target distribution rather than the conditional mean. The corresponding loss function is the pinball loss

$$L_\tau(y, \hat y) = \max(\tau (y - \hat y), (\tau - 1)(y - \hat y)),$$

which for $\tau = 0.5$ reduces to the absolute error loss (giving the conditional median). Quantile regression handles heteroscedastic noise naturally, because the spread between, say, the 10th and 90th percentile predictions can vary with the input. Linear quantile regression is implemented in the statsmodels package in Python and in the quantreg package in R, both maintained by Koenker.

Quantile gradient boosting fits the pinball loss with gradient boosted trees. XGBoost's reg:quantileerror, LightGBM's quantile objective, and CatBoost's Quantile loss all support this. A common practice is to train three models for the 10th, 50th, and 90th percentile and use the difference as an 80 percent prediction interval. The intervals are usually heteroscedastic and align with the local spread of the data, but they are not guaranteed to attain their nominal coverage on unseen data without further calibration.

Quantile random forests, introduced by Nicolai Meinshausen in 2006 in his Journal of Machine Learning Research paper, store the full distribution of training targets in each leaf of a random forest and predict quantiles by aggregating the empirical distributions across trees weighted by leaf membership. The method is implemented in the quantregForest R package and the quantile-forest Python package.

NGBoost (Natural Gradient Boosting), introduced by Tony Duan, Anand Avati, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Ng, and Alejandro Schuler at ICML 2020, generalizes gradient boosting to probabilistic regression by treating the parameters of a conditional distribution (for example the mean and standard deviation of a Gaussian, or the location and scale of a Laplace) as targets of a multiparameter boosting algorithm. The natural gradient corrects the training dynamics of this multiparameter setup and gives well-calibrated predictive distributions. NGBoost can be paired with any base learner (typically a regression tree), any parametric family with continuous parameters, and any proper scoring rule. The reference implementation is open-source at github.com/stanfordmlgroup/ngboost and the method has been adopted in healthcare, finance, and weather forecasting applications.

DeepAR (Salinas et al., 2020) and DeepFactor models from Amazon Web Services Forecast use recurrent or attention-based neural networks to output the parameters of a parametric distribution for each time step in a forecast, which is structurally similar to NGBoost for time-indexed tabular data.

Gaussian process regression, formalized for machine learning by Carl Rasmussen and Christopher Williams in their 2006 book Gaussian Processes for Machine Learning, models the target as a draw from a Gaussian process indexed by the input. It provides a closed-form posterior predictive distribution and is the gold standard for uncertainty quantification on small tabular datasets, though it scales poorly past tens of thousands of points without approximations.

Conformal prediction

Conformal prediction, introduced by Vladimir Vovk, Alex Gammerman, and Glenn Shafer in the late 1990s and developed in their 2005 book Algorithmic Learning in a Random World, is a framework for producing prediction sets (for regression, prediction intervals) with finite-sample, distribution-free coverage guarantees. Given a target miscoverage rate $\alpha$, the method outputs an interval $C(x)$ that satisfies

$$P(y \in C(x)) \geq 1 - \alpha$$

over the joint distribution of training and test data, assuming only exchangeability. Conformal prediction can wrap any point predictor, including linear regression, random forest, gradient boosting, and neural networks.

Split conformal regression, the most widely used variant, splits the training set into a proper training set used to fit a base regressor and a calibration set used to compute residual quantiles. The interval at a new point is the prediction plus or minus the empirical quantile of the absolute residuals on the calibration set. The interval has identical width everywhere, which is a limitation when noise is heteroscedastic.

Conformalized quantile regression, by Yaniv Romano, Evan Patterson, and Emmanuel Candès at NeurIPS 2019, combines quantile regression and conformal calibration. The base predictor outputs the $\alpha/2$ and $1 - \alpha/2$ conditional quantiles, and the calibration step adjusts the interval to obtain valid marginal coverage. The resulting intervals adapt to local noise levels and tend to be tighter than split conformal intervals on heteroscedastic data while retaining the coverage guarantee. The reference implementation is open-source at github.com/yromano/cqr.

Other variants include jackknife+ and CV+ (Barber, Candès, Ramdas, Tibshirani, 2021), which exchange computational cost for better small-sample behavior, and online conformal prediction for streaming and distribution-shift settings (Gibbs and Candès, 2021). The MAPIE Python package (Cordier and Renaudie, 2022) implements many of these methods on top of scikit-learn.

AutoML for tabular regression

AutoML systems automate model selection, hyperparameter tuning, and ensembling. Most major systems support regression as a first-class task.

Auto-sklearn, by Matthias Feurer, Aaron Klein, and colleagues at the University of Freiburg, won the first AutoML Challenge in 2014 and supports regression through its AutoSklearnRegressor class. It uses Bayesian optimization (SMAC) over the scikit-learn pipeline space, meta-learning from previous datasets to warm-start the search, and post-hoc ensembling of the best models found.

H2O AutoML, part of the open-source H2O platform from H2O.ai, trains gradient boosting machines, XGBoost, random forests, deep learning, and generalized linear models on regression tasks and stacks them with a meta-learner. It is widely used in finance and insurance for risk scoring and pricing.

AutoGluon, released by Amazon Web Services in 2020 (Erickson, Mueller, Shirkov, Zhang, Larroy, Li, and Smola, AutoGluon-Tabular), trains a hand-picked set of strong base models (LightGBM, CatBoost, XGBoost, MLP, random forest, k-NN) with sensible defaults and stacks them in multiple layers. The library has consistently ranked at or near the top of the OpenML AutoML Benchmark on both regression and classification tasks since 2020.

FLAML, by Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu at Microsoft Research in 2021, uses cost-aware Bayesian search to find good configurations quickly under a fixed budget and supports regression with several base learners.

TPOT, by Randy Olson and Jason Moore at the University of Pennsylvania in 2016, uses genetic programming to evolve full scikit-learn pipelines and supports regression through TPOTRegressor.

AutoML system	Year	Lead authors	Approach
Auto-WEKA	2013	Thornton et al.	SMAC over WEKA pipelines
Auto-sklearn	2015	Feurer et al. (Freiburg)	Meta-learning + SMAC + ensemble
TPOT	2016	Olson and Moore	Genetic programming over pipelines
H2O AutoML	2017	H2O.ai	Stacked ensembles of GBM, GLM, DL
AutoGluon	2020	Erickson et al. (AWS)	Hand-picked models, multi-layer stacking
FLAML	2021	Wang et al. (Microsoft)	Cost-aware blendsearch
MLJAR	2018	Pląskowski et al.	Interpretable AutoML reports

In most published comparisons AutoGluon and H2O AutoML lead the OpenML AutoML Benchmark on regression tasks, with FLAML close behind under tight time budgets.

Benchmarks and datasets

Benchmarks for tabular regression draw on three main sources.

The UCI Machine Learning Repository, founded by David Aha at UC Irvine in 1987, hosts many of the oldest reference datasets. The Boston Housing dataset, a 506-row regression dataset on median house prices in Boston suburbs, was once the canonical reference but is now considered ethically problematic because it includes a feature called B that encodes systemic racism in housing prices. The dataset was deprecated in scikit-learn 1.0 (2021) and removed in version 1.2 (2022). The maintainers strongly discourage its use except for educational purposes about ethical issues in data science. Common modern UCI regression datasets include California Housing (20,640 rows, 8 features; collected by Pace and Barry in 1997 as a replacement for Boston Housing), Concrete Compressive Strength (1,030 rows, 8 features; Yeh, 1998), Wine Quality (4,898 rows, 11 features; Cortez et al., 2009), Energy Efficiency (768 rows, 8 features), Airfoil Self-Noise (1,503 rows, 5 features), and Bike Sharing Demand.

OpenML, a community platform founded by Joaquin Vanschoren in 2014, hosts datasets, task definitions, and machine learning runs in a reproducible format. The OpenML-CTR23 (Curated Tabular Regression) benchmark, defined by Sebastian Fischer, Liana Harutyunyan, Matthias Feurer, and Bernd Bischl in 2023 and presented at the AutoML 2023 workshop, contains 35 regression datasets selected by strict criteria: between 500 and 100,000 observations, fewer than 5,000 features after one-hot encoding, no time dependencies, a documented source, a numeric target with at least five distinct values, and a baseline that beats a linear model. The suite evaluates models that span the spectrum from interpretable (ridge, decision tree, GAM) to complex black box (XGBoost, random forest).

The OpenML AutoML Benchmark (Gijsbers et al., 2019 and 2024) includes both classification and regression tasks for AutoML system comparison.

The TabZilla benchmark, by Duncan McElfresh and colleagues at NeurIPS 2023, examined 196 datasets across classification and regression to map model-by-dataset performance. The paper When Do Neural Nets Outperform Boosted Trees on Tabular Data? found that the answer depends on dataset characteristics, with neural networks closing or reversing the gap on a minority of tasks.

Kaggle hosts both academic-style competitions on cleaned regression datasets and industry challenges with messy production data. Notable regression competitions include House Prices: Advanced Regression Techniques (kaggle.com/c/house-prices-advanced-regression-techniques), Mercedes-Benz Greener Manufacturing (predicting test-bench time for new vehicle configurations), Allstate Claims Severity (insurance claim amounts), Zillow Prize (home value estimates), and Santander Value Prediction. The cumulative Kaggle leaderboard for tabular regression remains dominated by ensembles built around XGBoost, LightGBM, and CatBoost.

Benchmark	Year	Curators	Scope
UCI ML Repository	1987	Aha, Bay, others (UC Irvine)	Hundreds of small to medium datasets
California Housing	1997	Pace and Barry	20,640 rows, replacement for Boston Housing
OpenML AutoML Benchmark	2019	Gijsbers et al.	Mixed classification and regression
OpenML-CTR23	2023	Fischer et al.	35 curated regression tasks
Grinsztajn benchmark	2022	Grinsztajn, Oyallon, Varoquaux	45 medium tabular datasets, tree vs deep
TabZilla	2023	McElfresh et al.	196 datasets, model-by-dataset map
AMLB v2	2024	Gijsbers et al.	Expanded AutoML benchmark, 104 datasets

Do neural networks beat trees on tabular regression

Two influential 2022 papers crystallized the case that gradient boosted trees still beat deep learning on tabular data when the comparison is fair, on both classification and regression.

Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux published Why do tree-based models still outperform deep learning on typical tabular data? at NeurIPS 2022. The paper built a benchmark of 45 medium-sized tabular datasets including 17 regression tasks, controlling for issues that plague earlier comparisons (uninformative features, class versus regression mix, dataset size). With identical hyperparameter tuning budgets, the authors found that random forest, XGBoost, and LightGBM consistently outperformed FT-Transformer, ResNet, and MLPs on regression as well as on classification. They attributed the gap to three structural advantages of trees: robustness to uninformative features through embedded feature selection, robustness to rotation of the feature axes (trees are invariant to monotone transformations of individual features, neural networks are not), and the ability to learn non-smooth target functions without the smoothness bias of neural networks.

Ravid Shwartz-Ziv and Amitai Armon published Tabular Data: Deep Learning Is Not All You Need in Information Fusion in 2022. They reproduced the published results of TabNet, NODE, DNF-Net, and 1D-CNN on the datasets from their original papers and on a held-out set of additional benchmarks, including regression tasks. The headline finding was that the deep models did not generalize beyond their authors' chosen datasets, while a tuned XGBoost ensemble won on most of them. The paper also showed that an ensemble combining XGBoost with one or more neural models was usually the strongest, suggesting complementary inductive biases.

Subsequent work has refined the picture rather than overturning it. The Hollmann 2025 Nature paper on TabPFN v2 reopened the question for the small-data regime, where in-context learning offers a different kind of advantage than gradient boosting. The picture in 2025 is roughly the following: tuned gradient boosting remains the default winner on medium to large tabular regression tasks; small-data regimes are increasingly contested by TabPFN v2; ensembles of gradient boosting with at least one neural component often beat either alone in competitions.

Categorical encoding and feature engineering

The encoding of categorical features remains the single most important feature engineering choice for tabular regression. The most common schemes follow.

One-hot encoding expands a categorical column with $k$ levels into $k$ binary columns. This is the default in scikit-learn pipelines for linear models and neural networks, and is wasteful for high-cardinality columns.

Ordinal encoding assigns each level an integer code. The codes are arbitrary unless the variable is genuinely ordered, but tree-based learners can still find useful splits on ordinal codes regardless of the encoding order.

Target encoding (also called mean encoding) replaces each level with a smoothed average of the target on the training rows in that level. For regression the encoded value is the mean target per category, often shrunk toward the global mean using James-Stein or m-probability smoothing. Target encoding is powerful for high-cardinality columns but leaks information from the target into features unless the encoding is computed with care. The standard fix is cross-fitted target encoding: split the training data into K folds, compute means from the other K-1 folds, and encode each fold using only out-of-fold statistics. Scikit-learn's TargetEncoder (added in version 1.3) does this automatically during fit_transform.

CatBoost's ordered target statistics, introduced in the original CatBoost paper at NeurIPS 2018, are a more refined solution to target leakage. The library generates several random permutations of the training data; for each row in a permutation, the encoding uses only the rows that precede it. This removes the bias that ordinary target encoding introduces and is especially valuable for regression on high-cardinality categorical columns.

Embedding layers are learned during neural network training and map each categorical level to a low-dimensional vector. They were popularized by Cheng Guo and Felix Berkhahn's Entity Embeddings of Categorical Variables in 2016, which showed that a neural network with entity embeddings beat XGBoost on the Rossmann store sales Kaggle regression competition.

Hashing encoders map each level to one of a fixed number of buckets using a hash function, sacrificing some collisions for a bounded representation size.

Missing values are handled by gradient boosting libraries through dedicated default split directions (LightGBM, CatBoost) or by treating missing as a separate category (XGBoost's sparsity-aware split). Linear models and neural networks typically require explicit imputation, for which scikit-learn's IterativeImputer, KNNImputer, and SimpleImputer are common choices. The MissForest method by Daniel Stekhoven and Peter Bühlmann at ETH Zürich in 2012 uses random forests to impute missing values and remains a strong baseline.

Numerical features often require transformation when the target distribution is skewed. Common choices include log, square-root, Box-Cox (Box and Cox, 1964), and Yeo-Johnson (Yeo and Johnson, 2000) transforms on the target, or quantile binning of the features. Predicting the log of a strictly positive target and then exponentiating the prediction is a standard trick on house prices, energy demand, and claim amounts.

Calibration and reliability

Calibration for regression is the property that the predicted distribution or interval matches the empirical distribution of the target. For point predictions it is closely related to the bias of the model. For predictive distributions it is evaluated by reliability diagrams and by the calibration component of the CRPS.

Isotonic regression and Platt-style sigmoid scaling can be applied post-hoc to regression quantiles to improve their calibration. Conformal prediction provides a stronger guarantee: the resulting intervals attain a prescribed marginal coverage rate by construction. Probabilistic gradient boosting frameworks such as NGBoost provide reasonably well-calibrated predictive distributions out of the box because they optimize a proper scoring rule.

For heteroscedastic regression problems, NGBoost, conformalized quantile regression, and quantile gradient boosting paired with conformal calibration are the most reliable approaches. Plain gradient boosting with squared error gives a point estimate without uncertainty and should be combined with one of these methods when intervals are needed.

Open-source ecosystem

Most tabular regression work today is done in Python. The stack includes scikit-learn for classical models and pipelines, XGBoost, LightGBM, and CatBoost for gradient boosting, PyTorch and Keras for neural models, and AutoGluon, H2O, and FLAML for AutoML. The pytorch-tabular library by Manu Joseph in 2021 provides a unified PyTorch implementation of TabNet, FT-Transformer, NODE, and other neural tabular models with both regression and classification heads.

For probabilistic regression the main packages are ngboost, pyro and numpyro (for Bayesian regression), gpytorch (for Gaussian processes), and the quantile objectives in XGBoost, LightGBM, and CatBoost. For conformal prediction the MAPIE and nonconformist packages are the most common, with crepes providing additional algorithms.

For categorical encoding, the category_encoders library by Will McGinnis offers more than a dozen schemes including James-Stein, m-estimator, CatBoost-style ordered target statistics, and helmert/sum/backward-difference contrasts.

Production deployment increasingly relies on ONNX or Treelite (Hyunsu Cho et al., 2018) to convert tree ensembles to a portable runtime, and on tools such as Hummingbird (Microsoft) to compile tree models into tensor operations that run on GPUs.

In R, the main packages are glmnet (Friedman, Hastie, and Tibshirani) for penalized linear regression, randomForest and ranger (Marvin Wright) for random forests, xgboost and lightgbm for boosting, quantreg (Koenker) for quantile regression, quantregForest (Meinshausen) for quantile random forests, and mlr3 for unified workflows. H2O and tidymodels also support tabular regression pipelines.

Limitations

Several open problems remain in tabular regression.

Tabular foundation models such as TabPFN v2 were limited to roughly 10,000 rows and 500 features at the time of the Nature 2025 publication. Subsequent releases (TabPFN-2.5 and TabPFN-3) extended the range but the models still cannot match gradient boosting on very large datasets or on wide tables with more columns than rows.

Transfer learning across tabular regression datasets is largely unsolved. Pre-trained tabular models do not consistently transfer their learned representations to new datasets the way pre-trained language and vision models do. The closest existing analog is in-context learning with TabPFN-style models.

Distribution shift is endemic in tabular regression deployments (pricing, demand, claims) but rarely modeled explicitly. Most benchmarks assume IID train and test splits, while production data often suffers from temporal drift and covariate shift. Time-aware cross-validation, adversarial validation, and online conformal prediction provide partial solutions.

Heavy-tailed target distributions, common in financial and insurance applications, are not well served by squared error loss. Tweedie, gamma, and quantile losses, log transforms of the target, and robust losses such as Huber are the usual remedies, but each has its own limitations.

Calibration of tree ensembles for point predictions is usually acceptable, but the predictive distributions implied by quantile gradient boosting are not always well-calibrated without further conformal correction. Calibration of deep tabular regressors is generally worse than that of gradient boosting.

Interpretability remains in tension with accuracy. SHAP (Scott Lundberg and Su-In Lee, NeurIPS 2017) and TreeSHAP provide a near-canonical attribution method for tree ensembles, but the same techniques applied to neural tabular regressors give attributions that are noisier and less reproducible across seeds.

Fairness and disparate impact are first-order concerns in many regression applications (housing, lending, insurance). The deprecation of the Boston Housing dataset by scikit-learn in 2022 was an explicit acknowledgment of the field's history with biased benchmarks. Methods for fair regression include constrained optimization with demographic parity and equalized accuracy constraints, but no consensus on which to apply when.

References

Legendre, A.-M. (1805). *Nouvelles méthodes pour la détermination des orbites des comètes*. Paris.
Gauss, C. F. (1809). *Theoria motus corporum coelestium in sectionibus conicis solem ambientium*. Hamburg.
Cover, T. M., and Hart, P. E. (1967). Nearest neighbor pattern classification. *IEEE Transactions on Information Theory*, 13(1), 21-27.
Hoerl, A. E., and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. *Technometrics*, 12(1), 55-67.
Koenker, R., and Bassett, G. (1978). Regression quantiles. *Econometrica*, 46(1), 33-50.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. *Journal of the American Statistical Association*, 74(368), 829-836.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth.
Cortes, C., and Vapnik, V. (1995). Support-vector networks. *Machine Learning*, 20(3), 273-297.
Breiman, L. (1996). Bagging predictors. *Machine Learning*, 24(2), 123-140.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society Series B*, 58(1), 267-288.
Pace, R. K., and Barry, R. (1997). Sparse spatial autoregressions. *Statistics and Probability Letters*, 33(3), 291-297.
Breiman, L. (2001). Random forests. *Machine Learning*, 45(1), 5-32.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. *Annals of Statistics*, 29(5), 1189-1232.
Smola, A. J., and Schölkopf, B. (2004). A tutorial on support vector regression. *Statistics and Computing*, 14(3), 199-222.
Zou, H., and Hastie, T. (2005). Regularization and variable selection via the elastic net. *Journal of the Royal Statistical Society Series B*, 67(2), 301-320.
Vovk, V., Gammerman, A., and Shafer, G. (2005). *Algorithmic Learning in a Random World*. Springer.
Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. *Machine Learning*, 63(1), 3-42.
Meinshausen, N. (2006). Quantile regression forests. *Journal of Machine Learning Research*, 7, 983-999.
Rasmussen, C. E., and Williams, C. K. I. (2006). *Gaussian Processes for Machine Learning*. MIT Press.
Stekhoven, D. J., and Bühlmann, P. (2012). MissForest: Non-parametric missing value imputation for mixed-type data. *Bioinformatics*, 28(1), 112-118.
Thornton, C., Hutter, F., Hoos, H., and Leyton-Brown, K. (2013). Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. *KDD 2013*.
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. (2015). Efficient and robust automated machine learning. *NeurIPS 2015*.
Chen, T., and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. *KDD 2016*.
Guo, C., and Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv:1604.06737.
Olson, R. S., and Moore, J. H. (2016). TPOT: A tree-based pipeline optimization tool for automating machine learning. *ICML AutoML Workshop 2016*.
Ke, G., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. *NeurIPS 2017*.
Lundberg, S. M., and Lee, S.-I. (2017). A unified approach to interpreting model predictions. *NeurIPS 2017*.
Wang, R., Fu, B., Fu, G., and Wang, M. (2017). Deep and cross network for ad click predictions. *AdKDD 2017*.
Cho, H., et al. (2018). Treelite: Toolbox for decision tree deployment. *Conference on Machine Learning and Systems*.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. *NeurIPS 2018*.
Ke, G., Xu, Z., Zhang, J., Bian, J., and Liu, T.-Y. (2019). DeepGBM: A deep learning framework distilled by GBDT for online prediction tasks. *KDD 2019*.
Popov, S., Morozov, S., and Babenko, A. (2019). Neural oblivious decision ensembles for deep learning on tabular data. *ICLR 2020*.
Romano, Y., Patterson, E., and Candès, E. J. (2019). Conformalized quantile regression. *NeurIPS 2019*.
Duan, T., Avati, A., Ding, D. Y., Thai, K. K., Basu, S., Ng, A. Y., and Schuler, A. (2020). NGBoost: Natural gradient boosting for probabilistic prediction. *ICML 2020*.
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., and Smola, A. (2020). AutoGluon-Tabular: Robust and accurate AutoML for structured data. arXiv:2003.06505.
Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks. *International Journal of Forecasting*, 36(3), 1181-1191.
Arik, S. O., and Pfister, T. (2021). TabNet: Attentive interpretable tabular learning. *AAAI 2021*.
Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. J. (2021). Predictive inference with the jackknife+. *Annals of Statistics*, 49(1), 486-507.
Gorishniy, Y., Rubachev, I., Khrulkov, V., and Babenko, A. (2021). Revisiting deep learning models for tabular data. *NeurIPS 2021*.
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C. B., and Goldstein, T. (2021). SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv:2106.01342.
Wang, C., Wu, Q., Weimer, M., and Zhu, E. (2021). FLAML: A fast and lightweight AutoML library. *MLSys 2021*.
Wang, R., Shivanna, R., Cheng, D., Jain, S., Lin, D., Hong, L., and Chi, E. (2021). DCN-V2: Improved deep and cross network and practical lessons for web-scale learning to rank systems. *WWW 2021*.
Gibbs, I., and Candès, E. J. (2021). Adaptive conformal inference under distribution shift. *NeurIPS 2021*.
Borisov, V., Leemann, T., Sessler, K., Haug, J., Pawelczyk, M., and Kasneci, G. (2022). Deep neural networks and tabular data: A survey. *IEEE Transactions on Neural Networks and Learning Systems*.
Cordier, T., and Renaudie, V. (2022). MAPIE: An open-source library for distribution-free uncertainty quantification. *Quantitative Risk Modeling Workshop*.
Grinsztajn, L., Oyallon, E., and Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? *NeurIPS 2022 Datasets and Benchmarks*.
Shwartz-Ziv, R., and Armon, A. (2022). Tabular data: Deep learning is not all you need. *Information Fusion*, 81, 84-90.
Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., and Kasneci, G. (2023). Language models are realistic tabular data generators (GReaT). *ICLR 2023*.
Fischer, S., Harutyunyan, L., Feurer, M., and Bischl, B. (2023). OpenML-CTR23: A curated tabular regression benchmarking suite. *AutoML 2023 Workshop*.
Hollmann, N., Müller, S., Eggensperger, K., and Hutter, F. (2023). TabPFN: A transformer that solves small tabular classification problems in a second. *ICLR 2023* (Outstanding Paper).
McElfresh, D., et al. (2023). When do neural nets outperform boosted trees on tabular data? *NeurIPS 2023 Datasets and Benchmarks*.
Hollmann, N., Müller, S., Purucker, L., Krishnakumar, A., Körfer, M., Hoo, S. B., Schirrmeister, R. T., and Hutter, F. (2025). Accurate predictions on small data with a tabular foundation model. *Nature*, 637, 319-326.
XGBoost documentation. *https://xgboost.readthedocs.io/*
LightGBM documentation. *https://lightgbm.readthedocs.io/*
CatBoost documentation. *https://catboost.ai/docs/*
AutoGluon documentation. *https://auto.gluon.ai/*
scikit-learn documentation. *https://scikit-learn.org/stable/*
NGBoost documentation. *https://stanfordmlgroup.github.io/projects/ngboost/*
Prior Labs TabPFN. *https://priorlabs.ai/tabpfn*

Overview

Definition and problem setup

Classical algorithms

Gradient boosting frameworks

Neural approaches

Foundation models for tabular regression

Quantile and probabilistic regression

Conformal prediction

AutoML for tabular regression

Benchmarks and datasets

Do neural networks beat trees on tabular regression

Categorical encoding and feature engineering

Calibration and reliability

Open-source ecosystem

Limitations

See also

References

Improve this article

Related Articles

ARC-AGI 2

Tabular Classification Models

Automatic Speech Recognition Models

Text-to-Image Models

Visual Question Answering Models

Voice Activity Detection Models

Overview

Definition and problem setup

Classical algorithms

Gradient boosting frameworks

Neural approaches

Foundation models for tabular regression

Quantile and probabilistic regression

Conformal prediction

AutoML for tabular regression

Benchmarks and datasets

Do neural networks beat trees on tabular regression

Categorical encoding and feature engineering

Calibration and reliability

Open-source ecosystem

Limitations

See also

References

Related Articles

ARC-AGI 2

Tabular Classification Models

Automatic Speech Recognition Models

Text-to-Image Models

Visual Question Answering Models

Voice Activity Detection Models