Tabular Regression Models
Last reviewed
May 13, 2026
Sources
59 citations
Review status
Source-backed
Revision
v2 · 7,892 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
59 citations
Review status
Source-backed
Revision
v2 · 7,892 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Tabular Models and Tasks
Tabular regression models are machine learning systems that predict a continuous numeric target from a vector of tabular features, where rows are samples and columns are heterogeneous attributes (numeric, ordinal, categorical, or binary). The task setting covers a large share of practical machine learning applications, including house price prediction, demand forecasting, time-to-event modeling, sensor readings, energy load estimation, drug response curves, and risk scoring. Unlike image, text, or audio data, tabular features have no canonical spatial or temporal ordering, the columns differ in scale and semantics, and missing values are common. Predicting a continuous value also raises questions that do not arise in classification, such as how to handle heavy-tailed target distributions, how to model heteroscedastic noise, and how to produce calibrated prediction intervals rather than point estimates.
For most of the last two decades, gradient boosted decision trees implemented by XGBoost, LightGBM, and CatBoost have been the dominant family of tabular regressors. These libraries have won most Kaggle regression competitions and remain the default baseline in industry. From 2017 onward, an active research program has tried to design neural network architectures that match gradient boosting on tabular regression, producing models such as TabNet, FT-Transformer, SAINT, and NODE. Since 2022, foundation models for tabular data have emerged, most notably TabPFN v2, which was extended to regression and published in Nature in January 2025.
A tabular regression dataset consists of pairs $(x_i, y_i)$ where $x_i \in \mathbb{R}^d$ is a feature vector with mixed types and $y_i \in \mathbb{R}$ is a continuous target. The goal is to learn a function $f$ that maps a new feature vector to a real-valued prediction. The function can be a single point estimate, a probability density over the target, or a set of conditional quantiles.
Tabular regression problems share a few recurring properties that shape model choice. Datasets are usually small to medium, from a few hundred to a few million rows, with anywhere from a handful to a few thousand columns. Features are heterogeneous, mixing continuous variables with categorical codes, ordinal levels, and binary flags. Target distributions are often skewed, with a long right tail in house prices, energy demand, claims amounts, and many other applied settings. Outliers are common, so loss functions and preprocessing decisions matter as much as the model class. Interpretability and calibration are often as important as predictive accuracy because regression outputs drive consequential decisions in finance, insurance, healthcare, and operations.
The most common loss functions for regression include squared error (mean squared error, MSE), absolute error (mean absolute error, MAE), Huber loss (quadratic near zero and linear in the tails), log-cosh, and quantile (pinball) loss. Squared error is the maximum-likelihood objective under Gaussian noise and gives the conditional mean. Absolute error is the maximum-likelihood objective under Laplace noise and gives the conditional median. Huber loss combines the robustness of MAE in the tails with the smoothness of MSE near zero. Quantile loss with parameter $\tau \in (0, 1)$ gives the conditional $\tau$-quantile and is the basis of quantile regression and prediction intervals.
Reporting metrics typically include root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), the coefficient of determination $R^2$, and for probabilistic predictions the negative log-likelihood, the continuous ranked probability score (CRPS), and the pinball loss at chosen quantiles.
Let the training set be ${(x_i, y_i)}_{i=1}^n$ with $x_i \in \mathcal{X}$ and $y_i \in \mathbb{R}$. A regression model is a function $f: \mathcal{X} \to \mathbb{R}$ chosen to minimize the empirical risk
$$\hat f = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i)) + \Omega(f),$$
where $L$ is a loss function and $\Omega$ is a regularizer that penalizes model complexity. The choice of $\mathcal{F}$ (linear, kernel, tree ensemble, neural network, in-context predictor) and the choice of $L$ together define a regression algorithm.
Features may be continuous (age, income, sensor reading), categorical with low cardinality (sex, country code), categorical with high cardinality (zip code, product identifier), ordinal (education level, satisfaction score), boolean, or text or date fragments that have been preprocessed. The defining feature of a tabular setting is that no a priori topology is imposed on the columns: permuting the columns leaves the prediction problem unchanged after appropriate renaming.
Training a tabular regressor typically involves splitting the data into training, validation, and test partitions; encoding categorical and missing values; choosing a model class and loss function; tuning hyperparameters by cross-validation or a held-out validation set; and reporting performance on the test set using one or more of RMSE, MAE, MAPE, and $R^2$. For applications that require uncertainty, the model also outputs prediction intervals, conditional quantiles, or full predictive distributions, which are evaluated by interval coverage, pinball loss, or CRPS.
The earliest tabular regressors were linear models. Ordinary least squares regression, formalized by Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809, fits a linear combination of features to minimize the sum of squared residuals and has a closed-form solution through the normal equations. Ridge regression, proposed by Arthur Hoerl and Robert Kennard in 1970 (also known as Tikhonov regularization in the inverse problems literature), adds an L2 penalty on the coefficient vector. Ridge is the standard remedy when the design matrix is poorly conditioned or when there are more features than samples.
Lasso regression, introduced by Robert Tibshirani in 1996 in his Journal of the Royal Statistical Society paper Regression Shrinkage and Selection via the Lasso, adds an L1 penalty. The L1 penalty drives some coefficients to exactly zero, producing a sparse model that performs implicit variable selection. Elastic net, introduced by Hui Zou and Trevor Hastie in 2005, blends the L1 and L2 penalties and tends to be preferable to lasso when groups of predictors are correlated, because lasso would otherwise pick one and drop the others. Elastic net is also better behaved when the number of predictors exceeds the number of observations.
Nonparametric regressors include k-nearest neighbors regression, which predicts a weighted average of the targets of the $k$ closest training points under a chosen metric, and locally weighted regression (LOESS, William Cleveland, 1979), which fits a low-degree polynomial on a kernel-weighted neighborhood around each query point. Support vector regression, introduced by Vladimir Vapnik and colleagues in the mid 1990s and reviewed in Smola and Schölkopf's 2004 tutorial, minimizes an $\epsilon$-insensitive loss that ignores residuals smaller than $\epsilon$ and is robust to outliers outside that band. SVR can use polynomial, RBF, or sigmoid kernels and was popular before tree ensembles took over in the 2000s.
Regression trees, the regression branch of the CART framework introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in their 1984 book Classification and Regression Trees, recursively partition the feature space using axis-aligned splits chosen to maximize the reduction in mean squared error. A single tree is high-variance, but ensembles of trees are very competitive. Bagging (Breiman, 1996) trains many trees on bootstrap samples and averages their predictions. Random forest regression, introduced by Breiman in 2001, adds a feature subsampling rule at each split and averages the predictions of the resulting trees. Extra Trees, by Pierre Geurts and colleagues in 2006, randomize the split thresholds as well.
Gradient boosting regression, formalized by Jerome Friedman in his 2001 Annals of Statistics paper Greedy Function Approximation: A Gradient Boosting Machine, builds an additive ensemble of weak learners (usually shallow regression trees) where each new tree is fit to the negative gradient of a differentiable loss with respect to the current ensemble's predictions. A 2002 follow-up paper introduced stochastic subsampling. Friedman's framework is the basis of every modern boosting implementation.
The table below summarizes the most common classical regressors used on tabular data.
| Algorithm | Year | Key reference | Loss | Notable property |
|---|---|---|---|---|
| Ordinary least squares | 1805/1809 | Legendre, Gauss | MSE | Closed form |
| Ridge regression | 1970 | Hoerl and Kennard | MSE + L2 | Handles collinearity |
| Lasso | 1996 | Tibshirani | MSE + L1 | Sparse coefficients |
| Elastic net | 2005 | Zou and Hastie | MSE + L1 + L2 | Groups correlated features |
| k-nearest neighbors | 1967 | Cover and Hart | Any | No training phase |
| LOESS | 1979 | Cleveland | MSE (local) | Smooth nonparametric fits |
| Support vector regression | 1995 | Vapnik et al. | $\epsilon$-insensitive | Sparse support vectors |
| CART regression tree | 1984 | Breiman et al. | MSE | Interpretable |
| Bagging | 1996 | Breiman | Any | Variance reduction |
| Random forest regressor | 2001 | Breiman | MSE | Robust default |
| Extra Trees regressor | 2006 | Geurts et al. | MSE | Faster than RF |
| Gradient boosting machine | 2001 | Friedman | Any differentiable | Forward stagewise additive model |
Gradient boosting now dominates tabular regression in industry and competitions. Three open-source libraries account for the bulk of usage.
XGBoost, introduced by Tianqi Chen and Carlos Guestrin at KDD 2016, was the first widely adopted production-grade gradient boosting library. For regression it supports several objectives. The default reg:squarederror minimizes mean squared error and is the maximum likelihood estimator under Gaussian noise. The reg:absoluteerror objective minimizes mean absolute error and gives the conditional median. The reg:pseudohubererror objective minimizes a smooth approximation of the Huber loss, which is quadratic near zero and linear in the tails, giving robustness to outliers without sacrificing differentiability. XGBoost also offers reg:gamma and reg:tweedie for strictly positive targets with skewed distributions, and reg:quantileerror (added in 2.0) for quantile regression with a user-supplied quantile parameter. XGBoost adds second-order gradient information through Newton steps, a regularized objective with explicit L1 and L2 penalties on leaf weights, a sparsity-aware split finder that learns a default direction for missing values, weighted quantile sketching for approximate split candidates, cache-aware block storage, and parallel histogram construction.
LightGBM, announced by Guolin Ke and colleagues at Microsoft Research at NeurIPS 2017, introduced two techniques that made gradient boosting faster and more memory-efficient on large datasets. Gradient-based One-Side Sampling (GOSS) keeps all samples with large gradients and randomly subsamples the rest, focusing computation on samples that are still hard to fit. Exclusive Feature Bundling (EFB) merges mutually exclusive sparse features into a single bundle, reducing the effective dimensionality of categorical inputs after one-hot encoding. LightGBM also uses histogram-based binning of continuous features and grows trees leaf-wise (best-first) rather than level-wise. For regression LightGBM supports MSE (regression), MAE (regression_l1), Huber (huber), Fair, quantile (quantile), gamma, Tweedie, and log-cosh (mape) objectives, among others.
CatBoost, developed by Yandex and presented by Liudmila Prokhorenkova and colleagues at NeurIPS 2018, focused on principled handling of categorical features and on preventing target leakage during their encoding. It introduced ordered target statistics, an encoding scheme that computes the running mean of the target for each category using only earlier rows in a random permutation. CatBoost also uses oblivious decision trees, where the same split is applied at every node of a given depth, giving compact models and fast inference. For regression it provides RMSE, MAE, MAPE, Quantile, Expectile, Huber, Tweedie, LogCosh, and Poisson objectives, as well as multi-target regression through the MultiRMSE loss.
HistGradientBoostingRegressor, added to scikit-learn in version 0.21 (2019), is a pure-Python and Cython implementation modeled on LightGBM. It is slower than the dedicated libraries but is shipped with sklearn and is therefore the most accessible gradient boosting regressor for many practitioners. It supports MSE, MAE, Poisson, gamma, and quantile losses.
The table below compares the three production gradient boosting libraries on regression-specific features.
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Initial release | 2014 | 2016 | 2017 |
| Paper venue | KDD 2016 | NeurIPS 2017 | NeurIPS 2018 |
| Lead authors | Chen, Guestrin | Ke et al. (Microsoft) | Prokhorenkova et al. (Yandex) |
| MSE objective | reg:squarederror | regression | RMSE |
| MAE objective | reg:absoluteerror | regression_l1 | MAE |
| Huber objective | reg:pseudohubererror | huber | Huber |
| Quantile objective | reg:quantileerror | quantile | Quantile |
| Poisson, Gamma, Tweedie | Yes | Yes | Yes (Tweedie, Poisson) |
| Native categorical handling | Limited (since 1.5) | Yes | Yes (ordered target statistics) |
| Missing value handling | Sparsity-aware split | Default direction | Default direction |
| Multi-target regression | Yes (since 2.0) | No (workaround) | Yes (MultiRMSE) |
| GPU support | Yes | Yes | Yes |
In industry practice XGBoost is the most common default for regression, LightGBM is preferred for very large datasets with sparse features, and CatBoost tends to be the most ergonomic on datasets with many high-cardinality categorical columns and minimal manual feature engineering.
Neural network regressors for tabular data go back to the early 1990s, when single hidden layer multilayer perceptrons were already competing with linear regression on housing and energy datasets. Modern neural tabular models try to do three things at once: handle mixed input types without manual encoding, learn feature interactions automatically, and remain competitive with gradient boosting on small and medium datasets.
TabNet, proposed by Sercan Arik and Tomas Pfister of Google Cloud at AAAI 2021, uses sequential attention to select features at each decision step, mimicking the way a tree splits on one feature at a time while remaining differentiable. The regressor variant replaces the classification head with a linear output and is trained with squared error or any other differentiable loss. TabNet supports an unsupervised pre-training task in which masked features are reconstructed, which can help on small labeled datasets.
FT-Transformer, by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko at NeurIPS 2021, treats both categorical and numerical features uniformly as tokens, passes them through a stack of standard transformer encoder blocks, and reads out a prediction head from a special token. The same paper, Revisiting Deep Learning Models for Tabular Data, also introduced a residual MLP baseline (ResNet for tabular) and reported careful benchmark comparisons with gradient boosting on both classification and regression datasets, finding that no neural model dominated XGBoost across all datasets.
SAINT (Self-Attention and Intersample Attention Transformer), by Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C. Bayan Bruss, and Tom Goldstein in 2021, added a second attention dimension across samples within a minibatch, allowing the model to compare a query row to other rows from the same batch. The method also used contrastive pre-training similar to that of SimCLR. SAINT was reported to outperform XGBoost, CatBoost, and LightGBM on several benchmark regression tasks in the original paper.
NODE (Neural Oblivious Decision Ensembles), by Sergei Popov, Stanislav Morozov, and Artem Babenko at Yandex in 2019, designed a differentiable analog of oblivious decision trees that can be trained end-to-end with gradient descent. A NODE layer is a concatenation of independent trees with their own learnable branching decisions and leaf values, and stacked layers allow each set of ensembles to take input from the previous layer. NODE was reported to outperform gradient boosting on several tabular regression tasks at the time of publication.
Deep and Cross Network (DCN) by Ruoxi Wang and colleagues in 2017 and DCN-V2 by Wang, Shivanna, Cheng, Jain, Lin, Hong, and Chi at WWW 2021 introduced an explicit cross layer that computes feature crosses of bounded degree. DCN-V2 replaced the cross weight vector with a cross weight matrix, which significantly increased the expressiveness of the cross network, and introduced a low-rank decomposition for efficiency. DCN-V2 has been used for both classification and regression in Google production systems and is open-source through TensorFlow Recommenders.
DeepGBM, by Guolin Ke and colleagues at KDD 2019, integrated gradient boosting decision trees and neural networks through two components: CatNN focuses on sparse categorical features, and GBDT2NN distills knowledge from a trained gradient boosting model into a neural network operating on dense numerical features. The framework was designed for online prediction with continuous data updates and supports both regression and classification.
The table below summarizes the most cited neural tabular regressors.
| Model | Year | Authors | Core idea |
|---|---|---|---|
| DCN | 2017 | Wang et al. | Explicit cross layers of bounded degree |
| NODE | 2019 | Popov, Morozov, Babenko | Differentiable oblivious tree ensemble |
| DeepGBM | 2019 | Ke et al. (Microsoft) | GBDT distillation + neural categorical model |
| TabNet | 2021 | Arik and Pfister (Google) | Sequential attention with sparse feature selection |
| FT-Transformer | 2021 | Gorishniy et al. (Yandex) | Feature tokenizer + standard transformer encoder |
| ResNet for tabular | 2021 | Gorishniy et al. | Residual MLP baseline |
| SAINT | 2021 | Somepalli et al. | Attention across columns and across samples |
| DCN-V2 | 2021 | Wang et al. (Google) | Improved cross network with weight matrix |
Empirical studies have consistently found that careful tuning of an MLP or a residual MLP recovers most of the gap to the more elaborate models. When gradient boosting and neural models receive the same tuning budget, gradient boosting typically wins on small to medium tabular regression datasets.
A newer line of work treats tabular regression the way GPT treats text: train one large model once on a huge variety of synthetic or real tabular tasks, then deploy it as a frozen in-context predictor that takes a labeled support set as input and outputs predictions for a query set without per-dataset gradient updates.
TabPFN (Tabular Prior-data Fitted Network), introduced by Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter at ICLR 2023, was the first widely cited model of this type. The original TabPFN was a transformer trained on millions of synthetic classification tasks generated from a structural causal model prior, with no support for regression. At inference time the entire training set of the target task is concatenated with the query points and fed through the network in a single forward pass; the model performs approximate Bayesian inference in the function space defined by its synthetic prior. The original release was limited to roughly 1,000 training rows, 100 features, and 10 classes, with purely numerical inputs. Within those constraints it matched or beat tuned gradient boosting and AutoML systems while running in under a second on a single GPU. The paper received an Outstanding Paper award at ICLR 2023.
TabPFN v2, by Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Schirrmeister, and Frank Hutter at the University of Freiburg and Prior Labs, was published in Nature in January 2025 under the title Accurate predictions on small data with a tabular foundation model. The second version increased the parameter count, extended the model to handle categorical features and missing values natively, scaled to roughly 10,000 training rows and 500 features, and crucially added regression as a first-class task. For regression TabPFN v2 outputs a piecewise predictive distribution over the target, from which point estimates (mean or median), quantiles, and prediction intervals can be derived. The Nature paper reported that TabPFN v2 matched or surpassed tuned ensembles of XGBoost, CatBoost, and LightGBM on a broad benchmark of small to medium regression datasets, often after only a few seconds of inference and with no per-dataset hyperparameter tuning. Prior Labs released the model weights and a Python package under a non-commercial license through the Prior-Labs/TabPFN-v2-reg repository on Hugging Face.
TabPFN-2.5, released in late 2025, scaled the same approach to 50,000 training samples and 2,000 features and reported an 85 percent win rate against tuned gradient boosting on regression tasks. TabPFN-3, released as a technical report in 2026, scaled further.
TabDPT (Tabular Discriminative Pre-trained Transformer), released by Layer 6 AI in 2024, follows the same in-context learning recipe but is trained on a curated mix of real OpenML datasets rather than purely synthetic data. It targets the same regime of small to medium tabular tasks for both classification and regression.
GReaT (Generation of Realistic Tabular Data), by Vadim Borisov and colleagues at ICLR 2023, used pre-trained language models to generate tabular data rather than to predict targets, but the serialization scheme has been adapted for downstream regression in subsequent work. Related efforts include TabBERT (Padhi et al., IBM, ICASSP 2021) for transactional and event data.
| Model | Year | Authors | Regression support |
|---|---|---|---|
| TabPFN | 2023 | Hollmann et al. (Freiburg) | No (classification only) |
| TabPFN v2 | 2025 | Hollmann et al. (Prior Labs) | Yes (Nature 2025) |
| TabPFN-2.5 | 2025 | Prior Labs | Yes |
| TabDPT | 2024 | Layer 6 AI | Yes |
| GReaT | 2023 | Borisov et al. | Generative, adapted |
| TabBERT | 2021 | Padhi et al. (IBM) | Yes (sequence regression) |
Many applications require more than a point estimate. The output of a regression model can be a full probability distribution, a set of conditional quantiles, or a calibrated prediction interval.
Quantile regression, introduced by Roger Koenker and Gilbert Bassett in their 1978 Econometrica paper Regression Quantiles, estimates the conditional $\tau$-quantile $q_\tau(x)$ of the target distribution rather than the conditional mean. The corresponding loss function is the pinball loss
$$L_\tau(y, \hat y) = \max(\tau (y - \hat y), (\tau - 1)(y - \hat y)),$$
which for $\tau = 0.5$ reduces to the absolute error loss (giving the conditional median). Quantile regression handles heteroscedastic noise naturally, because the spread between, say, the 10th and 90th percentile predictions can vary with the input. Linear quantile regression is implemented in the statsmodels package in Python and in the quantreg package in R, both maintained by Koenker.
Quantile gradient boosting fits the pinball loss with gradient boosted trees. XGBoost's reg:quantileerror, LightGBM's quantile objective, and CatBoost's Quantile loss all support this. A common practice is to train three models for the 10th, 50th, and 90th percentile and use the difference as an 80 percent prediction interval. The intervals are usually heteroscedastic and align with the local spread of the data, but they are not guaranteed to attain their nominal coverage on unseen data without further calibration.
Quantile random forests, introduced by Nicolai Meinshausen in 2006 in his Journal of Machine Learning Research paper, store the full distribution of training targets in each leaf of a random forest and predict quantiles by aggregating the empirical distributions across trees weighted by leaf membership. The method is implemented in the quantregForest R package and the quantile-forest Python package.
NGBoost (Natural Gradient Boosting), introduced by Tony Duan, Anand Avati, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Ng, and Alejandro Schuler at ICML 2020, generalizes gradient boosting to probabilistic regression by treating the parameters of a conditional distribution (for example the mean and standard deviation of a Gaussian, or the location and scale of a Laplace) as targets of a multiparameter boosting algorithm. The natural gradient corrects the training dynamics of this multiparameter setup and gives well-calibrated predictive distributions. NGBoost can be paired with any base learner (typically a regression tree), any parametric family with continuous parameters, and any proper scoring rule. The reference implementation is open-source at github.com/stanfordmlgroup/ngboost and the method has been adopted in healthcare, finance, and weather forecasting applications.
DeepAR (Salinas et al., 2020) and DeepFactor models from Amazon Web Services Forecast use recurrent or attention-based neural networks to output the parameters of a parametric distribution for each time step in a forecast, which is structurally similar to NGBoost for time-indexed tabular data.
Gaussian process regression, formalized for machine learning by Carl Rasmussen and Christopher Williams in their 2006 book Gaussian Processes for Machine Learning, models the target as a draw from a Gaussian process indexed by the input. It provides a closed-form posterior predictive distribution and is the gold standard for uncertainty quantification on small tabular datasets, though it scales poorly past tens of thousands of points without approximations.
Conformal prediction, introduced by Vladimir Vovk, Alex Gammerman, and Glenn Shafer in the late 1990s and developed in their 2005 book Algorithmic Learning in a Random World, is a framework for producing prediction sets (for regression, prediction intervals) with finite-sample, distribution-free coverage guarantees. Given a target miscoverage rate $\alpha$, the method outputs an interval $C(x)$ that satisfies
$$P(y \in C(x)) \geq 1 - \alpha$$
over the joint distribution of training and test data, assuming only exchangeability. Conformal prediction can wrap any point predictor, including linear regression, random forest, gradient boosting, and neural networks.
Split conformal regression, the most widely used variant, splits the training set into a proper training set used to fit a base regressor and a calibration set used to compute residual quantiles. The interval at a new point is the prediction plus or minus the empirical quantile of the absolute residuals on the calibration set. The interval has identical width everywhere, which is a limitation when noise is heteroscedastic.
Conformalized quantile regression, by Yaniv Romano, Evan Patterson, and Emmanuel Candès at NeurIPS 2019, combines quantile regression and conformal calibration. The base predictor outputs the $\alpha/2$ and $1 - \alpha/2$ conditional quantiles, and the calibration step adjusts the interval to obtain valid marginal coverage. The resulting intervals adapt to local noise levels and tend to be tighter than split conformal intervals on heteroscedastic data while retaining the coverage guarantee. The reference implementation is open-source at github.com/yromano/cqr.
Other variants include jackknife+ and CV+ (Barber, Candès, Ramdas, Tibshirani, 2021), which exchange computational cost for better small-sample behavior, and online conformal prediction for streaming and distribution-shift settings (Gibbs and Candès, 2021). The MAPIE Python package (Cordier and Renaudie, 2022) implements many of these methods on top of scikit-learn.
AutoML systems automate model selection, hyperparameter tuning, and ensembling. Most major systems support regression as a first-class task.
Auto-sklearn, by Matthias Feurer, Aaron Klein, and colleagues at the University of Freiburg, won the first AutoML Challenge in 2014 and supports regression through its AutoSklearnRegressor class. It uses Bayesian optimization (SMAC) over the scikit-learn pipeline space, meta-learning from previous datasets to warm-start the search, and post-hoc ensembling of the best models found.
H2O AutoML, part of the open-source H2O platform from H2O.ai, trains gradient boosting machines, XGBoost, random forests, deep learning, and generalized linear models on regression tasks and stacks them with a meta-learner. It is widely used in finance and insurance for risk scoring and pricing.
AutoGluon, released by Amazon Web Services in 2020 (Erickson, Mueller, Shirkov, Zhang, Larroy, Li, and Smola, AutoGluon-Tabular), trains a hand-picked set of strong base models (LightGBM, CatBoost, XGBoost, MLP, random forest, k-NN) with sensible defaults and stacks them in multiple layers. The library has consistently ranked at or near the top of the OpenML AutoML Benchmark on both regression and classification tasks since 2020.
FLAML, by Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu at Microsoft Research in 2021, uses cost-aware Bayesian search to find good configurations quickly under a fixed budget and supports regression with several base learners.
TPOT, by Randy Olson and Jason Moore at the University of Pennsylvania in 2016, uses genetic programming to evolve full scikit-learn pipelines and supports regression through TPOTRegressor.
| AutoML system | Year | Lead authors | Approach |
|---|---|---|---|
| Auto-WEKA | 2013 | Thornton et al. | SMAC over WEKA pipelines |
| Auto-sklearn | 2015 | Feurer et al. (Freiburg) | Meta-learning + SMAC + ensemble |
| TPOT | 2016 | Olson and Moore | Genetic programming over pipelines |
| H2O AutoML | 2017 | H2O.ai | Stacked ensembles of GBM, GLM, DL |
| AutoGluon | 2020 | Erickson et al. (AWS) | Hand-picked models, multi-layer stacking |
| FLAML | 2021 | Wang et al. (Microsoft) | Cost-aware blendsearch |
| MLJAR | 2018 | Pląskowski et al. | Interpretable AutoML reports |
In most published comparisons AutoGluon and H2O AutoML lead the OpenML AutoML Benchmark on regression tasks, with FLAML close behind under tight time budgets.
Benchmarks for tabular regression draw on three main sources.
The UCI Machine Learning Repository, founded by David Aha at UC Irvine in 1987, hosts many of the oldest reference datasets. The Boston Housing dataset, a 506-row regression dataset on median house prices in Boston suburbs, was once the canonical reference but is now considered ethically problematic because it includes a feature called B that encodes systemic racism in housing prices. The dataset was deprecated in scikit-learn 1.0 (2021) and removed in version 1.2 (2022). The maintainers strongly discourage its use except for educational purposes about ethical issues in data science. Common modern UCI regression datasets include California Housing (20,640 rows, 8 features; collected by Pace and Barry in 1997 as a replacement for Boston Housing), Concrete Compressive Strength (1,030 rows, 8 features; Yeh, 1998), Wine Quality (4,898 rows, 11 features; Cortez et al., 2009), Energy Efficiency (768 rows, 8 features), Airfoil Self-Noise (1,503 rows, 5 features), and Bike Sharing Demand.
OpenML, a community platform founded by Joaquin Vanschoren in 2014, hosts datasets, task definitions, and machine learning runs in a reproducible format. The OpenML-CTR23 (Curated Tabular Regression) benchmark, defined by Sebastian Fischer, Liana Harutyunyan, Matthias Feurer, and Bernd Bischl in 2023 and presented at the AutoML 2023 workshop, contains 35 regression datasets selected by strict criteria: between 500 and 100,000 observations, fewer than 5,000 features after one-hot encoding, no time dependencies, a documented source, a numeric target with at least five distinct values, and a baseline that beats a linear model. The suite evaluates models that span the spectrum from interpretable (ridge, decision tree, GAM) to complex black box (XGBoost, random forest).
The OpenML AutoML Benchmark (Gijsbers et al., 2019 and 2024) includes both classification and regression tasks for AutoML system comparison.
The TabZilla benchmark, by Duncan McElfresh and colleagues at NeurIPS 2023, examined 196 datasets across classification and regression to map model-by-dataset performance. The paper When Do Neural Nets Outperform Boosted Trees on Tabular Data? found that the answer depends on dataset characteristics, with neural networks closing or reversing the gap on a minority of tasks.
Kaggle hosts both academic-style competitions on cleaned regression datasets and industry challenges with messy production data. Notable regression competitions include House Prices: Advanced Regression Techniques (kaggle.com/c/house-prices-advanced-regression-techniques), Mercedes-Benz Greener Manufacturing (predicting test-bench time for new vehicle configurations), Allstate Claims Severity (insurance claim amounts), Zillow Prize (home value estimates), and Santander Value Prediction. The cumulative Kaggle leaderboard for tabular regression remains dominated by ensembles built around XGBoost, LightGBM, and CatBoost.
| Benchmark | Year | Curators | Scope |
|---|---|---|---|
| UCI ML Repository | 1987 | Aha, Bay, others (UC Irvine) | Hundreds of small to medium datasets |
| California Housing | 1997 | Pace and Barry | 20,640 rows, replacement for Boston Housing |
| OpenML AutoML Benchmark | 2019 | Gijsbers et al. | Mixed classification and regression |
| OpenML-CTR23 | 2023 | Fischer et al. | 35 curated regression tasks |
| Grinsztajn benchmark | 2022 | Grinsztajn, Oyallon, Varoquaux | 45 medium tabular datasets, tree vs deep |
| TabZilla | 2023 | McElfresh et al. | 196 datasets, model-by-dataset map |
| AMLB v2 | 2024 | Gijsbers et al. | Expanded AutoML benchmark, 104 datasets |
Two influential 2022 papers crystallized the case that gradient boosted trees still beat deep learning on tabular data when the comparison is fair, on both classification and regression.
Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux published Why do tree-based models still outperform deep learning on typical tabular data? at NeurIPS 2022. The paper built a benchmark of 45 medium-sized tabular datasets including 17 regression tasks, controlling for issues that plague earlier comparisons (uninformative features, class versus regression mix, dataset size). With identical hyperparameter tuning budgets, the authors found that random forest, XGBoost, and LightGBM consistently outperformed FT-Transformer, ResNet, and MLPs on regression as well as on classification. They attributed the gap to three structural advantages of trees: robustness to uninformative features through embedded feature selection, robustness to rotation of the feature axes (trees are invariant to monotone transformations of individual features, neural networks are not), and the ability to learn non-smooth target functions without the smoothness bias of neural networks.
Ravid Shwartz-Ziv and Amitai Armon published Tabular Data: Deep Learning Is Not All You Need in Information Fusion in 2022. They reproduced the published results of TabNet, NODE, DNF-Net, and 1D-CNN on the datasets from their original papers and on a held-out set of additional benchmarks, including regression tasks. The headline finding was that the deep models did not generalize beyond their authors' chosen datasets, while a tuned XGBoost ensemble won on most of them. The paper also showed that an ensemble combining XGBoost with one or more neural models was usually the strongest, suggesting complementary inductive biases.
Subsequent work has refined the picture rather than overturning it. The Hollmann 2025 Nature paper on TabPFN v2 reopened the question for the small-data regime, where in-context learning offers a different kind of advantage than gradient boosting. The picture in 2025 is roughly the following: tuned gradient boosting remains the default winner on medium to large tabular regression tasks; small-data regimes are increasingly contested by TabPFN v2; ensembles of gradient boosting with at least one neural component often beat either alone in competitions.
The encoding of categorical features remains the single most important feature engineering choice for tabular regression. The most common schemes follow.
One-hot encoding expands a categorical column with $k$ levels into $k$ binary columns. This is the default in scikit-learn pipelines for linear models and neural networks, and is wasteful for high-cardinality columns.
Ordinal encoding assigns each level an integer code. The codes are arbitrary unless the variable is genuinely ordered, but tree-based learners can still find useful splits on ordinal codes regardless of the encoding order.
Target encoding (also called mean encoding) replaces each level with a smoothed average of the target on the training rows in that level. For regression the encoded value is the mean target per category, often shrunk toward the global mean using James-Stein or m-probability smoothing. Target encoding is powerful for high-cardinality columns but leaks information from the target into features unless the encoding is computed with care. The standard fix is cross-fitted target encoding: split the training data into K folds, compute means from the other K-1 folds, and encode each fold using only out-of-fold statistics. Scikit-learn's TargetEncoder (added in version 1.3) does this automatically during fit_transform.
CatBoost's ordered target statistics, introduced in the original CatBoost paper at NeurIPS 2018, are a more refined solution to target leakage. The library generates several random permutations of the training data; for each row in a permutation, the encoding uses only the rows that precede it. This removes the bias that ordinary target encoding introduces and is especially valuable for regression on high-cardinality categorical columns.
Embedding layers are learned during neural network training and map each categorical level to a low-dimensional vector. They were popularized by Cheng Guo and Felix Berkhahn's Entity Embeddings of Categorical Variables in 2016, which showed that a neural network with entity embeddings beat XGBoost on the Rossmann store sales Kaggle regression competition.
Hashing encoders map each level to one of a fixed number of buckets using a hash function, sacrificing some collisions for a bounded representation size.
Missing values are handled by gradient boosting libraries through dedicated default split directions (LightGBM, CatBoost) or by treating missing as a separate category (XGBoost's sparsity-aware split). Linear models and neural networks typically require explicit imputation, for which scikit-learn's IterativeImputer, KNNImputer, and SimpleImputer are common choices. The MissForest method by Daniel Stekhoven and Peter Bühlmann at ETH Zürich in 2012 uses random forests to impute missing values and remains a strong baseline.
Numerical features often require transformation when the target distribution is skewed. Common choices include log, square-root, Box-Cox (Box and Cox, 1964), and Yeo-Johnson (Yeo and Johnson, 2000) transforms on the target, or quantile binning of the features. Predicting the log of a strictly positive target and then exponentiating the prediction is a standard trick on house prices, energy demand, and claim amounts.
Calibration for regression is the property that the predicted distribution or interval matches the empirical distribution of the target. For point predictions it is closely related to the bias of the model. For predictive distributions it is evaluated by reliability diagrams and by the calibration component of the CRPS.
Isotonic regression and Platt-style sigmoid scaling can be applied post-hoc to regression quantiles to improve their calibration. Conformal prediction provides a stronger guarantee: the resulting intervals attain a prescribed marginal coverage rate by construction. Probabilistic gradient boosting frameworks such as NGBoost provide reasonably well-calibrated predictive distributions out of the box because they optimize a proper scoring rule.
For heteroscedastic regression problems, NGBoost, conformalized quantile regression, and quantile gradient boosting paired with conformal calibration are the most reliable approaches. Plain gradient boosting with squared error gives a point estimate without uncertainty and should be combined with one of these methods when intervals are needed.
Most tabular regression work today is done in Python. The stack includes scikit-learn for classical models and pipelines, XGBoost, LightGBM, and CatBoost for gradient boosting, PyTorch and Keras for neural models, and AutoGluon, H2O, and FLAML for AutoML. The pytorch-tabular library by Manu Joseph in 2021 provides a unified PyTorch implementation of TabNet, FT-Transformer, NODE, and other neural tabular models with both regression and classification heads.
For probabilistic regression the main packages are ngboost, pyro and numpyro (for Bayesian regression), gpytorch (for Gaussian processes), and the quantile objectives in XGBoost, LightGBM, and CatBoost. For conformal prediction the MAPIE and nonconformist packages are the most common, with crepes providing additional algorithms.
For categorical encoding, the category_encoders library by Will McGinnis offers more than a dozen schemes including James-Stein, m-estimator, CatBoost-style ordered target statistics, and helmert/sum/backward-difference contrasts.
Production deployment increasingly relies on ONNX or Treelite (Hyunsu Cho et al., 2018) to convert tree ensembles to a portable runtime, and on tools such as Hummingbird (Microsoft) to compile tree models into tensor operations that run on GPUs.
In R, the main packages are glmnet (Friedman, Hastie, and Tibshirani) for penalized linear regression, randomForest and ranger (Marvin Wright) for random forests, xgboost and lightgbm for boosting, quantreg (Koenker) for quantile regression, quantregForest (Meinshausen) for quantile random forests, and mlr3 for unified workflows. H2O and tidymodels also support tabular regression pipelines.
Several open problems remain in tabular regression.
Tabular foundation models such as TabPFN v2 were limited to roughly 10,000 rows and 500 features at the time of the Nature 2025 publication. Subsequent releases (TabPFN-2.5 and TabPFN-3) extended the range but the models still cannot match gradient boosting on very large datasets or on wide tables with more columns than rows.
Transfer learning across tabular regression datasets is largely unsolved. Pre-trained tabular models do not consistently transfer their learned representations to new datasets the way pre-trained language and vision models do. The closest existing analog is in-context learning with TabPFN-style models.
Distribution shift is endemic in tabular regression deployments (pricing, demand, claims) but rarely modeled explicitly. Most benchmarks assume IID train and test splits, while production data often suffers from temporal drift and covariate shift. Time-aware cross-validation, adversarial validation, and online conformal prediction provide partial solutions.
Heavy-tailed target distributions, common in financial and insurance applications, are not well served by squared error loss. Tweedie, gamma, and quantile losses, log transforms of the target, and robust losses such as Huber are the usual remedies, but each has its own limitations.
Calibration of tree ensembles for point predictions is usually acceptable, but the predictive distributions implied by quantile gradient boosting are not always well-calibrated without further conformal correction. Calibration of deep tabular regressors is generally worse than that of gradient boosting.
Interpretability remains in tension with accuracy. SHAP (Scott Lundberg and Su-In Lee, NeurIPS 2017) and TreeSHAP provide a near-canonical attribution method for tree ensembles, but the same techniques applied to neural tabular regressors give attributions that are noisier and less reproducible across seeds.
Fairness and disparate impact are first-order concerns in many regression applications (housing, lending, insurance). The deprecation of the Boston Housing dataset by scikit-learn in 2022 was an explicit acknowledgment of the field's history with biased benchmarks. Methods for fair regression include constrained optimization with demographic parity and equalized accuracy constraints, but no consensus on which to apply when.