AutoML (Automated Machine Learning)
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,997 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,997 words
Add missing citations, update stale details, or suggest a clearer explanation.
AutoML (Automated Machine Learning) is the process of automating the end-to-end pipeline of applying machine learning to real-world problems. A typical AutoML system handles data preprocessing, feature engineering, model selection, hyperparameter optimisation, model evaluation, and often ensembling, with little or no manual intervention from a human practitioner.
The term has been in use since at least the early 2010s, when researchers began publishing systems such as Auto-WEKA (2013) and tools for Bayesian optimisation of machine learning algorithms (Snoek, Larochelle and Adams 2012). The field crystallised with the publication of the auto-sklearn paper by Feurer et al. at NeurIPS 2015 and the open-access edited volume by Hutter, Kotthoff and Vanschoren in 2019, the first comprehensive book devoted to the subject. Commercial AutoML products entered the market in 2018, when Google launched Cloud AutoML Vision, and the field has since become a standard component of major cloud machine-learning platforms.
Applying machine learning to a new dataset is more craft than science. The practitioner must clean and encode data, choose a model family that suits the problem, set dozens of hyperparameters, decide on a validation scheme, and iterate. Expertise of this kind is scarce, expensive, and slow to acquire. Hutter, Kotthoff and Vanschoren framed the motivation in their 2019 book as a desire for "effective machine learning out of the box," so that domain scientists, business analysts and software engineers can apply ML without a dedicated specialist.
There are also methodological reasons. Manual hyperparameter tuning is error-prone and biased toward configurations the practitioner has used before. Bergstra and Bengio showed in 2012 that grid search wastes most of its budget on irrelevant axes, and that random search is strictly more efficient under realistic assumptions. Even random search is not informed by prior runs, which leaves performance on the table when many similar datasets have been studied before. AutoML treats model selection and tuning as a formal optimisation problem.
A further motivation is reproducibility. A pipeline produced by an AutoML system is fully specified by the search space, the optimiser, the random seed and the dataset, so two researchers running the same configuration get the same model.
A modern AutoML system typically combines several stages. The table below summarises the main components, drawing on the taxonomy in He, Zhao and Chu's 2021 survey and Chapter 1 of the Hutter, Kotthoff and Vanschoren book.
| Stage | Typical operations | Example systems |
|---|---|---|
| Data preprocessing | Missing-value imputation, type inference, categorical encoding, scaling, outlier handling | auto-sklearn, AutoGluon |
| Feature engineering | Automated feature construction, polynomial and interaction terms, target encoding, feature selection | TPOT, FeatureTools (Deep Feature Synthesis) |
| Model selection | Choose family from trees, linear models, kernel methods, neural networks, gradient boosting | All AutoML systems |
| Hyperparameter optimisation | Tune per-model hyperparameters such as learning rate, depth, regularisation strength | SMAC, BOHB, Optuna |
| Architecture search | For deep networks, search over layer types, connectivity, width and depth | NASNet, DARTS, Auto-Keras |
| Ensembling and stacking | Combine top-k candidate models with weighted averaging, stacking or bagging | auto-sklearn, AutoGluon, H2O AutoML |
| Calibration and post-processing | Probability calibration, threshold tuning, fairness post-processing | H2O AutoML, FLAML |
Not every system covers every stage. Auto-sklearn and TPOT focus on the tabular pipeline; Auto-Keras and DARTS focus on neural architecture search; AutoGluon spans tabular, text, image and multimodal data with a unified API.
The core algorithmic problem inside AutoML is hyperparameter tuning. Many algorithms have been proposed, and most production systems use a combination of them. The table below lists the most influential.
| Method | Reference | Idea |
|---|---|---|
| Grid search | Long-standing baseline | Exhaustive enumeration of a discretised hyperparameter grid |
| Random search | Bergstra and Bengio 2012 (JMLR) | Sample configurations independently from a prior; provably better than grid for low-effective-dimensionality problems |
| Bayesian optimisation with Gaussian processes | Snoek, Larochelle and Adams 2012 (NeurIPS), Spearmint package | Fit a GP surrogate to past evaluations, choose next point via acquisition function such as Expected Improvement |
| Tree-structured Parzen Estimator (TPE) | Bergstra, Bardenet, Bengio and Kegl 2011 (NeurIPS) | Use density estimators on "good" and "bad" configurations rather than a single regression surrogate |
| SMAC | Hutter, Hoos and Leyton-Brown 2011 (LION) | Random-forest surrogate suited to mixed continuous, discrete and conditional spaces; used by auto-sklearn |
| Successive halving | Karnin, Koren and Somekh 2013 | Allocate small budgets to many configurations, prune the worst, double the budget for survivors |
| Hyperband | Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar 2017 | Bandit-based wrapper around successive halving with multiple bracket sizes |
| BOHB | Falkner, Klein and Hutter 2018 (ICML) | Combines TPE-style Bayesian optimisation with Hyperband multi-fidelity scheduling |
| Population-based training | Jaderberg et al. 2017 (DeepMind) | Train a population of models in parallel, periodically copy weights of the best workers and perturb their hyperparameters |
| CFO and BlendSearch | Wang, Wu, Weimer and Zhu 2021 (FLAML, MLSys) | Cost-aware search that trades off evaluation cost against improvement |
The shift from grid search through random search to Bayesian and bandit-based approaches reflects a steady increase in sample efficiency. Multi-fidelity methods such as Hyperband and BOHB are the dominant choice when each evaluation is expensive, for example when fitting a deep network on a large dataset.
Neural architecture search (NAS) is the sub-area of AutoML concerned with discovering the topology of deep networks. The modern wave began with Zoph and Le's 2017 ICLR paper "Neural Architecture Search with Reinforcement Learning," in which a recurrent controller proposes architectures encoded as variable-length strings, the proposed network is trained on CIFAR-10, and the validation accuracy is fed back to the controller as a reinforcement-learning reward. The original system used about 800 GPUs for several weeks but produced architectures competitive with the best human designs of the time.
Follow-up work attacked the cost. NASNet (Zoph, Vasudevan, Shlens and Le 2018) restricted the search to a small "cell" that is then stacked, allowing search on CIFAR-10 to transfer to ImageNet. ENAS (Pham, Guan, Zoph, Le and Dean 2018, ICML) introduced parameter sharing between candidate architectures, which the authors reported as roughly 1000x cheaper than the original NAS. DARTS (Liu, Simonyan and Yang 2019, ICLR) replaced the discrete search with a continuous relaxation, allowing gradient descent to optimise architecture weights directly. The DARTS paper reported competitive results on CIFAR-10 and Penn Treebank in the order of one GPU-day rather than thousands.
Latency-aware search came next. MnasNet (Tan, Chen, Pang, Vasudevan, Sandler, Howard and Le, CVPR 2019) added measured on-device latency to the reward, so that the search produced architectures suitable for mobile inference. EfficientNet (Tan and Le, ICML 2019) used a NAS-derived baseline (EfficientNet-B0) and a principled compound-scaling rule for width, depth and resolution; the resulting family of models reached state-of-the-art ImageNet accuracy with a fraction of the parameters of earlier networks.
The table below summarises the main NAS approaches.
| Method | Year | Search strategy | Notes |
|---|---|---|---|
| NAS with RL | Zoph and Le 2017 | RNN controller trained by REINFORCE | First modern NAS; thousands of GPU-days |
| NASNet | Zoph et al. 2018 | RL on transferable cells | CIFAR-10 search transfers to ImageNet |
| ENAS | Pham et al. 2018 | RL with weight sharing | About 1000x cheaper than NAS |
| DARTS | Liu, Simonyan and Yang 2019 | Differentiable, gradient-based | Continuous relaxation of architecture |
| MnasNet | Tan et al. 2019 | RL with latency objective | Targets mobile inference |
| EfficientNet | Tan and Le 2019 | NAS baseline plus compound scaling | State-of-the-art accuracy / parameter trade-off |
| Auto-Keras | Jin, Song and Hu 2019 | Bayesian optimisation over network morphisms | Integrates with Keras API |
Research systems and commercial products have proliferated since around 2015. The table below compares the most widely used.
| System | First release | Maintainer | Approach | Modalities |
|---|---|---|---|---|
| Auto-WEKA | 2013 | University of British Columbia | Bayesian optimisation (SMAC) over WEKA classifiers | Tabular |
| auto-sklearn | 2015 | University of Freiburg | Meta-learning warm-start, SMAC, ensemble selection | Tabular |
| TPOT | 2016 | University of Pennsylvania | Genetic programming over scikit-learn pipelines | Tabular |
| H2O AutoML | 2017 | H2O.ai | Random search and grid search plus stacked ensembles | Tabular |
| Google Cloud AutoML | 2018 | Google Cloud | Transfer learning and NAS, later folded into Vertex AI | Vision, language, tables, video |
| Auto-Keras | 2018 (paper 2019) | Texas A&M | Bayesian optimisation over network morphisms | Vision, text |
| Azure Automated ML | 2018 | Microsoft | Probabilistic matrix factorisation for warm-start, plus SMAC and ensembles | Tabular, vision, NLP |
| Amazon SageMaker Autopilot | 2019 | AWS | White-box AutoML producing notebooks | Tabular |
| AutoGluon | 2020 | Amazon | Multi-layer stacking, no model search per se | Tabular, text, image, multimodal, time series |
| FLAML | 2021 | Microsoft | Cost-aware search (CFO, BlendSearch) | Tabular |
| LightAutoML | 2021 | Sber AI Lab | Modular pipeline tuned for financial data | Tabular |
| DataRobot | 2014 | DataRobot Inc. | Commercial closed-source platform | Tabular, text, time series |
| MLJAR, PyCaret | 2019 | Independent | Open-source wrappers around scikit-learn and gradient boosting | Tabular |
A few of these warrant a closer look.
Auto-sklearn (Feurer, Klein, Eggensperger, Springenberg, Blum and Hutter, NeurIPS 2015) defines a structured search space of 15 classifiers, 14 feature preprocessing methods and 4 data preprocessing methods, giving 110 hyperparameters in total. The system uses two ideas not present in earlier work. First, it warm-starts SMAC by retrieving promising configurations from datasets that are similar in meta-features to the current one, drawing on a database of past runs. Second, it builds an ensemble out of all configurations evaluated during search, rather than returning only the single best, which markedly reduces variance. Auto-sklearn won the first phase of the ChaLearn AutoML challenge.
TPOT (Olson and Moore, ICML 2016 AutoML workshop) treats a pipeline as an expression tree and uses genetic programming to evolve it. Operators include preprocessing transformations and scikit-learn estimators. TPOT exposes the discovered pipeline as Python code, which makes it easy to inspect and edit. On 150 supervised classification tasks the original benchmark reported significant improvements over a default scikit-learn baseline on 22 of them.
H2O AutoML (LeDell and Poirier, ICML 2020 AutoML workshop) takes a more pragmatic line. It trains a fixed grid of GLMs, random forests, gradient-boosting machines (including XGBoost), and deep neural networks within a user-specified time budget, then builds two stacked ensembles, one over all models and one restricted to the best of each family. The H2O AutoML algorithm was first released in H2O 3.12.0.1 in June 2017. It has APIs in R, Python, Java and Scala.
AutoGluon (Erickson, Mueller, Shirkov, Zhang, Larroy, Li and Smola 2020) made an explicit choice not to search hyperparameters or architectures. Instead, it trains a fixed set of strong models with sensible defaults and combines them via multi-layer stacking. The 2020 paper reported that AutoGluon-Tabular beat 99% of human teams on two Kaggle competitions after about four hours on raw data, and outperformed auto-sklearn, H2O and TPOT on the OpenML AutoML Benchmark.
FLAML (Wang, Wu, Weimer and Zhu, MLSys 2021) emphasises cost. Its CFO and BlendSearch algorithms model the cost of an evaluation as a function of hyperparameters and choose configurations that maximise expected improvement per unit cost. The library is intentionally lightweight: it ships with a few hundred lines of core search code rather than a heavyweight framework.
Google Cloud AutoML launched on 17 January 2018 with Cloud AutoML Vision, a no-code service that fine-tunes a pretrained image classifier on user-uploaded data using transfer learning and architecture search. Google later added Natural Language, Translation, Tables and Video, and folded all of them into Vertex AI in 2021.
A handful of ideas appear repeatedly across AutoML systems.
Meta-learning and warm-starting. When a new dataset arrives, an AutoML system can use prior knowledge from previous runs on similar datasets to bias the initial configurations of the optimiser. Auto-sklearn does this by computing meta-features (number of samples, number of classes, skewness, kurtosis and so on) and retrieving the 25 most similar OpenML datasets, then seeding SMAC with their best-known configurations. Vanschoren's chapter in the 2019 book gives a thorough treatment.
Multi-fidelity optimisation. Evaluating a configuration on the full dataset for the full number of epochs is expensive. Multi-fidelity methods cheat by evaluating on a subset of the data, for fewer epochs, or with a smaller model, and use these cheap proxies to filter unpromising configurations before committing real budget. Hyperband and BOHB are the canonical examples.
Surrogate models. A surrogate is a cheap-to-evaluate proxy for the expensive objective (validation accuracy after full training). Gaussian processes (Spearmint), random forests (SMAC), TPE density estimators, and neural-network surrogates have all been used. The acquisition function on top of the surrogate decides where to evaluate next.
Bandits and successive halving. Successive halving treats configurations as arms of a multi-armed bandit. Run all of them for a small budget, keep the top half, double the budget, repeat. Hyperband runs successive halving with several bracket sizes to hedge against picking a wrong starting budget.
Time and compute budgeting. Real users care about wall-clock time. AutoML systems usually expose a budget parameter and try to make the best use of it. FLAML's headline feature is that it can do useful work in seconds rather than hours.
For tabular data with reasonably clean inputs, modern AutoML systems frequently match or beat hand-tuned pipelines built by experienced data scientists, especially within fixed time budgets. The OpenML AutoML Benchmark by Gijsbers, LeDell, Thomas, Poirier, Bischl and Vanschoren (2019, updated as a JMLR paper in 2024) shows tight competition between auto-sklearn, H2O AutoML, TPOT, AutoGluon and FLAML on tabular tasks, with AutoGluon and H2O often near the top.
The key advantages are a lower barrier to entry, since a non-expert can produce a reasonable model with a few lines of code; strong baselines that set a high bar for any hand-crafted alternative; reproducibility, because pipelines and seeds are recorded; and broader coverage of the search space, since optimisers explore configurations a human would not think to try.
AutoML is not a silver bullet, and several limitations are well documented.
It is compute-intensive. A long auto-sklearn run can consume tens of CPU-hours; a NAS run can consume thousands of GPU-hours. The original Zoph and Le NAS paper used the equivalent of around 22,400 GPU-hours. Even after the cost reductions delivered by ENAS and DARTS, neural architecture search remains more expensive than training a single off-the-shelf network.
The resulting pipeline is mostly opaque. A stacked ensemble of 30 gradient-boosting machines and random forests is not easy to explain to a regulator or domain expert. AutoGluon and similar systems trade interpretability for accuracy, which is acceptable in many applications and unacceptable in others.
Results can be brittle on novel domains. AutoML systems are tuned on benchmark distributions and may underperform on data whose structure was not anticipated, such as sparse high-cardinality categorical data, time series with irregular sampling, or scientific data with strong physical constraints.
Long searches risk overfitting to the validation set. If 10,000 configurations are evaluated against the same validation split, the best one is partly selected for noise. Cross-validation and ensemble averaging mitigate this but do not eliminate the multiple-comparisons problem. On high-stakes problems where human expertise is plentiful, manual tuning by a skilled team often still wins. AutoML is most valuable where expertise is the bottleneck.
The OpenML AutoML Benchmark, introduced by Gijsbers et al. at the ICML 2019 AutoML workshop, is the standard evaluation suite for tabular AutoML. It defines a curated list of binary and multiclass classification tasks (and later regression tasks) drawn from OpenML, fixes time budgets (typically one hour and four hours), and runs each system in a Docker container with controlled resources. The 2024 JMLR update by Gijsbers et al. extended the benchmark to 71 tasks and 11 frameworks. Results are publicly available and updated regularly.
For neural architecture search the canonical benchmarks are CIFAR-10 and ImageNet image classification, with NAS-discovered architectures (NASNet, MnasNet, EfficientNet) achieving state-of-the-art accuracy in their respective compute classes. NAS-Bench-101 (Ying et al. 2019) and NAS-Bench-201 (Dong and Yang 2020) provide tabulated training results for many architectures, allowing fast and reproducible comparisons of NAS algorithms without re-running expensive training.
AutoML has been deployed in many industries. Documented use cases include drug discovery and molecular property prediction, genomics and clinical risk prediction, time-series forecasting in retail and supply chain (AutoGluon-TimeSeries and similar tools produce probabilistic forecasts at scale), recommendation systems with cold-start models on new product categories, manufacturing quality control using vision AutoML services to detect defects, and standard business analytics such as sales forecasting, customer churn modelling and marketing-mix attribution. At Sber, LightAutoML was used in production for credit-risk and customer-analytics tasks.
A practitioner today can build an AutoML stack entirely from open-source components. Common building blocks include auto-sklearn for tabular classification and regression with meta-learning warm-start; TPOT for genetic programming over pipelines; AutoGluon for tabular, multimodal and time-series tasks; FLAML for fast and lightweight HPO; Optuna (Akiba, Sano, Yanase, Ohta and Koyama, KDD 2019), a define-by-run hyperparameter optimisation framework used as a backend by many other systems; Ray Tune for distributed hyperparameter search with Hyperband, BOHB and PBT; Hyperopt, the original TPE implementation by James Bergstra; SMAC3 (Lindauer et al. 2022), the modern Python SMAC implementation; and NNI (Microsoft Neural Network Intelligence), a toolkit covering HPO, NAS, model compression and feature engineering.
Alongside the open-source projects, every major cloud has its own AutoML product. The table below summarises them.
| Vendor | Product | Modality | Launched | Notes |
|---|---|---|---|---|
| Google Cloud | Cloud AutoML, then Vertex AI AutoML | Vision, NLP, Tables, Video | 2018 | First major no-code cloud AutoML |
| Microsoft Azure | Azure Automated ML | Tabular, vision, NLP | 2018 | Integrated with Azure ML Studio |
| AWS | SageMaker Autopilot | Tabular | 2019 | White-box approach: returns generated notebooks |
| AWS | Amazon SageMaker Canvas | Tabular | 2021 | Business-analyst-oriented no-code interface |
| DataRobot | DataRobot AI Cloud | Tabular, text, time series | 2014 | Pioneering enterprise AutoML vendor |
| H2O.ai | Driverless AI | Tabular, time series, NLP | 2017 | Commercial counterpart to H2O AutoML |
| dotData | dotData Enterprise | Tabular with auto-feature-engineering | 2018 | Spin-off from NEC research |
AutoML overlaps with MLOps. MLOps is concerned with the full lifecycle of model deployment, monitoring and retraining; AutoML focuses on the model-building stage. Many MLOps platforms include an AutoML component to bootstrap models that are then handed to deployment pipelines.
The rise of foundation models has shifted the centre of gravity. Where AutoML once asked which small model is best for this dataset, the question for foundation-model users is closer to which adapter, learning rate and data mixture should I use to fine-tune this pretrained model. Research on AutoML for foundation models, including AutoLoRA (Zhang et al. 2023) and automated parameter-efficient fine-tuning, is active, and tools such as FLAML and Optuna are increasingly used to tune LoRA ranks, prompt templates and retrieval parameters. NAS has continued to evolve in parallel, with active work on transformer-based language models, graph neural networks (Graph NAS, Gao et al. 2020) and multimodal architectures.
Several open problems are likely to shape AutoML in the coming years. Automated data engineering and labelling is one: many real-world failures are caused by data issues rather than modelling issues, and tools for automated data quality checks, labelling and weak supervision are an active research area. AutoML for foundation models is another, with the combinatorial space of base model, fine-tuning regime, retrieval component and prompt much larger than classical model selection. AutoML for graph neural networks and other non-tabular structures is less mature than its tabular and image counterparts. Multimodal AutoML, building on systems like AutoGluon-Multimodal, has room for better fusion architectures and automated cross-modal preprocessing. Compute-efficient NAS, using once-for-all networks, supernet training and zero-cost proxies, aims to bring architecture search closer to the cost of a single training run. Finally, better evaluation methodology, beyond OpenML AutoML Benchmark and NAS-Bench-x, is needed because real-world deployment performance is still hard to predict from benchmark scores.