# Bayesian Optimization

> Source: https://aiwiki.ai/wiki/bayesian_optimization
> Updated: 2026-06-21
> Categories: Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

See also: [Machine learning](/wiki/machine_learning) terms, [Hyperparameter tuning](/wiki/hyperparameter_tuning), [AutoML](/wiki/automl)

Bayesian optimization is a sequential, model-based strategy for finding the global optimum of expensive black-box functions in as few evaluations as possible. It builds a probabilistic surrogate model (most often a Gaussian process) of an unknown objective, then uses an acquisition function to decide where to evaluate next, balancing exploration of uncertain regions against exploitation of promising ones. Peter Frazier, in the widely cited tutorial that the field treats as a standard reference, defines it directly: "Bayesian optimization is an approach to optimizing objective functions that take a long time (minutes or hours) to evaluate," and notes it is "best-suited for optimization over continuous domains of less than 20 dimensions."[10] Because each evaluation can cost hours of compute or a real-world experiment, Bayesian optimization is the dominant approach for hyperparameter tuning, neural architecture search, drug discovery, and materials design, typically reaching good solutions in tens to low hundreds of evaluations where [grid search](/wiki/hyperparameter_tuning) or random search would need thousands.

The method's signature result in machine learning came from Jasper Snoek, Hugo Larochelle, and Ryan P. Adams at NeurIPS 2012, who showed that automating this tuning could "reach or surpass human expert-level optimization on a diverse set of contemporary algorithms."[4] In empirical comparisons it routinely matches or beats grid search while requiring 10x to 100x fewer evaluations.[4]

## What is Bayesian optimization?

Bayesian optimization is a sequential strategy for the global optimization of expensive black-box functions. Unlike gradient-based methods such as [gradient descent](/wiki/gradient_descent), it does not require access to derivatives of the objective function and makes no assumptions about its convexity. Instead, Bayesian optimization builds a probabilistic surrogate model of the objective, uses that model to decide where to evaluate next, and updates the model with each new observation. This makes it especially well suited to problems where each function evaluation is costly in time, money, or computational resources.

The core idea draws on Bayesian inference: a prior belief about the objective function is combined with observed data to form a posterior distribution, which in turn guides the search. Because the method reasons about uncertainty, it can balance exploration (sampling in regions of high uncertainty) with exploitation (sampling in regions predicted to perform well), converging on good solutions with far fewer evaluations than brute-force alternatives.

Bayesian optimization has been widely adopted across machine learning, engineering design, drug discovery, materials science, and many other fields where evaluations are expensive and the search space may be high-dimensional, mixed-type, or constrained.

## Who invented Bayesian optimization?

The intellectual roots of Bayesian optimization extend to the early 1960s. In 1964, Harold J. Kushner published "A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise," which used a Wiener process (Brownian motion) as a stochastic model for an unknown one-dimensional function and proposed maximizing the probability of improvement as a decision criterion.[1] Kushner's formulation already contained the essential insight of using a probabilistic model to guide sequential decisions, including a parameter controlling the trade-off between more global and more local search.[1]

The foundations of Bayesian optimization as a general methodology, however, trace back to the work of Jonas Mockus and colleagues in the 1970s and 1980s. Mockus developed methods for multidimensional optimization using Gaussian random fields and, in a landmark 1978 paper, introduced the concept of expected improvement as an acquisition criterion.[2] Because of the influence of this principle, Mockus is widely regarded as the founder of Bayesian optimization, and the term "Bayesian optimization" is generally attributed to his series of publications on global optimization during this period.[2]

The technique gained renewed attention in the engineering community through the 1998 paper by Donald R. Jones, Matthias Schonlau, and William J. Welch, which introduced the Efficient Global Optimization (EGO) algorithm.[3] EGO combined Gaussian process regression with the expected improvement acquisition function and provided a principled stopping rule. The paper, published in the Journal of Global Optimization, demonstrated that the method could optimize expensive black-box functions with remarkably few evaluations compared to traditional design-of-experiments approaches.[3]

Bayesian optimization became widely known in the machine learning community after Jasper Snoek, Hugo Larochelle, and Ryan P. Adams published "Practical Bayesian Optimization of Machine Learning Algorithms" at NeurIPS 2012.[4] The authors framed the problem starkly, observing that hyperparameter tuning "is often a 'black art' that requires expert experience, unwritten rules of thumb, or sometimes brute-force search," and showed that Bayesian optimization could instead "reach or surpass human expert-level optimization on a diverse set of contemporary algorithms."[4] Their work automatically tuned [hyperparameters](/wiki/hyperparameter) of machine learning models (including [convolutional neural networks](/wiki/convolutional_neural_network), latent Dirichlet allocation, and structured SVMs) to match or surpass configurations found by human experts.[4] The accompanying software package, Spearmint, became one of the first widely used Bayesian optimization tools in the ML community. Snoek et al. also advocated for a fully Bayesian treatment of Gaussian process kernel hyperparameters (marginalizing rather than optimizing them) and introduced practical algorithms for parallel evaluation and variable-cost experiments.[4]

Since then, the field has grown rapidly. New surrogate models, acquisition functions, and software libraries have been developed, and the technique has been applied to problems ranging from [neural architecture search](/wiki/neural_architecture_search) to chemical synthesis optimization.

## How does Bayesian optimization work?

The Bayesian optimization loop consists of the following steps:

1. **Initialize:** Evaluate the objective function at a small number of initial points, often chosen by Latin hypercube sampling or random sampling.
2. **Fit the surrogate model:** Build or update a probabilistic model (for example, a Gaussian process) using all observed data points.
3. **Optimize the acquisition function:** Use the surrogate model's predictions and uncertainty estimates to compute an acquisition function, then find the point in the search space that maximizes this acquisition function.
4. **Evaluate:** Run the true (expensive) objective function at the selected point.
5. **Update:** Add the new observation to the dataset and return to step 2.
6. **Terminate:** Stop when a budget of evaluations is exhausted, a convergence criterion is met, or a satisfactory solution has been found.

Because each iteration uses all previously collected data to inform the next evaluation, Bayesian optimization is highly sample-efficient.[10] The method typically finds good solutions in tens to hundreds of evaluations, even for objective functions that would require thousands of evaluations under grid or random search.

## The exploration-exploitation tradeoff

A central concept in Bayesian optimization is the tension between exploration and exploitation. Exploitation means sampling where the surrogate model predicts a high objective value, focusing the search on regions that appear promising based on current knowledge. Exploration means sampling in regions where the surrogate model is uncertain, gathering information that could reveal previously unknown good regions.

Pure exploitation risks converging prematurely to a local optimum because the surrogate model may be inaccurate in unexplored regions. Pure exploration wastes evaluations by sampling in regions that are unlikely to contain the optimum. Effective Bayesian optimization requires balancing these two goals, and the acquisition function is the mechanism that encodes this balance.[10]

Different acquisition functions handle the tradeoff differently. Expected Improvement (EI) implicitly balances the two by being large both where the predicted mean is favorable and where uncertainty is high. The Upper Confidence Bound (UCB) provides explicit control through a tunable parameter, kappa, that weights the contribution of the uncertainty term. Probability of Improvement (PI) tends to favor exploitation, which is why it often converges prematurely on its own. Information-theoretic methods like Entropy Search approach the tradeoff by prioritizing evaluations that reduce global uncertainty about the location of the optimum.

The optimal balance depends on the evaluation budget. With a small budget, moderate exploration is important because the surrogate model is highly uncertain. As the budget grows and the model becomes more accurate, the optimizer should shift toward exploitation to refine the best-known solution.

## Surrogate models

The surrogate model (also called the response surface model) is the probabilistic approximation of the true objective function. It must provide both a prediction (mean estimate) and a measure of uncertainty (variance or confidence interval) at any point in the search space. Several types of surrogate models are used in practice.

### Gaussian processes

Gaussian processes (GPs) are the most common surrogate model in Bayesian optimization. A GP is a non-parametric model that defines a distribution over functions. It is fully specified by a mean function and a covariance (kernel) function. Given a set of observations, the GP posterior provides a closed-form predictive distribution at any new input point, yielding both a mean prediction and a variance.

GPs are attractive because they naturally quantify uncertainty, which is essential for acquisition functions. They also provide exact inference in the case of Gaussian noise. However, standard GP inference has O(n^3) computational complexity in the number of observations, which can become a bottleneck when the number of evaluations grows large (typically beyond a few thousand). Various sparse GP approximations and scalable GP methods have been developed to address this limitation.

The choice of kernel function (such as the squared exponential, Matern, or rational quadratic kernel) encodes assumptions about the smoothness and structure of the objective function. Kernel hyperparameters are typically learned from data via maximum likelihood estimation or, as Snoek et al. (2012) recommended, by marginalizing over them with Markov chain Monte Carlo (MCMC) for a fully Bayesian treatment.[4]

### Random forests (SMAC)

The SMAC (Sequential Model-based Algorithm Configuration) framework, introduced by Hutter, Hoos, and Leyton-Brown in 2011, uses [random forests](/wiki/random_forest) as the surrogate model instead of Gaussian processes.[7] In this approach, the mean prediction is the average of the individual trees' predictions, and uncertainty is estimated from the variance across trees.

Random forest surrogates offer several advantages over GPs for certain problems. They handle categorical and conditional hyperparameters naturally without special encoding. They also scale better to high-dimensional spaces and to large numbers of observations, since random forest training is O(n log n) per tree rather than the O(n^3) required by exact GP inference. On the other hand, random forests do not provide the same smooth, well-calibrated uncertainty estimates as GPs, and their predictions can be less accurate in low-data regimes.

SMAC has been particularly successful in the automated algorithm configuration community, where it is used to tune parameters of combinatorial solvers, machine learning pipelines, and other complex algorithms.

### Tree-structured Parzen Estimator (TPE)

The Tree-structured Parzen Estimator (TPE), introduced by James Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl at NeurIPS 2011, takes a fundamentally different approach from GP-based methods.[5] Rather than modeling the objective function directly (p(y | x)), TPE models the inverse: it builds two density estimators over the input space, one for observations that produced "good" results (below a threshold y*) and one for "bad" results (above y*).[5]

Formally, TPE models:

- l(x) = p(x | y < y*), the density of hyperparameter configurations among the best-performing evaluations
- g(x) = p(x | y >= y*), the density of configurations among the worse-performing evaluations

The acquisition strategy then selects points that maximize the ratio l(x) / g(x), which can be shown to be proportional to the expected improvement criterion.[5]

TPE naturally handles tree-structured (conditional) search spaces, where the value of one hyperparameter determines whether another hyperparameter is relevant. This makes it especially well suited for complex machine learning pipelines. TPE is the default optimizer in the Hyperopt library and is also available in Optuna.[12] It scales well to high-dimensional spaces where GPs struggle and can handle thousands of observations efficiently.

The name "Tree-structured" refers to two aspects: the use of Parzen (kernel density) estimation to model the distributions, and the tree-like posterior-inference graph structure used to handle conditional parameter spaces efficiently.

## Acquisition functions

Acquisition functions use the surrogate model's predictions and uncertainty estimates to assign a utility value to each candidate point in the search space. By maximizing the acquisition function, the optimizer selects the next point to evaluate. Different acquisition functions encode different trade-offs between exploration and exploitation.

### Expected Improvement (EI)

Expected Improvement is the most widely used acquisition function in Bayesian optimization. It measures the expected amount by which a candidate point will improve over the current best observed value. Given a surrogate model with predictive mean mu(x) and predictive standard deviation sigma(x), and the current best observation f*, the expected improvement is:

EI(x) = E[max(f* - f(x), 0)]

For a Gaussian predictive distribution, EI has a convenient closed-form expression involving the standard normal probability density and cumulative distribution functions. EI naturally balances exploration and exploitation: it is large both where the predicted value is better than the current best (exploitation) and where uncertainty is high (exploration).

EI was first introduced by Mockus (1978) and later formalized in the context of Bayesian optimization by Jones et al. (1998); it remains the default choice in many implementations.[2][3]

### Probability of Improvement (PI)

Probability of Improvement (also called Maximum Probability of Improvement, or MPI) selects the point that has the highest probability of producing a value better than the current best:

PI(x) = P(f(x) < f*)

PI is simple and computationally cheap but tends to be overly exploitative. It favors points that are very likely to produce a small improvement over the current best, even if the expected magnitude of improvement is tiny. For this reason, PI often converges prematurely to local optima and is less commonly used as a standalone acquisition function in modern implementations. Kushner (1964) was the first to propose this criterion, making it the oldest acquisition function in the Bayesian optimization literature.[1]

### Upper Confidence Bound (UCB)

The Upper Confidence Bound (also known as GP-UCB in the Gaussian process context) selects the point that maximizes:

UCB(x) = mu(x) + kappa * sigma(x)

where kappa is a tunable parameter that controls the exploration-exploitation trade-off. A larger kappa encourages more exploration by placing greater weight on uncertainty, while a smaller kappa encourages exploitation by focusing on the predictive mean.

UCB has strong theoretical guarantees: Srinivas et al. (2010) proved sublinear cumulative regret bounds for GP-UCB under certain assumptions about the kernel and the function class, establishing a novel connection between GP optimization and experimental design through the concept of maximum information gain.[8] UCB is intuitive and easy to implement. Its main drawback is that the performance depends on the choice of kappa, which may require problem-specific tuning.

### Knowledge Gradient (KG)

The Knowledge Gradient, developed by Peter Frazier, Warren Powell, and Savas Dayanik (2009), selects the point whose evaluation would produce the largest expected improvement in the quality of the overall decision.[9] Unlike EI, which considers only the improvement at the sampled point, KG considers how a new observation would change the surrogate model and thereby improve the recommendation across the entire search space.

Formally, the knowledge gradient at a point x is the expected reduction in the posterior minimum after observing f(x). KG is a one-step Bayes-optimal policy, meaning it makes the best possible decision if only one more evaluation remains. In practice, KG often outperforms EI when the goal is to make the best final recommendation rather than to find the global optimum as quickly as possible. However, KG is more computationally expensive to evaluate than EI or UCB because it requires solving an inner optimization problem.

The parallel knowledge gradient extends this idea to batch settings, where multiple points are evaluated simultaneously. BoTorch provides an efficient "one-shot" implementation of KG that leverages auto-differentiation for optimization.[11]

### Entropy Search and Predictive Entropy Search

Entropy Search (Hennig and Schuler, 2012) and Predictive Entropy Search (Hernandez-Lobato et al., 2014) take an information-theoretic approach.[16] They select the point whose evaluation would most reduce the entropy (uncertainty) of the posterior distribution over the location of the global optimum. These methods can be highly effective but are computationally demanding, as they require approximating intractable integrals over the posterior.

### Comparison of acquisition functions

| Acquisition Function | Principle | Strengths | Weaknesses |
|---|---|---|---|
| Expected Improvement (EI) | Expected amount of improvement over current best | Closed-form for GPs; good exploration-exploitation balance | Can over-exploit in high noise settings |
| Probability of Improvement (PI) | Probability of beating the current best | Simple, computationally cheap | Tends to be overly exploitative; premature convergence |
| Upper Confidence Bound (UCB) | Optimistic estimate using mean plus scaled uncertainty | Theoretical regret bounds; intuitive tuning via kappa | Performance depends on kappa parameter choice |
| Knowledge Gradient (KG) | Expected improvement in final recommendation quality | One-step Bayes-optimal; strong in noisy settings | Computationally expensive; harder to implement |
| Entropy Search | Reduction in entropy of the optimum location | Information-theoretically principled | Computationally very expensive; approximation required |

## Theoretical foundations and convergence

Bayesian optimization has a growing body of theoretical results that provide guarantees on its convergence behavior. Two primary notions of performance are used in this literature.

**Simple regret** measures the quality of the best solution found so far. After T evaluations, the simple regret is r_T = f(x*) - f(x_best), where x* is the true global optimum and x_best is the best point evaluated. An algorithm with vanishing simple regret will eventually identify the global optimum.

**Cumulative regret** measures the total suboptimality accumulated across all evaluations: R_T = sum of (f(x*) - f(x_t)) for t = 1 to T. An algorithm with sublinear cumulative regret (R_T / T approaches 0) is called a no-regret algorithm, meaning the average per-step suboptimality vanishes over time.

The foundational theoretical work for GP-based Bayesian optimization is due to Srinivas, Krause, Kakade, and Seeger (2010), who analyzed the GP-UCB algorithm.[8] They proved that GP-UCB achieves sublinear cumulative regret bounds that depend on the maximum information gain, a quantity from experimental design that measures how much information can be extracted from function evaluations.[8] For common kernel functions, the information gain grows slowly with the number of evaluations, yielding the following regret rates:

| Kernel | Cumulative Regret Bound |
|---|---|
| Squared exponential (RBF) | O(sqrt(T) * (log T)^(d+1)) |
| Matern (nu > 1) | O(T^((2nu + d(d+1))/(2nu + 2d(d+1))) * log T) |
| Linear | O(d * sqrt(T * log T)) |

Here d is the input dimension and T is the number of evaluations. A notable finding is that for smooth kernels like the squared exponential, the regret has surprisingly weak dependence on the dimensionality.

More recent work has extended convergence guarantees to other acquisition functions. Bull (2011) proved convergence rates for Expected Improvement under GP surrogates.[18] Recent results from 2025 have shown that GP-EI with certain incumbent choices is also a no-regret algorithm for both squared exponential and Matern kernels. Thompson sampling for GPs has also been shown to achieve sublinear Bayesian regret bounds.

These theoretical results provide reassurance that Bayesian optimization is not merely a heuristic but has principled convergence guarantees under appropriate assumptions about the objective function.

## What is Bayesian optimization used for?

Bayesian optimization has found success in a wide range of domains where function evaluations are expensive and the search space is complex.

### Hyperparameter tuning

Hyperparameter tuning is the most prominent application of Bayesian optimization in machine learning. Training a [deep learning](/wiki/deep_learning) model with a given set of hyperparameters (learning rate, batch size, weight decay, dropout rate, optimizer settings, and so on) can take hours or days on modern hardware, particularly when performance is measured through [cross-validation](/wiki/cross-validation) rather than a single train-test split. Bayesian optimization can find high-performing configurations in a fraction of the evaluations required by grid or random search.

The Snoek et al. (2012) paper demonstrated that Bayesian optimization could match or exceed hand-tuned hyperparameter settings on convolutional neural networks for image classification, topic models, and structured prediction models.[4] Since then, Bayesian optimization has become a standard component of AutoML systems including Auto-sklearn, Auto-WEKA, and Google Vizier.

### Neural architecture search

Neural architecture search (NAS) aims to automatically discover optimal [neural network](/wiki/neural_network) architectures. Because evaluating a candidate architecture requires training it to convergence (or at least for many epochs), NAS is an extremely expensive optimization problem. Bayesian optimization provides a principled way to navigate the vast architecture space efficiently.

Methods like BANANAS (White et al., 2019) combine Bayesian optimization with neural network predictors to achieve results that are over 100 times more sample-efficient than random search on standard NAS benchmarks.[17] Auto-Keras uses Bayesian optimization for architecture search, and multi-objective Bayesian optimization methods have been applied to find architectures that trade off accuracy against model size or inference latency. More recent work, such as the RoBoT algorithm (2024), employs Bayesian optimization to find optimal combinations of training-free NAS metrics, further reducing the computational cost of architecture discovery.

### Drug discovery

In drug discovery, Bayesian optimization is used to guide the search for molecules with desired properties such as binding affinity, selectivity, solubility, and low toxicity. Evaluating a candidate molecule may involve computationally expensive simulations, wet-lab synthesis, or biological assays, making sample efficiency critical.

Researchers have applied multi-fidelity Bayesian optimization to drug discovery workflows, where candidate molecules are first screened with cheap, approximate assays and then evaluated with more expensive, accurate methods. A 2025 study published in ACS Central Science demonstrated that multi-fidelity Bayesian optimization (MF-BO) could accelerate automated drug molecule discovery by taking advantage of different experimental fidelities and their associated costs. Batched Bayesian optimization has also been used to design experimental campaigns where multiple molecules are synthesized and tested in parallel. Studies have demonstrated that Bayesian optimization can reduce the number of experiments needed to identify promising drug candidates by an order of magnitude compared to traditional screening approaches.

### Materials science and chemistry

Bayesian optimization has accelerated discovery in materials science by guiding experiments toward promising compositions, processing conditions, and synthesis parameters. Applications include:

- **Catalyst optimization:** Closed-loop Bayesian optimization has been used to optimize catalyst compositions for CO2-to-methanol conversion, achieving 5.7x improvement in CO2 conversion rate and 12.6x improvement in methanol formation rate across a small number of experimental generations.
- **Polymer and nanoparticle design:** Bayesian optimization has been applied to optimize polymeric formulations, nanoparticle compositions, and thin-film manufacturing processes.
- **Organic photovoltaics:** Autonomous experimental platforms using Bayesian optimization with neural network surrogates have accelerated process optimization for organic solar cells.
- **Alloy design:** Bayesian optimization guides the exploration of composition spaces for high-entropy alloys and other advanced materials, reducing the need for exhaustive experimentation.
- **Targeted materials discovery:** A 2024 study in npj Computational Materials used Bayesian algorithm execution (BAX) to efficiently map phase boundaries and identify optimal material compositions, outperforming standard Bayesian optimization approaches in certain structured discovery tasks.

In bioprocess engineering, studies have shown that Bayesian optimization outperforms the industry-standard design of experiments methodology. For monoclonal antibody process optimization, Bayesian optimization achieved comparable product yield with only 14 experiments versus 45 experiments required by traditional response surface methodology, representing roughly a 70% reduction in experimental effort.

### Robotics and engineering

In robotics, Bayesian optimization is used to tune controller parameters, optimize gait patterns for walking robots, and calibrate sensor systems. In aerospace and mechanical engineering, it is applied to optimize shapes, structural designs, and manufacturing processes where each simulation or physical experiment is expensive.

### A/B testing and experimental design

Bayesian optimization is increasingly used in industry for adaptive experimental design, including optimizing website layouts, advertising strategies, and product configurations. The Ax platform from Meta uses Bayesian optimization to run adaptive experiments at scale, balancing multiple objectives such as user engagement, revenue, and latency.

## How does Bayesian optimization compare to grid search and random search?

Bayesian optimization is often compared with grid search and random search, the two most straightforward alternatives for hyperparameter tuning and black-box optimization.

| Feature | Grid Search | Random Search | Bayesian Optimization |
|---|---|---|---|
| Search strategy | Exhaustive evaluation of all points on a predefined grid | Random sampling from the search space | Sequential, model-guided sampling |
| Sample efficiency | Low; evaluates many redundant configurations | Moderate; better coverage of important dimensions | High; uses information from previous evaluations |
| Scalability with dimensions | Exponential cost growth (curse of dimensionality) | Scales better than grid search but still unguided | Handles moderate dimensionality well (typically up to 20 dimensions for GP-based methods) |
| Parallelism | Trivially parallel | Trivially parallel | Parallel via batch acquisition functions (qEI, qKG) |
| Handles conditional parameters | No | No | Yes (with TPE or SMAC surrogates) |
| Uses prior evaluations | No | No | Yes; each evaluation informs the next |
| Computational overhead | Negligible per iteration | Negligible per iteration | Moderate (surrogate fitting and acquisition optimization) |
| Typical number of evaluations | Hundreds to thousands | Tens to hundreds | Tens to low hundreds |
| Best suited for | Small, discrete search spaces with cheap evaluations | Quick baseline; moderate budgets | Expensive evaluations; limited budget |

Bergstra and Bengio (2012) showed that random search is significantly more efficient than grid search for hyperparameter optimization because, in many problems, only a small subset of hyperparameters strongly affects performance.[6] As they put it, "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid," finding that random search over the same domain "is able to find models that are as good or better within a small fraction of the computation time."[6] Random search samples these important dimensions more effectively than a grid. Bayesian optimization goes further by learning which regions of the search space are promising and focusing evaluations there, achieving the best results in the fewest trials.

Hyperband (Li et al., 2017) takes a different approach by combining random search with early stopping: it allocates a small budget to many configurations and progressively increases the budget for the most promising ones.[19] Hyperband can outperform vanilla random search and black-box Bayesian optimization for iterative algorithms such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) for [deep neural networks](/wiki/deep_neural_network).[19] BOHB (Bayesian Optimization and Hyperband) combines the best of both worlds by using TPE-guided sampling within the Hyperband framework.

In empirical comparisons, Bayesian optimization routinely finds configurations matching or exceeding those found by grid search while requiring 10x to 100x fewer evaluations. For example, Snoek et al. (2012) reported that Bayesian optimization achieved competitive results in roughly 30 to 60 evaluations on problems where grid search would require hundreds or thousands of configurations.[4]

## Multi-objective Bayesian optimization

Many real-world optimization problems involve multiple conflicting objectives. For example, a neural architecture search might seek to maximize accuracy while minimizing model size and inference time. Multi-objective Bayesian optimization (MOBO) extends the standard framework to handle these scenarios.

In multi-objective optimization, there is generally no single solution that is best on all objectives simultaneously. Instead, the goal is to find the Pareto frontier: the set of solutions where improving one objective necessarily worsens another. MOBO methods aim to identify this frontier sample-efficiently.

Several acquisition functions have been developed for MOBO:

- **ParEGO** (Knowles, 2006) randomly scalarizes the multiple objectives into a single objective using different weight vectors across iterations, then applies standard Expected Improvement.[13] It is simple and effective but may miss parts of the Pareto frontier.
- **Expected Hypervolume Improvement (EHVI)** measures the expected increase in the hypervolume dominated by the current Pareto frontier. EHVI is widely regarded as the gold standard for MOBO, though its computation becomes expensive as the number of objectives grows.
- **qNEHVI** (Noisy Expected Hypervolume Improvement) extends EHVI to handle noisy observations and batch evaluations, integrating over uncertainty in the Pareto frontier.[14] It is implemented in BoTorch and has shown strong performance in practical applications.[14]
- **qNParEGO** combines the scalarization approach of ParEGO with noisy expected improvement and batch evaluation support.

BoTorch provides first-class support for multi-objective Bayesian optimization, including efficient implementations of EHVI, qNEHVI, and qNParEGO with gradient-based optimization of acquisition functions via auto-differentiation.[11]

## Software tools and libraries

A rich ecosystem of open-source tools supports Bayesian optimization across different use cases.

### Optuna

Optuna is an automatic hyperparameter optimization framework introduced by Akiba et al. at KDD 2019.[12] It features a define-by-run API that allows users to construct the search space dynamically within the objective function, making it easy to define conditional and nested parameter spaces.[12] Optuna supports TPE, CMA-ES, and GP-based samplers, as well as pruning strategies that terminate unpromising trials early. It integrates with PyTorch, TensorFlow, and most major ML frameworks. Optuna is one of the most widely used hyperparameter optimization tools in industry and research.

### Hyperopt

Hyperopt, developed by James Bergstra and colleagues, is a Python library for serial and parallel optimization over search spaces that may include real-valued, discrete, and conditional dimensions. Its primary optimization algorithm is the Tree-structured Parzen Estimator (TPE).[5] Hyperopt was one of the first libraries to make Bayesian optimization accessible to the broader ML community and integrates with distributed computing frameworks through Hyperopt-Spark.

### BoTorch

BoTorch is a Bayesian optimization library built on PyTorch and GPyTorch, developed at Meta (Balandat et al., NeurIPS 2020).[11] It provides a modular, composable framework for research and production use, with support for Monte Carlo acquisition functions, multi-objective optimization, multi-fidelity optimization, constrained optimization, and custom surrogate models.[11] BoTorch leverages auto-differentiation for efficient gradient-based optimization of acquisition functions and supports GPU acceleration.

### Ax

Ax (Adaptive Experimentation Platform), also developed at Meta, is a higher-level platform built on top of BoTorch. Ax provides a user-friendly interface for running adaptive experiments and supports both single-objective and multi-objective optimization. It is designed for both research and production use, handling experiment management, trial scheduling, and result analysis. Ax is used internally at Meta for a wide variety of optimization tasks, from A/B testing to hardware configuration.

### Spearmint

Spearmint, created by Jasper Snoek, Hugo Larochelle, and Ryan P. Adams, was the software companion to the influential Snoek et al. (2012) paper.[4] It uses Gaussian process surrogates with a fully Bayesian treatment of kernel hyperparameters and supports parallel evaluations. While Spearmint was highly influential in popularizing Bayesian optimization for ML, it has been largely superseded by more actively maintained tools like BoTorch and Optuna.

### SMAC

SMAC (Sequential Model-based Algorithm Configuration) uses random forests as surrogate models and is particularly well suited for algorithm configuration problems with categorical, conditional, and high-dimensional parameter spaces.[7] SMAC3, the current Python implementation, combines Bayesian optimization with an aggressive racing mechanism to efficiently compare configurations. Its random forest surrogate is written in C++ for performance. SMAC has been a core component of the Auto-WEKA and Auto-sklearn AutoML systems.

### Comparison of tools

| Tool | Surrogate Model | Key Feature | Primary Use Case | Language |
|---|---|---|---|---|
| [Optuna](/wiki/optuna) | TPE, CMA-ES, GP | Define-by-run API; pruning | General hyperparameter tuning | Python |
| Hyperopt | TPE | Distributed optimization via Spark | Hyperparameter tuning | Python |
| BoTorch | GP (GPyTorch) | Monte Carlo acquisition; multi-objective | Research; advanced BO applications | Python (PyTorch) |
| Ax | GP (via BoTorch) | Experiment management platform | Production adaptive experimentation | Python |
| Spearmint | GP (fully Bayesian) | Marginalization over GP hyperparameters | Academic research (historical) | Python |
| SMAC | [Random Forest](/wiki/random_forest) | Handles conditional/categorical parameters | Algorithm configuration; AutoML | Python (C++ backend) |

## Advanced topics

### High-dimensional Bayesian optimization

Standard GP-based Bayesian optimization works well in low to moderate dimensions (up to roughly 10 to 20 parameters) but struggles in higher dimensions due to the difficulty of learning a good GP model from limited data. Several strategies address this challenge, including random embeddings (REMBO), additive kernel decompositions, and trust-region methods such as TuRBO (Eriksson et al., 2019), which maintains local GP models within trust regions and has demonstrated strong performance on problems with hundreds of dimensions.[15]

### Multi-fidelity Bayesian optimization

Many expensive optimization problems offer cheaper, lower-fidelity approximations of the objective (for example, training a neural network for fewer epochs, or using a smaller dataset). Multi-fidelity Bayesian optimization methods, such as Fabolas (Klein et al., 2017) and multi-task GP models, exploit these cheap approximations to learn about the objective function faster. The key idea is to evaluate cheap approximations frequently and expensive, high-fidelity evaluations sparingly, guided by the surrogate model.

### Constrained Bayesian optimization

In many practical problems, candidate solutions must satisfy constraints (for example, a model must fit within a memory budget, or a material must meet safety requirements). Constrained Bayesian optimization extends the standard framework by modeling both the objective and the constraint functions with surrogate models, and modifying the acquisition function to account for the probability of constraint satisfaction. Recent work such as c-TPE (2023) has extended the TPE algorithm to handle inequality constraints natively within the Parzen estimator framework.

### Batch (parallel) Bayesian optimization

When multiple evaluations can be run simultaneously (for example, on a compute cluster), batch Bayesian optimization selects a set of points to evaluate in parallel rather than one at a time. Batch acquisition functions such as q-Expected Improvement (qEI), q-Knowledge Gradient (qKG), and q-Upper Confidence Bound (qUCB) extend their sequential counterparts to the parallel setting. Efficient batch methods are critical for modern hyperparameter tuning workflows that leverage distributed computing.

## Advantages and limitations

### Advantages

- **Sample efficiency:** Bayesian optimization typically finds good solutions with far fewer evaluations than grid search, random search, or evolutionary methods, making it ideal for expensive objective functions.
- **Uncertainty-aware:** The probabilistic surrogate model quantifies uncertainty, enabling principled exploration of the search space.
- **Global optimization:** The method searches globally rather than getting trapped in local optima, thanks to the exploration component of acquisition functions.
- **Flexible:** Bayesian optimization can handle continuous, discrete, categorical, and conditional parameters, especially with TPE or random forest surrogates.
- **Principled stopping:** The surrogate model can indicate when further evaluations are unlikely to yield significant improvement.

### Limitations

- **Scalability in dimensions:** GP-based Bayesian optimization becomes less effective in very high-dimensional spaces (beyond roughly 20 dimensions for standard GPs), although methods like TuRBO and REMBO mitigate this.
- **Computational overhead:** Fitting the surrogate model and optimizing the acquisition function adds overhead to each iteration. For very cheap objective functions, this overhead may exceed the cost of additional evaluations, making simpler methods preferable.
- **Smoothness assumption:** GP surrogates assume the objective function has some degree of smoothness. Highly discontinuous or adversarial functions may violate this assumption.
- **Noise handling:** While GP models can accommodate noise, very high noise levels can degrade the quality of the surrogate fit and slow convergence.
- **Parallelism complexity:** Although batch methods exist, they add complexity compared to the trivially parallel nature of grid and random search.
- **[Overfitting](/wiki/overfitting) the surrogate:** With very few initial observations, the surrogate model may overfit to noise or initial samples, leading to poor acquisition function recommendations in early iterations.

## Explain Like I'm 5 (ELI5)

Imagine you are trying to find the highest hill in a big, foggy field, but every step you take is very slow and tiring. Instead of walking everywhere, you bring a smart helper who keeps a map. Each time you visit a spot and check its height, the helper updates the map with a best guess of what the whole field looks like, including which areas might be tall hills and which areas are still unknown.

The helper then picks the next spot for you to visit: sometimes a place that looks promising based on the map, sometimes an unexplored area that might be hiding a tall hill. After just a few visits, the helper's map is good enough to point you to the highest hill without needing to visit every spot in the field.

That is how Bayesian optimization works: it builds a map of the problem, uses that map to make smart choices about where to look next, and finds the best answer with as few tries as possible.

## References

1. Kushner, H. J. (1964). "A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise." Journal of Basic Engineering, 86(1), 97-106.
2. Mockus, J., Tiesis, V., and Zilinskas, A. (1978). "The application of Bayesian methods for seeking the extremum." Towards Global Optimization, 2, 117-129.
3. Jones, D. R., Schonlau, M., and Welch, W. J. (1998). "Efficient global optimization of expensive black-box functions." Journal of Global Optimization, 13(4), 455-492.
4. Snoek, J., Larochelle, H., and Adams, R. P. (2012). "Practical Bayesian optimization of machine learning algorithms." Advances in Neural Information Processing Systems, 25. arXiv:1206.2944.
5. Bergstra, J., Bardenet, R., Bengio, Y., and Kegl, B. (2011). "Algorithms for hyper-parameter optimization." Advances in Neural Information Processing Systems, 24.
6. Bergstra, J. and Bengio, Y. (2012). "Random search for hyper-parameter optimization." Journal of Machine Learning Research, 13, 281-305.
7. Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2011). "Sequential model-based optimization for general algorithm configuration." Proceedings of LION-5, 507-523.
8. Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). "Gaussian process optimization in the bandit setting: No regret and experimental design." Proceedings of the 27th International Conference on Machine Learning (ICML).
9. Frazier, P. I., Powell, W. B., and Dayanik, S. (2009). "The knowledge-gradient policy for correlated normal beliefs." INFORMS Journal on Computing, 21(4), 599-613.
10. Frazier, P. I. (2018). "A tutorial on Bayesian optimization." arXiv preprint arXiv:1807.02811.
11. Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. (2020). "BoTorch: A framework for efficient Monte-Carlo Bayesian optimization." Advances in Neural Information Processing Systems, 33.
12. Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). "Optuna: A next-generation hyperparameter optimization framework." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2623-2631.
13. Knowles, J. (2006). "ParEGO: A hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems." IEEE Transactions on Evolutionary Computation, 10(1), 50-66.
14. Daulton, S., Balandat, M., and Bakshy, E. (2020). "Differentiable expected hypervolume improvement for parallel multi-objective Bayesian optimization." Advances in Neural Information Processing Systems, 33.
15. Eriksson, D., Pearce, M., Gardner, J., Turner, R., and Poloczek, M. (2019). "Scalable global optimization via local Bayesian optimization." Advances in Neural Information Processing Systems, 32.
16. Hennig, P. and Schuler, C. J. (2012). "Entropy search for information-efficient global optimization." Journal of Machine Learning Research, 13, 1809-1837.
17. White, C., Neiswanger, W., and Savani, Y. (2019). "BANANAS: Bayesian optimization with neural architectures for neural architecture search." arXiv preprint arXiv:1910.11858.
18. Bull, A. D. (2011). "Convergence rates of efficient global optimization algorithms." Journal of Machine Learning Research, 12, 2879-2904.
19. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2017). "Hyperband: A novel bandit-based approach to hyperparameter optimization." Journal of Machine Learning Research, 18(185), 1-52.

