See also: Machine learning terms, Hyperparameter tuning, AutoML
Bayesian optimization is a sequential strategy for the global optimization of expensive black-box functions. Unlike gradient-based methods such as gradient descent, it does not require access to derivatives of the objective function and makes no assumptions about its convexity. Instead, Bayesian optimization builds a probabilistic surrogate model of the objective, uses that model to decide where to evaluate next, and updates the model with each new observation. This makes it especially well suited to problems where each function evaluation is costly in time, money, or computational resources.
The core idea draws on Bayesian inference: a prior belief about the objective function is combined with observed data to form a posterior distribution, which in turn guides the search. Because the method reasons about uncertainty, it can balance exploration (sampling in regions of high uncertainty) with exploitation (sampling in regions predicted to perform well), converging on good solutions with far fewer evaluations than brute-force alternatives.
Bayesian optimization has been widely adopted across machine learning, engineering design, drug discovery, materials science, and many other fields where evaluations are expensive and the search space may be high-dimensional, mixed-type, or constrained.
The intellectual roots of Bayesian optimization extend to the early 1960s. In 1964, Harold J. Kushner published "A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise," which used a Wiener process (Brownian motion) as a stochastic model for an unknown one-dimensional function and proposed maximizing the probability of improvement as a decision criterion. Kushner's formulation already contained the essential insight of using a probabilistic model to guide sequential decisions, including a parameter controlling the trade-off between more global and more local search.
The foundations of Bayesian optimization as a general methodology, however, trace back to the work of Jonas Mockus and colleagues in the 1970s and 1980s in the former Soviet Union. Mockus developed methods for multidimensional optimization using Gaussian random fields and, in a landmark 1978 paper, introduced the concept of expected improvement as an acquisition criterion. The term "Bayesian optimization" is generally attributed to Mockus and is coined in his series of publications on global optimization during this period.
The technique gained renewed attention in the engineering community through the 1998 paper by Donald R. Jones, Matthias Schonlau, and William J. Welch, which introduced the Efficient Global Optimization (EGO) algorithm. EGO combined Gaussian process regression with the expected improvement acquisition function and provided a principled stopping rule. The paper, published in the Journal of Global Optimization, demonstrated that the method could optimize expensive black-box functions with remarkably few evaluations compared to traditional design-of-experiments approaches.
Bayesian optimization became widely known in the machine learning community after Jasper Snoek, Hugo Larochelle, and Ryan P. Adams published "Practical Bayesian Optimization of Machine Learning Algorithms" at NeurIPS 2012. Their work showed that Bayesian optimization could automatically tune hyperparameters of machine learning models (including convolutional neural networks, latent Dirichlet allocation, and structured SVMs) to match or surpass configurations found by human experts. The accompanying software package, Spearmint, became one of the first widely used Bayesian optimization tools in the ML community. Snoek et al. also advocated for a fully Bayesian treatment of Gaussian process kernel hyperparameters (marginalizing rather than optimizing them) and introduced practical algorithms for parallel evaluation and variable-cost experiments.
Since then, the field has grown rapidly. New surrogate models, acquisition functions, and software libraries have been developed, and the technique has been applied to problems ranging from neural architecture search to chemical synthesis optimization.
The Bayesian optimization loop consists of the following steps:
Because each iteration uses all previously collected data to inform the next evaluation, Bayesian optimization is highly sample-efficient. The method typically finds good solutions in tens to hundreds of evaluations, even for objective functions that would require thousands of evaluations under grid or random search.
A central concept in Bayesian optimization is the tension between exploration and exploitation. Exploitation means sampling where the surrogate model predicts a high objective value, focusing the search on regions that appear promising based on current knowledge. Exploration means sampling in regions where the surrogate model is uncertain, gathering information that could reveal previously unknown good regions.
Pure exploitation risks converging prematurely to a local optimum because the surrogate model may be inaccurate in unexplored regions. Pure exploration wastes evaluations by sampling in regions that are unlikely to contain the optimum. Effective Bayesian optimization requires balancing these two goals, and the acquisition function is the mechanism that encodes this balance.
Different acquisition functions handle the tradeoff differently. Expected Improvement (EI) implicitly balances the two by being large both where the predicted mean is favorable and where uncertainty is high. The Upper Confidence Bound (UCB) provides explicit control through a tunable parameter, kappa, that weights the contribution of the uncertainty term. Probability of Improvement (PI) tends to favor exploitation, which is why it often converges prematurely on its own. Information-theoretic methods like Entropy Search approach the tradeoff by prioritizing evaluations that reduce global uncertainty about the location of the optimum.
The optimal balance depends on the evaluation budget. With a small budget, moderate exploration is important because the surrogate model is highly uncertain. As the budget grows and the model becomes more accurate, the optimizer should shift toward exploitation to refine the best-known solution.
The surrogate model (also called the response surface model) is the probabilistic approximation of the true objective function. It must provide both a prediction (mean estimate) and a measure of uncertainty (variance or confidence interval) at any point in the search space. Several types of surrogate models are used in practice.
Gaussian processes (GPs) are the most common surrogate model in Bayesian optimization. A GP is a non-parametric model that defines a distribution over functions. It is fully specified by a mean function and a covariance (kernel) function. Given a set of observations, the GP posterior provides a closed-form predictive distribution at any new input point, yielding both a mean prediction and a variance.
GPs are attractive because they naturally quantify uncertainty, which is essential for acquisition functions. They also provide exact inference in the case of Gaussian noise. However, standard GP inference has O(n^3) computational complexity in the number of observations, which can become a bottleneck when the number of evaluations grows large (typically beyond a few thousand). Various sparse GP approximations and scalable GP methods have been developed to address this limitation.
The choice of kernel function (such as the squared exponential, Matern, or rational quadratic kernel) encodes assumptions about the smoothness and structure of the objective function. Kernel hyperparameters are typically learned from data via maximum likelihood estimation or, as Snoek et al. (2012) recommended, by marginalizing over them with Markov chain Monte Carlo (MCMC) for a fully Bayesian treatment.
The SMAC (Sequential Model-based Algorithm Configuration) framework, introduced by Hutter, Hoos, and Leyton-Brown in 2011, uses random forests as the surrogate model instead of Gaussian processes. In this approach, the mean prediction is the average of the individual trees' predictions, and uncertainty is estimated from the variance across trees.
Random forest surrogates offer several advantages over GPs for certain problems. They handle categorical and conditional hyperparameters naturally without special encoding. They also scale better to high-dimensional spaces and to large numbers of observations, since random forest training is O(n log n) per tree rather than the O(n^3) required by exact GP inference. On the other hand, random forests do not provide the same smooth, well-calibrated uncertainty estimates as GPs, and their predictions can be less accurate in low-data regimes.
SMAC has been particularly successful in the automated algorithm configuration community, where it is used to tune parameters of combinatorial solvers, machine learning pipelines, and other complex algorithms.
The Tree-structured Parzen Estimator (TPE), introduced by James Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl at NeurIPS 2011, takes a fundamentally different approach from GP-based methods. Rather than modeling the objective function directly (p(y | x)), TPE models the inverse: it builds two density estimators over the input space, one for observations that produced "good" results (below a threshold y*) and one for "bad" results (above y*).
Formally, TPE models:
The acquisition strategy then selects points that maximize the ratio l(x) / g(x), which can be shown to be proportional to the expected improvement criterion.
TPE naturally handles tree-structured (conditional) search spaces, where the value of one hyperparameter determines whether another hyperparameter is relevant. This makes it especially well suited for complex machine learning pipelines. TPE is the default optimizer in the Hyperopt library and is also available in Optuna. It scales well to high-dimensional spaces where GPs struggle and can handle thousands of observations efficiently.
The name "Tree-structured" refers to two aspects: the use of Parzen (kernel density) estimation to model the distributions, and the tree-like posterior-inference graph structure used to handle conditional parameter spaces efficiently.
Acquisition functions use the surrogate model's predictions and uncertainty estimates to assign a utility value to each candidate point in the search space. By maximizing the acquisition function, the optimizer selects the next point to evaluate. Different acquisition functions encode different trade-offs between exploration and exploitation.
Expected Improvement is the most widely used acquisition function in Bayesian optimization. It measures the expected amount by which a candidate point will improve over the current best observed value. Given a surrogate model with predictive mean mu(x) and predictive standard deviation sigma(x), and the current best observation f*, the expected improvement is:
EI(x) = E[max(f* - f(x), 0)]
For a Gaussian predictive distribution, EI has a convenient closed-form expression involving the standard normal probability density and cumulative distribution functions. EI naturally balances exploration and exploitation: it is large both where the predicted value is better than the current best (exploitation) and where uncertainty is high (exploration).
EI was formalized in the context of Bayesian optimization by Jones et al. (1998) and remains the default choice in many implementations.
Probability of Improvement (also called Maximum Probability of Improvement, or MPI) selects the point that has the highest probability of producing a value better than the current best:
PI(x) = P(f(x) < f*)
PI is simple and computationally cheap but tends to be overly exploitative. It favors points that are very likely to produce a small improvement over the current best, even if the expected magnitude of improvement is tiny. For this reason, PI often converges prematurely to local optima and is less commonly used as a standalone acquisition function in modern implementations. Kushner (1964) was the first to propose this criterion, making it the oldest acquisition function in the Bayesian optimization literature.
The Upper Confidence Bound (also known as GP-UCB in the Gaussian process context) selects the point that maximizes:
UCB(x) = mu(x) + kappa * sigma(x)
where kappa is a tunable parameter that controls the exploration-exploitation trade-off. A larger kappa encourages more exploration by placing greater weight on uncertainty, while a smaller kappa encourages exploitation by focusing on the predictive mean.
UCB has strong theoretical guarantees: Srinivas et al. (2010) proved sublinear cumulative regret bounds for GP-UCB under certain assumptions about the kernel and the function class, establishing a novel connection between GP optimization and experimental design through the concept of maximum information gain. UCB is intuitive and easy to implement. Its main drawback is that the performance depends on the choice of kappa, which may require problem-specific tuning.
The Knowledge Gradient, developed by Peter Frazier, Warren Powell, and Savas Dayanik (2009), selects the point whose evaluation would produce the largest expected improvement in the quality of the overall decision. Unlike EI, which considers only the improvement at the sampled point, KG considers how a new observation would change the surrogate model and thereby improve the recommendation across the entire search space.
Formally, the knowledge gradient at a point x is the expected reduction in the posterior minimum after observing f(x). KG is a one-step Bayes-optimal policy, meaning it makes the best possible decision if only one more evaluation remains. In practice, KG often outperforms EI when the goal is to make the best final recommendation rather than to find the global optimum as quickly as possible. However, KG is more computationally expensive to evaluate than EI or UCB because it requires solving an inner optimization problem.
The parallel knowledge gradient extends this idea to batch settings, where multiple points are evaluated simultaneously. BoTorch provides an efficient "one-shot" implementation of KG that leverages auto-differentiation for optimization.
Entropy Search (Hennig and Schuler, 2012) and Predictive Entropy Search (Hernandez-Lobato et al., 2014) take an information-theoretic approach. They select the point whose evaluation would most reduce the entropy (uncertainty) of the posterior distribution over the location of the global optimum. These methods can be highly effective but are computationally demanding, as they require approximating intractable integrals over the posterior.
| Acquisition Function | Principle | Strengths | Weaknesses |
|---|---|---|---|
| Expected Improvement (EI) | Expected amount of improvement over current best | Closed-form for GPs; good exploration-exploitation balance | Can over-exploit in high noise settings |
| Probability of Improvement (PI) | Probability of beating the current best | Simple, computationally cheap | Tends to be overly exploitative; premature convergence |
| Upper Confidence Bound (UCB) | Optimistic estimate using mean plus scaled uncertainty | Theoretical regret bounds; intuitive tuning via kappa | Performance depends on kappa parameter choice |
| Knowledge Gradient (KG) | Expected improvement in final recommendation quality | One-step Bayes-optimal; strong in noisy settings | Computationally expensive; harder to implement |
| Entropy Search | Reduction in entropy of the optimum location | Information-theoretically principled | Computationally very expensive; approximation required |
Bayesian optimization has a growing body of theoretical results that provide guarantees on its convergence behavior. Two primary notions of performance are used in this literature.
Simple regret measures the quality of the best solution found so far. After T evaluations, the simple regret is r_T = f(x*) - f(x_best), where x* is the true global optimum and x_best is the best point evaluated. An algorithm with vanishing simple regret will eventually identify the global optimum.
Cumulative regret measures the total suboptimality accumulated across all evaluations: R_T = sum of (f(x*) - f(x_t)) for t = 1 to T. An algorithm with sublinear cumulative regret (R_T / T approaches 0) is called a no-regret algorithm, meaning the average per-step suboptimality vanishes over time.
The foundational theoretical work for GP-based Bayesian optimization is due to Srinivas, Krause, Kakade, and Seeger (2010), who analyzed the GP-UCB algorithm. They proved that GP-UCB achieves sublinear cumulative regret bounds that depend on the maximum information gain, a quantity from experimental design that measures how much information can be extracted from function evaluations. For common kernel functions, the information gain grows slowly with the number of evaluations, yielding the following regret rates:
| Kernel | Cumulative Regret Bound |
|---|---|
| Squared exponential (RBF) | O(sqrt(T) * (log T)^(d+1)) |
| Matern (nu > 1) | O(T^((2nu + d(d+1))/(2nu + 2d(d+1))) * log T) |
| Linear | O(d * sqrt(T * log T)) |
Here d is the input dimension and T is the number of evaluations. A notable finding is that for smooth kernels like the squared exponential, the regret has surprisingly weak dependence on the dimensionality.
More recent work has extended convergence guarantees to other acquisition functions. Bull (2011) proved convergence rates for Expected Improvement under GP surrogates. Recent results from 2025 have shown that GP-EI with certain incumbent choices is also a no-regret algorithm for both squared exponential and Matern kernels. Thompson sampling for GPs has also been shown to achieve sublinear Bayesian regret bounds.
These theoretical results provide reassurance that Bayesian optimization is not merely a heuristic but has principled convergence guarantees under appropriate assumptions about the objective function.
Bayesian optimization has found success in a wide range of domains where function evaluations are expensive and the search space is complex.
Hyperparameter tuning is the most prominent application of Bayesian optimization in machine learning. Training a deep learning model with a given set of hyperparameters (learning rate, batch size, weight decay, dropout rate, optimizer settings, and so on) can take hours or days on modern hardware, particularly when performance is measured through cross-validation rather than a single train-test split. Bayesian optimization can find high-performing configurations in a fraction of the evaluations required by grid or random search.
The Snoek et al. (2012) paper demonstrated that Bayesian optimization could match or exceed hand-tuned hyperparameter settings on convolutional neural networks for image classification, topic models, and structured prediction models. Since then, Bayesian optimization has become a standard component of AutoML systems including Auto-sklearn, Auto-WEKA, and Google Vizier.
Neural architecture search (NAS) aims to automatically discover optimal neural network architectures. Because evaluating a candidate architecture requires training it to convergence (or at least for many epochs), NAS is an extremely expensive optimization problem. Bayesian optimization provides a principled way to navigate the vast architecture space efficiently.
Methods like BANANAS (White et al., 2019) combine Bayesian optimization with neural network predictors to achieve results that are over 100 times more sample-efficient than random search on standard NAS benchmarks. Auto-Keras uses Bayesian optimization for architecture search, and multi-objective Bayesian optimization methods have been applied to find architectures that trade off accuracy against model size or inference latency. More recent work, such as the RoBoT algorithm (2024), employs Bayesian optimization to find optimal combinations of training-free NAS metrics, further reducing the computational cost of architecture discovery.
In drug discovery, Bayesian optimization is used to guide the search for molecules with desired properties such as binding affinity, selectivity, solubility, and low toxicity. Evaluating a candidate molecule may involve computationally expensive simulations, wet-lab synthesis, or biological assays, making sample efficiency critical.
Researchers have applied multi-fidelity Bayesian optimization to drug discovery workflows, where candidate molecules are first screened with cheap, approximate assays and then evaluated with more expensive, accurate methods. A 2025 study published in ACS Central Science demonstrated that multi-fidelity Bayesian optimization (MF-BO) could accelerate automated drug molecule discovery by taking advantage of different experimental fidelities and their associated costs. Batched Bayesian optimization has also been used to design experimental campaigns where multiple molecules are synthesized and tested in parallel. Studies have demonstrated that Bayesian optimization can reduce the number of experiments needed to identify promising drug candidates by an order of magnitude compared to traditional screening approaches.
Bayesian optimization has accelerated discovery in materials science by guiding experiments toward promising compositions, processing conditions, and synthesis parameters. Applications include:
In bioprocess engineering, studies have shown that Bayesian optimization outperforms the industry-standard design of experiments methodology. For monoclonal antibody process optimization, Bayesian optimization achieved comparable product yield with only 14 experiments versus 45 experiments required by traditional response surface methodology, representing roughly a 70% reduction in experimental effort.
In robotics, Bayesian optimization is used to tune controller parameters, optimize gait patterns for walking robots, and calibrate sensor systems. In aerospace and mechanical engineering, it is applied to optimize shapes, structural designs, and manufacturing processes where each simulation or physical experiment is expensive.
Bayesian optimization is increasingly used in industry for adaptive experimental design, including optimizing website layouts, advertising strategies, and product configurations. The Ax platform from Meta uses Bayesian optimization to run adaptive experiments at scale, balancing multiple objectives such as user engagement, revenue, and latency.
Bayesian optimization is often compared with grid search and random search, the two most straightforward alternatives for hyperparameter tuning and black-box optimization.
| Feature | Grid Search | Random Search | Bayesian Optimization |
|---|---|---|---|
| Search strategy | Exhaustive evaluation of all points on a predefined grid | Random sampling from the search space | Sequential, model-guided sampling |
| Sample efficiency | Low; evaluates many redundant configurations | Moderate; better coverage of important dimensions | High; uses information from previous evaluations |
| Scalability with dimensions | Exponential cost growth (curse of dimensionality) | Scales better than grid search but still unguided | Handles moderate dimensionality well (typically up to 20 dimensions for GP-based methods) |
| Parallelism | Trivially parallel | Trivially parallel | Parallel via batch acquisition functions (qEI, qKG) |
| Handles conditional parameters | No | No | Yes (with TPE or SMAC surrogates) |
| Uses prior evaluations | No | No | Yes; each evaluation informs the next |
| Computational overhead | Negligible per iteration | Negligible per iteration | Moderate (surrogate fitting and acquisition optimization) |
| Typical number of evaluations | Hundreds to thousands | Tens to hundreds | Tens to low hundreds |
| Best suited for | Small, discrete search spaces with cheap evaluations | Quick baseline; moderate budgets | Expensive evaluations; limited budget |
Bergstra and Bengio (2012) showed that random search is significantly more efficient than grid search for hyperparameter optimization because, in many problems, only a small subset of hyperparameters strongly affects performance. Random search samples these important dimensions more effectively than a grid. Bayesian optimization goes further by learning which regions of the search space are promising and focusing evaluations there, achieving the best results in the fewest trials.
Hyperband (Li et al., 2017) takes a different approach by combining random search with early stopping: it allocates a small budget to many configurations and progressively increases the budget for the most promising ones. Hyperband can outperform vanilla random search and black-box Bayesian optimization for iterative algorithms such as stochastic gradient descent for deep neural networks. BOHB (Bayesian Optimization and Hyperband) combines the best of both worlds by using TPE-guided sampling within the Hyperband framework.
In empirical comparisons, Bayesian optimization routinely finds configurations matching or exceeding those found by grid search while requiring 10x to 100x fewer evaluations. For example, Snoek et al. (2012) reported that Bayesian optimization achieved competitive results in roughly 30 to 60 evaluations on problems where grid search would require hundreds or thousands of configurations.
Many real-world optimization problems involve multiple conflicting objectives. For example, a neural architecture search might seek to maximize accuracy while minimizing model size and inference time. Multi-objective Bayesian optimization (MOBO) extends the standard framework to handle these scenarios.
In multi-objective optimization, there is generally no single solution that is best on all objectives simultaneously. Instead, the goal is to find the Pareto frontier: the set of solutions where improving one objective necessarily worsens another. MOBO methods aim to identify this frontier sample-efficiently.
Several acquisition functions have been developed for MOBO:
BoTorch provides first-class support for multi-objective Bayesian optimization, including efficient implementations of EHVI, qNEHVI, and qNParEGO with gradient-based optimization of acquisition functions via auto-differentiation.
A rich ecosystem of open-source tools supports Bayesian optimization across different use cases.
Optuna is an automatic hyperparameter optimization framework introduced by Akiba et al. at KDD 2019. It features a define-by-run API that allows users to construct the search space dynamically within the objective function, making it easy to define conditional and nested parameter spaces. Optuna supports TPE, CMA-ES, and GP-based samplers, as well as pruning strategies that terminate unpromising trials early. It integrates with PyTorch, TensorFlow, and most major ML frameworks. Optuna is one of the most widely used hyperparameter optimization tools in industry and research.
Hyperopt, developed by James Bergstra and colleagues, is a Python library for serial and parallel optimization over search spaces that may include real-valued, discrete, and conditional dimensions. Its primary optimization algorithm is the Tree-structured Parzen Estimator (TPE). Hyperopt was one of the first libraries to make Bayesian optimization accessible to the broader ML community and integrates with distributed computing frameworks through Hyperopt-Spark.
BoTorch is a Bayesian optimization library built on PyTorch and GPyTorch, developed at Meta (Balandat et al., NeurIPS 2020). It provides a modular, composable framework for research and production use, with support for Monte Carlo acquisition functions, multi-objective optimization, multi-fidelity optimization, constrained optimization, and custom surrogate models. BoTorch leverages auto-differentiation for efficient gradient-based optimization of acquisition functions and supports GPU acceleration.
Ax (Adaptive Experimentation Platform), also developed at Meta, is a higher-level platform built on top of BoTorch. Ax provides a user-friendly interface for running adaptive experiments and supports both single-objective and multi-objective optimization. It is designed for both research and production use, handling experiment management, trial scheduling, and result analysis. Ax is used internally at Meta for a wide variety of optimization tasks, from A/B testing to hardware configuration.
Spearmint, created by Jasper Snoek, Hugo Larochelle, and Ryan P. Adams, was the software companion to the influential Snoek et al. (2012) paper. It uses Gaussian process surrogates with a fully Bayesian treatment of kernel hyperparameters and supports parallel evaluations. While Spearmint was highly influential in popularizing Bayesian optimization for ML, it has been largely superseded by more actively maintained tools like BoTorch and Optuna.
SMAC (Sequential Model-based Algorithm Configuration) uses random forests as surrogate models and is particularly well suited for algorithm configuration problems with categorical, conditional, and high-dimensional parameter spaces. SMAC3, the current Python implementation, combines Bayesian optimization with an aggressive racing mechanism to efficiently compare configurations. Its random forest surrogate is written in C++ for performance. SMAC has been a core component of the Auto-WEKA and Auto-sklearn AutoML systems.
| Tool | Surrogate Model | Key Feature | Primary Use Case | Language |
|---|---|---|---|---|
| Optuna | TPE, CMA-ES, GP | Define-by-run API; pruning | General hyperparameter tuning | Python |
| Hyperopt | TPE | Distributed optimization via Spark | Hyperparameter tuning | Python |
| BoTorch | GP (GPyTorch) | Monte Carlo acquisition; multi-objective | Research; advanced BO applications | Python (PyTorch) |
| Ax | GP (via BoTorch) | Experiment management platform | Production adaptive experimentation | Python |
| Spearmint | GP (fully Bayesian) | Marginalization over GP hyperparameters | Academic research (historical) | Python |
| SMAC | Random Forest | Handles conditional/categorical parameters | Algorithm configuration; AutoML | Python (C++ backend) |
Standard GP-based Bayesian optimization works well in low to moderate dimensions (up to roughly 10 to 20 parameters) but struggles in higher dimensions due to the difficulty of learning a good GP model from limited data. Several strategies address this challenge, including random embeddings (REMBO), additive kernel decompositions, and trust-region methods such as TuRBO (Eriksson et al., 2019), which maintains local GP models within trust regions and has demonstrated strong performance on problems with hundreds of dimensions.
Many expensive optimization problems offer cheaper, lower-fidelity approximations of the objective (for example, training a neural network for fewer epochs, or using a smaller dataset). Multi-fidelity Bayesian optimization methods, such as Fabolas (Klein et al., 2017) and multi-task GP models, exploit these cheap approximations to learn about the objective function faster. The key idea is to evaluate cheap approximations frequently and expensive, high-fidelity evaluations sparingly, guided by the surrogate model.
In many practical problems, candidate solutions must satisfy constraints (for example, a model must fit within a memory budget, or a material must meet safety requirements). Constrained Bayesian optimization extends the standard framework by modeling both the objective and the constraint functions with surrogate models, and modifying the acquisition function to account for the probability of constraint satisfaction. Recent work such as c-TPE (2023) has extended the TPE algorithm to handle inequality constraints natively within the Parzen estimator framework.
When multiple evaluations can be run simultaneously (for example, on a compute cluster), batch Bayesian optimization selects a set of points to evaluate in parallel rather than one at a time. Batch acquisition functions such as q-Expected Improvement (qEI), q-Knowledge Gradient (qKG), and q-Upper Confidence Bound (qUCB) extend their sequential counterparts to the parallel setting. Efficient batch methods are critical for modern hyperparameter tuning workflows that leverage distributed computing.
Imagine you are trying to find the highest hill in a big, foggy field, but every step you take is very slow and tiring. Instead of walking everywhere, you bring a smart helper who keeps a map. Each time you visit a spot and check its height, the helper updates the map with a best guess of what the whole field looks like, including which areas might be tall hills and which areas are still unknown.
The helper then picks the next spot for you to visit: sometimes a place that looks promising based on the map, sometimes an unexplored area that might be hiding a tall hill. After just a few visits, the helper's map is good enough to point you to the highest hill without needing to visit every spot in the field.
That is how Bayesian optimization works: it builds a map of the problem, uses that map to make smart choices about where to look next, and finds the best answer with as few tries as possible.