See also: Machine learning terms
A Bayesian neural network (BNN) is a neural network in which the weights and biases are represented as probability distributions rather than fixed point estimates. By placing prior distributions over parameters and using Bayes' theorem to compute posterior distributions given observed data, BNNs provide a principled framework for quantifying uncertainty in predictions. This makes them especially valuable in safety-critical applications such as medical diagnosis, autonomous driving, and scientific discovery, where knowing how confident a model is matters as much as the prediction itself.
BNNs combine the flexibility and learning capabilities of artificial neural networks with the principles of Bayesian inference to perform decision-making under uncertainty. In a standard neural network, each weight is a scalar value optimized through backpropagation. In a Bayesian neural network, each weight is instead represented by a probability distribution (for example, a Gaussian with a learned mean and variance). Predictions are then made by integrating over these weight distributions rather than relying on a single point estimate. The result is not just a prediction but a distribution over possible predictions, giving the model a principled way to express "how sure" it is about any given output.
Imagine you are guessing how many candies are in a jar. A regular neural network gives you one number, like "42 candies." A Bayesian neural network instead says, "I think it is somewhere between 38 and 46 candies, and I am most confident it is around 42." If someone shows you a jar you have never seen before (maybe it is shaped very differently), a Bayesian neural network would say, "I really am not sure about this one," which is much more honest and helpful than just guessing a single number.
Another way to picture it: imagine a smart robot that learns from its experiences. Usually the robot learns by changing some values in its brain (called weights) to make better decisions. A Bayesian neural network is like giving the robot a way to say, "I think this decision might be good, but I'm not sure. There's also a chance that another decision could be better." Instead of picking just one answer, the robot keeps track of many possible answers and how likely each one is. If the robot has seen lots of examples like your question, it will be very confident. If your question is different from anything it has seen before, it will tell you it is not so sure.
In technical terms, instead of learning one set of weights, a BNN learns a whole range of possible weights along with how likely each one is. When it makes a prediction, it considers all those possibilities, giving you not just an answer but also a measure of confidence.
The application of Bayesian methods to neural networks has roots stretching back to the late 1980s and early 1990s, when researchers began applying Bayesian probability theory to neural network models. Several foundational contributions shaped the field.
Wray Buntine and Andreas Weigend published "Bayesian Back-Propagation" in 1991, one of the earliest works to apply approximate Bayesian methods to neural network training. Their paper introduced the idea of interpreting weight decay as a form of prior probability distribution over weights and showed how Bayesian reasoning could be used for pruning insignificant weights, estimating the uncertainty of predictions, and comparing different network architectures. They formulated the conventional Bayesian view of backpropagation, starting with a likelihood distribution P(data | weights) and a prior distribution P(weights), and combining them via Bayes' theorem to obtain a posterior distribution over weights.[1]
David MacKay published two landmark papers in 1992: "Bayesian Interpolation" and "A Practical Bayesian Framework for Backpropagation Networks," both in the journal Neural Computation. MacKay developed a complete Bayesian framework for feedforward neural networks based on the Laplace approximation, fitting a Gaussian to the posterior distribution around the maximum a posteriori (MAP) estimate using the Hessian of the loss function. By approximating the posterior distribution over weights with Gaussians and adopting smoothing priors, one could estimate weight uncertainties, compute output variances, and automatically set regularization coefficients through the evidence framework. His framework enabled:
MacKay also introduced the concept of Bayesian "evidence" for model comparison, which automatically embodies Occam's razor by penalizing overly complex models. His evidence framework became one of the primary methods for Bayesian treatment of neural networks through the 1990s.[2]
Radford Neal's 1995 PhD thesis, "Bayesian Learning for Neural Networks" at the University of Toronto (later published as a Springer monograph in 1996), represented a major advance. Neal argued that the Laplace approximation used by MacKay could be too restrictive for the multimodal, complex posterior distributions that arise in neural networks. Instead, Neal proposed using Hamiltonian Monte Carlo (HMC), a Markov chain Monte Carlo (MCMC) sampling method that uses gradient information to explore the posterior distribution more efficiently than random-walk Metropolis-Hastings sampling.
Neal's HMC approach for neural networks uses the analogy of a physical system: the weight parameters are treated as the "position" of a particle, and auxiliary "momentum" variables are introduced. The Hamiltonian dynamics of this system are simulated using a leapfrog integrator, which preserves the volume and reversibility properties required for valid MCMC sampling. This allows the sampler to take large, directed steps through weight space while maintaining a high acceptance rate.
Neal also made the important theoretical observation that Bayesian neural networks with infinitely many hidden units converge to Gaussian processes, establishing a deep connection between neural networks and kernel methods in the infinite-width limit. This connection between BNNs and Gaussian processes has remained a central theme in the field.[3]
Despite this early work, BNNs saw limited practical adoption for many years due to computational costs. Interest in Bayesian neural networks surged again in the 2010s with the rise of deep learning. The need for uncertainty quantification in safety-critical applications, combined with new scalable approximate inference methods, brought BNNs back into active research. Key milestones include Graves (2011) on practical variational inference for neural networks[4], Blundell et al. (2015) on Bayes by Backprop[8], Gal and Ghahramani (2016) on MC Dropout[9], and Maddox et al. (2019) on SWAG[13], all of which made Bayesian deep learning more accessible to practitioners.
The Bayesian approach to neural networks rests on three core components: the prior, the likelihood, and the posterior. Understanding these elements is essential for grasping how BNNs differ from their deterministic counterparts.
Before observing any data, we specify a prior distribution p(w) over the network's weight parameters w. This encodes the modeler's initial beliefs about what reasonable weight values might look like. Common choices include:
The choice of prior has a significant impact on the behavior of the BNN, particularly when training data is limited. Well-chosen priors can encode domain knowledge, control model complexity, and improve generalization.
Given a dataset D = {(x_i, y_i)} and a set of weights w, the likelihood p(D|w) describes how probable the observed data is under the model parameterized by those weights. For a regression task with Gaussian noise, the likelihood takes the form:
p(D | w) = product over all data points of N(y_i; f_w(x_i), sigma^2)
where f_w(x_i) is the network output for input x_i with weights w, y_i is the observed target, and sigma^2 is the observation noise variance. For classification tasks, the likelihood is typically a categorical distribution parameterized by softmax outputs of the network.
Bayes' theorem combines the prior and likelihood to yield the posterior distribution:
p(w|D) = p(D|w) * p(w) / p(D)
The posterior p(w|D) captures our updated beliefs about the weights after observing the data. The denominator p(D), called the marginal likelihood or model evidence, is an integral over all possible weight configurations:
p(D) = integral of p(D|w) * p(w) dw
For neural networks with thousands or millions of parameters, this integral is computationally intractable. This intractability is the central challenge in Bayesian deep learning and motivates the development of approximate inference methods.
To make predictions for a new input x*, BNNs marginalize (average) over the posterior distribution of weights:
p(y*|x*, D) = integral of p(y*|x*, w) * p(w|D) dw
This integral accounts for all plausible weight configurations weighted by their posterior probability, naturally incorporating model uncertainty into predictions. Rather than producing a single output, the network produces a distribution over possible outputs, reflecting both the uncertainty in the weights and the inherent noise in the data. In practice, this integral is also approximated, typically using samples from the (approximate) posterior.
In practice, computing the posterior p(w | D) exactly is intractable for all but the simplest neural network architectures. The marginal likelihood p(D) involves an integral over a high-dimensional weight space with a highly nonlinear integrand, making analytical solutions impossible. This intractability motivates the development of approximate inference methods, which form the practical backbone of Bayesian deep learning.
A distinguishing feature of BNNs is their ability to decompose predictive uncertainty into two complementary types. This decomposition, formalized by Kendall and Gal (2017) in their paper "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" (NeurIPS 2017), distinguishes between two fundamental types of uncertainty.[10]
| Uncertainty type | Also known as | Source | Reducible? | How BNNs capture it |
|---|---|---|---|---|
| Epistemic | Model uncertainty | Limited training data, model ignorance | Yes, with more data | Variance across posterior weight samples |
| Aleatoric | Data uncertainty | Inherent noise in observations (sensor noise, label ambiguity) | No | Learned output variance (heteroscedastic models) |
Epistemic uncertainty reflects what the model does not know due to insufficient data. It arises from the fact that multiple different weight configurations could explain the observed data equally well. Key characteristics include:
Epistemic uncertainty is particularly important for detecting out-of-distribution inputs and for active learning, where the model should identify which new data points would be most informative to label.
Aleatoric uncertainty captures noise that is intrinsic to the data generation process. It reflects variability that cannot be reduced by collecting more data. For example, in medical imaging, some images are inherently ambiguous regardless of model quality. There are two subtypes:
Aleatoric uncertainty is captured by the likelihood function in the Bayesian framework. A BNN can be designed to output both a prediction and an estimate of the aleatoric uncertainty for each input by predicting the parameters of the output distribution (such as both the mean and variance of a Gaussian).
Kendall and Gal (2017) demonstrated a practical framework for combining both types of uncertainty in deep learning models for computer vision tasks. They showed that modeling aleatoric uncertainty can improve model performance even when data is limited, and that the combination of both uncertainty types leads to better-calibrated and more robust predictions in semantic segmentation and depth regression.
In practice, total predictive uncertainty is the sum of epistemic and aleatoric components:
Total uncertainty = Epistemic uncertainty + Aleatoric uncertainty
This decomposition enables practitioners to understand whether prediction errors are due to insufficient training data (epistemic) or inherent data noise (aleatoric), guiding appropriate actions such as collecting more data or improving sensor quality.
Since exact Bayesian inference is intractable for neural networks, a variety of approximation techniques have been developed. Each offers different trade-offs between accuracy, computational cost, and ease of implementation.
Variational inference (VI) reframes posterior inference as an optimization problem. Instead of computing p(w|D) directly, VI introduces a parameterized approximate posterior q_theta(w) (often a factorized Gaussian) from a tractable family and minimizes the Kullback-Leibler (KL) divergence between q_theta(w) and the true posterior:
KL(q(w; theta) || p(w | D))
Minimizing this KL divergence is equivalent to maximizing the Evidence Lower BOund (ELBO):
ELBO = E_q[log p(D|w)] - KL(q_theta(w) || p(w))
The first term encourages the approximate posterior to explain the data well, while the KL term acts as a regularizer pulling q_theta(w) toward the prior.
Blundell, Cornebise, Kavukcuoglu, and Wierstra introduced "Bayes by Backprop" in their 2015 paper "Weight Uncertainty in Neural Networks," published at the International Conference on Machine Learning (ICML). The method parameterizes each weight as a Gaussian with learnable mean and variance (a fully factorized mean-field Gaussian), and uses the reparameterization trick to obtain unbiased gradient estimates: weights are sampled as w = mu + sigma * epsilon, where epsilon is drawn from a standard normal distribution, making the sampling differentiable and compatible with standard backpropagation. This allows the variational parameters to be optimized using standard gradient descent methods. The algorithm minimizes the variational free energy (also called the compression cost), which naturally balances data fit against model complexity.
Bayes by Backprop uses a scale mixture of two Gaussians as the prior (a "spike-and-slab" style prior), which encourages the network to learn both sparse and dense weight configurations. The method was shown to achieve performance comparable to dropout on MNIST classification while also providing meaningful uncertainty estimates. It also demonstrated how learned weight uncertainty could drive exploration in reinforcement learning tasks.[8]
The mean-field approximation assumes that the approximate posterior fully factorizes across all weights: q(w) = product of q(w_i). While computationally efficient because each weight's distribution can be updated independently, this assumption ignores correlations between weights, which can lead to underestimation of posterior uncertainty. Despite this limitation, mean-field VI remains popular due to its scalability to large networks.
Gal and Ghahramani (2016) made a surprising theoretical contribution by showing that applying dropout at test time is mathematically equivalent to approximate variational inference in a deep Gaussian process. Their paper "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" was published at ICML 2016.
In standard practice, dropout is turned off at test time. Gal and Ghahramani showed that by keeping dropout active during test time (hence "Monte Carlo Dropout" or MC Dropout), one can obtain an approximate posterior predictive distribution by running the same input through the network multiple times, each time with a different random dropout mask. The mean of these stochastic forward passes approximates the predictive mean, while the variance captures model uncertainty.
This approach is appealing because it requires no architectural changes, no additional parameters, and no modification to the training procedure. The cost is only the additional forward passes at test time. However, the quality of the uncertainty estimates depends on the dropout rate and network architecture, and critics have noted that MC Dropout can sometimes produce poorly calibrated uncertainty in certain settings.[9]
The Laplace approximation is one of the oldest and simplest approaches to approximate Bayesian inference, dating back to Pierre-Simon Laplace in the 18th century and first applied to neural networks by MacKay (1992). It fits a Gaussian distribution to the posterior centered at the maximum a posteriori (MAP) estimate. The covariance of this Gaussian is determined by the inverse of the Hessian of the negative log-posterior evaluated at the MAP point:
q(w) = N(w_MAP, H^{-1})
where H is the Hessian matrix of the loss function evaluated at w_MAP.
For modern deep neural networks with millions of parameters, computing and storing the full Hessian is prohibitively large (quadratic in the number of parameters). Several scalable variants have been developed:
A key advantage of Laplace methods is that they can be applied post-hoc to any pre-trained network without retraining.
Maddox, Garipov, Izmailov, Vetrov, and Wilson introduced SWAG in their 2019 paper "A Simple Baseline for Bayesian Uncertainty in Deep Learning," published at NeurIPS. SWAG builds on stochastic weight averaging (SWA) by fitting a Gaussian distribution to the trajectory of stochastic gradient descent (SGD) iterates during training.
The core idea is that the SGD iterates, collected with a cyclical or high constant learning rate, act like approximate samples from the posterior distribution. SWAG captures:
At test time, multiple weight samples are drawn from this Gaussian distribution, and predictions are averaged to perform approximate Bayesian model averaging. SWAG was shown to produce well-calibrated uncertainty estimates and strong performance on out-of-distribution detection, calibration, and transfer learning benchmarks, often outperforming MC Dropout and other methods.[13]
Markov Chain Monte Carlo (MCMC) methods generate samples from the posterior distribution by constructing a Markov chain whose stationary distribution is the target posterior. Unlike variational methods, MCMC methods are asymptotically exact: given enough samples and proper convergence, they can approximate the true posterior to arbitrary accuracy.
| Method | Key idea | Scalability | Key reference |
|---|---|---|---|
| Hamiltonian Monte Carlo (HMC) | Uses gradient information to propose moves along Hamiltonian dynamics, reducing random walk behavior | Limited to small/medium networks | Neal, 1995/1996 |
| No-U-Turn Sampler (NUTS) | Automatically tunes HMC trajectory length, eliminating a sensitive hyperparameter | Limited to small/medium networks | Hoffman and Gelman, 2014 |
| Stochastic Gradient Langevin Dynamics (SGLD) | Adds calibrated noise to stochastic gradient updates; transitions from optimization to sampling | Scales to large datasets via minibatches | Welling and Teh, 2011 |
| Stochastic Gradient HMC (SGHMC) | Combines HMC with minibatch gradients and a friction term to correct for gradient noise | Scales to large datasets via minibatches | Chen et al., 2014 |
HMC exploits the gradient of the log-posterior to propose high-probability weight configurations, making it far more efficient than random-walk Metropolis-Hastings in high dimensions. However, standard HMC requires full-batch gradient computation and careful tuning of step size and trajectory length, which limits its applicability to large datasets.
SGLD and SGHMC address this scalability limitation by using minibatch gradient estimates with injected noise. As training progresses and the learning rate is annealed, these methods transition smoothly from stochastic optimization to posterior sampling. This makes them practical for large-scale Bayesian deep learning, though convergence diagnostics can be challenging.[5][7]
While not strictly Bayesian in the formal sense, deep ensembles (Lakshminarayanan, Pritzel, and Blundell, 2017) have emerged as a strong practical baseline for uncertainty estimation and are frequently compared to BNNs. A deep ensemble trains M independently initialized neural networks (typically 5 to 10) on the same data and aggregates their predictions. The mean of the ensemble provides the point prediction, while the variance (disagreement among members) captures uncertainty.[11]
Deep ensembles are simple to implement, readily parallelizable, and consistently produce well-calibrated uncertainty estimates that often match or exceed those of approximate BNN methods. However, they require training and storing M separate networks, which increases computational and memory costs by a factor of M. The theoretical relationship between deep ensembles and Bayesian inference remains an active area of research, with some interpretations framing ensembles as approximating a multimodal posterior (Wilson and Izmailov, 2020).[14]
| Method | Type | Key idea | Computational cost | Scalability | Uncertainty quality | Post-hoc applicable? | Key limitation | Key reference |
|---|---|---|---|---|---|---|---|---|
| Variational Inference (Bayes by Backprop) | Optimization-based | Minimize KL divergence between approximate and true posterior | High (doubles parameters) | Good | Moderate (mean-field can underestimate) | No (requires retraining) | Mean-field assumption limits expressiveness | Blundell et al., 2015 |
| MC Dropout | Approximate VI | Dropout at test time as approximate Bayesian inference | Low (multiple forward passes) | Excellent | Moderate (depends on dropout rate) | Yes (any dropout-trained network) | Calibration can be poor | Gal and Ghahramani, 2016 |
| Laplace approximation (KFAC) | Curvature-based | Gaussian fit at MAP using Hessian | Low to moderate | Good (Kronecker factorization) | Moderate to good | Yes (applied post-training) | Gaussian assumption may be poor for multimodal posteriors | MacKay, 1992; Ritter et al., 2018 |
| SWAG | SGD trajectory-based | Gaussian fit to SGD iterates | Low (extends standard training) | Good | Good | Partially (needs SWA collection phase) | Assumes Gaussian around SWA solution | Maddox et al., 2019 |
| HMC / NUTS | Sampling (MCMC) | Hamiltonian dynamics for posterior sampling | Very high (full-batch gradients) | Poor for large models | Excellent (gold standard, asymptotically exact) | No (requires full resampling) | Does not scale to large networks or datasets | Neal, 1995 |
| SG-MCMC (SGLD, SGHMC) | Sampling (MCMC) | Mini-batch MCMC with noise injection | Moderate to high | Moderate | Good to high | No (requires modified training) | Convergence diagnostics challenging | Welling and Teh, 2011 |
| Deep ensembles | Ensemble | Train multiple independent networks | High (M times single model) | Good (parallelizable) | Good to very good | No (requires training M models) | M-fold cost; not formally Bayesian | Lakshminarayanan et al., 2017 |
Choosing appropriate prior distributions is a critical but challenging aspect of BNNs. The prior encodes assumptions about the function the network should compute before seeing any data.
Uninformative priors (such as broad Gaussians) impose minimal assumptions and are the most common default. They correspond roughly to L2 regularization in standard networks. While convenient, uninformative priors can lead to overconfident or poorly calibrated predictions, particularly in data-sparse regions.
Informative priors incorporate domain knowledge into the model. For example, one might use priors that encourage sparsity (spike-and-slab priors), smoothness, or particular functional behaviors. Fortuin (2022) provides a comprehensive review of prior choices in Bayesian deep learning.[16]
Function-space priors represent a more recent direction. Rather than specifying priors in weight space, where the relationship between weights and the resulting function is highly nonlinear, Sun et al. (2019) proposed specifying priors directly over the functions the network computes. Their functional variational BNN (fBNN) framework defines the ELBO directly on stochastic processes, allowing the use of Gaussian process priors or other structured function-space priors that encode properties like smoothness and periodicity.[17]
One of the most compelling applications of BNNs is detecting inputs that fall outside the training distribution. When a BNN encounters an input that is very different from its training data, the posterior predictive distribution will exhibit high epistemic uncertainty. This provides a natural mechanism for flagging out-of-distribution (OOD) inputs.
In safety-critical systems, this capability is essential. For autonomous vehicles, a perception model must recognize when it encounters a scenario it was not trained for (unusual weather, unexpected obstacles) and defer to a safer fallback strategy. In medical diagnosis, a BNN-based classifier can flag cases where its prediction is unreliable, prompting further review by a clinician.
Research has shown that BNNs and ensembles generally provide better OOD detection than standard deterministic networks, though the quality of OOD detection depends significantly on the inference method and the nature of the distribution shift.
BNNs have found adoption across a range of domains where uncertainty quantification is critical for reliable decision-making.
Healthcare is one of the most natural application areas for BNNs. Medical data is often scarce, noisy, and expensive to obtain, and incorrect predictions can have severe consequences. BNNs offer several benefits in this domain:
In safety-critical applications, a model that "knows what it doesn't know" is far more valuable than one that is merely accurate on average.
Active learning is a machine learning paradigm where the model selects which data points to label next, aiming to maximize learning efficiency. BNNs are naturally suited for this because their uncertainty estimates directly indicate which unlabeled data points would be most informative.
In fields such as drug design, materials science, and climate modeling, BNNs are used within Bayesian optimization loops to efficiently explore high-dimensional design spaces, balancing exploration of uncertain regions with exploitation of known promising areas. BNNs are used in molecular property prediction, materials science, and climate modeling, where quantifying the confidence of predictions guides experimental design.
Continual learning benefits from Bayesian approaches that help mitigate catastrophic forgetting by using the posterior from previous tasks as the prior for new tasks, as in approaches like Elastic Weight Consolidation. This provides a principled mechanism for retaining knowledge while adapting to new data.
Despite their theoretical appeal, BNNs face significant practical challenges when applied to modern large-scale architectures.
Computational overhead. Most BNN inference methods at least double the number of parameters (storing mean and variance for each weight) or require multiple forward passes, increasing both training time and inference latency.
Memory requirements. Storing full covariance matrices for the posterior is infeasible for networks with millions of parameters. Even factored approximations (such as Kronecker-factored methods) require substantially more memory than standard point-estimate networks.
Scaling to large architectures. Applying BNN methods to architectures on the scale of modern large language models (with billions of parameters) remains an open challenge. Current research explores subnetwork inference (applying Bayesian treatment only to a subset of layers), last-layer BNNs, and efficient low-rank posterior approximations as practical compromises.
Approximation quality. The gap between the approximate and true posterior can be substantial, particularly for mean-field variational inference, which assumes independent weights. This can lead to overconfident uncertainty estimates that partially undermine the purpose of using a Bayesian approach.
PAC-Bayes theory provides a formal connection between Bayesian methods and statistical learning theory. PAC-Bayes bounds give generalization guarantees for stochastic predictors (such as BNNs that sample weights from a posterior) and take the form:
Generalization error is bounded by (training error) + (complexity term involving KL(posterior || prior))
This is structurally similar to the ELBO objective used in variational BNNs, establishing a theoretical link between Bayesian training and generalization. Recent work has produced nonvacuous PAC-Bayes bounds for deep networks, suggesting that this framework can meaningfully explain why overparameterized networks generalize well. Compression-based PAC-Bayes bounds (Zhou et al., 2022) have achieved particularly tight generalization guarantees by quantizing neural network parameters in learned subspaces.[18]
Several mature libraries support BNN implementation across different deep learning frameworks.
| Library | Backend | Key features | Reference |
|---|---|---|---|
| Pyro | PyTorch | Full probabilistic programming language; supports VI, MCMC, and normalizing flows; developed by Uber AI Labs | Bingham et al., 2019 |
| TensorFlow Probability | TensorFlow | Probabilistic layers, distributions API, MCMC kernels, VI; maintained by Google | Dillon et al., 2017 |
| Edward / Edward2 | TensorFlow | Lightweight probabilistic programming; black-box VI; now integrated into TensorFlow Probability | Tran et al., 2017 |
| TyXe | PyTorch + Pyro | Clean separation of architecture, prior, inference, and likelihood specification for BNNs | Ritter et al., 2021 |
| Laplace (laplace-torch) | PyTorch | Post-hoc Laplace approximation with various Hessian factorizations | Daxberger et al., 2021 |
| NumPyro | JAX | Lightweight, hardware-accelerated MCMC (NUTS, HMC) and VI | Phan et al., 2019 |
Bayesian neural networks provide several advantages over traditional deterministic neural networks:
Despite their theoretical appeal, BNNs face several practical challenges:
laplace-torch library and Pyro have helped, but BNNs are still not as plug-and-play as standard deep learning frameworks.Bayesian deep learning remains an active and growing area of research. Several trends and open challenges define the current landscape.
One of the most pressing challenges is applying Bayesian methods to the very large neural networks that dominate modern deep learning, including transformer-based language models with billions of parameters. Researchers are exploring efficient ways to apply Bayesian methods via last-layer Bayesian approaches (applying Bayesian inference only to the final layer while keeping earlier layers deterministic), linearized Laplace methods, subspace inference methods, and parameter-efficient Bayesian fine-tuning.
A 2024 position paper, "Bayesian Deep Learning is Needed in the Age of Large-Scale AI" (Papamarkou et al.), argued that Bayesian methods are increasingly important as AI systems are deployed in high-stakes settings. The authors highlighted that Bayesian approaches can provide the calibrated uncertainty estimates and principled model selection criteria needed for trustworthy large-scale AI.[19]
Deploying uncertainty-aware models on edge devices requires efficient implementations that fit within tight computational and memory budgets. Researchers have explored implementing BNNs on specialized hardware, including memristor-based and ferroelectric NAND devices, to bring uncertainty quantification to edge computing platforms where computational resources are severely limited.
The theoretical connections between BNNs and other approaches continue to be explored. The relationship between infinite-width BNNs and Gaussian processes (established by Neal, 1995, and extended by Lee et al., 2018, and Matthews et al., 2018) provides valuable theoretical insights. The connection between deep ensembles and approximate Bayesian inference (Wilson and Izmailov, 2020) has blurred the line between Bayesian and non-Bayesian approaches to uncertainty.
Developing more informative and structured priors for deep networks is an active research direction. Moving beyond weight-space inference to directly reason about function-space posteriors promises more interpretable and effective priors. Function-space priors (specifying beliefs about the input-output function rather than individual weights), learned priors from related tasks, and priors based on symmetry and invariance properties of the data are all being explored.
Empirical observations that "cold" posteriors (sharpened versions of the standard posterior) often outperform the theoretically correct posterior have prompted investigation into model misspecification and data curation effects.
Using posterior distributions from previous tasks as priors for new tasks provides a natural framework for lifelong learning.