# Bayesian Neural Network

> Source: https://aiwiki.ai/wiki/bayesian_neural_network
> Updated: 2026-06-22
> Categories: Deep Learning, Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A **Bayesian neural network** (BNN) is a [neural network](/wiki/neural_network) in which the weights and biases are represented as probability distributions rather than fixed point estimates. By placing prior distributions over parameters and using Bayes' theorem to compute posterior distributions given observed data, BNNs provide a principled framework for quantifying uncertainty in predictions. This makes them especially valuable in safety-critical applications such as medical diagnosis, autonomous driving, and scientific discovery, where knowing how confident a model is matters as much as the prediction itself. The core ideas were established in the early 1990s by David MacKay's evidence framework (1992) and Radford Neal's PhD thesis (1995), and the field saw a modern resurgence after scalable methods such as Bayes by Backprop (2015) and Monte Carlo dropout (2016) made approximate Bayesian inference practical for deep networks.

BNNs combine the flexibility and learning capabilities of artificial [neural networks](/wiki/neural_network) with the principles of [Bayesian inference](/wiki/bayesian_inference) to perform decision-making under uncertainty. In a standard neural network, each weight is a scalar value optimized through [backpropagation](/wiki/backpropagation). In a Bayesian neural network, each weight is instead represented by a probability distribution (for example, a Gaussian with a learned mean and variance). Predictions are then made by integrating over these weight distributions rather than relying on a single point estimate. The result is not just a prediction but a distribution over possible predictions, giving the model a principled way to express "how sure" it is about any given output.

## Explain like I'm 5 (ELI5)

Imagine you are guessing how many candies are in a jar. A regular neural network gives you one number, like "42 candies." A Bayesian neural network instead says, "I think it is somewhere between 38 and 46 candies, and I am most confident it is around 42." If someone shows you a jar you have never seen before (maybe it is shaped very differently), a Bayesian neural network would say, "I really am not sure about this one," which is much more honest and helpful than just guessing a single number.

Another way to picture it: imagine a smart robot that learns from its experiences. Usually the robot learns by changing some values in its brain (called weights) to make better decisions. A Bayesian neural network is like giving the robot a way to say, "I think this decision might be good, but I'm not sure. There's also a chance that another decision could be better." Instead of picking just one answer, the robot keeps track of many possible answers and how likely each one is. If the robot has seen lots of examples like your question, it will be very confident. If your question is different from anything it has seen before, it will tell you it is not so sure.

In technical terms, instead of learning one set of [weights](/wiki/weight), a BNN learns a whole range of possible weights along with how likely each one is. When it makes a prediction, it considers all those possibilities, giving you not just an answer but also a measure of confidence.

## When were Bayesian neural networks invented?

The application of Bayesian methods to neural networks has roots stretching back to the late 1980s and early 1990s, when researchers began applying Bayesian probability theory to neural network models. Several foundational contributions shaped the field.

### Early work: Buntine and Weigend (1991)

Wray Buntine and Andreas Weigend published "Bayesian Back-Propagation" in 1991, one of the earliest works to apply approximate Bayesian methods to neural network training. Their paper introduced the idea of interpreting weight decay as a form of prior probability distribution over weights and showed how Bayesian reasoning could be used for pruning insignificant weights, estimating the uncertainty of predictions, and comparing different network architectures. They formulated the conventional Bayesian view of backpropagation, starting with a likelihood distribution P(data | weights) and a prior distribution P(weights), and combining them via Bayes' theorem to obtain a posterior distribution over weights.[1]

### MacKay's evidence framework (1992)

David MacKay published two landmark papers in 1992: "Bayesian Interpolation" and "A Practical Bayesian Framework for Backpropagation Networks," the latter in the journal *Neural Computation* (volume 4, issue 3, pages 448-472). MacKay developed a complete Bayesian framework for feedforward neural networks based on the [Laplace approximation](/wiki/laplace_approximation), fitting a Gaussian to the posterior distribution around the maximum a posteriori (MAP) estimate using the Hessian of the loss function. By approximating the posterior distribution over weights with Gaussians and adopting smoothing priors, one could estimate weight uncertainties, compute output variances, and automatically set [regularization](/wiki/regularization) coefficients through the evidence framework. As MacKay put it, the Bayesian "evidence" automatically embodies "Occam's razor," penalizing overflexible and overcomplex models.[2] His framework enabled:

- Objective comparison of solutions using alternative network architectures
- Automatic control of regularization through learned hyperparameters
- Estimation of error bars on network outputs
- A measure of the effective number of well-determined parameters in a model

MacKay also introduced the concept of Bayesian "evidence" for model comparison, which automatically embodies Occam's razor by penalizing overly complex models. His evidence framework became one of the primary methods for Bayesian treatment of neural networks through the 1990s.[2]

### Neal's Hamiltonian Monte Carlo approach (1995/1996)

Radford Neal's 1995 PhD thesis, "Bayesian Learning for Neural Networks" at the University of Toronto (later published as a Springer monograph in 1996), represented a major advance. Neal argued that the Laplace approximation used by MacKay could be too restrictive for the multimodal, complex posterior distributions that arise in neural networks. Instead, Neal proposed using Hamiltonian Monte Carlo (HMC), a Markov chain Monte Carlo ([MCMC](/wiki/markov_chain_monte_carlo)) sampling method that uses gradient information to explore the posterior distribution more efficiently than random-walk Metropolis-Hastings sampling.

Neal's HMC approach for neural networks uses the analogy of a physical system: the weight parameters are treated as the "position" of a particle, and auxiliary "momentum" variables are introduced. The Hamiltonian dynamics of this system are simulated using a leapfrog integrator, which preserves the volume and reversibility properties required for valid MCMC sampling. This allows the sampler to take large, directed steps through weight space while maintaining a high acceptance rate.

Neal also made the important theoretical observation that Bayesian neural networks with infinitely many hidden units converge to [Gaussian processes](/wiki/gaussian_process), establishing a deep connection between neural networks and kernel methods in the infinite-width limit. This connection between BNNs and Gaussian processes has remained a central theme in the field, and was later extended to deep networks by Lee et al. (2018) and Matthews et al. (2018).[3]

### Modern resurgence (2010s onward)

Despite this early work, BNNs saw limited practical adoption for many years due to computational costs. Interest in Bayesian neural networks surged again in the 2010s with the rise of [deep learning](/wiki/deep_model). The need for uncertainty quantification in safety-critical applications, combined with new scalable approximate inference methods, brought BNNs back into active research. Key milestones include Graves (2011) on practical variational inference for neural networks[4], Blundell et al. (2015) on Bayes by Backprop[8], Gal and Ghahramani (2016) on MC [Dropout](/wiki/dropout_regularization)[9], and Maddox et al. (2019) on SWAG[13], all of which made Bayesian deep learning more accessible to practitioners.

## Mathematical foundations

The Bayesian approach to neural networks rests on three core components: the prior, the likelihood, and the posterior. Understanding these elements is essential for grasping how BNNs differ from their deterministic counterparts.

### Prior distribution

Before observing any data, we specify a prior distribution p(w) over the network's weight parameters w. This encodes the modeler's initial beliefs about what reasonable weight values might look like. Common choices include:

- **Isotropic Gaussian prior:** p(w) = N(0, sigma^2 I), where each weight is independently drawn from a zero-mean Gaussian. This is equivalent to L2 weight decay regularization in deterministic networks and imposes a preference for smaller weights.
- **Spike-and-slab prior:** A mixture of a point mass at zero (spike) and a broad distribution (slab), which encourages sparsity in the network weights.
- **Hierarchical priors:** Priors where the hyperparameters (such as the variance of the Gaussian) are themselves given prior distributions, allowing the model to learn the appropriate level of regularization from the data.
- **Scale mixture priors:** Mixtures of Gaussians with different variances, as used in the Bayes by Backprop paper (Blundell et al., 2015), to allow both sparse and dense weight configurations.

The choice of prior has a significant impact on the behavior of the BNN, particularly when training data is limited. Well-chosen priors can encode domain knowledge, control model complexity, and improve generalization.

### Likelihood function

Given a dataset D = {(x_i, y_i)} and a set of weights w, the likelihood p(D|w) describes how probable the observed data is under the model parameterized by those weights. For a regression task with Gaussian noise, the likelihood takes the form:

p(D | w) = product over all data points of N(y_i; f_w(x_i), sigma^2)

where f_w(x_i) is the network output for input x_i with weights w, y_i is the observed target, and sigma^2 is the observation noise variance. For classification tasks, the likelihood is typically a categorical distribution parameterized by [softmax](/wiki/softmax) outputs of the network.

### Posterior distribution

Bayes' theorem combines the prior and likelihood to yield the posterior distribution:

p(w|D) = p(D|w) * p(w) / p(D)

The posterior p(w|D) captures our updated beliefs about the weights after observing the data. The denominator p(D), called the marginal likelihood or model evidence, is an integral over all possible weight configurations:

p(D) = integral of p(D|w) * p(w) dw

For neural networks with thousands or millions of parameters, this integral is computationally intractable. This intractability is the central challenge in Bayesian deep learning and motivates the development of approximate inference methods.

### Predictive distribution

To make predictions for a new input x*, BNNs marginalize (average) over the posterior distribution of weights:

p(y*|x*, D) = integral of p(y*|x*, w) * p(w|D) dw

This integral accounts for all plausible weight configurations weighted by their posterior probability, naturally incorporating model uncertainty into predictions. Rather than producing a single output, the network produces a distribution over possible outputs, reflecting both the uncertainty in the weights and the inherent noise in the data. In practice, this integral is also approximated, typically using samples from the (approximate) posterior.

### Intractability of exact inference

In practice, computing the posterior p(w | D) exactly is intractable for all but the simplest neural network architectures. The marginal likelihood p(D) involves an integral over a high-dimensional weight space with a highly nonlinear integrand, making analytical solutions impossible. This intractability motivates the development of approximate inference methods, which form the practical backbone of Bayesian deep learning.

## What types of uncertainty do BNNs capture?

A distinguishing feature of BNNs is their ability to decompose predictive uncertainty into two complementary types. This decomposition, formalized by Kendall and Gal (2017) in their paper "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" (NeurIPS 2017), distinguishes between two fundamental types of uncertainty.[10]

| Uncertainty type | Also known as | Source | Reducible? | How BNNs capture it |
|---|---|---|---|---|
| Epistemic | Model uncertainty | Limited training data, model ignorance | Yes, with more data | Variance across posterior weight samples |
| Aleatoric | Data uncertainty | Inherent noise in observations (sensor noise, label ambiguity) | No | Learned output variance (heteroscedastic models) |

### Epistemic uncertainty (model uncertainty)

Epistemic uncertainty reflects what the model does not know due to insufficient data. It arises from the fact that multiple different weight configurations could explain the observed data equally well. Key characteristics include:

- **Reducible:** Epistemic uncertainty decreases as more training data becomes available, because additional data constrains the posterior distribution over weights.
- **High in data-sparse regions:** In regions of input space far from training examples, the posterior over weights will be broad, leading to diverse predictions and high epistemic uncertainty.
- **Captured by the posterior:** In a BNN, epistemic uncertainty is directly represented by the spread of the posterior distribution p(w | D). A wide posterior indicates high epistemic uncertainty.

Epistemic uncertainty is particularly important for detecting out-of-distribution inputs and for [active learning](/wiki/active_learning), where the model should identify which new data points would be most informative to label.

### Aleatoric uncertainty (data uncertainty)

Aleatoric uncertainty captures noise that is intrinsic to the data generation process. It reflects variability that cannot be reduced by collecting more data. For example, in medical imaging, some images are inherently ambiguous regardless of model quality. There are two subtypes:

- **Homoscedastic aleatoric uncertainty:** Constant noise across all inputs. For example, a fixed sensor measurement noise.
- **Heteroscedastic aleatoric uncertainty:** Input-dependent noise, where some inputs are inherently noisier than others. For example, in depth estimation from images, regions with poor texture have higher aleatoric uncertainty.

Aleatoric uncertainty is captured by the likelihood function in the Bayesian framework. A BNN can be designed to output both a prediction and an estimate of the aleatoric uncertainty for each input by predicting the parameters of the output distribution (such as both the mean and variance of a Gaussian).

### Practical decomposition

Kendall and Gal (2017) demonstrated a practical framework for combining both types of uncertainty in deep learning models for [computer vision](/wiki/computer_vision) tasks. They showed that modeling aleatoric uncertainty can improve model performance even when data is limited, and that the combination of both uncertainty types leads to better-calibrated and more robust predictions in semantic segmentation and depth regression.

In practice, total predictive uncertainty is the sum of epistemic and aleatoric components:

Total uncertainty = Epistemic uncertainty + Aleatoric uncertainty

This decomposition enables practitioners to understand whether prediction errors are due to insufficient training data (epistemic) or inherent data noise (aleatoric), guiding appropriate actions such as collecting more data or improving sensor quality.

## How do you train a Bayesian neural network?

Since exact Bayesian inference is intractable for neural networks, a variety of approximation techniques have been developed. Each offers different trade-offs between accuracy, computational cost, and ease of implementation.

### Variational inference and Bayes by Backprop

[Variational inference](/wiki/variational_inference) (VI) reframes posterior inference as an optimization problem. Instead of computing p(w|D) directly, VI introduces a parameterized approximate posterior q_theta(w) (often a factorized Gaussian) from a tractable family and minimizes the Kullback-Leibler (KL) divergence between q_theta(w) and the true posterior:

KL(q(w; theta) || p(w | D))

Minimizing this KL divergence is equivalent to maximizing the Evidence Lower BOund (ELBO):

ELBO = E_q[log p(D|w)] - KL(q_theta(w) || p(w))

The first term encourages the approximate posterior to explain the data well, while the KL term acts as a regularizer pulling q_theta(w) toward the prior.

#### Bayes by Backprop (Blundell et al., 2015)

Blundell, Cornebise, Kavukcuoglu, and Wierstra introduced "Bayes by Backprop" in their 2015 paper "Weight Uncertainty in Neural Networks," published at the International Conference on Machine Learning (ICML). The authors describe it as "a new, efficient, principled and backpropagation-compatible algorithm for learning a probability distribution on the weights of a neural network."[8] The method parameterizes each weight as a Gaussian with learnable mean and variance (a fully factorized mean-field Gaussian), and uses the reparameterization trick to obtain unbiased gradient estimates: weights are sampled as w = mu + sigma * epsilon, where epsilon is drawn from a standard normal distribution, making the sampling differentiable and compatible with standard [backpropagation](/wiki/backpropagation). This allows the variational parameters to be optimized using standard [gradient descent](/wiki/gradient_descent) methods. The algorithm minimizes the variational free energy (also called the compression cost), which naturally balances data fit against model complexity.

Bayes by Backprop uses a scale mixture of two Gaussians as the prior (a "spike-and-slab" style prior), which encourages the network to learn both sparse and dense weight configurations. The method was shown to achieve performance comparable to [dropout](/wiki/dropout_regularization) on MNIST classification while also providing meaningful uncertainty estimates. It also demonstrated how learned weight uncertainty could drive exploration in [reinforcement learning](/wiki/reinforcement_learning) tasks.[8]

#### Mean-field variational inference

The mean-field approximation assumes that the approximate posterior fully factorizes across all weights: q(w) = product of q(w_i). While computationally efficient because each weight's distribution can be updated independently, this assumption ignores correlations between weights, which can lead to underestimation of posterior uncertainty. Despite this limitation, mean-field VI remains popular due to its scalability to large networks.

### Monte Carlo dropout

Gal and Ghahramani (2016) made a surprising theoretical contribution by showing that applying [dropout](/wiki/dropout_regularization) at test time is mathematically equivalent to approximate variational inference in a deep Gaussian process. Their paper "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" was published at ICML 2016. The authors frame the result as "casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes," which lets practitioners extract uncertainty estimates from existing models "without sacrificing either computational complexity or test accuracy."[9]

In standard practice, dropout is turned off at test time. Gal and Ghahramani showed that by keeping dropout active during test time (hence "Monte Carlo Dropout" or MC Dropout), one can obtain an approximate posterior predictive distribution by running the same input through the network multiple times, each time with a different random dropout mask. The mean of these stochastic forward passes approximates the predictive mean, while the variance captures model uncertainty.

This approach is appealing because it requires no architectural changes, no additional parameters, and no modification to the training procedure. The cost is only the additional forward passes at test time. However, the quality of the uncertainty estimates depends on the dropout rate and network architecture, and critics have noted that MC Dropout can sometimes produce poorly calibrated uncertainty in certain settings.[9]

### Laplace approximation

The Laplace approximation is one of the oldest and simplest approaches to approximate Bayesian inference, dating back to Pierre-Simon Laplace in the 18th century and first applied to neural networks by MacKay (1992). It fits a Gaussian distribution to the posterior centered at the maximum a posteriori (MAP) estimate. The covariance of this Gaussian is determined by the inverse of the Hessian of the negative log-posterior evaluated at the MAP point:

q(w) = N(w_MAP, H^{-1})

where H is the Hessian matrix of the loss function evaluated at w_MAP.

For modern deep neural networks with millions of parameters, computing and storing the full Hessian is prohibitively large (quadratic in the number of parameters). Several scalable variants have been developed:

- **Diagonal Laplace:** Keeps only the diagonal of the Hessian, ignoring weight correlations. Fast but often too crude.
- **Kronecker-Factored Laplace (KFAC / KFLA):** Ritter, Botev, and Barber (2018) proposed a scalable Laplace approximation using Kronecker-factored approximate curvature (K-FAC) to efficiently approximate the Hessian. Their method computes two smaller curvature factor matrices per layer rather than the full Hessian, making it efficient in both computation and memory. Published at ICLR 2018, this approach requires no modification to the training procedure, allowing practitioners to add uncertainty estimates to pre-trained models post hoc.[12]
- **Laplace Redux:** Daxberger et al. (2021) presented a comprehensive framework ("Laplace Redux: Effortless Bayesian Deep Learning") at [NeurIPS](/wiki/neurips) 2021, providing a modular library for applying various Laplace approximation variants to deep learning models. They demonstrated that Laplace approximations applied to subnetworks (e.g., only the last few layers) can yield competitive uncertainty estimates with minimal overhead.[15]

A key advantage of Laplace methods is that they can be applied post-hoc to any pre-trained network without retraining.

### Stochastic Weight Averaging-Gaussian (SWAG)

Maddox, Garipov, Izmailov, Vetrov, and Wilson introduced SWAG in their 2019 paper "A Simple Baseline for Bayesian Uncertainty in Deep Learning," published at NeurIPS. SWAG builds on stochastic weight averaging (SWA) by fitting a Gaussian distribution to the trajectory of [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) iterates during training.

The core idea is that the SGD iterates, collected with a cyclical or high constant learning rate, act like approximate samples from the posterior distribution. SWAG captures:

1. The SWA solution (running average of SGD iterates) as the mean of the Gaussian.
2. A diagonal plus low-rank approximation to the covariance, derived from the deviation of SGD iterates from the mean.

At test time, multiple weight samples are drawn from this Gaussian distribution, and predictions are averaged to perform approximate Bayesian model averaging. SWAG was shown to produce well-calibrated uncertainty estimates and strong performance on out-of-distribution detection, calibration, and [transfer learning](/wiki/transfer_learning) benchmarks, often outperforming MC Dropout and other methods.[13]

### Markov Chain Monte Carlo methods

Markov Chain Monte Carlo ([MCMC](/wiki/markov_chain_monte_carlo)) methods generate samples from the posterior distribution by constructing a Markov chain whose stationary distribution is the target posterior. Unlike variational methods, MCMC methods are asymptotically exact: given enough samples and proper convergence, they can approximate the true posterior to arbitrary accuracy.

| Method | Key idea | Scalability | Key reference |
|---|---|---|---|
| Hamiltonian Monte Carlo (HMC) | Uses gradient information to propose moves along Hamiltonian dynamics, reducing random walk behavior | Limited to small/medium networks | Neal, 1995/1996 |
| No-U-Turn Sampler (NUTS) | Automatically tunes HMC trajectory length, eliminating a sensitive hyperparameter | Limited to small/medium networks | Hoffman and Gelman, 2014 |
| Stochastic Gradient Langevin Dynamics (SGLD) | Adds calibrated noise to stochastic gradient updates; transitions from optimization to sampling | Scales to large datasets via minibatches | Welling and Teh, 2011 |
| Stochastic Gradient HMC (SGHMC) | Combines HMC with minibatch gradients and a friction term to correct for gradient noise | Scales to large datasets via minibatches | Chen et al., 2014 |

**HMC** exploits the gradient of the log-posterior to propose high-probability weight configurations, making it far more efficient than random-walk Metropolis-Hastings in high dimensions. However, standard HMC requires full-batch gradient computation and careful tuning of step size and trajectory length, which limits its applicability to large datasets.

**SGLD** and **SGHMC** address this scalability limitation by using minibatch gradient estimates with injected noise. As training progresses and the learning rate is annealed, these methods transition smoothly from stochastic optimization to posterior sampling. This makes them practical for large-scale Bayesian deep learning, though convergence diagnostics can be challenging.[5][7]

### Deep ensembles

While not strictly Bayesian in the formal sense, **deep ensembles** (Lakshminarayanan, Pritzel, and Blundell, 2017) have emerged as a strong practical baseline for uncertainty estimation and are frequently compared to BNNs. A deep ensemble trains M independently initialized neural networks (typically 5 to 10) on the same data and aggregates their predictions. The mean of the ensemble provides the point prediction, while the variance (disagreement among members) captures uncertainty.[11]

Deep ensembles are simple to implement, readily parallelizable, and consistently produce well-calibrated uncertainty estimates that often match or exceed those of approximate BNN methods. However, they require training and storing M separate networks, which increases computational and memory costs by a factor of M. The theoretical relationship between deep ensembles and Bayesian inference remains an active area of research, with some interpretations framing ensembles as approximating a multimodal posterior (Wilson and Izmailov, 2020).[14]

## Comparison of approximate inference methods

| Method | Type | Key idea | Computational cost | Scalability | Uncertainty quality | Post-hoc applicable? | Key limitation | Key reference |
|---|---|---|---|---|---|---|---|---|
| [Variational Inference](/wiki/variational_inference) (Bayes by Backprop) | Optimization-based | Minimize KL divergence between approximate and true posterior | High (doubles parameters) | Good | Moderate (mean-field can underestimate) | No (requires retraining) | Mean-field assumption limits expressiveness | Blundell et al., 2015 |
| MC Dropout | Approximate VI | [Dropout](/wiki/dropout_regularization) at test time as approximate Bayesian inference | Low (multiple forward passes) | Excellent | Moderate (depends on dropout rate) | Yes (any dropout-trained network) | Calibration can be poor | Gal and Ghahramani, 2016 |
| Laplace approximation (KFAC) | Curvature-based | Gaussian fit at MAP using Hessian | Low to moderate | Good (Kronecker factorization) | Moderate to good | Yes (applied post-training) | Gaussian assumption may be poor for multimodal posteriors | MacKay, 1992; Ritter et al., 2018 |
| SWAG | SGD trajectory-based | Gaussian fit to SGD iterates | Low (extends standard training) | Good | Good | Partially (needs SWA collection phase) | Assumes Gaussian around SWA solution | Maddox et al., 2019 |
| HMC / NUTS | Sampling (MCMC) | Hamiltonian dynamics for posterior sampling | Very high (full-batch gradients) | Poor for large models | Excellent (gold standard, asymptotically exact) | No (requires full resampling) | Does not scale to large networks or datasets | Neal, 1995 |
| SG-MCMC (SGLD, SGHMC) | Sampling (MCMC) | Mini-batch MCMC with noise injection | Moderate to high | Moderate | Good to high | No (requires modified training) | Convergence diagnostics challenging | Welling and Teh, 2011 |
| [Deep ensembles](/wiki/deep_ensemble) | Ensemble | Train multiple independent networks | High (M times single model) | Good (parallelizable) | Good to very good | No (requires training M models) | M-fold cost; not formally Bayesian | Lakshminarayanan et al., 2017 |

## Prior selection

Choosing appropriate prior distributions is a critical but challenging aspect of BNNs. The prior encodes assumptions about the function the network should compute before seeing any data.

**Uninformative priors** (such as broad Gaussians) impose minimal assumptions and are the most common default. They correspond roughly to L2 regularization in standard networks. While convenient, uninformative priors can lead to overconfident or poorly calibrated predictions, particularly in data-sparse regions.

**Informative priors** incorporate domain knowledge into the model. For example, one might use priors that encourage sparsity (spike-and-slab priors), smoothness, or particular functional behaviors. Fortuin (2022) provides a comprehensive review of prior choices in Bayesian deep learning.[16]

**Function-space priors** represent a more recent direction. Rather than specifying priors in weight space, where the relationship between weights and the resulting function is highly nonlinear, Sun et al. (2019) proposed specifying priors directly over the functions the network computes. Their functional variational BNN (fBNN) framework defines the ELBO directly on stochastic processes, allowing the use of Gaussian process priors or other structured function-space priors that encode properties like smoothness and periodicity.[17]

## Out-of-distribution detection

One of the most compelling applications of BNNs is detecting inputs that fall outside the training distribution. When a BNN encounters an input that is very different from its training data, the posterior predictive distribution will exhibit high epistemic uncertainty. This provides a natural mechanism for flagging out-of-distribution (OOD) inputs.

In safety-critical systems, this capability is essential. For autonomous vehicles, a perception model must recognize when it encounters a scenario it was not trained for (unusual weather, unexpected obstacles) and defer to a safer fallback strategy. In medical diagnosis, a BNN-based classifier can flag cases where its prediction is unreliable, prompting further review by a clinician.

Research has shown that BNNs and ensembles generally provide better OOD detection than standard deterministic networks, though the quality of OOD detection depends significantly on the inference method and the nature of the distribution shift.

## What are Bayesian neural networks used for?

BNNs have found adoption across a range of domains where uncertainty quantification is critical for reliable decision-making.

### Medical diagnosis and healthcare

Healthcare is one of the most natural application areas for BNNs. Medical data is often scarce, noisy, and expensive to obtain, and incorrect predictions can have severe consequences. BNNs offer several benefits in this domain:

- **Diagnostic confidence:** BNNs can flag cases where the model is uncertain, directing them to human experts for review. This is especially valuable in radiology, pathology, and other image-based diagnostics.
- **Small dataset reliability:** Conventional neural networks tend to overfit and provide overconfident predictions when trained on small medical datasets. BNNs regularize naturally through the prior and provide calibrated uncertainty estimates.
- **Clinical decision support:** Researchers have applied Bayesian convolutional neural networks to cardiac amyloidosis classification, early Alzheimer's disease detection, and predictive modeling for HbA1c levels in diabetes management.
- **Applications to screening:** BNNs have been applied to diabetic retinopathy screening and pathology image analysis, flagging uncertain cases for specialist review and improving decision-making under ambiguity.
- **Drug discovery:** BNNs help quantify the confidence of molecular property predictions, guiding experimental prioritization in pharmaceutical research.

### Safety-critical systems

In safety-critical applications, a model that "knows what it doesn't know" is far more valuable than one that is merely accurate on average.

- **[Autonomous driving](/wiki/autonomous_driving):** Self-driving systems must operate reliably in novel environments. BNN-based perception modules can quantify uncertainty about object detection and scene understanding, enabling the vehicle to take cautious actions when the model is unsure or flag unusual road conditions and ambiguous objects that should trigger cautious behavior or human intervention.
- **Robotics:** Uncertainty-aware control policies allow robots to behave conservatively when facing unfamiliar situations, improving safety in human-robot interaction.
- **Aerospace and defense:** BNNs provide uncertainty estimates for anomaly detection in sensor data, structural health monitoring, and mission-critical prediction tasks.

### Active learning

[Active learning](/wiki/active_learning) is a machine learning paradigm where the model selects which data points to label next, aiming to maximize learning efficiency. BNNs are naturally suited for this because their uncertainty estimates directly indicate which unlabeled data points would be most informative.

- **Query-by-uncertainty:** The model selects the input for which its predictive uncertainty (particularly epistemic uncertainty) is highest.
- **Bayesian Active Learning by Disagreement (BALD):** Houlsby et al. (2011) proposed an information-theoretic acquisition function that maximizes the mutual information between predictions and model parameters, leveraging the BNN's posterior distribution.[6]
- **Cost reduction:** In domains like medical imaging, active learning with BNNs can significantly reduce the number of expensive expert annotations needed to train a reliable model.

### Scientific discovery

In fields such as drug design, materials science, and climate modeling, BNNs are used within Bayesian optimization loops to efficiently explore high-dimensional design spaces, balancing exploration of uncertain regions with exploitation of known promising areas. BNNs are used in molecular property prediction, materials science, and climate modeling, where quantifying the confidence of predictions guides experimental design.

### Continual learning

[Continual learning](/wiki/continual_learning) benefits from Bayesian approaches that help mitigate catastrophic forgetting by using the posterior from previous tasks as the prior for new tasks, as in approaches like Elastic Weight Consolidation. This provides a principled mechanism for retaining knowledge while adapting to new data.

### Other applications

- **[Natural language processing](/wiki/natural_language_processing):** Uncertainty-aware text classification, machine translation confidence estimation, and selective prediction in [language models](/wiki/large_language_model).
- **Financial modeling:** BNNs quantify uncertainty in stock price predictions, credit risk assessment, and fraud detection, where overconfident models can lead to costly errors.

## Scalability challenges

Despite their theoretical appeal, BNNs face significant practical challenges when applied to modern large-scale architectures.

**Computational overhead.** Most BNN inference methods at least double the number of parameters (storing mean and variance for each weight) or require multiple forward passes, increasing both training time and inference latency.

**Memory requirements.** Storing full covariance matrices for the posterior is infeasible for networks with millions of parameters. Even factored approximations (such as Kronecker-factored methods) require substantially more memory than standard point-estimate networks.

**Scaling to large architectures.** Applying BNN methods to architectures on the scale of modern large language models (with billions of parameters) remains an open challenge. Current research explores subnetwork inference (applying Bayesian treatment only to a subset of layers), last-layer BNNs, and efficient low-rank posterior approximations as practical compromises.

**Approximation quality.** The gap between the approximate and true posterior can be substantial, particularly for mean-field variational inference, which assumes independent weights. This can lead to overconfident uncertainty estimates that partially undermine the purpose of using a Bayesian approach.

## Connection to PAC-Bayes bounds

PAC-Bayes theory provides a formal connection between Bayesian methods and statistical learning theory. PAC-Bayes bounds give generalization guarantees for stochastic predictors (such as BNNs that sample weights from a posterior) and take the form:

Generalization error is bounded by (training error) + (complexity term involving KL(posterior || prior))

This is structurally similar to the ELBO objective used in variational BNNs, establishing a theoretical link between Bayesian training and generalization. Recent work has produced nonvacuous PAC-Bayes bounds for deep networks, suggesting that this framework can meaningfully explain why overparameterized networks generalize well. Compression-based PAC-Bayes bounds (Zhou et al., 2022) have achieved particularly tight generalization guarantees by quantizing neural network parameters in learned subspaces.[18]

## Software and tools

Several mature libraries support BNN implementation across different deep learning frameworks.

| Library | Backend | Key features | Reference |
|---|---|---|---|
| [Pyro](https://pyro.ai/) | [PyTorch](/wiki/pytorch) | Full probabilistic programming language; supports VI, MCMC, and normalizing flows; developed by Uber AI Labs | Bingham et al., 2019 |
| [TensorFlow Probability](https://www.tensorflow.org/probability) | [TensorFlow](/wiki/tensorflow) | Probabilistic layers, distributions API, MCMC kernels, VI; maintained by Google | Dillon et al., 2017 |
| Edward / Edward2 | TensorFlow | Lightweight probabilistic programming; black-box VI; now integrated into TensorFlow Probability | Tran et al., 2017 |
| TyXe | PyTorch + Pyro | Clean separation of architecture, prior, inference, and likelihood specification for BNNs | Ritter et al., 2021 |
| Laplace (laplace-torch) | PyTorch | Post-hoc Laplace approximation with various Hessian factorizations | Daxberger et al., 2021 |
| NumPyro | JAX | Lightweight, hardware-accelerated MCMC (NUTS, HMC) and VI | Phan et al., 2019 |

## How do BNNs compare to standard neural networks?

### Advantages

Bayesian neural networks provide several advantages over traditional deterministic neural networks:

- **Uncertainty quantification.** BNNs provide calibrated uncertainty estimates, enabling better decision-making in high-stakes applications and risk-sensitive environments.
- **Automatic regularization.** The prior distribution acts as a principled regularizer, automatically controlling model complexity and reducing the risk of [overfitting](/wiki/overfitting), particularly on small datasets. This is equivalent to (and generalizes) techniques like weight decay.
- **Model comparison.** The marginal likelihood (model evidence) provides a principled criterion for comparing different model architectures without requiring a separate validation set.
- **Data efficiency.** By encoding prior knowledge and accounting for uncertainty, BNNs can learn effectively from fewer training examples.
- **Robustness.** BNNs tend to be more robust to adversarial examples and noisy or out-of-distribution inputs compared to deterministic networks, because averaging over the posterior dampens sensitivity to any single weight configuration and assigns higher uncertainty to unusual inputs.
- **Transfer learning.** The ability to incorporate prior knowledge through informative prior distributions makes BNNs suitable for [transfer learning](/wiki/transfer_learning) and multitask learning scenarios.

### Limitations

Despite their theoretical appeal, BNNs face several practical challenges:

- **Computational cost.** All approximate inference methods add overhead compared to standard training and inference. Variational methods roughly double the number of parameters, MCMC methods require many forward and backward passes, and ensembles require training multiple models.
- **Scalability.** While progress has been made, scaling BNNs to very large architectures (hundreds of millions or billions of parameters, as in modern [large language models](/wiki/large_language_model)) remains challenging. Most BNN methods have been demonstrated on relatively small to medium-sized networks.
- **Approximation gaps.** All practical inference methods introduce approximation errors. Mean-field VI ignores weight correlations and can underestimate uncertainty. The Laplace approximation assumes a unimodal Gaussian posterior, which may miss important multimodal structure. MC Dropout's theoretical justification relies on assumptions that may not hold in practice.
- **Prior specification.** Choosing meaningful priors for deep networks remains difficult. Simple priors like isotropic Gaussians may not reflect meaningful beliefs about network behavior, and the predictions of BNNs can be sensitive to prior choices, especially with limited data.
- **Evaluation difficulty.** There is no single agreed-upon metric for evaluating the quality of uncertainty estimates. Metrics like calibration, negative log-likelihood, and Brier score capture different aspects of uncertainty quality, making comparison across methods challenging.
- **Implementation complexity.** BNN methods require specialized training procedures and libraries, increasing the engineering burden compared to standard deterministic networks. Tools like the `laplace-torch` library and Pyro have helped, but BNNs are still not as plug-and-play as standard deep learning frameworks.

## Current research directions

Bayesian deep learning remains an active and growing area of research. Several trends and open challenges define the current landscape.

### Scaling to foundation models

One of the most pressing challenges is applying Bayesian methods to the very large neural networks that dominate modern deep learning, including [transformer](/wiki/transformer)-based language models with billions of parameters. Researchers are exploring efficient ways to apply Bayesian methods via last-layer Bayesian approaches (applying Bayesian inference only to the final layer while keeping earlier layers deterministic), linearized Laplace methods, subspace inference methods, and parameter-efficient Bayesian fine-tuning.

### Bayesian deep learning for large language models

A 2024 position paper, "Bayesian Deep Learning is Needed in the Age of Large-[Scale AI](/wiki/scale_ai)" (Papamarkou et al.), argued that Bayesian methods are increasingly important as AI systems are deployed in high-stakes settings. The authors highlighted that Bayesian approaches can provide the calibrated uncertainty estimates and principled model selection criteria needed for trustworthy large-scale AI.[19]

### Hardware acceleration

Deploying uncertainty-aware models on edge devices requires efficient implementations that fit within tight computational and memory budgets. Researchers have explored implementing BNNs on specialized hardware, including memristor-based and ferroelectric NAND devices, to bring uncertainty quantification to edge computing platforms where computational resources are severely limited.

### Connections to other methods

The theoretical connections between BNNs and other approaches continue to be explored. The relationship between infinite-width BNNs and [Gaussian processes](/wiki/gaussian_process) (established by Neal, 1995, and extended by Lee et al., 2018, and Matthews et al., 2018) provides valuable theoretical insights. The connection between deep ensembles and approximate Bayesian inference (Wilson and Izmailov, 2020) has blurred the line between Bayesian and non-Bayesian approaches to uncertainty.

### Improved priors and function-space methods

Developing more informative and structured priors for deep networks is an active research direction. Moving beyond weight-space inference to directly reason about function-space posteriors promises more interpretable and effective priors. Function-space priors (specifying beliefs about the input-output function rather than individual weights), learned priors from related tasks, and priors based on symmetry and invariance properties of the data are all being explored.

### Cold posteriors and tempering

Empirical observations that "cold" posteriors (sharpened versions of the standard posterior) often outperform the theoretically correct posterior have prompted investigation into model misspecification and data curation effects.

### Bayesian methods for continual and meta-learning

Using posterior distributions from previous tasks as priors for new tasks provides a natural framework for lifelong learning.

## See also

- [Neural network](/wiki/neural_network)
- [Deep learning](/wiki/deep_model)
- [Regularization](/wiki/regularization)
- [Overfitting](/wiki/overfitting)
- [Dropout](/wiki/dropout_regularization)
- [Backpropagation](/wiki/backpropagation)
- [Gradient descent](/wiki/gradient_descent)
- [Bayesian inference](/wiki/bayesian_inference)
- [Bayesian optimization](/wiki/bayesian_optimization)
- [Gaussian process](/wiki/gaussian_process)
- [Ensemble learning](/wiki/ensemble_learning)
- [Deep ensemble](/wiki/deep_ensemble)
- [Variational inference](/wiki/variational_inference)
- [Markov chain Monte Carlo](/wiki/markov_chain_monte_carlo)
- [Active learning](/wiki/active_learning)
- [Continual learning](/wiki/continual_learning)

## References

1. Buntine, W. L. and Weigend, A. S. (1991). "Bayesian Back-Propagation." *Complex Systems*, 5(6), 603-643.
2. MacKay, D. J. C. (1992). "A Practical Bayesian Framework for Backpropagation Networks." *Neural Computation*, 4(3), 448-472. https://direct.mit.edu/neco/article/4/3/448/5654/A-Practical-Bayesian-Framework-for-Backpropagation
3. Neal, R. M. (1995/1996). *Bayesian Learning for Neural Networks*. PhD thesis, University of Toronto. Published as Springer Lecture Notes in Statistics, Vol. 118.
4. Graves, A. (2011). "Practical Variational Inference for Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 24.
5. Welling, M. and Teh, Y. W. (2011). "Bayesian Learning via Stochastic Gradient Langevin Dynamics." *Proceedings of the 28th International Conference on Machine Learning (ICML)*, 681-688.
6. Houlsby, N., Huszar, F., Ghahramani, Z., and Lengyel, M. (2011). "Bayesian Active Learning for Classification and Preference Learning." *arXiv preprint arXiv:1112.5745*.
7. Chen, T., Fox, E. B., and Guestrin, C. (2014). "Stochastic Gradient Hamiltonian Monte Carlo." *Proceedings of the 31st International Conference on Machine Learning (ICML)*.
8. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). "Weight Uncertainty in Neural Networks." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, PMLR 37:1613-1622. https://arxiv.org/abs/1505.05424
9. Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." *Proceedings of the 33rd International Conference on Machine Learning (ICML)*, PMLR 48:1050-1059. https://arxiv.org/abs/1506.02142
10. Kendall, A. and Gal, Y. (2017). "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
11. Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
12. Ritter, H., Botev, A., and Barber, D. (2018). "A Scalable Laplace Approximation for Neural Networks." *Proceedings of the 6th International Conference on Learning Representations (ICLR)*.
13. Maddox, W. J., Garipov, T., Izmailov, P., Vetrov, D., and Wilson, A. G. (2019). "A Simple Baseline for Bayesian Uncertainty in Deep Learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 32.
14. Wilson, A. G. and Izmailov, P. (2020). "Bayesian Deep Learning and a Probabilistic Perspective of Generalization." *Advances in Neural Information Processing Systems (NeurIPS)*, 33.
15. Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. (2021). "Laplace Redux: Effortless Bayesian Deep Learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 34.
16. Fortuin, V. (2022). "Priors in Bayesian Deep Learning: A Review." *International Statistical Review*, 90(3), 563-591.
17. Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). "Functional Variational Bayesian Neural Networks." *Proceedings of the 7th International Conference on Learning Representations (ICLR)*.
18. Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2022). "PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization." *Advances in Neural Information Processing Systems (NeurIPS)*, 35.
19. Papamarkou, T. et al. (2024). "Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI." *Proceedings of the 41st International Conference on Machine Learning (ICML)*.

