# Prior belief

> Source: https://aiwiki.ai/wiki/prior_belief
> Updated: 2026-07-12
> Categories: Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Bayes' theorem](/wiki/bayes_theorem), [Bayesian inference](/wiki/bayesian_inference), [Posterior](/wiki/posterior), [Likelihood](/wiki/likelihood)*

## What is a prior belief?

A prior belief, also called the prior distribution or simply the prior, is the probability distribution that expresses what is known or assumed about an unknown parameter before any data is observed. It is the starting point of [Bayesian inference](/wiki/bayesian_inference): combined with the [likelihood](/wiki/likelihood) of the observed data, the prior is one of the two inputs that [Bayes' theorem](/wiki/bayes_theorem) needs to produce a [posterior](/wiki/posterior) distribution, which represents updated beliefs after the data is taken into account.

Formally, if the parameter of interest is denoted theta and the data is D, Bayes' theorem states:

```
P(theta | D) = P(D | theta) * P(theta) / P(D)
```

where P(theta) is the prior, P(D | theta) is the likelihood, P(D) is the marginal likelihood (also called the evidence), and P(theta | D) is the posterior. Because P(D) does not depend on theta, the relationship is often written more compactly as:

```
posterior is proportional to likelihood times prior
```

The prior is one of the most distinctive and most debated parts of Bayesian methods. Frequentist statistics treats unknown parameters as fixed quantities to be estimated from data, while the Bayesian framework treats them as random variables that have probability distributions even before any data is collected [1][2]. Prior beliefs in [machine learning](/wiki/machine_learning) appear in many forms, from the explicit Gaussian priors over weights in a [Bayesian neural network](/wiki/bayesian_neural_network) to the implicit priors that hide inside common forms of [regularization](/wiki/regularization).

## ELI5: Explain like I'm 5

Imagine you are about to flip a coin that you have never seen before. Before flipping it, you probably already think it is roughly fair, with about a 50/50 chance of heads. That guess, made before any flips happen, is your prior belief.

Then you flip the coin ten times and get nine heads. Your belief should change. Maybe the coin is biased. After updating your belief based on what you saw, you have a posterior belief, which is your prior belief plus what the data taught you.

The key idea is that you almost never start from total ignorance. You usually have some background expectation, even if it is vague. The prior is just a way of writing that expectation down as math so a model can use it.

## Historical background

### Bayes' essay (1763)

The modern theory of priors traces back to the Reverend Thomas Bayes, an English Presbyterian minister and amateur mathematician. After Bayes died in 1761, his friend Richard Price found a manuscript among his papers titled "An Essay Towards Solving a Problem in the Doctrine of Chances." Price edited the manuscript and communicated it to the Royal Society through a letter to the mathematician John Canton; the essay was read on 23 December 1763 and published in the Philosophical Transactions that year, two years after Bayes' death [3].

The essay considered the following problem: given the observed number of successes and failures in a sequence of trials, what can be said about the underlying probability of success on each trial? Bayes' answer required positing an initial distribution over the unknown probability before any trials were observed. He used a uniform distribution on the unit interval, which is one of the earliest explicit uses of a prior distribution [3].

### Laplace's independent rediscovery (1774)

Working independently in France, Pierre-Simon Laplace developed essentially the same ideas in his 1774 "Memoire sur la probabilite des causes par les evenements" and used them on real problems in astronomy, demography, and the analysis of court verdicts. Laplace popularized what became known as the principle of insufficient reason: in the absence of information distinguishing the possible outcomes, equal probability should be assigned to each. This is the philosophical foundation of the uniform prior [1][2]. The economist John Maynard Keynes later renamed this idea the principle of indifference in his 1921 book A Treatise on Probability, while devoting a chapter to its weaknesses, since assigning equal probability to outcomes that can be subdivided in more than one way can lead to contradictions [16].

### The Bayesian revival (1950s onwards)

For much of the late nineteenth and early twentieth centuries, the dominant approach to statistics was the frequentist framework developed by Ronald Fisher, Jerzy Neyman, and Egon Pearson. Priors were viewed with suspicion because they introduce a subjective element into the analysis. Bayesian methods saw a revival starting in the 1950s, driven by the work of Leonard Savage, Bruno de Finetti, Harold Jeffreys, and Dennis Lindley, who developed both the philosophical foundations and the practical machinery of subjective and objective Bayes [1].

The explosion of computational power from the 1990s onwards made [Markov chain Monte Carlo](/wiki/mcmc) (MCMC) practical for routine use, removing many of the obstacles that had previously limited Bayesian analysis to toy problems and analytically tractable conjugate models [4].

## Mathematical formulation

### Prior, likelihood, and posterior

Let theta denote a parameter (or vector of parameters) and D denote observed data. The Bayesian recipe consists of three pieces:

- The prior P(theta), which encodes belief about theta before data is seen.
- The likelihood P(D | theta), which describes how probable the data is for each candidate value of theta.
- The posterior P(theta | D), which combines the two and represents updated belief about theta.

The relationship between these is given by Bayes' theorem:

```
P(theta | D) = P(D | theta) * P(theta) / P(D)
```

where P(D) = integral P(D | theta) * P(theta) d theta is the marginal likelihood. In practice the marginal likelihood is often hard to evaluate, but for many tasks (including sampling and finding posterior modes) only the unnormalized posterior P(D | theta) * P(theta) is needed [2].

### Sequential updating

A convenient property of Bayes' rule is that yesterday's posterior can serve as today's prior. If data arrives in two batches D_1 and D_2 that are conditionally independent given theta:

```
P(theta | D_1, D_2) is proportional to P(D_2 | theta) * P(theta | D_1)
```

This makes the prior a flexible accounting tool. It can hold the conclusions of earlier studies, the output of an analytical approximation, or pure subjective belief, and it can be refined as new evidence arrives [1].

## What are the main types of priors?

Priors are commonly grouped by how much information they carry and how they are built: informative, weakly informative, and uninformative priors, objective constructions such as the Jeffreys, reference, and maximum entropy priors, and conjugate, hierarchical, empirical, and improper priors. The choice of prior is a modeling decision. Different priors encode different assumptions, and the literature classifies them along several axes: how much information they carry, whether they are mathematically convenient, whether they are proper or improper, and whether they have hyperpriors of their own.

| Type | Description | When to use | Examples |
|---|---|---|---|
| Informative | Reflects substantive knowledge from prior studies, expert elicitation, or theory | Good external information available; small sample size | Normal(70, 5) for adult human height in inches |
| Weakly informative | Vague but proper, used to gently regularize and rule out absurd values | Default for many applied problems; helps stabilize estimation | Normal(0, 10) on a logistic regression coefficient |
| Uninformative (flat) | Tries to express ignorance, often a uniform distribution over the parameter space | When no external information is available and approximate frequentist behavior is desired | Uniform(0, 1) on a probability |
| [Jeffreys prior](/wiki/jeffreys_prior) | Proportional to the square root of the determinant of the Fisher information; invariant under reparameterization | When a parameterization-independent objective prior is wanted | Beta(1/2, 1/2) for a binomial proportion |
| Reference prior | Bernardo's construction that maximizes the expected Kullback-Leibler divergence between prior and posterior | When several parameters need to be ranked by importance for objective inference | Reduces to Jeffreys in one dimension |
| Maximum entropy | Among all distributions meeting stated constraints, the one with the largest Shannon entropy, adding no information beyond those constraints | When prior knowledge takes the form of constraints such as a known mean, variance, or range | Uniform with no constraint; Gaussian with fixed mean and variance |
| [Conjugate prior](/wiki/conjugate_prior) | Belongs to the same family as the resulting posterior under a specific likelihood | When closed-form posterior updates are wanted, e.g., for streaming data | Beta prior with binomial likelihood |
| Hierarchical | Has its own hyperprior on the parameters of the prior, allowing the data to inform the prior structure | Multilevel data, partial pooling across groups | Eight-schools model in Gelman et al. |
| Empirical | Hyperparameters of the prior are estimated from the same data, an approach called empirical Bayes | When no fully Bayesian solution is feasible and a quick fix is needed | James-Stein style shrinkage |
| Improper | Does not integrate to a finite value but can still yield a proper posterior in many models | When mimicking the absence of prior information; must be checked for posterior propriety | Uniform on the real line for a location parameter |

### Informative priors

Informative priors carry real information about the parameter. They might come from a previous study, an expert's knowledge of the problem domain, or a theoretical model. For example, a clinical trial of a new drug might use a prior on the treatment effect built from a meta-analysis of previous trials of similar drugs. When the prior information is reliable, an informative prior can substantially improve estimates, especially with small sample sizes [5].

### Weakly informative priors

A weakly informative prior is meant to be vague enough to let the data dominate, while still ruling out absurd values. Andrew Gelman has been a strong advocate of weakly informative defaults. For logistic regression, Gelman et al. (2008) recommended a Cauchy distribution with center 0 and scale 2.5 on the coefficients after scaling the inputs, a choice made partly because it always returns a finite estimate, even under the complete separation that defeats ordinary maximum likelihood [17]. Stan based tools such as rstanarm later adopted a normal(0, 2.5) prior autoscaled to the data in the same spirit [18]. The idea is to provide enough regularization to prevent ill-posed estimation in tricky cases (such as separation in logistic regression) without injecting strong opinions about the parameter [5].

### Uninformative and Jeffreys priors

Non-informative priors aim to express ignorance, but defining ignorance turns out to be subtle. A uniform prior on theta is not uniform on a transformation of theta, so what looks ignorant in one parameterization can look highly informative in another. The [Jeffreys prior](/wiki/jeffreys_prior) addresses this issue. Harold Jeffreys proposed in his 1946 work that the prior should be proportional to the square root of the determinant of the Fisher information matrix:

```
p(theta) is proportional to sqrt(det(I(theta)))
```

where I(theta) is the Fisher information. The Jeffreys prior is invariant under reparameterization, which means that if you transform theta to a new variable phi, applying the Jeffreys rule in the phi parameterization gives the same prior as transforming the Jeffreys prior from the theta parameterization. For a binomial proportion p, the Jeffreys prior is Beta(1/2, 1/2), which puts more mass near zero and one than the uniform prior [6].

### Reference priors

Jose-Miguel Bernardo introduced reference priors in 1979 by proposing that the prior should maximize the expected Kullback-Leibler divergence between the posterior and the prior. Intuitively, this picks the prior that lets the data have the greatest influence on the posterior, that is, the prior that is least informative in an information-theoretic sense. In one-parameter problems the reference prior coincides with the Jeffreys prior, but the two can differ in multi-parameter problems, where Bernardo and Jim Berger developed a stepwise procedure that orders parameters by their importance [7].

### Maximum entropy priors

The maximum entropy principle, developed by the physicist E. T. Jaynes in 1957, selects the prior with the greatest Shannon entropy among all distributions consistent with what is known [19]. Jaynes argued that this is the least biased choice because it stays maximally noncommittal about everything the stated constraints do not pin down. With no constraint other than a bounded range, the maximum entropy prior is the uniform distribution; with a fixed mean and variance on the real line it is the Gaussian; with a fixed mean on the positive half-line it is the exponential. Maximum entropy, Jeffreys, and reference priors are three of the main routes to an objective prior, and they do not always agree [19].

### Conjugate priors

A prior is conjugate to a likelihood when the resulting posterior belongs to the same family as the prior. Conjugate priors give analytic posterior updates, which is convenient computationally and pedagogically. The classical conjugate pairs come from the exponential family of distributions [8].

| Likelihood | Parameter | Conjugate prior | Posterior |
|---|---|---|---|
| Bernoulli / Binomial(n, p) | Probability p | Beta(alpha, beta) | Beta(alpha + successes, beta + failures) |
| Multinomial(n, p_1, ..., p_k) | Probability vector | Dirichlet(alpha_1, ..., alpha_k) | Dirichlet(alpha_i + counts_i) |
| Poisson(lambda) | Rate lambda | Gamma(alpha, beta) | Gamma(alpha + sum of counts, beta + n) |
| Normal(mu, sigma^2) with known variance | Mean mu | Normal(mu_0, sigma_0^2) | Normal with precision-weighted mean |
| Normal(mu, sigma^2) with known mean | Variance sigma^2 | Inverse-Gamma(alpha, beta) | Inverse-Gamma(alpha + n/2, beta + sum of squared deviations / 2) |
| Normal(mu, sigma^2) with both unknown | (mu, sigma^2) | Normal-Inverse-Gamma | Normal-Inverse-Gamma with updated parameters |
| Exponential(lambda) | Rate lambda | Gamma(alpha, beta) | Gamma(alpha + n, beta + sum of observations) |
| Geometric(p) | Probability p | Beta(alpha, beta) | Beta(alpha + n, beta + sum of failures) |
| Uniform(0, theta) | Upper bound theta | Pareto(x_m, k) | Pareto(max(x_m, max observation), k + n) |

The beta-binomial pair is the textbook starter example. If the prior on the success probability p is Beta(alpha, beta) and you observe x successes in n trials, the posterior is Beta(alpha + x, beta + n - x). The hyperparameters alpha and beta have an intuitive reading as pseudo-counts of prior successes and failures [8].

### Hierarchical and empirical priors

In a hierarchical model, the prior over the parameter of interest itself depends on hyperparameters that are also given a prior. A canonical example is the eight-schools model in Gelman et al.'s *Bayesian Data Analysis*, where the treatment effects in eight schools are given a Normal(mu, tau^2) prior, and mu and tau are themselves given priors. This structure allows partial pooling: information from one school's observations influences estimates for the other schools through the shared hyperparameters [5].

Empirical Bayes shortcuts the hierarchical approach by estimating hyperparameters from the data using maximum marginal likelihood and then plugging those point estimates into the prior. This sidesteps a fully Bayesian analysis but can give a useful approximation. The James-Stein estimator is a famous early example whose shrinkage behavior can be derived from an empirical Bayes argument [9].

## How do you choose a prior?

In practice priors are chosen based on a combination of substantive knowledge, mathematical convenience, computational tractability, and robustness considerations. As Gelman, Simpson, and Betancourt argued in a 2017 paper whose title states the point, "the prior can often only be understood in the context of the likelihood": the same distribution can act as informative or as vague depending on the model and data it is paired with [5].

- **Substantive knowledge.** When information from previous studies, physical constraints, or expert judgment is available, that information should be encoded in the prior. Elicitation procedures have been developed to translate qualitative expert knowledge into quantitative priors [5].
- **Mathematical convenience.** Conjugate priors give closed-form posteriors. With modern MCMC and variational methods this is less critical than it used to be, but conjugate components are still useful inside larger models and for analytical insight [8].
- **Computational tractability.** Some priors play more nicely with samplers than others. Priors with hard boundaries or heavy tails can cause Hamiltonian Monte Carlo to produce divergent transitions. Reparameterization (such as the non-centered parameterization for hierarchical models) is often used to make sampling efficient [4].
- **Robustness.** When the prior cannot be pinned down precisely, sensitivity analysis is used to check how much the conclusions change under different reasonable priors. If the conclusions are stable across a range of priors, the analysis is said to be robust to the choice of prior. If they change dramatically, the prior is doing more work than the data, which is a signal that more information is needed before strong conclusions can be drawn [5].

## How do priors connect to machine learning?

Priors appear throughout machine learning, sometimes stated explicitly and sometimes hidden inside familiar techniques. Regularization penalties, maximum a posteriori estimation, Bayesian neural networks, Bayesian optimization, and probabilistic programming can all be read as placing a prior on the model.

### Regularization as a prior

Many of the standard regularizers used in machine learning have a Bayesian interpretation: they are equivalent to maximum a posteriori ([MAP estimation](/wiki/maximum_a_posteriori)) under a particular prior. Two well known examples are [ridge regression](/wiki/ridge_regression) and the lasso.

Ridge regression solves:

```
minimize ||y - X beta||^2 + lambda ||beta||^2
```

which is the MAP estimate for beta when the likelihood is Gaussian and the prior on each coefficient is independent N(0, sigma^2 / lambda). The L2 penalty corresponds to the log of a zero-mean Gaussian prior, so larger values of lambda correspond to a tighter prior centered at zero [10].

The lasso replaces the L2 penalty with an L1 penalty:

```
minimize ||y - X beta||^2 + lambda ||beta||_1
```

This is the MAP estimate when the prior on each coefficient is a zero-mean Laplace (double exponential) distribution. The Laplace prior has heavier tails and a sharp peak at zero, which is what produces the lasso's tendency to drive many coefficients exactly to zero [10].

| Regularizer | Equivalent prior | Behavior |
|---|---|---|
| L2 (ridge) | Zero-mean Gaussian | Shrinks coefficients smoothly toward zero |
| L1 (lasso) | Zero-mean Laplace | Produces sparse solutions, exactly zero coefficients |
| Elastic net | Mixture of Gaussian and Laplace | Combines smoothing and sparsity |
| Group lasso | Multivariate Laplace on groups | Sparsity at the group level |
| Dropout (in expectation) | Bernoulli scale mixture on weights | Approximates a deep Gaussian process [11] |

### MAP estimation and posterior modes

Maximum a posteriori (MAP) estimation finds the value of theta that maximizes the posterior density. With a uniform prior, MAP coincides with maximum likelihood estimation. With a non-uniform prior, MAP shifts the estimate toward regions of high prior density. MAP is sometimes used as a quick Bayesian-flavored point estimate, but it gives no measure of uncertainty and can behave oddly in high dimensions because it concentrates on a single point of a potentially complicated posterior [2].

### Bayesian neural networks

A [Bayesian neural network](/wiki/bayesian_neural_network) places a prior over the weights of a neural network and uses Bayesian inference to compute a posterior distribution over those weights given training data. The most common choice is an independent Gaussian prior on each weight, often with mean zero. Predictions are made by integrating over the weight posterior, which gives both a predicted output and an estimate of model uncertainty. Exact inference is intractable, so Bayesian neural networks rely on approximate methods such as variational inference, Hamiltonian Monte Carlo, and Monte Carlo dropout [11][12].

In 2016 Yarin Gal and Zoubin Ghahramani showed that training a deep network with dropout can be interpreted as variational approximation in a deep Gaussian process, a framework they described as "casting dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes." Under this view, the dropout pattern at test time produces samples from an approximate posterior, providing a cheap way to obtain uncertainty estimates from standard architectures [11].

### Bayesian optimization

Bayesian optimization is a strategy for optimizing expensive black-box functions. It builds a surrogate model of the objective, usually a Gaussian process, that encodes prior beliefs about smoothness and length scales of the function. New points are chosen by an acquisition function that balances exploration (regions of high posterior uncertainty) and exploitation (regions of low predicted value). The Gaussian process prior is at the heart of the method [13].

### Probabilistic programming

[Probabilistic programming](/wiki/probabilistic_programming) languages let users describe Bayesian models, including their priors, in code, and then perform inference automatically. Three of the most widely used systems are:

- [Stan](/wiki/stan), a C++ based system first released in 2012 and named for the Monte Carlo pioneer Stanislaw Ulam, with interfaces in R, Python, Julia, and other languages, using the No-U-Turn Sampler (NUTS) as its default inference engine [14].
- [PyMC](/wiki/pymc), a Python library that compiles models through PyTensor and supports NUTS, variational inference, and integration with JAX-based samplers [15].
- [NumPyro](/wiki/numpyro), a JAX-based library built by the Pyro team, with hardware acceleration through GPUs and TPUs and a NUTS implementation [15].

In all three systems the user writes priors as part of the model. For example, a logistic regression in PyMC might declare `pm.Normal("beta", mu=0, sigma=2.5, shape=p)` to give the regression coefficients a weakly informative Gaussian prior. The library then handles posterior sampling without requiring the user to derive update equations by hand [15].

## Sensitivity analysis

A prior is a modeling assumption, and like any modeling assumption it should be probed. Sensitivity analysis varies the prior across a range of plausible alternatives and checks whether the posterior conclusions hold up. The simplest version reruns the analysis with a few different priors (perhaps an informative prior, a weakly informative one, and an uninformative one) and compares the resulting posterior summaries.

More sophisticated approaches include local sensitivity measures based on derivatives of posterior quantities with respect to prior hyperparameters, and global methods that consider entire classes of priors. When sensitivity is high, the natural response is to gather more data, refine the prior, or report the range of conclusions rather than a single answer [5].

## When does the prior stop mattering?

The influence of the prior fades as data accumulates. The Bernstein-von Mises theorem makes this precise: under regularity conditions, and for any prior that assigns positive probability to a neighborhood of the true parameter, the posterior distribution converges as the sample size grows to a normal distribution centered on the maximum likelihood estimate, with covariance equal to the inverse of the Fisher information, and the prior drops out of that limit [20]. This large-sample agreement is one reason Bayesian and frequentist estimates often coincide when data is plentiful.

The result is asymptotic, and its conditions matter. The prior influences conclusions most when the sample is small, when the parameter space is high dimensional relative to the data, or when the model is misspecified. The positive-probability requirement is also a formal version of Cromwell's rule: a prior that rules out the true value entirely can never recover it, no matter how much data arrives. This is why sensitivity analysis is run on finite data rather than assumed away, and why the prior deserves as much scrutiny as the likelihood.

## Why is the prior controversial?

The role of the prior has been the subject of long-running debate.

- **Subjectivity.** Frequentists have historically objected that the prior introduces subjectivity into a supposedly scientific procedure. Two analysts with different priors can reach different conclusions from the same data. The most radical subjectivists embraced this: Bruno de Finetti opened his 1974 treatise with the declaration "PROBABILITY DOES NOT EXIST", meaning that probability is not a feature of the world but a description of an individual's uncertainty [21]. Bayesians respond that the assumptions are at least made explicit and that frequentist methods make their own implicit modeling choices [1].
- **Improper priors.** A prior that does not integrate to a finite value is called improper. Improper priors can still give proper posteriors in many models, and they are often used as defaults (for example, a uniform prior on the real line for a location parameter). They must be checked carefully because in some models an improper prior leads to an improper posterior, which is meaningless [2].
- **Stein's paradox.** Charles Stein's 1956 result showed that for estimating the mean of a multivariate normal in three or more dimensions, the maximum likelihood estimator is inadmissible under squared error loss. The James-Stein shrinkage estimator dominates the MLE everywhere, even though no single component does. The phenomenon can be derived from a Bayesian perspective: any shrinkage estimator corresponds to a prior that shrinks toward the shrinkage target. Stein's paradox is often cited as evidence that even a small amount of prior structure can pay off in high dimensions [9].
- **Cromwell's rule.** Dennis Lindley named this rule after Oliver Cromwell, who wrote to the General Assembly of the Church of Scotland in 1650, "I beseech you, in the bowels of Christ, think it possible that you may be mistaken." If a prior assigns probability zero to some value of theta, no amount of data can ever raise the posterior probability above zero. Lindley advised that an analyst should "leave a little probability for the moon being made of green cheese", so that surprising evidence can still move the conclusion; the recommendation is to avoid priors that assign exactly zero probability to anything that is not logically impossible [1][22].
- **Frequentist statistics and implicit priors.** A common claim is that frequentist methods avoid priors entirely. In practice every frequentist analysis makes modeling choices (choice of likelihood, sampling distribution, regularization) that have implicit Bayesian counterparts. The contrast is less between priors and no priors than between explicit and implicit assumptions [2].

## Where are priors used in practice?

Priors show up across the sciences and in industry wherever prior knowledge or regularization helps, from clinical trials and epidemiology to recommender systems and large language model evaluation.

| Domain | Use of priors |
|---|---|
| Clinical trials | Hierarchical priors on treatment effects across trial sites; informative priors from meta-analyses of prior trials |
| Genome-wide association studies | Sparsity-inducing priors (spike-and-slab, horseshoe) on the effects of millions of genetic variants |
| Recommender systems | Bayesian Personalized Ranking and matrix factorization with Gaussian priors on latent user and item factors |
| A/B testing | Beta priors on conversion rates as an alternative to null hypothesis significance testing; allows continuous monitoring |
| Bayesian deep learning | Gaussian priors over weights, often combined with variational inference or Monte Carlo dropout |
| Topic modeling | Dirichlet priors on document-topic and topic-word distributions in latent Dirichlet allocation |
| Astronomy and cosmology | Priors on cosmological parameters informed by previous surveys; uniform priors on positions |
| Epidemiology | Priors on transmission rates, reproduction numbers, and serial intervals in compartmental models |
| Reinforcement learning | Bayesian bandits use Beta or Dirichlet priors on arm rewards for Thompson sampling |
| LLM evaluation | Bayesian aggregation of human ratings, with priors over rater bias and item difficulty |

## Modern context

Bayesian methods, and therefore the prior, have become more practical at scale through advances in computation. NUTS, automatic differentiation variational inference, normalizing flows, and stochastic gradient MCMC have made approximate Bayesian inference feasible for models with millions of parameters [4][14]. Probabilistic programming languages have lowered the barrier to writing custom models with bespoke priors [15].

In deep learning, Monte Carlo dropout and ensembling are commonly used as cheap approximate Bayesian methods that yield uncertainty estimates from architectures that were not originally designed with priors in mind [11][12]. Bayesian transformers and Bayesian fine-tuning of large language models are active research areas, with priors over weights or low-rank adaptation parameters used to quantify uncertainty in LLM outputs.

In evaluation of large language models, Bayesian aggregation methods place priors over rater bias, item difficulty, and model ability to combine noisy human judgments into more reliable scores. The same machinery used in psychometrics for decades, with explicit priors at every level, is being applied to questions like "is GPT-class A better than GPT-class B?" where individual ratings are sparse and noisy.

## Common misconceptions

- **"Frequentist statistics has no priors."** Frequentist analyses still make modeling choices that have Bayesian equivalents. Ridge regression with a fixed penalty has a Gaussian prior implicit in its objective; choosing the form of the likelihood is itself a strong assumption [10].
- **"Uninformative prior means no prior."** An uninformative prior is still a prior. It carries assumptions, and a uniform prior in one parameterization is not uniform after transformation. There is no truly assumption-free starting point [6].
- **"With enough data, the prior does not matter."** Asymptotically the prior is washed out by the data under regularity conditions, but in finite samples (especially small samples or high dimensional problems) the prior can still have meaningful influence. Sensitivity analysis is needed to check whether this is the case [5].
- **"Priors must be subjective."** Objective Bayes attempts to construct priors using formal rules (Jeffreys, reference priors, maximum entropy) that do not depend on subjective judgment. Whether these are truly objective is itself debated, but the framework is well developed [6][7].
- **"MAP is the Bayesian answer."** MAP returns a single point and ignores the rest of the posterior. The full Bayesian answer is the posterior distribution, which contains information about uncertainty that MAP discards [2].

## References

1. Bernardo, J. M. and Smith, A. F. M. (2000). *Bayesian Theory*. Wiley. ISBN 978-0471494645.
2. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). *Bayesian Data Analysis* (3rd ed.). Chapman and Hall/CRC. ISBN 978-1439840955.
3. Bayes, T. (1763). "An Essay Towards Solving a Problem in the Doctrine of Chances." *Philosophical Transactions of the Royal Society of London*, 53, 370-418.
4. Robert, C. P. and Casella, G. (2004). *Monte Carlo Statistical Methods*. Springer. ISBN 978-0387212395.
5. Gelman, A., Simpson, D., and Betancourt, M. (2017). "The Prior Can Often Only Be Understood in the Context of the Likelihood." *Entropy*, 19(10), 555.
6. Jeffreys, H. (1946). "An Invariant Form for the Prior Probability in Estimation Problems." *Proceedings of the Royal Society A*, 186(1007), 453-461.
7. Bernardo, J. M. (1979). "Reference Posterior Distributions for Bayesian Inference." *Journal of the Royal Statistical Society, Series B*, 41(2), 113-147.
8. Diaconis, P. and Ylvisaker, D. (1979). "Conjugate Priors for Exponential Families." *The Annals of Statistics*, 7(2), 269-281.
9. Efron, B. and Morris, C. (1977). "Stein's Paradox in Statistics." *Scientific American*, 236(5), 119-127.
10. Park, T. and Casella, G. (2008). "The Bayesian Lasso." *Journal of the American Statistical Association*, 103(482), 681-686.
11. Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." *Proceedings of the 33rd International Conference on Machine Learning*, 48, 1050-1059.
12. MacKay, D. J. C. (1992). "A Practical Bayesian Framework for Backpropagation Networks." *Neural Computation*, 4(3), 448-472.
13. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). "Taking the Human Out of the Loop: A Review of Bayesian Optimization." *Proceedings of the IEEE*, 104(1), 148-175.
14. Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). "Stan: A Probabilistic Programming Language." *Journal of Statistical Software*, 76(1), 1-32.
15. Abril-Pla, O., Andreani, V., Carroll, C., Dong, L., Fonnesbeck, C. J., Kochurov, M., Kumar, R., Lao, J., Luhmann, C. C., Martin, O. A., Osthege, M., Vieira, R., Wiecki, T., and Zinkov, R. (2023). "PyMC: A Modern, and Comprehensive Probabilistic Programming Framework in Python." *PeerJ Computer Science*, 9, e1516.
16. Keynes, J. M. (1921). *A Treatise on Probability*. Macmillan.
17. Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). "A Weakly Informative Default Prior Distribution for Logistic and Other Regression Models." *The Annals of Applied Statistics*, 2(4), 1360-1383.
18. Stan Development Team. "Prior Distributions for rstanarm Models." rstanarm package vignette. https://mc-stan.org/rstanarm/articles/priors.html
19. Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics." *Physical Review*, 106(4), 620-630.
20. van der Vaart, A. W. (1998). *Asymptotic Statistics*. Cambridge University Press. ISBN 978-0521784504.
21. de Finetti, B. (1974). *Theory of Probability: A Critical Introductory Treatment*. Wiley.
22. Lindley, D. V. (1985). *Making Decisions* (2nd ed.). Wiley. ISBN 978-0471908081.