See also: Bayes' theorem, Bayesian inference, Posterior, Likelihood
In Bayesian inference, the prior belief, also called the prior distribution or simply the prior, is the probability distribution that expresses one's beliefs about a parameter before any data is observed. Together with the likelihood of the data, the prior is the second ingredient that Bayes' theorem requires to produce a posterior distribution, which represents updated beliefs after the data is taken into account.
Formally, if the parameter of interest is denoted theta and the data is D, Bayes' theorem states:
P(theta | D) = P(D | theta) * P(theta) / P(D)
where P(theta) is the prior, P(D | theta) is the likelihood, P(D) is the marginal likelihood (also called the evidence), and P(theta | D) is the posterior. Because P(D) does not depend on theta, the relationship is often written more compactly as:
posterior is proportional to likelihood times prior
The prior is one of the most distinctive and most debated parts of Bayesian methods. Frequentist statistics treats unknown parameters as fixed quantities to be estimated from data, while the Bayesian framework treats them as random variables that have probability distributions even before any data is collected [1][2]. Prior beliefs in machine learning appear in many forms, from the explicit Gaussian priors over weights in a Bayesian neural network to the implicit priors that hide inside common forms of regularization.
Imagine you are about to flip a coin that you have never seen before. Before flipping it, you probably already think it is roughly fair, with about a 50/50 chance of heads. That guess, made before any flips happen, is your prior belief.
Then you flip the coin ten times and get nine heads. Your belief should change. Maybe the coin is biased. After updating your belief based on what you saw, you have a posterior belief, which is your prior belief plus what the data taught you.
The key idea is that you almost never start from total ignorance. You usually have some background expectation, even if it is vague. The prior is just a way of writing that expectation down as math so a model can use it.
The modern theory of priors traces back to the Reverend Thomas Bayes, an English Presbyterian minister and amateur mathematician. After Bayes died in 1761, his friend Richard Price found a manuscript among his papers titled "An Essay Towards Solving a Problem in the Doctrine of Chances." Price edited the manuscript and presented it to the Royal Society, which published it in the Philosophical Transactions in 1763, two years after Bayes' death [3].
The essay considered the following problem: given the observed number of successes and failures in a sequence of trials, what can be said about the underlying probability of success on each trial? Bayes' answer required positing an initial distribution over the unknown probability before any trials were observed. He used a uniform distribution on the unit interval, which is one of the earliest explicit uses of a prior distribution [3].
Working independently in France, Pierre-Simon Laplace developed essentially the same ideas in his 1774 "Memoire sur la probabilite des causes par les evenements" and used them on real problems in astronomy, demography, and the analysis of court verdicts. Laplace popularized what became known as the principle of insufficient reason: in the absence of information distinguishing the possible outcomes, equal probability should be assigned to each. This is the philosophical foundation of the uniform prior [1][2].
For much of the late nineteenth and early twentieth centuries, the dominant approach to statistics was the frequentist framework developed by Ronald Fisher, Jerzy Neyman, and Egon Pearson. Priors were viewed with suspicion because they introduce a subjective element into the analysis. Bayesian methods saw a revival starting in the 1950s, driven by the work of Leonard Savage, Bruno de Finetti, Harold Jeffreys, and Dennis Lindley, who developed both the philosophical foundations and the practical machinery of subjective and objective Bayes [1].
The explosion of computational power from the 1990s onwards made Markov chain Monte Carlo (MCMC) practical for routine use, removing many of the obstacles that had previously limited Bayesian analysis to toy problems and analytically tractable conjugate models [4].
Let theta denote a parameter (or vector of parameters) and D denote observed data. The Bayesian recipe consists of three pieces:
The relationship between these is given by Bayes' theorem:
P(theta | D) = P(D | theta) * P(theta) / P(D)
where P(D) = integral P(D | theta) * P(theta) d theta is the marginal likelihood. In practice the marginal likelihood is often hard to evaluate, but for many tasks (including sampling and finding posterior modes) only the unnormalized posterior P(D | theta) * P(theta) is needed [2].
A convenient property of Bayes' rule is that yesterday's posterior can serve as today's prior. If data arrives in two batches D_1 and D_2 that are conditionally independent given theta:
P(theta | D_1, D_2) is proportional to P(D_2 | theta) * P(theta | D_1)
This makes the prior a flexible accounting tool. It can hold the conclusions of earlier studies, the output of an analytical approximation, or pure subjective belief, and it can be refined as new evidence arrives [1].
The choice of prior is a modeling decision. Different priors encode different assumptions, and the literature classifies them along several axes: how much information they carry, whether they are mathematically convenient, whether they are proper or improper, and whether they have hyperpriors of their own.
| Type | Description | When to use | Examples |
|---|---|---|---|
| Informative | Reflects substantive knowledge from prior studies, expert elicitation, or theory | Good external information available; small sample size | Normal(70, 5) for adult human height in inches |
| Weakly informative | Vague but proper, used to gently regularize and rule out absurd values | Default for many applied problems; helps stabilize estimation | Normal(0, 10) on a logistic regression coefficient |
| Uninformative (flat) | Tries to express ignorance, often a uniform distribution over the parameter space | When no external information is available and approximate frequentist behavior is desired | Uniform(0, 1) on a probability |
| Jeffreys prior | Proportional to the square root of the determinant of the Fisher information; invariant under reparameterization | When a parameterization-independent objective prior is wanted | Beta(1/2, 1/2) for a binomial proportion |
| Reference prior | Bernardo's construction that maximizes the expected Kullback-Leibler divergence between prior and posterior | When several parameters need to be ranked by importance for objective inference | Reduces to Jeffreys in one dimension |
| Conjugate prior | Belongs to the same family as the resulting posterior under a specific likelihood | When closed-form posterior updates are wanted, e.g., for streaming data | Beta prior with binomial likelihood |
| Hierarchical | Has its own hyperprior on the parameters of the prior, allowing the data to inform the prior structure | Multilevel data, partial pooling across groups | Eight-schools model in Gelman et al. |
| Empirical | Hyperparameters of the prior are estimated from the same data, an approach called empirical Bayes | When no fully Bayesian solution is feasible and a quick fix is needed | James-Stein style shrinkage |
| Improper | Does not integrate to a finite value but can still yield a proper posterior in many models | When mimicking the absence of prior information; must be checked for posterior propriety | Uniform on the real line for a location parameter |
Informative priors carry real information about the parameter. They might come from a previous study, an expert's knowledge of the problem domain, or a theoretical model. For example, a clinical trial of a new drug might use a prior on the treatment effect built from a meta-analysis of previous trials of similar drugs. When the prior information is reliable, an informative prior can substantially improve estimates, especially with small sample sizes [5].
A weakly informative prior is meant to be vague enough to let the data dominate, while still ruling out absurd values. Andrew Gelman has been a strong advocate of weakly informative defaults such as a Normal(0, 2.5) prior on standardized logistic regression coefficients. The idea is to provide enough regularization to prevent ill-posed estimation in tricky cases (such as separation in logistic regression) without injecting strong opinions about the parameter [5].
Non-informative priors aim to express ignorance, but defining ignorance turns out to be subtle. A uniform prior on theta is not uniform on a transformation of theta, so what looks ignorant in one parameterization can look highly informative in another. The Jeffreys prior addresses this issue. Harold Jeffreys proposed in his 1946 work that the prior should be proportional to the square root of the determinant of the Fisher information matrix:
p(theta) is proportional to sqrt(det(I(theta)))
where I(theta) is the Fisher information. The Jeffreys prior is invariant under reparameterization, which means that if you transform theta to a new variable phi, applying the Jeffreys rule in the phi parameterization gives the same prior as transforming the Jeffreys prior from the theta parameterization. For a binomial proportion p, the Jeffreys prior is Beta(1/2, 1/2), which puts more mass near zero and one than the uniform prior [6].
Jose-Miguel Bernardo introduced reference priors in 1979 by proposing that the prior should maximize the expected Kullback-Leibler divergence between the posterior and the prior. Intuitively, this picks the prior that lets the data have the greatest influence on the posterior, that is, the prior that is least informative in an information-theoretic sense. In one-parameter problems the reference prior coincides with the Jeffreys prior, but the two can differ in multi-parameter problems, where Bernardo and Jim Berger developed a stepwise procedure that orders parameters by their importance [7].
A prior is conjugate to a likelihood when the resulting posterior belongs to the same family as the prior. Conjugate priors give analytic posterior updates, which is convenient computationally and pedagogically. The classical conjugate pairs come from the exponential family of distributions [8].
| Likelihood | Parameter | Conjugate prior | Posterior |
|---|---|---|---|
| Bernoulli / Binomial(n, p) | Probability p | Beta(alpha, beta) | Beta(alpha + successes, beta + failures) |
| Multinomial(n, p_1, ..., p_k) | Probability vector | Dirichlet(alpha_1, ..., alpha_k) | Dirichlet(alpha_i + counts_i) |
| Poisson(lambda) | Rate lambda | Gamma(alpha, beta) | Gamma(alpha + sum of counts, beta + n) |
| Normal(mu, sigma^2) with known variance | Mean mu | Normal(mu_0, sigma_0^2) | Normal with precision-weighted mean |
| Normal(mu, sigma^2) with known mean | Variance sigma^2 | Inverse-Gamma(alpha, beta) | Inverse-Gamma(alpha + n/2, beta + sum of squared deviations / 2) |
| Normal(mu, sigma^2) with both unknown | (mu, sigma^2) | Normal-Inverse-Gamma | Normal-Inverse-Gamma with updated parameters |
| Exponential(lambda) | Rate lambda | Gamma(alpha, beta) | Gamma(alpha + n, beta + sum of observations) |
| Geometric(p) | Probability p | Beta(alpha, beta) | Beta(alpha + n, beta + sum of failures) |
| Uniform(0, theta) | Upper bound theta | Pareto(x_m, k) | Pareto(max(x_m, max observation), k + n) |
The beta-binomial pair is the textbook starter example. If the prior on the success probability p is Beta(alpha, beta) and you observe x successes in n trials, the posterior is Beta(alpha + x, beta + n - x). The hyperparameters alpha and beta have an intuitive reading as pseudo-counts of prior successes and failures [8].
In a hierarchical model, the prior over the parameter of interest itself depends on hyperparameters that are also given a prior. A canonical example is the eight-schools model in Gelman et al.'s Bayesian Data Analysis, where the treatment effects in eight schools are given a Normal(mu, tau^2) prior, and mu and tau are themselves given priors. This structure allows partial pooling: information from one school's observations influences estimates for the other schools through the shared hyperparameters [5].
Empirical Bayes shortcuts the hierarchical approach by estimating hyperparameters from the data using maximum marginal likelihood and then plugging those point estimates into the prior. This sidesteps a fully Bayesian analysis but can give a useful approximation. The James-Stein estimator is a famous early example whose shrinkage behavior can be derived from an empirical Bayes argument [9].
In practice priors are chosen based on a combination of substantive knowledge, mathematical convenience, computational tractability, and robustness considerations.
Many of the standard regularizers used in machine learning have a Bayesian interpretation: they are equivalent to maximum a posteriori (MAP estimation) under a particular prior. Two well known examples are ridge regression and the lasso.
Ridge regression solves:
minimize ||y - X beta||^2 + lambda ||beta||^2
which is the MAP estimate for beta when the likelihood is Gaussian and the prior on each coefficient is independent N(0, sigma^2 / lambda). The L2 penalty corresponds to the log of a zero-mean Gaussian prior, so larger values of lambda correspond to a tighter prior centered at zero [10].
The lasso replaces the L2 penalty with an L1 penalty:
minimize ||y - X beta||^2 + lambda ||beta||_1
This is the MAP estimate when the prior on each coefficient is a zero-mean Laplace (double exponential) distribution. The Laplace prior has heavier tails and a sharp peak at zero, which is what produces the lasso's tendency to drive many coefficients exactly to zero [10].
| Regularizer | Equivalent prior | Behavior |
|---|---|---|
| L2 (ridge) | Zero-mean Gaussian | Shrinks coefficients smoothly toward zero |
| L1 (lasso) | Zero-mean Laplace | Produces sparse solutions, exactly zero coefficients |
| Elastic net | Mixture of Gaussian and Laplace | Combines smoothing and sparsity |
| Group lasso | Multivariate Laplace on groups | Sparsity at the group level |
| Dropout (in expectation) | Bernoulli scale mixture on weights | Approximates a deep Gaussian process [11] |
Maximum a posteriori (MAP) estimation finds the value of theta that maximizes the posterior density. With a uniform prior, MAP coincides with maximum likelihood estimation. With a non-uniform prior, MAP shifts the estimate toward regions of high prior density. MAP is sometimes used as a quick Bayesian-flavored point estimate, but it gives no measure of uncertainty and can behave oddly in high dimensions because it concentrates on a single point of a potentially complicated posterior [2].
A Bayesian neural network places a prior over the weights of a neural network and uses Bayesian inference to compute a posterior distribution over those weights given training data. The most common choice is an independent Gaussian prior on each weight, often with mean zero. Predictions are made by integrating over the weight posterior, which gives both a predicted output and an estimate of model uncertainty. Exact inference is intractable, so Bayesian neural networks rely on approximate methods such as variational inference, Hamiltonian Monte Carlo, and Monte Carlo dropout [11][12].
In 2016 Yarin Gal and Zoubin Ghahramani showed that training a deep network with dropout can be interpreted as variational approximation in a deep Gaussian process. Under this view, the dropout pattern at test time produces samples from an approximate posterior, providing a cheap way to obtain uncertainty estimates from standard architectures [11].
Bayesian optimization is a strategy for optimizing expensive black-box functions. It builds a surrogate model of the objective, usually a Gaussian process, that encodes prior beliefs about smoothness and length scales of the function. New points are chosen by an acquisition function that balances exploration (regions of high posterior uncertainty) and exploitation (regions of low predicted value). The Gaussian process prior is at the heart of the method [13].
Probabilistic programming languages let users describe Bayesian models, including their priors, in code, and then perform inference automatically. Three of the most widely used systems are:
In all three systems the user writes priors as part of the model. For example, a logistic regression in PyMC might declare pm.Normal("beta", mu=0, sigma=2.5, shape=p) to give the regression coefficients a weakly informative Gaussian prior. The library then handles posterior sampling without requiring the user to derive update equations by hand [15].
A prior is a modeling assumption, and like any modeling assumption it should be probed. Sensitivity analysis varies the prior across a range of plausible alternatives and checks whether the posterior conclusions hold up. The simplest version reruns the analysis with a few different priors (perhaps an informative prior, a weakly informative one, and an uninformative one) and compares the resulting posterior summaries.
More sophisticated approaches include local sensitivity measures based on derivatives of posterior quantities with respect to prior hyperparameters, and global methods that consider entire classes of priors. When sensitivity is high, the natural response is to gather more data, refine the prior, or report the range of conclusions rather than a single answer [5].
The role of the prior has been the subject of long-running debate.
| Domain | Use of priors |
|---|---|
| Clinical trials | Hierarchical priors on treatment effects across trial sites; informative priors from meta-analyses of prior trials |
| Genome-wide association studies | Sparsity-inducing priors (spike-and-slab, horseshoe) on the effects of millions of genetic variants |
| Recommender systems | Bayesian Personalized Ranking and matrix factorization with Gaussian priors on latent user and item factors |
| A/B testing | Beta priors on conversion rates as an alternative to null hypothesis significance testing; allows continuous monitoring |
| Bayesian deep learning | Gaussian priors over weights, often combined with variational inference or Monte Carlo dropout |
| Topic modeling | Dirichlet priors on document-topic and topic-word distributions in latent Dirichlet allocation |
| Astronomy and cosmology | Priors on cosmological parameters informed by previous surveys; uniform priors on positions |
| Epidemiology | Priors on transmission rates, reproduction numbers, and serial intervals in compartmental models |
| Reinforcement learning | Bayesian bandits use Beta or Dirichlet priors on arm rewards for Thompson sampling |
| LLM evaluation | Bayesian aggregation of human ratings, with priors over rater bias and item difficulty |
Bayesian methods, and therefore the prior, have become more practical at scale through advances in computation. NUTS, automatic differentiation variational inference, normalizing flows, and stochastic gradient MCMC have made approximate Bayesian inference feasible for models with millions of parameters [4][14]. Probabilistic programming languages have lowered the barrier to writing custom models with bespoke priors [15].
In deep learning, Monte Carlo dropout and ensembling are commonly used as cheap approximate Bayesian methods that yield uncertainty estimates from architectures that were not originally designed with priors in mind [11][12]. Bayesian transformers and Bayesian fine-tuning of large language models are active research areas, with priors over weights or low-rank adaptation parameters used to quantify uncertainty in LLM outputs.
In evaluation of large language models, Bayesian aggregation methods place priors over rater bias, item difficulty, and model ability to combine noisy human judgments into more reliable scores. The same machinery used in psychometrics for decades, with explicit priors at every level, is being applied to questions like "is GPT-class A better than GPT-class B?" where individual ratings are sparse and noisy.