Naive Bayes

Machine Learning Statistics

19 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v6 · 3,805 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Naive Bayes is a family of probabilistic classification algorithms that apply Bayes' theorem under a strong ("naive") assumption that every feature is conditionally independent of every other feature given the class label.^[1] To classify an example, a Naive Bayes model picks the class with the highest posterior probability, computed as the class prior multiplied by the product of the per-feature likelihoods. Despite the independence assumption, which rarely holds in practice, Naive Bayes classifiers perform remarkably well on many tasks, especially text classification and spam filtering, and they require only a small amount of training data to estimate their parameters.^[8]^[9]^[10] They are among the simplest, fastest, and most efficient supervised learning algorithms in machine learning.

The approach traces to the work of the Reverend Thomas Bayes, whose "An Essay towards Solving a Problem in the Doctrine of Chances" was communicated to the Royal Society by Richard Price after Bayes' death and published in the Philosophical Transactions in 1763, volume 53, pages 370-418.^[1] The scikit-learn documentation summarizes the modern appeal succinctly: "In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters."^[10]

What is the Bayes' theorem foundation?

Bayes' theorem describes how to update the probability of a hypothesis in light of new evidence.^[1] For a classification problem, it provides the posterior probability of a class y given observed features x = (x_1, x_2, ..., x_n):

P(y | x_1, x_2, ..., x_n) = P(x_1, x_2, ..., x_n | y) * P(y) / P(x_1, x_2, ..., x_n)

Where:

Term	Name	Meaning
P(y \| x_1, ..., x_n)	Posterior probability	The probability of class y given the observed features
P(x_1, ..., x_n \| y)	Likelihood	The probability of observing the features given class y
P(y)	Prior probability	The probability of class y before observing any features
P(x_1, ..., x_n)	Evidence (marginal likelihood)	The total probability of observing the features across all classes

The prior P(y) can be estimated from the frequency of each class in the training data. The evidence P(x_1, ..., x_n) is a normalizing constant that does not depend on the class, so it can be ignored when comparing classes. The main challenge is estimating the likelihood P(x_1, ..., x_n | y), which requires modeling the joint distribution of all features conditioned on the class.^[7]

What does the "naive" conditional independence assumption mean?

Estimating the full joint probability P(x_1, x_2, ..., x_n | y) directly is intractable for high-dimensional data. With d binary features, there are 2^d possible feature combinations per class, requiring an exponential number of parameters.^[7]

The "naive" assumption simplifies this by assuming that all features are conditionally independent given the class:

P(x_1, x_2, ..., x_n | y) = P(x_1 | y) * P(x_2 | y) * ... * P(x_n | y) = product from i=1 to n of P(x_i | y)

This transforms the problem from estimating one joint distribution into estimating n individual feature distributions, each of which requires far fewer parameters.

The full Naive Bayes classification rule combines Bayes' theorem with the independence assumption:

y_hat = argmax_y P(y) * product from i=1 to n of P(x_i | y)

Since the evidence term P(x_1, ..., x_n) is the same for all classes, it cancels out during comparison. In practice, the products of probabilities are computed as sums of log-probabilities to avoid numerical underflow:

y_hat = argmax_y [log P(y) + sum from i=1 to n of log P(x_i | y)]

The independence assumption is almost always violated in real data. For instance, in text classification, the presence of the word "machine" makes the word "learning" more likely to appear. Nevertheless, Naive Bayes often produces accurate classifications because the predicted class depends on which class has the highest posterior probability, not on the exact probability values.^[8] Even when the probability estimates are poorly calibrated, the ranking of classes can still be correct.^[9]

Harry Zhang's 2004 paper "The Optimality of Naive Bayes" gives a formal explanation for this resilience. Zhang showed that no matter how strong the dependences among attributes are, Naive Bayes can still be optimal if those dependences distribute evenly across classes, or if they cancel each other out, which helps account for the classifier's surprisingly strong accuracy despite its unrealistic assumption.^[8]

What are the types of Naive Bayes classifiers?

Different variants of Naive Bayes differ in the assumptions they make about the distribution of P(x_i | y). The choice of variant depends on the nature of the feature data.

Gaussian Naive Bayes

Gaussian Naive Bayes assumes that the continuous features associated with each class follow a normal distribution (Gaussian distribution). For each class y and feature x_i, the likelihood is modeled as:

P(x_i | y) = (1 / sqrt(2 * pi * sigma_y^2)) * exp(-(x_i - mu_y)^2 / (2 * sigma_y^2))

Where mu_y is the mean and sigma_y^2 is the variance of feature x_i for class y, both estimated from the training data.

When to use: Gaussian Naive Bayes is appropriate for datasets with continuous, real-valued features. It is commonly applied in medical diagnosis, iris species classification, and other tasks where features are measurements or sensor readings.

Advantages: Simple to implement; no discretization of features required; works well when the Gaussian assumption roughly holds.

Limitations: Performs poorly when features have heavy-tailed, multimodal, or highly skewed distributions that deviate significantly from the Gaussian assumption.

Multinomial Naive Bayes

Multinomial Naive Bayes models the likelihood of features using a multinomial distribution. It is designed for discrete count data, where features represent frequencies or occurrence counts.^[3]

For a document represented as a vector of word counts x = (x_1, x_2, ..., x_n), the likelihood for class y is:

P(x | y) proportional to product from i=1 to n of P(w_i | y)^(x_i)

Where P(w_i | y) is the probability of word w_i appearing in a document of class y, estimated as:

P(w_i | y) = (count of w_i in documents of class y) / (total count of all words in documents of class y)

When to use: Multinomial Naive Bayes is the standard choice for text classification tasks, including document classification, sentiment analysis, and topic categorization. It works with bag-of-words and TF-IDF feature representations.^[7]

Advantages: Handles high-dimensional sparse data efficiently; naturally accounts for word frequency information; widely used and well-tested for natural language processing tasks.

Limitations: Assumes features are counts (non-negative integers); does not capture word order or semantic relationships.

Bernoulli Naive Bayes

Bernoulli Naive Bayes models features as binary (Boolean) variables, indicating the presence or absence of a feature. For text, this means each feature represents whether a particular word appears in a document, regardless of how many times.

The likelihood for class y is:

P(x | y) = product from i=1 to n of [P(x_i = 1 | y)^(x_i) * (1 - P(x_i = 1 | y))^(1 - x_i)]

A distinctive property of Bernoulli Naive Bayes is that it explicitly models the absence of features (when x_i = 0), penalizing the non-occurrence of features that are expected for a given class. Multinomial Naive Bayes, by contrast, simply ignores features with zero counts.^[3]

When to use: Bernoulli Naive Bayes is appropriate when features are binary. In text classification, it is best suited for short documents or small vocabularies where word presence/absence is more informative than word frequency.

Advantages: Explicitly penalizes absent features, which can improve classification when feature absence is informative; works well with binary feature vectors.

Limitations: Discards frequency information; the multivariate Bernoulli model tends to perform worse than the multinomial model when the vocabulary is large.^[3] In the McCallum and Nigam (1998) study that named these two event models, the multinomial model gave on average a 27% reduction in classification error over the multivariate Bernoulli model at larger vocabulary sizes, while the Bernoulli model remained competitive only when the vocabulary was small.^[3]

Complement Naive Bayes

Complement Naive Bayes (CNB), introduced by Rennie et al. in 2003, addresses some of the issues with standard Multinomial Naive Bayes on imbalanced datasets.^[4] Instead of estimating the parameters for each class using documents belonging to that class, CNB uses the complement of each class (all documents not in the class) to estimate parameters. This approach often outperforms standard Multinomial Naive Bayes, especially on datasets with uneven class distributions.^[4] Rennie and colleagues reported that their corrected Naive Bayes is a fast algorithm that is competitive with state-of-the-art text classifiers such as the support vector machine.^[4]

Comparison of Naive Bayes Variants

Variant	Feature Type	Distribution	Best For	Key Property
Gaussian	Continuous	Normal (Gaussian)	Numeric data, sensor data, measurements	Estimates mean and variance per class
Multinomial	Discrete counts	Multinomial	Text classification with word counts	Uses word frequency
Bernoulli	Binary (0/1)	Bernoulli	Short texts, binary feature presence	Penalizes absent features
Complement	Discrete counts	Complement of multinomial	Imbalanced text datasets	Uses complement class statistics

How is Naive Bayes used for text classification?

Naive Bayes classifiers are among the most widely used algorithms for text classification. The standard pipeline for text classification with Naive Bayes involves:^[7]

Preprocessing: Tokenize the text, remove stop words, apply stemming or lemmatization.
Feature extraction: Represent each document as a bag of words (word counts) or TF-IDF vector.
Training: Estimate prior probabilities P(y) and likelihoods P(x_i | y) from the training data.
Classification: For a new document, compute the posterior for each class and assign the class with the highest posterior.

Naive Bayes handles high-dimensional text data efficiently because training only requires a single pass through the data to collect word counts per class. The time complexity is O(n * d) for training and O(c * d) for classification, where n is the number of training examples, d is the vocabulary size, and c is the number of classes. The scikit-learn project notes that this decoupling is why the method can be so fast: "Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one-dimensional distribution."^[10]

Example: Topic Classification

Consider classifying news articles into categories such as "Sports," "Politics," and "Technology." The Multinomial Naive Bayes classifier would:

Estimate P(Sports), P(Politics), and P(Technology) from the proportion of articles in each category.
Estimate P("goal" | Sports), P("goal" | Politics), etc. for every word in the vocabulary.
For a new article containing words like "goal," "match," "team," and "scored," compute the posterior for each class and assign the most probable one.

How does Naive Bayes power spam filtering?

One of the most famous applications of Naive Bayes is email spam filtering. The history of Bayesian spam filtering illustrates both the power and the practical appeal of the algorithm.

When did Bayesian spam filtering start?

Bayesian approaches to spam filtering date back to at least 1998, when Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz published one of the first scholarly papers on the topic, "A Bayesian Approach to Filtering Junk E-Mail," presented at the AAAI Workshop on Learning for Text Categorization.^[5] Bayesian spam filtering then gained widespread popularity in August 2002 when Paul Graham published his influential essay "A Plan for Spam."^[6] Graham demonstrated that a simple Bayesian classifier trained on a user's own email could achieve remarkably low false positive rates, outperforming the handcrafted rule-based filters that were standard at the time.^[6]

Graham reported strikingly low error in the essay: "Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives."^[6] He also argued that probabilistic scoring was the key advantage over earlier rule-based scoring systems: "The Bayesian approach assigns an actual probability. The problem with a 'score' is that no one knows what it means."^[6] His approach worked by computing the probability that a word appeared in spam versus legitimate email ("ham") and combining these individual probabilities using Bayes' theorem. The system learned from the user's own data, adapting to the specific types of spam they received.^[6] Following Graham's essay, Bayesian spam filters were incorporated into email clients and tools such as Mozilla Thunderbird, SpamAssassin, SpamBayes, and others.

How does a Naive Bayes spam filter work?

A Naive Bayes spam filter classifies an email as spam or ham based on the words it contains:

Training phase: The classifier counts how often each word appears in known spam and known ham emails.
Classification phase: For a new email with words w_1, w_2, ..., w_n:
- Compute P(spam) * P(w_1 | spam) * P(w_2 | spam) * ... * P(w_n | spam)
- Compute P(ham) * P(w_1 | ham) * P(w_2 | ham) * ... * P(w_n | ham)
- Classify as spam if the spam posterior exceeds the ham posterior.

Words like "free," "winner," "click," and "unsubscribe" would have high P(word | spam), while words like a recipient's name or workplace would have high P(word | ham).

What is Laplace smoothing in Naive Bayes?

A practical problem arises when a word appears in test data but was never observed in one of the classes during training. In this case, P(x_i | y) = 0, and because Naive Bayes multiplies probabilities, a single zero probability makes the entire posterior zero, regardless of all other evidence. This is known as the zero-frequency problem.^[7]

Laplace smoothing (also called additive smoothing or Lidstone smoothing) addresses this by adding a small constant alpha to every count:^[7]

P(x_i | y) = (count(x_i, y) + alpha) / (count(y) + alpha * |V|)

Where:

Symbol	Meaning
count(x_i, y)	Number of times feature x_i appears in class y
count(y)	Total count of all features in class y
alpha	Smoothing parameter
\|V\|	Size of the vocabulary (total number of distinct features)

Common choices for alpha:

Alpha Value	Name	Effect
0	No smoothing	Zero probabilities possible; can fail on unseen features
0.5	Jeffreys prior	Moderate smoothing
1	Laplace smoothing (add-one)	Each feature is "seen" at least once per class
Tuned value	Lidstone smoothing	Optimized via cross-validation

When alpha = 1, the formula acts as if every word was observed one additional time in every class, ensuring that no probability estimate is zero. In practice, treating alpha as a hyperparameter and tuning it via cross-validation often yields better results than using a fixed value.

What are the advantages and disadvantages of Naive Bayes?

Advantages

Advantage	Explanation
Fast training and prediction	Training requires only a single pass through the data; prediction involves computing products of stored probabilities
Scales well to high dimensions	Handles thousands or millions of features (e.g., large vocabularies in text) without difficulty
Works well with small datasets	Requires relatively few training examples to estimate parameters reliably
Robust to irrelevant features	Irrelevant features contribute roughly equally to all classes, so they do not strongly affect classification
Easy to implement	The algorithm is straightforward and has few hyperparameters
Naturally handles multi-class problems	Extends directly to multiple classes without modification
Good baseline	Often provides a surprisingly strong baseline that more complex models struggle to beat
Interpretable	The learned probabilities are easy to inspect and understand

Disadvantages

Disadvantage	Explanation
Independence assumption is unrealistic	Features in real data are almost always correlated; this can lead to poorly calibrated probability estimates
Poor probability estimates	While class rankings may be correct, the raw probability values are often unreliable
Sensitive to feature engineering	Performance depends heavily on how features are represented and selected
Cannot learn feature interactions	The independence assumption prevents the model from capturing relationships between features
Zero-frequency problem	Without smoothing, unseen features cause zero probabilities
Bias toward classes with more features	In text, longer documents tend to be favored unless normalization is applied

The scikit-learn documentation states this calibration caveat plainly: "although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously."^[10]

How does Naive Bayes differ from logistic regression?

Naive Bayes and logistic regression are both widely used for classification, but they represent fundamentally different approaches. Naive Bayes is a generative model that learns the joint probability P(X, y) and derives the decision boundary from it, while logistic regression is a discriminative model that directly learns the conditional probability P(y | X).^[2]

This distinction was formally analyzed by Andrew Ng and Michael Jordan in their paper "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes," published in Advances in Neural Information Processing Systems 14 (NeurIPS) in 2001.^[2]

Aspect	Naive Bayes	Logistic Regression
Model type	Generative	Discriminative
What it models	P(X, y) = P(X \| y) * P(y)	P(y \| X) directly
Feature independence	Assumes conditional independence	Does not require independence
Training speed	Very fast (closed-form parameter estimation)	Requires iterative optimization
Data efficiency	Reaches asymptotic performance with O(log n) examples	Reaches asymptotic performance with O(n) examples
Asymptotic accuracy	Lower (bounded by independence assumption)	Higher (no independence constraint)
Small dataset performance	Often better	May overfit
Large dataset performance	Often worse	Often better
Handling correlated features	Cannot model correlations	Handles correlations naturally
Probability calibration	Often poorly calibrated	Generally better calibrated
Regularization	Laplace smoothing	L1 (Lasso), L2 (Ridge) penalties

Key Insights from Ng and Jordan (2001)

Ng and Jordan's analysis revealed two important findings:

Asymptotic performance: With infinite training data, logistic regression achieves lower classification error than Naive Bayes when the independence assumption is violated (which is almost always the case in practice).^[2]
Convergence rate: Naive Bayes reaches its (higher) asymptotic error rate much faster, requiring only O(log n) training examples, whereas logistic regression requires O(n) examples to reach its (lower) asymptotic error. This means that Naive Bayes can outperform logistic regression when training data is scarce.^[2]

The paper challenged the prevailing view that discriminative classifiers should almost always be preferred, arguing instead that there are two distinct regimes of performance as the training set grows: while discriminative learning has lower asymptotic error, a generative classifier such as Naive Bayes may approach its higher asymptotic error much faster.^[2]

In practical terms: for small datasets where features are reasonably independent, Naive Bayes may be the better choice. For large datasets with correlated features, logistic regression (or other discriminative models) will typically outperform Naive Bayes.

What are the practical considerations for using Naive Bayes?

When should you use Naive Bayes?

Naive Bayes is a strong choice in the following scenarios:

Text classification and NLP: Document categorization, spam detection, sentiment analysis, language identification.
Real-time prediction: When low latency is required because of its fast prediction time.
Baseline model: As a first-pass classifier before investing in more complex models.
High-dimensional data: When the number of features is very large relative to the number of training examples.
Multi-class problems: When there are many classes and computational efficiency matters.

When should you avoid Naive Bayes?

When features are highly correlated and the independence assumption causes significant errors.
When well-calibrated probability estimates (not just class rankings) are needed.
When feature interactions are important for classification.
When the dataset is large enough for discriminative models to substantially outperform generative ones.

Implementation in scikit-learn

The scikit-learn library provides implementations of all major Naive Bayes variants in its sklearn.naive_bayes module:^[10]

Class	Variant	Typical Use Case
GaussianNB	Gaussian Naive Bayes	Continuous features
MultinomialNB	Multinomial Naive Bayes	Word counts, TF-IDF
BernoulliNB	Bernoulli Naive Bayes	Binary features
ComplementNB	Complement Naive Bayes	Imbalanced text classification
CategoricalNB	Categorical Naive Bayes	Categorical features

What are the extensions and variants of Naive Bayes?

Several extensions have been proposed to address the limitations of standard Naive Bayes:

Semi-naive Bayes: Relaxes the independence assumption by allowing limited dependencies between features, for example by using tree-augmented networks (TAN).
Bayesian networks: Generalize Naive Bayes by allowing arbitrary conditional dependency structures among features.
Averaged one-dependence estimators (AODE): Average over all possible one-feature-dependent models to reduce the impact of the independence assumption.
Selective Naive Bayes: Uses feature selection to remove highly correlated features before applying Naive Bayes, partially mitigating the independence assumption.

References

Bayes, T. (1763). "An Essay towards Solving a Problem in the Doctrine of Chances." Communicated by Richard Price. Philosophical Transactions of the Royal Society of London, 53, 370-418. ↩
Ng, A. Y., & Jordan, M. I. (2001). "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes." Advances in Neural Information Processing Systems 14 (NeurIPS). https://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes ↩
McCallum, A., & Nigam, K. (1998). "A Comparison of Event Models for Naive Bayes Text Classification." AAAI-98 Workshop on Learning for Text Categorization, 41-48. https://aaai.org/papers/041-ws98-05-007/ ↩
Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). "Tackling the Poor Assumptions of Naive Bayes Text Classifiers." Proceedings of the 20th International Conference on Machine Learning (ICML), 616-623. https://aaai.org/papers/icml03-081-tackling-the-poor-assumptions-of-naive-bayes-text-classifiers/ ↩
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). "A Bayesian Approach to Filtering Junk E-Mail." AAAI Workshop on Learning for Text Categorization. https://aaai.org/papers/055-ws98-05-009/ ↩
Graham, P. (2002). "A Plan for Spam." Available at: https://paulgraham.com/spam.html ↩
Manning, C. D., Raghavan, P., & Schutze, H. (2008). "Introduction to Information Retrieval." Cambridge University Press. Chapter 13: Text classification and Naive Bayes. ↩
Zhang, H. (2004). "The Optimality of Naive Bayes." Proceedings of the 17th International Florida Artificial Intelligence Research Society Conference (FLAIRS). https://aaai.org/papers/flairs-2004-097/ ↩
Hand, D. J., & Yu, K. (2001). "Idiot's Bayes: Not So Stupid After All?" International Statistical Review, 69(3), 385-398. ↩
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830. Naive Bayes user guide: https://scikit-learn.org/stable/modules/naive_bayes.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Naive Bayes

What is the Bayes' theorem foundation?

What does the "naive" conditional independence assumption mean?

What are the types of Naive Bayes classifiers?

Gaussian Naive Bayes

Multinomial Naive Bayes

Bernoulli Naive Bayes

Complement Naive Bayes

Comparison of Naive Bayes Variants

How is Naive Bayes used for text classification?

Example: Topic Classification

How does Naive Bayes power spam filtering?

When did Bayesian spam filtering start?

How does a Naive Bayes spam filter work?

What is Laplace smoothing in Naive Bayes?

What are the advantages and disadvantages of Naive Bayes?

Advantages

Disadvantages

How does Naive Bayes differ from logistic regression?

Key Insights from Ng and Jordan (2001)

What are the practical considerations for using Naive Bayes?

When should you use Naive Bayes?

When should you avoid Naive Bayes?

Implementation in scikit-learn

What are the extensions and variants of Naive Bayes?

References

Improve this article

What links here (24 of 40)

What links here (24 of 40)

What is the Bayes' theorem foundation?

What does the "naive" conditional independence assumption mean?

What are the types of Naive Bayes classifiers?

Gaussian Naive Bayes

Multinomial Naive Bayes

Bernoulli Naive Bayes

Complement Naive Bayes

Comparison of Naive Bayes Variants

How is Naive Bayes used for text classification?

Example: Topic Classification

How does Naive Bayes power spam filtering?

When did Bayesian spam filtering start?

How does a Naive Bayes spam filter work?

What is Laplace smoothing in Naive Bayes?

What are the advantages and disadvantages of Naive Bayes?

Advantages

Disadvantages

How does Naive Bayes differ from logistic regression?

Key Insights from Ng and Jordan (2001)

What are the practical considerations for using Naive Bayes?

When should you use Naive Bayes?

When should you avoid Naive Bayes?

Implementation in scikit-learn

What are the extensions and variants of Naive Bayes?

References

Improve this article

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here (24 of 40)

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here (24 of 40)