# Naive Bayes

> Source: https://aiwiki.ai/wiki/naive_bayes
> Updated: 2026-06-21
> Categories: Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Naive Bayes** is a family of probabilistic [classification](/wiki/classification) algorithms that apply [Bayes' theorem](/wiki/bayes_theorem) under a strong ("naive") assumption that every feature is conditionally independent of every other feature given the class label.[1] To classify an example, a Naive Bayes model picks the class with the highest posterior probability, computed as the class prior multiplied by the product of the per-feature likelihoods. Despite the independence assumption, which rarely holds in practice, Naive Bayes classifiers perform remarkably well on many tasks, especially [text classification](/wiki/text_classification_models) and [spam filtering](/wiki/spam_filtering), and they require only a small amount of training data to estimate their parameters.[8][9][10] They are among the simplest, fastest, and most efficient [supervised learning](/wiki/supervised_learning) algorithms in [machine learning](/wiki/machine_learning).

The approach traces to the work of the Reverend Thomas Bayes, whose "An Essay towards Solving a Problem in the Doctrine of Chances" was communicated to the Royal Society by Richard Price after Bayes' death and published in the Philosophical Transactions in 1763, volume 53, pages 370-418.[1] The scikit-learn documentation summarizes the modern appeal succinctly: "In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters."[10]

## What is the Bayes' theorem foundation?

[Bayes' theorem](/wiki/bayes_theorem) describes how to update the probability of a hypothesis in light of new evidence.[1] For a classification problem, it provides the posterior probability of a class y given observed features **x** = (x_1, x_2, ..., x_n):

P(y | x_1, x_2, ..., x_n) = P(x_1, x_2, ..., x_n | y) * P(y) / P(x_1, x_2, ..., x_n)

Where:

| Term | Name | Meaning |
|---|---|---|
| P(y \| x_1, ..., x_n) | Posterior probability | The probability of class y given the observed features |
| P(x_1, ..., x_n \| y) | Likelihood | The probability of observing the features given class y |
| P(y) | Prior probability | The probability of class y before observing any features |
| P(x_1, ..., x_n) | Evidence (marginal likelihood) | The total probability of observing the features across all classes |

The prior P(y) can be estimated from the frequency of each class in the training data. The evidence P(x_1, ..., x_n) is a normalizing constant that does not depend on the class, so it can be ignored when comparing classes. The main challenge is estimating the likelihood P(x_1, ..., x_n | y), which requires modeling the joint distribution of all features conditioned on the class.[7]

## What does the "naive" conditional independence assumption mean?

Estimating the full joint probability P(x_1, x_2, ..., x_n | y) directly is intractable for high-dimensional data. With d binary features, there are 2^d possible feature combinations per class, requiring an exponential number of parameters.[7]

The "naive" assumption simplifies this by assuming that all features are conditionally independent given the class:

P(x_1, x_2, ..., x_n | y) = P(x_1 | y) * P(x_2 | y) * ... * P(x_n | y) = product from i=1 to n of P(x_i | y)

This transforms the problem from estimating one joint distribution into estimating n individual feature distributions, each of which requires far fewer parameters.

The full Naive Bayes classification rule combines Bayes' theorem with the independence assumption:

y_hat = argmax_y P(y) * product from i=1 to n of P(x_i | y)

Since the evidence term P(x_1, ..., x_n) is the same for all classes, it cancels out during comparison. In practice, the products of probabilities are computed as sums of log-probabilities to avoid numerical underflow:

y_hat = argmax_y [log P(y) + sum from i=1 to n of log P(x_i | y)]

The independence assumption is almost always violated in real data. For instance, in text classification, the presence of the word "machine" makes the word "learning" more likely to appear. Nevertheless, Naive Bayes often produces accurate classifications because the predicted class depends on which class has the highest posterior probability, not on the exact probability values.[8] Even when the probability estimates are poorly calibrated, the ranking of classes can still be correct.[9]

Harry Zhang's 2004 paper "The Optimality of Naive Bayes" gives a formal explanation for this resilience. Zhang showed that no matter how strong the dependences among attributes are, Naive Bayes can still be optimal if those dependences distribute evenly across classes, or if they cancel each other out, which helps account for the classifier's surprisingly strong accuracy despite its unrealistic assumption.[8]

## What are the types of Naive Bayes classifiers?

Different variants of Naive Bayes differ in the assumptions they make about the distribution of P(x_i | y). The choice of variant depends on the nature of the feature data.

### Gaussian Naive Bayes

Gaussian Naive Bayes assumes that the continuous features associated with each class follow a [normal distribution](/wiki/normal_distribution) (Gaussian distribution). For each class y and feature x_i, the likelihood is modeled as:

P(x_i | y) = (1 / sqrt(2 * pi * sigma_y^2)) * exp(-(x_i - mu_y)^2 / (2 * sigma_y^2))

Where mu_y is the mean and sigma_y^2 is the variance of feature x_i for class y, both estimated from the training data.

**When to use:** Gaussian Naive Bayes is appropriate for datasets with continuous, real-valued features. It is commonly applied in medical diagnosis, iris species classification, and other tasks where features are measurements or sensor readings.

**Advantages:** Simple to implement; no discretization of features required; works well when the Gaussian assumption roughly holds.

**Limitations:** Performs poorly when features have heavy-tailed, multimodal, or highly skewed distributions that deviate significantly from the Gaussian assumption.

### Multinomial Naive Bayes

Multinomial Naive Bayes models the likelihood of features using a [multinomial distribution](/wiki/multinomial_distribution). It is designed for discrete count data, where features represent frequencies or occurrence counts.[3]

For a document represented as a vector of word counts **x** = (x_1, x_2, ..., x_n), the likelihood for class y is:

P(**x** | y) proportional to product from i=1 to n of P(w_i | y)^(x_i)

Where P(w_i | y) is the probability of word w_i appearing in a document of class y, estimated as:

P(w_i | y) = (count of w_i in documents of class y) / (total count of all words in documents of class y)

**When to use:** Multinomial Naive Bayes is the standard choice for text classification tasks, including [document classification](/wiki/document_classification), [sentiment analysis](/wiki/sentiment_analysis), and topic categorization. It works with bag-of-words and [TF-IDF](/wiki/tf_idf) feature representations.[7]

**Advantages:** Handles high-dimensional sparse data efficiently; naturally accounts for word frequency information; widely used and well-tested for [natural language processing](/wiki/natural_language_processing) tasks.

**Limitations:** Assumes features are counts (non-negative integers); does not capture word order or semantic relationships.

### Bernoulli Naive Bayes

Bernoulli Naive Bayes models features as binary (Boolean) variables, indicating the presence or absence of a feature. For text, this means each feature represents whether a particular word appears in a document, regardless of how many times.

The likelihood for class y is:

P(**x** | y) = product from i=1 to n of [P(x_i = 1 | y)^(x_i) * (1 - P(x_i = 1 | y))^(1 - x_i)]

A distinctive property of Bernoulli Naive Bayes is that it explicitly models the absence of features (when x_i = 0), penalizing the non-occurrence of features that are expected for a given class. Multinomial Naive Bayes, by contrast, simply ignores features with zero counts.[3]

**When to use:** Bernoulli Naive Bayes is appropriate when features are binary. In text classification, it is best suited for short documents or small vocabularies where word presence/absence is more informative than word frequency.

**Advantages:** Explicitly penalizes absent features, which can improve classification when feature absence is informative; works well with binary feature vectors.

**Limitations:** Discards frequency information; the multivariate Bernoulli model tends to perform worse than the multinomial model when the vocabulary is large.[3] In the McCallum and Nigam (1998) study that named these two event models, the multinomial model gave on average a 27% reduction in classification error over the multivariate Bernoulli model at larger vocabulary sizes, while the Bernoulli model remained competitive only when the vocabulary was small.[3]

### Complement Naive Bayes

Complement Naive Bayes (CNB), introduced by Rennie et al. in 2003, addresses some of the issues with standard Multinomial Naive Bayes on imbalanced datasets.[4] Instead of estimating the parameters for each class using documents belonging to that class, CNB uses the complement of each class (all documents not in the class) to estimate parameters. This approach often outperforms standard Multinomial Naive Bayes, especially on datasets with uneven class distributions.[4] Rennie and colleagues reported that their corrected Naive Bayes is a fast algorithm that is competitive with state-of-the-art text classifiers such as the [support vector machine](/wiki/support_vector_machine).[4]

### Comparison of Naive Bayes Variants

| Variant | Feature Type | Distribution | Best For | Key Property |
|---|---|---|---|---|
| Gaussian | Continuous | Normal (Gaussian) | Numeric data, sensor data, measurements | Estimates mean and variance per class |
| Multinomial | Discrete counts | Multinomial | Text classification with word counts | Uses word frequency |
| Bernoulli | Binary (0/1) | Bernoulli | Short texts, binary feature presence | Penalizes absent features |
| Complement | Discrete counts | Complement of multinomial | Imbalanced text datasets | Uses complement class statistics |

## How is Naive Bayes used for text classification?

Naive Bayes classifiers are among the most widely used algorithms for text classification. The standard pipeline for text classification with Naive Bayes involves:[7]

1. **Preprocessing:** Tokenize the text, remove stop words, apply stemming or lemmatization.
2. **Feature extraction:** Represent each document as a [bag of words](/wiki/bag_of_words) (word counts) or TF-IDF vector.
3. **Training:** Estimate prior probabilities P(y) and likelihoods P(x_i | y) from the training data.
4. **Classification:** For a new document, compute the posterior for each class and assign the class with the highest posterior.

Naive Bayes handles high-dimensional text data efficiently because training only requires a single pass through the data to collect word counts per class. The time complexity is O(n * d) for training and O(c * d) for classification, where n is the number of training examples, d is the vocabulary size, and c is the number of classes. The scikit-learn project notes that this decoupling is why the method can be so fast: "Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one-dimensional distribution."[10]

### Example: Topic Classification

Consider classifying news articles into categories such as "Sports," "Politics," and "Technology." The Multinomial Naive Bayes classifier would:

- Estimate P(Sports), P(Politics), and P(Technology) from the proportion of articles in each category.
- Estimate P("goal" | Sports), P("goal" | Politics), etc. for every word in the vocabulary.
- For a new article containing words like "goal," "match," "team," and "scored," compute the posterior for each class and assign the most probable one.

## How does Naive Bayes power spam filtering?

One of the most famous applications of Naive Bayes is email [spam filtering](/wiki/spam_filtering). The history of Bayesian spam filtering illustrates both the power and the practical appeal of the algorithm.

### When did Bayesian spam filtering start?

Bayesian approaches to spam filtering date back to at least 1998, when Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz published one of the first scholarly papers on the topic, "A Bayesian Approach to Filtering Junk E-Mail," presented at the AAAI Workshop on Learning for Text Categorization.[5] Bayesian spam filtering then gained widespread popularity in August 2002 when Paul Graham published his influential essay "A Plan for Spam."[6] Graham demonstrated that a simple Bayesian classifier trained on a user's own email could achieve remarkably low false positive rates, outperforming the handcrafted rule-based filters that were standard at the time.[6]

Graham reported strikingly low error in the essay: "Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives."[6] He also argued that probabilistic scoring was the key advantage over earlier rule-based scoring systems: "The Bayesian approach assigns an actual probability. The problem with a 'score' is that no one knows what it means."[6] His approach worked by computing the probability that a word appeared in spam versus legitimate email ("ham") and combining these individual probabilities using Bayes' theorem. The system learned from the user's own data, adapting to the specific types of spam they received.[6] Following Graham's essay, Bayesian spam filters were incorporated into email clients and tools such as Mozilla Thunderbird, SpamAssassin, SpamBayes, and others.

### How does a Naive Bayes spam filter work?

A Naive Bayes spam filter classifies an email as spam or ham based on the words it contains:

1. **Training phase:** The classifier counts how often each word appears in known spam and known ham emails.
2. **Classification phase:** For a new email with words w_1, w_2, ..., w_n:
   - Compute P(spam) * P(w_1 | spam) * P(w_2 | spam) * ... * P(w_n | spam)
   - Compute P(ham) * P(w_1 | ham) * P(w_2 | ham) * ... * P(w_n | ham)
   - Classify as spam if the spam posterior exceeds the ham posterior.

Words like "free," "winner," "click," and "unsubscribe" would have high P(word | spam), while words like a recipient's name or workplace would have high P(word | ham).

## What is Laplace smoothing in Naive Bayes?

A practical problem arises when a word appears in test data but was never observed in one of the classes during training. In this case, P(x_i | y) = 0, and because Naive Bayes multiplies probabilities, a single zero probability makes the entire posterior zero, regardless of all other evidence. This is known as the **zero-frequency problem**.[7]

**Laplace smoothing** (also called additive smoothing or Lidstone smoothing) addresses this by adding a small constant alpha to every count:[7]

P(x_i | y) = (count(x_i, y) + alpha) / (count(y) + alpha * |V|)

Where:

| Symbol | Meaning |
|---|---|
| count(x_i, y) | Number of times feature x_i appears in class y |
| count(y) | Total count of all features in class y |
| alpha | Smoothing parameter |
| \|V\| | Size of the vocabulary (total number of distinct features) |

Common choices for alpha:

| Alpha Value | Name | Effect |
|---|---|---|
| 0 | No smoothing | Zero probabilities possible; can fail on unseen features |
| 0.5 | Jeffreys prior | Moderate smoothing |
| 1 | Laplace smoothing (add-one) | Each feature is "seen" at least once per class |
| Tuned value | Lidstone smoothing | Optimized via cross-validation |

When alpha = 1, the formula acts as if every word was observed one additional time in every class, ensuring that no probability estimate is zero. In practice, treating alpha as a [hyperparameter](/wiki/hyperparameter) and tuning it via [cross-validation](/wiki/cross-validation) often yields better results than using a fixed value.

## What are the advantages and disadvantages of Naive Bayes?

### Advantages

| Advantage | Explanation |
|---|---|
| Fast training and prediction | Training requires only a single pass through the data; prediction involves computing products of stored probabilities |
| Scales well to high dimensions | Handles thousands or millions of features (e.g., large vocabularies in text) without difficulty |
| Works well with small datasets | Requires relatively few training examples to estimate parameters reliably |
| Robust to irrelevant features | Irrelevant features contribute roughly equally to all classes, so they do not strongly affect classification |
| Easy to implement | The algorithm is straightforward and has few hyperparameters |
| Naturally handles multi-class problems | Extends directly to multiple classes without modification |
| Good baseline | Often provides a surprisingly strong baseline that more complex models struggle to beat |
| Interpretable | The learned probabilities are easy to inspect and understand |

### Disadvantages

| Disadvantage | Explanation |
|---|---|
| Independence assumption is unrealistic | Features in real data are almost always correlated; this can lead to poorly calibrated probability estimates |
| Poor probability estimates | While class rankings may be correct, the raw probability values are often unreliable |
| Sensitive to feature engineering | Performance depends heavily on how features are represented and selected |
| Cannot learn feature interactions | The independence assumption prevents the model from capturing relationships between features |
| Zero-frequency problem | Without smoothing, unseen features cause zero probabilities |
| Bias toward classes with more features | In text, longer documents tend to be favored unless normalization is applied |

The scikit-learn documentation states this calibration caveat plainly: "although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously."[10]

## How does Naive Bayes differ from logistic regression?

Naive Bayes and [logistic regression](/wiki/logistic_regression) are both widely used for classification, but they represent fundamentally different approaches. Naive Bayes is a **generative model** that learns the joint probability P(X, y) and derives the decision boundary from it, while logistic regression is a **discriminative model** that directly learns the conditional probability P(y | X).[2]

This distinction was formally analyzed by [Andrew Ng](/wiki/andrew_ng) and Michael Jordan in their paper "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes," published in Advances in Neural Information Processing Systems 14 ([NeurIPS](/wiki/neurips)) in 2001.[2]

| Aspect | Naive Bayes | Logistic Regression |
|---|---|---|
| Model type | Generative | Discriminative |
| What it models | P(X, y) = P(X \| y) * P(y) | P(y \| X) directly |
| Feature independence | Assumes conditional independence | Does not require independence |
| Training speed | Very fast (closed-form parameter estimation) | Requires iterative optimization |
| Data efficiency | Reaches asymptotic performance with O(log n) examples | Reaches asymptotic performance with O(n) examples |
| Asymptotic accuracy | Lower (bounded by independence assumption) | Higher (no independence constraint) |
| Small dataset performance | Often better | May overfit |
| Large dataset performance | Often worse | Often better |
| Handling correlated features | Cannot model correlations | Handles correlations naturally |
| Probability calibration | Often poorly calibrated | Generally better calibrated |
| Regularization | Laplace smoothing | L1 (Lasso), L2 (Ridge) penalties |

### Key Insights from Ng and Jordan (2001)

Ng and Jordan's analysis revealed two important findings:

1. **Asymptotic performance:** With infinite training data, logistic regression achieves lower classification error than Naive Bayes when the independence assumption is violated (which is almost always the case in practice).[2]

2. **Convergence rate:** Naive Bayes reaches its (higher) asymptotic error rate much faster, requiring only O(log n) training examples, whereas logistic regression requires O(n) examples to reach its (lower) asymptotic error. This means that Naive Bayes can outperform logistic regression when training data is scarce.[2]

The paper challenged the prevailing view that discriminative classifiers should almost always be preferred, arguing instead that there are two distinct regimes of performance as the training set grows: while discriminative learning has lower asymptotic error, a generative classifier such as Naive Bayes may approach its higher asymptotic error much faster.[2]

In practical terms: for small datasets where features are reasonably independent, Naive Bayes may be the better choice. For large datasets with correlated features, logistic regression (or other discriminative models) will typically outperform Naive Bayes.

## What are the practical considerations for using Naive Bayes?

### When should you use Naive Bayes?

Naive Bayes is a strong choice in the following scenarios:

- **Text classification and NLP:** Document categorization, spam detection, [sentiment analysis](/wiki/sentiment_analysis), language identification.
- **Real-time prediction:** When low latency is required because of its fast prediction time.
- **Baseline model:** As a first-pass classifier before investing in more complex models.
- **High-dimensional data:** When the number of features is very large relative to the number of training examples.
- **Multi-class problems:** When there are many classes and computational efficiency matters.

### When should you avoid Naive Bayes?

- When features are highly correlated and the independence assumption causes significant errors.
- When well-calibrated probability estimates (not just class rankings) are needed.
- When feature interactions are important for classification.
- When the dataset is large enough for discriminative models to substantially outperform generative ones.

### Implementation in scikit-learn

The [scikit-learn](/wiki/scikit_learn) library provides implementations of all major Naive Bayes variants in its sklearn.naive_bayes module:[10]

| Class | Variant | Typical Use Case |
|---|---|---|
| GaussianNB | Gaussian Naive Bayes | Continuous features |
| MultinomialNB | Multinomial Naive Bayes | Word counts, TF-IDF |
| BernoulliNB | Bernoulli Naive Bayes | Binary features |
| ComplementNB | Complement Naive Bayes | Imbalanced text classification |
| CategoricalNB | Categorical Naive Bayes | Categorical features |

## What are the extensions and variants of Naive Bayes?

Several extensions have been proposed to address the limitations of standard Naive Bayes:

- **Semi-naive Bayes:** Relaxes the independence assumption by allowing limited dependencies between features, for example by using tree-augmented networks (TAN).
- **Bayesian networks:** Generalize Naive Bayes by allowing arbitrary conditional dependency structures among features.
- **Averaged one-dependence estimators (AODE):** Average over all possible one-feature-dependent models to reduce the impact of the independence assumption.
- **Selective Naive Bayes:** Uses [feature selection](/wiki/feature_selection) to remove highly correlated features before applying Naive Bayes, partially mitigating the independence assumption.

## References

1. Bayes, T. (1763). "An Essay towards Solving a Problem in the Doctrine of Chances." Communicated by Richard Price. Philosophical Transactions of the Royal Society of London, 53, 370-418.
2. Ng, A. Y., & Jordan, M. I. (2001). "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes." Advances in Neural Information Processing Systems 14 (NeurIPS). https://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes
3. McCallum, A., & Nigam, K. (1998). "A Comparison of Event Models for Naive Bayes Text Classification." AAAI-98 Workshop on Learning for Text Categorization, 41-48. https://aaai.org/papers/041-ws98-05-007/
4. Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). "Tackling the Poor Assumptions of Naive Bayes Text Classifiers." Proceedings of the 20th International Conference on Machine Learning (ICML), 616-623. https://aaai.org/papers/icml03-081-tackling-the-poor-assumptions-of-naive-bayes-text-classifiers/
5. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). "A Bayesian Approach to Filtering Junk E-Mail." AAAI Workshop on Learning for Text Categorization. https://aaai.org/papers/055-ws98-05-009/
6. Graham, P. (2002). "A Plan for Spam." Available at: https://paulgraham.com/spam.html
7. Manning, C. D., Raghavan, P., & Schutze, H. (2008). "Introduction to [Information Retrieval](/wiki/information_retrieval)." Cambridge University Press. Chapter 13: Text classification and Naive Bayes.
8. Zhang, H. (2004). "The Optimality of Naive Bayes." Proceedings of the 17th International Florida Artificial Intelligence Research Society Conference (FLAIRS). https://aaai.org/papers/flairs-2004-097/
9. Hand, D. J., & Yu, K. (2001). "Idiot's Bayes: Not So Stupid After All?" International Statistical Review, 69(3), 385-398.
10. Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830. Naive Bayes user guide: https://scikit-learn.org/stable/modules/naive_bayes.html