# Bigram

> Source: https://aiwiki.ai/wiki/bigram
> Updated: 2026-06-24
> Categories: Machine Learning, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **bigram** (also written 2-gram) is a contiguous sequence of two adjacent elements, typically two words or two characters, taken from a body of text or speech. In statistical [natural language processing](/wiki/natural_language_processing) (NLP) a **bigram language model** predicts each word from only the single word immediately before it, estimating the conditional probability P(w_n | w_n-1), and it is the classic building block of statistical [language models](/wiki/language_model), collocation detection, text classification, [speech recognition](/wiki/speech_recognition), and spell checking. The bigram is the n equals 2 case of the more general [n-gram](/wiki/n-gram): a unigram conditions on no prior word, a bigram on one, and a [trigram](/wiki/trigram) on two.[2]

Bigram models rest on a simplifying first-order [Markov](/wiki/markov_chain) assumption, the idea that a word's probability depends only on the word that immediately precedes it. The concept traces to Claude Shannon's 1948 paper "A Mathematical Theory of Communication," which used n-gram statistics (including bigrams) to model and generate English text.[1] Despite their simplicity, bigram models were the dominant language-modeling technology in [speech recognition](/wiki/speech_recognition) and [machine translation](/wiki/machine_translation) through the 1980s and 1990s, and they remain in use today as baselines, lightweight components, and teaching tools.

## Explain Like I'm 5 (ELI5)

Imagine you are reading a book, and you cover up all the words except the one you just read. Now you try to guess the next word based only on that one word. If the word you see is "ice," you might guess "cream" because you have heard "ice cream" many times before. That is exactly what a bigram does. It looks at pairs of words that sit next to each other and remembers how often each pair shows up. Then, when it sees one word, it guesses the next word by picking the one that appeared most often after it. A bigram is simply two words, side by side, treated as a pair.

## What is a bigram? Definition and notation

Formally, a bigram is an ordered pair of two consecutive [tokens](/wiki/token) (w_1, w_2) drawn from a sequence of text. Given a sentence like "The cat sat on the mat," the word-level bigrams are:

| Position | Bigram |
|---|---|
| 1 | (The, cat) |
| 2 | (cat, sat) |
| 3 | (sat, on) |
| 4 | (on, the) |
| 5 | (the, mat) |

A bigram can also be defined at the character level. For the word "hello," the character bigrams are: (h, e), (e, l), (l, l), (l, o).

The total number of possible word-level bigrams for a vocabulary of size V is V<sup>2</sup>. In practice, only a small fraction of these pairs actually occur in any given corpus, which creates the well-known problem of data sparsity.[8]

## How is bigram probability estimated?

The central quantity of interest in a bigram model is the conditional probability of a word w<sub>n</sub> given the preceding word w<sub>n-1</sub>, written as:

**P(w<sub>n</sub> | w<sub>n-1</sub>)**

This probability is typically estimated using Maximum Likelihood Estimation (MLE). As Jurafsky and Martin state, "we get the MLE estimate for the parameters of an n-gram model by getting counts from a corpus, and normalizing the counts so that they lie between 0 and 1."[2] The MLE formula for a bigram probability is:[2]

**P(w<sub>n</sub> | w<sub>n-1</sub>) = C(w<sub>n-1</sub>, w<sub>n</sub>) / C(w<sub>n-1</sub>)**

where C(w<sub>n-1</sub>, w<sub>n</sub>) is the count of the bigram (w<sub>n-1</sub>, w<sub>n</sub>) in the training corpus and C(w<sub>n-1</sub>) is the count of the word w<sub>n-1</sub>.

For example, suppose a corpus contains 1,000 occurrences of the word "artificial" and 800 of those are followed by "intelligence." The bigram probability P("intelligence" | "artificial") would be 800/1,000 = 0.8.

To compute the probability of an entire sentence, the chain rule of probability is applied.[2] For a sentence of words w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>, the bigram model approximates the joint probability as:

**P(w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) ≈ P(w<sub>1</sub>) × P(w<sub>2</sub> | w<sub>1</sub>) × P(w<sub>3</sub> | w<sub>2</sub>) × ... × P(w<sub>n</sub> | w<sub>n-1</sub>)**

In practice, sentence boundary tokens such as `<s>` (start) and `</s>` (end) are often added so that the model can capture which words tend to begin and end sentences.

## What is the Markov assumption?

The bigram model relies on the **first-order Markov assumption**: the probability of each word depends only on the immediately preceding word, not on any earlier words.[2] Formally:

**P(w<sub>n</sub> | w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n-1</sub>) ≈ P(w<sub>n</sub> | w<sub>n-1</sub>)**

This simplification dramatically reduces the number of parameters the model must estimate. Instead of conditioning on the full history of all preceding words (which would require an impossibly large number of parameters), the model only needs to store probabilities for each pair of words.

The term "Markov" refers to the Russian mathematician Andrey Markov, who first studied stochastic processes with this memoryless property in the early 1900s. Jurafsky and Martin note that "Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past."[2] A bigram model is specifically a first-order [Markov chain](/wiki/markov_chain) because it looks exactly one step into the past. By contrast, a [trigram](/wiki/trigram) model is a second-order Markov model (conditioning on the two preceding words), and a unigram model is a zero-order model (no conditioning at all).

## What is a bigram language model?

A **bigram language model** assigns a probability to sequences of words using bigram probabilities. In the words of Jurafsky and Martin, "models that assign probabilities to sequences of words are called language models or LMs."[2] Language models are central to many NLP tasks because they capture the statistical regularities of a language.

### When were bigram language models first used?

The idea of using statistical models of language dates back to Claude Shannon's landmark 1948 paper, "A Mathematical Theory of Communication."[1] Shannon demonstrated that language has measurable statistical properties and used n-gram models (including bigrams, which he called "digram" approximations) to generate text by sampling words one at a time based on conditional probabilities. His experiments showed that bigram-generated text was noticeably more coherent than unigram-generated text, though still far from natural language.[1]

In the 1980s and 1990s, bigram and trigram language models became standard tools in [speech recognition](/wiki/speech_recognition) and [machine translation](/wiki/machine_translation) systems. Frederick Jelinek and his team at IBM pioneered the use of n-gram models for speech recognition. Jelinek is famously associated with the quip "Every time I fire a linguist, the performance of the speech recognizer goes up," a remark reflecting the success of purely statistical approaches (its exact wording and date are disputed, and Jelinek later distanced himself from it).[3]

### How is a bigram language model trained?

Training a bigram model involves the following steps:

1. **Tokenize** the training corpus into words or subword units.
2. **Count** every bigram (w<sub>i</sub>, w<sub>j</sub>) in the corpus.
3. **Count** every unigram w<sub>i</sub> in the corpus.
4. **Compute** conditional probabilities P(w<sub>j</sub> | w<sub>i</sub>) = C(w<sub>i</sub>, w<sub>j</sub>) / C(w<sub>i</sub>) for all observed bigrams.
5. **Apply smoothing** to handle unseen bigrams (discussed below).

The result is a probability table (or matrix) that maps each word pair to a probability value. This table can be stored efficiently using sparse data structures since most word pairs never occur.

### How is a bigram model evaluated? Perplexity

Bigram models are commonly evaluated using **[perplexity](/wiki/perplexity)**, a metric rooted in information theory. Jurafsky and Martin define it as "the inverse probability of the test set, normalized by the number of words."[2] Intuitively, perplexity measures how "surprised" the model is by new text, and lower perplexity indicates a better model.[2] For a bigram model, the perplexity on a test set of N words is:

**PP = P(w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>N</sub>)<sup>-1/N</sup>**

Perplexity was introduced as a language model evaluation metric in 1977 by Frederick Jelinek, Robert Mercer, Lalit Bahl, and James Baker at IBM, in the context of speech recognition.[3] A bigram model typically achieves lower perplexity than a unigram model but higher perplexity than a trigram model on the same test data, reflecting the trade-off between model complexity and predictive accuracy.

## How does a bigram differ from a unigram and a trigram?

The bigram sits between the unigram and the trigram in terms of complexity and context window. The following table compares these three n-gram orders:

| Feature | Unigram (n=1) | Bigram (n=2) | [Trigram](/wiki/trigram) (n=3) |
|---|---|---|---|
| Context used | None | 1 preceding word | 2 preceding words |
| Markov order | 0th order | 1st order | 2nd order |
| Parameters (vocab V) | V | V<sup>2</sup> | V<sup>3</sup> |
| Data sparsity | Low | Moderate | High |
| Generated text quality | Incoherent word salad | Some local coherence | Noticeably more fluent |
| Training data needed | Little | Moderate | Large |
| Typical perplexity | Highest | Medium | Lowest |

As the n-gram order increases, the model captures more context and produces better probability estimates, but the number of possible n-grams grows exponentially. This exponential growth means that higher-order models require vastly more training data to obtain reliable estimates and are more susceptible to the zero-frequency problem for unseen n-grams.

## How do you handle unseen bigrams? Smoothing techniques

A fundamental problem with bigram models estimated by MLE is that any bigram not observed in the training data receives a probability of zero. Since multiplying by zero makes the probability of an entire sentence zero, this is a serious practical issue. Several smoothing techniques have been developed to redistribute probability mass to unseen bigrams.[5]

### Laplace (add-one) smoothing

The simplest smoothing method adds 1 to every bigram count before computing probabilities:

**P<sub>Laplace</sub>(w<sub>n</sub> | w<sub>n-1</sub>) = (C(w<sub>n-1</sub>, w<sub>n</sub>) + 1) / (C(w<sub>n-1</sub>) + V)**

where V is the vocabulary size. This ensures no bigram has zero probability. However, Laplace smoothing tends to over-allocate probability to unseen events, especially when the vocabulary is large, making it a poor choice for real-world applications.[2]

### Add-k smoothing

A generalization of Laplace smoothing where a fractional count k (typically between 0.01 and 1) is added instead of 1:

**P<sub>add-k</sub>(w<sub>n</sub> | w<sub>n-1</sub>) = (C(w<sub>n-1</sub>, w<sub>n</sub>) + k) / (C(w<sub>n-1</sub>) + kV)**

The value of k can be tuned on held-out data to find the best balance between smoothing and fidelity to the observed counts.

### Good-Turing smoothing

Good-Turing smoothing, developed by Alan Turing and I.J. Good during World War II for cryptographic applications, reassigns probability mass by using the frequency of frequencies.[9] The basic idea is that the count of n-grams seen r times is adjusted based on the count of n-grams seen r+1 times. This method works well when there are many n-grams with low frequency counts.

### Backoff and interpolation

Two complementary strategies combine estimates across n-gram orders. In **backoff**, the model uses the bigram estimate when the bigram has been seen, and otherwise "backs off" to the lower-order unigram estimate; Katz backoff (1987) is the classic example. In **interpolation**, the bigram and unigram estimates are always mixed together using weights, as in the deleted-interpolation method of Jelinek and Mercer.[3] Both approaches let a bigram model fall back gracefully on less specific evidence when a particular word pair is rare or unseen.

### Kneser-Ney smoothing

Kneser-Ney smoothing is widely regarded as the most effective smoothing method for n-gram models.[5] Introduced by Reinhard Kneser and Hermann Ney in 1995, it uses absolute discounting (subtracting a fixed discount value d from each nonzero count) combined with a novel backoff distribution.[4] In their large 1999 empirical study, Chen and Goodman concluded that a modified version of Kneser-Ney smoothing "consistently outperforms all other algorithms" they evaluated across training sizes and n-gram orders.[5]

The key insight behind Kneser-Ney is the concept of **continuation probability**. Rather than using the raw unigram count of a word as the backoff, it asks: in how many different bigram contexts does this word appear? A word like "Francisco" has a high unigram count but appears in very few contexts (almost always after "San"). Kneser-Ney assigns it a low backoff probability, reflecting the fact that "Francisco" is unlikely to follow an arbitrary word.

The formula for interpolated Kneser-Ney smoothing of bigrams is:

**P<sub>KN</sub>(w<sub>n</sub> | w<sub>n-1</sub>) = max(C(w<sub>n-1</sub>, w<sub>n</sub>) - d, 0) / C(w<sub>n-1</sub>) + λ(w<sub>n-1</sub>) × P<sub>continuation</sub>(w<sub>n</sub>)**

where λ(w<sub>n-1</sub>) is a normalizing constant and P<sub>continuation</sub>(w<sub>n</sub>) is proportional to the number of distinct words that precede w<sub>n</sub> in the training data.

### Summary of smoothing methods

| Method | Approach | Strengths | Weaknesses |
|---|---|---|---|
| Laplace (add-one) | Add 1 to all counts | Simple to implement | Over-smooths; poor for large vocabularies |
| Add-k | Add k < 1 to all counts | Tunable; better than Laplace | Still somewhat ad hoc |
| Good-Turing | Use frequency of frequencies | Principled statistical basis | Complex; unreliable at high counts |
| Backoff / interpolation | Combine bigram with lower orders | Graceful fallback for rare pairs | Requires tuning weights or discounts |
| Kneser-Ney | Absolute discount + continuation probability | Best empirical performance | More complex to implement |

## What are bigram collocations?

Bigram frequency analysis is a core technique for identifying **collocations**: pairs of words that occur together more often than chance would predict.[6] Collocations include compound nouns ("ice cream"), phrasal verbs ("give up"), and fixed expressions ("strong coffee").

To determine whether a bigram is a collocation rather than a coincidental pairing, several statistical measures are used:[6]

| Measure | Description |
|---|---|
| Pointwise Mutual Information (PMI) | Compares the observed bigram probability to the probability expected if the two words were independent |
| Chi-squared test | Tests whether the co-occurrence of two words is statistically significant |
| Log-likelihood ratio | Compares the likelihood of the data under a model where the words are associated vs. independent |
| t-test | Tests whether the mean frequency of a bigram differs significantly from what is expected by chance |

Collocation extraction using bigrams is valuable in lexicography (identifying multi-word expressions for dictionaries), [machine translation](/wiki/machine_translation) (treating collocations as translation units), and [information retrieval](/wiki/information_retrieval) (indexing multi-word phrases for search).

## Character-level bigrams

While word-level bigrams are the most commonly discussed type, character-level bigrams (also called character bigrams or letter bigrams) are widely used in several domains.

### Language identification

Different languages have distinctive character bigram frequency distributions. For example, "th" is extremely common in English but rare in French, while "qu" is frequent in French and Spanish but less so in German. By comparing the character bigram distribution of an unknown text against reference distributions for known languages, classifiers can identify the language with high accuracy.

### Spelling correction

Character bigrams help spell checkers evaluate whether a word "looks right" for a given language. A misspelled word like "teh" contains the unusual bigram (e, h) in a position where (h, e) would be expected. Character bigram models assign low probability to sequences that violate typical letter patterns, helping flag potential errors.

### Information retrieval

Character n-grams, including bigrams, are used in approximate string matching algorithms. They allow search engines to find documents matching a query even when there are minor spelling variations or typos, because similar strings share many character bigrams.

### Authorship attribution

Character bigram frequencies can serve as a stylometric feature for identifying the author of a text. Different writers tend to use different distributions of letter pairs, making character bigrams a lightweight but effective feature for authorship analysis.

## What is a bigram used for? Applications

Bigrams are used across a broad range of NLP and [machine learning](/wiki/machine_learning) tasks:

### Text prediction and autocomplete

Bigram models power simple next-word prediction systems. When a user types a word on a mobile keyboard, a bigram model can suggest the most likely next word based on the conditional probability P(next word | current word). Modern smartphone keyboards use bigrams as one component in a larger prediction pipeline that also incorporates trigrams, user history, and neural models.

### Speech recognition

In automatic [speech recognition](/wiki/speech_recognition) (ASR) systems, bigram language models help disambiguate acoustically similar utterances. When the acoustic model produces multiple candidate transcriptions, the language model scores each candidate based on how likely its word sequences are. Bigram models were the standard language model in ASR systems throughout the 1980s and 1990s before being supplemented by trigram and neural models.

### Machine translation

Bigram statistics contribute to translation quality in statistical [machine translation](/wiki/machine_translation) systems. The language model component evaluates how natural a candidate translation sounds in the target language by scoring its bigram probabilities.

### Sentiment analysis

Bigrams can capture sentiment-relevant phrases that individual words cannot. For example, the word "good" is positive, but the bigram "not good" reverses the sentiment. Including bigram features in [sentiment analysis](/wiki/sentiment_analysis) classifiers often improves accuracy over unigram-only models.

### Text classification

Bigram features are commonly used in text classification tasks, where the goal is to categorize documents into predefined classes. By incorporating bigrams, classification algorithms can leverage contextual information that unigrams miss, such as distinguishing "New York" (a location) from the separate words "new" and "York."

### Information retrieval

In [information retrieval](/wiki/information_retrieval) systems, bigrams improve search relevance by considering word co-occurrence. A query for "machine learning" should prioritize documents where these two words appear adjacently, not just documents that happen to contain both words independently.

## Implementation

Bigrams can be computed efficiently using standard programming libraries. Below are common approaches in Python.

### Using NLTK

The Natural Language Toolkit (NLTK) provides a built-in `bigrams()` function:[7]

```python
import nltk
from nltk import bigrams
from collections import Counter

text = "the cat sat on the mat"
tokens = nltk.word_tokenize(text)
bigram_list = list(bigrams(tokens))
# [('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]

bigram_counts = Counter(bigram_list)
print(bigram_counts.most_common(3))
```

### Using Python's collections.Counter

Bigrams can be computed without any external library by using the built-in `zip()` function:

```python
from collections import Counter

tokens = ["the", "cat", "sat", "on", "the", "mat"]
bigram_list = list(zip(tokens, tokens[1:]))
bigram_counts = Counter(bigram_list)
```

### Computing bigram probabilities

To build a simple bigram language model:

```python
from collections import Counter, defaultdict

def train_bigram_model(tokens):
    bigram_counts = Counter(zip(tokens, tokens[1:]))
    unigram_counts = Counter(tokens)
    model = defaultdict(dict)
    for (w1, w2), count in bigram_counts.items():
        model[w1][w2] = count / unigram_counts[w1]
    return model

tokens = "the cat sat on the mat the cat slept".split()
model = train_bigram_model(tokens)
print(model["the"])  # {'cat': 0.667, 'mat': 0.333}
```

## What are the limitations of bigram models?

Despite their usefulness, bigram models have significant limitations:

- **Limited context window.** By conditioning on only one preceding word, bigram models cannot capture long-range dependencies. In a sentence like "The dog that chased the cat ran away," a bigram model cannot connect "dog" to "ran" because they are separated by several words.
- **Data sparsity.** For a vocabulary of 50,000 words, there are 2.5 billion possible bigrams. Most of these will never appear in any training corpus, making reliable probability estimation difficult.[8]
- **No semantic understanding.** Bigram models treat words as arbitrary symbols with no understanding of meaning. They cannot distinguish between "bank" (financial institution) and "bank" (river bank) based on context beyond the immediately preceding word.
- **Sensitivity to domain.** A bigram model trained on news articles will perform poorly on medical texts or social media posts because the word co-occurrence patterns differ substantially across domains.
- **Inability to generalize.** Bigram models cannot generalize to novel word combinations that are semantically plausible but not observed in the training data. They have no mechanism for understanding that "ate breakfast" and "ate lunch" should have similar probabilities if "ate dinner" was observed frequently.

These limitations motivated the development of more powerful language models, including [neural language models](/wiki/neural_network) and ultimately [transformer](/wiki/transformer)-based models such as [BERT](/wiki/bert) and [GPT](/wiki/gpt), which can capture long-range dependencies and semantic relationships. Nevertheless, bigram models remain valuable as baselines, components in larger systems, and educational tools for understanding the foundations of [natural language understanding](/wiki/natural_language_understanding).

## How are bigrams related to modern NLP?

Although modern [large language models](/wiki/large_language_model) have largely superseded n-gram models for tasks requiring deep language understanding, bigrams continue to play roles in contemporary NLP:

- **Feature engineering.** Bigram features are still used in traditional [machine learning](/wiki/machine_learning) classifiers (such as [logistic regression](/wiki/logistic_regression) or [support vector machines](/wiki/support_vector_machine_svm)) for text classification tasks.
- **Tokenization.** Subword [tokenization](/wiki/tokenization) algorithms like [Byte Pair Encoding](/wiki/byte_pair_encoding) (BPE) use a process closely related to bigram frequency analysis, iteratively merging the most frequent adjacent pairs.
- **Evaluation.** BLEU (Bilingual Evaluation Understudy), widely used to evaluate machine translation quality, relies on modified n-gram precision for n from 1 to 4, including bigram precision, combined as a geometric mean with a brevity penalty.[10]
- **Efficiency.** For applications where computational resources are limited or latency requirements are strict, bigram models offer a lightweight alternative to neural models.

## References

1. Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
2. Jurafsky, D. & Martin, J. H. (2024). *Speech and Language Processing* (3rd ed.), Chapter 3: N-gram Language Models. Stanford University. https://web.stanford.edu/~jurafsky/slp3/3.pdf
3. Jelinek, F. & Mercer, R. L. (1980). "Interpolated estimation of Markov source parameters from sparse data." *Proceedings of the Workshop on Pattern Recognition in Practice*.
4. Kneser, R. & Ney, H. (1995). "Improved backing-off for m-gram language modeling." *Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)*.
5. Chen, S. F. & Goodman, J. (1999). "An empirical study of smoothing techniques for language modeling." *Computer Speech and Language*, 13(4), 359-394.
6. Manning, C. D. & Schutze, H. (1999). *Foundations of Statistical Natural Language Processing*. MIT Press.
7. Bird, S., Klein, E., & Loper, E. (2009). *Natural Language Processing with Python*. O'Reilly Media.
8. Keller, F. & Lapata, M. (2003). "Using the Web to Obtain Frequencies for Unseen Bigrams." *Computational Linguistics*, 29(3), 459-484.
9. Good, I. J. (1953). "The population frequencies of species and the estimation of population parameters." *Biometrika*, 40(3-4), 237-264.
10. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). "BLEU: A Method for Automatic Evaluation of Machine Translation." *Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)*.

