Bigram

A bigram is a sequence of two consecutive elements (typically words or characters) extracted from a body of text or speech. In natural language processing (NLP) and computational linguistics, bigrams serve as the foundational building block for statistical language models, text analysis, and a wide range of applications including text prediction, speech recognition, and spell checking. The bigram is a specific case of the more general n-gram, where n equals 2.

Bigram models estimate the probability of a word based solely on the word that immediately precedes it, an assumption rooted in Markov chain theory. Despite their simplicity, bigram models have been instrumental in the development of modern NLP and continue to be used in practical systems today.

Explain Like I'm 5 (ELI5)

Imagine you are reading a book, and you cover up all the words except the one you just read. Now you try to guess the next word based only on that one word. If the word you see is "ice," you might guess "cream" because you have heard "ice cream" many times before. That is exactly what a bigram does. It looks at pairs of words that sit next to each other and remembers how often each pair shows up. Then, when it sees one word, it guesses the next word by picking the one that appeared most often after it. A bigram is simply two words, side by side, treated as a pair.

Definition and notation

Formally, a bigram is an ordered pair of two consecutive tokens (w₁, w₂) drawn from a sequence of text. Given a sentence like "The cat sat on the mat," the word-level bigrams are:

Position	Bigram
1	(The, cat)
2	(cat, sat)
3	(sat, on)
4	(on, the)
5	(the, mat)

A bigram can also be defined at the character level. For the word "hello," the character bigrams are: (h, e), (e, l), (l, l), (l, o).

The total number of possible word-level bigrams for a vocabulary of size V is V². In practice, only a small fraction of these pairs actually occur in any given corpus, which creates the well-known problem of data sparsity.

Bigram probability and estimation

The central quantity of interest in a bigram model is the conditional probability of a word w_n given the preceding word w_n-1, written as:

P(w_n | w_n-1)

This probability is typically estimated using Maximum Likelihood Estimation (MLE). The MLE formula for a bigram probability is:

P(w_n | w_n-1) = C(w_n-1, w_n) / C(w_n-1)

where C(w_n-1, w_n) is the count of the bigram (w_n-1, w_n) in the training corpus and C(w_n-1) is the count of the word w_n-1.

For example, suppose a corpus contains 1,000 occurrences of the word "artificial" and 800 of those are followed by "intelligence." The bigram probability P("intelligence" | "artificial") would be 800/1,000 = 0.8.

To compute the probability of an entire sentence, the chain rule of probability is applied. For a sentence of words w₁, w₂, ..., w_n, the bigram model approximates the joint probability as:

P(w₁, w₂, ..., w_n) ≈ P(w₁) × P(w₂ | w₁) × P(w₃ | w₂) × ... × P(w_n | w_n-1)

In practice, sentence boundary tokens such as <s> (start) and </s> (end) are often added so that the model can capture which words tend to begin and end sentences.

The Markov assumption

The bigram model relies on the first-order Markov assumption: the probability of each word depends only on the immediately preceding word, not on any earlier words. Formally:

P(w_n | w₁, w₂, ..., w_n-1) ≈ P(w_n | w_n-1)

This simplification dramatically reduces the number of parameters the model must estimate. Instead of conditioning on the full history of all preceding words (which would require an impossibly large number of parameters), the model only needs to store probabilities for each pair of words.

The term "Markov" refers to the Russian mathematician Andrey Markov, who first studied stochastic processes with this memoryless property in the early 1900s. A bigram model is specifically a first-order Markov model because it looks exactly one step into the past. By contrast, a trigram model is a second-order Markov model (conditioning on the two preceding words), and a unigram model is a zero-order model (no conditioning at all).

Bigram language models

A bigram language model assigns a probability to sequences of words using bigram probabilities. Language models are central to many NLP tasks because they capture the statistical regularities of a language.

Historical context

The idea of using statistical models of language dates back to Claude Shannon's landmark 1948 paper, "A Mathematical Theory of Communication." Shannon demonstrated that language has measurable statistical properties and used n-gram models (including bigrams) to generate text by sampling words one at a time based on conditional probabilities. His experiments showed that bigram-generated text was noticeably more coherent than unigram-generated text, though still far from natural language.

In the 1980s and 1990s, bigram and trigram language models became standard tools in speech recognition and machine translation systems. Frederick Jelinek and his team at IBM pioneered the use of n-gram models for speech recognition, famously remarking that "every time I fire a linguist, the performance of the speech recognizer goes up," reflecting the power of purely statistical approaches.

Training a bigram language model

Training a bigram model involves the following steps:

Tokenize the training corpus into words or subword units.
Count every bigram (w_i, w_j) in the corpus.
Count every unigram w_i in the corpus.
Compute conditional probabilities P(w_j | w_i) = C(w_i, w_j) / C(w_i) for all observed bigrams.
Apply smoothing to handle unseen bigrams (discussed below).

The result is a probability table (or matrix) that maps each word pair to a probability value. This table can be stored efficiently using sparse data structures since most word pairs never occur.

Evaluation with perplexity

Bigram models are commonly evaluated using perplexity, a metric rooted in information theory. Perplexity measures how "surprised" the model is by new text. Lower perplexity indicates a better model. For a bigram model, the perplexity on a test set of N words is:

PP = P(w₁, w₂, ..., w_N)^-1/N

Perplexity was introduced as a language model evaluation metric by Frederick Jelinek and Robert Mercer at IBM in 1977. A bigram model typically achieves lower perplexity than a unigram model but higher perplexity than a trigram model on the same test data, reflecting the trade-off between model complexity and predictive accuracy.

Bigram vs. unigram vs. trigram

The bigram sits between the unigram and the trigram in terms of complexity and context window. The following table compares these three n-gram orders:

Feature	Unigram (n=1)	Bigram (n=2)	Trigram (n=3)
Context used	None	1 preceding word	2 preceding words
Markov order	0th order	1st order	2nd order
Parameters (vocab V)	V	V²	V³
Data sparsity	Low	Moderate	High
Generated text quality	Incoherent word salad	Some local coherence	Noticeably more fluent
Training data needed	Little	Moderate	Large
Typical perplexity	Highest	Medium	Lowest

As the n-gram order increases, the model captures more context and produces better probability estimates, but the number of possible n-grams grows exponentially. This exponential growth means that higher-order models require vastly more training data to obtain reliable estimates and are more susceptible to the zero-frequency problem for unseen n-grams.

Smoothing techniques for unseen bigrams

A fundamental problem with bigram models estimated by MLE is that any bigram not observed in the training data receives a probability of zero. Since multiplying by zero makes the probability of an entire sentence zero, this is a serious practical issue. Several smoothing techniques have been developed to redistribute probability mass to unseen bigrams.

Laplace (add-one) smoothing

The simplest smoothing method adds 1 to every bigram count before computing probabilities:

P_Laplace(w_n | w_n-1) = (C(w_n-1, w_n) + 1) / (C(w_n-1) + V)

where V is the vocabulary size. This ensures no bigram has zero probability. However, Laplace smoothing tends to over-allocate probability to unseen events, especially when the vocabulary is large, making it a poor choice for real-world applications.

Add-k smoothing

A generalization of Laplace smoothing where a fractional count k (typically between 0.01 and 1) is added instead of 1:

P_add-k(w_n | w_n-1) = (C(w_n-1, w_n) + k) / (C(w_n-1) + kV)

The value of k can be tuned on held-out data to find the best balance between smoothing and fidelity to the observed counts.

Good-Turing smoothing

Good-Turing smoothing, developed by Alan Turing and I.J. Good during World War II for cryptographic applications, reassigns probability mass by using the frequency of frequencies. The basic idea is that the count of n-grams seen r times is adjusted based on the count of n-grams seen r+1 times. This method works well when there are many n-grams with low frequency counts.

Kneser-Ney smoothing

Kneser-Ney smoothing is widely regarded as the most effective smoothing method for n-gram models. Introduced by Reinhard Kneser and Hermann Ney in 1995, it uses absolute discounting (subtracting a fixed discount value d from each nonzero count) combined with a novel backoff distribution.

The key insight behind Kneser-Ney is the concept of continuation probability. Rather than using the raw unigram count of a word as the backoff, it asks: in how many different bigram contexts does this word appear? A word like "Francisco" has a high unigram count but appears in very few contexts (almost always after "San"). Kneser-Ney assigns it a low backoff probability, reflecting the fact that "Francisco" is unlikely to follow an arbitrary word.

The formula for interpolated Kneser-Ney smoothing of bigrams is:

P_KN(w_n | w_n-1) = max(C(w_n-1, w_n) - d, 0) / C(w_n-1) + λ(w_n-1) × P_continuation(w_n)

where λ(w_n-1) is a normalizing constant and P_continuation(w_n) is proportional to the number of distinct words that precede w_n in the training data.

Summary of smoothing methods

Method	Approach	Strengths	Weaknesses
Laplace (add-one)	Add 1 to all counts	Simple to implement	Over-smooths; poor for large vocabularies
Add-k	Add k < 1 to all counts	Tunable; better than Laplace	Still somewhat ad hoc
Good-Turing	Use frequency of frequencies	Principled statistical basis	Complex; unreliable at high counts
Kneser-Ney	Absolute discount + continuation probability	Best empirical performance	More complex to implement

Bigram frequency and collocations

Bigram frequency analysis is a core technique for identifying collocations: pairs of words that occur together more often than chance would predict. Collocations include compound nouns ("ice cream"), phrasal verbs ("give up"), and fixed expressions ("strong coffee").

To determine whether a bigram is a collocation rather than a coincidental pairing, several statistical measures are used:

Measure	Description
Pointwise Mutual Information (PMI)	Compares the observed bigram probability to the probability expected if the two words were independent
Chi-squared test	Tests whether the co-occurrence of two words is statistically significant
Log-likelihood ratio	Compares the likelihood of the data under a model where the words are associated vs. independent
t-test	Tests whether the mean frequency of a bigram differs significantly from what is expected by chance

Collocation extraction using bigrams is valuable in lexicography (identifying multi-word expressions for dictionaries), machine translation (treating collocations as translation units), and information retrieval (indexing multi-word phrases for search).

Character-level bigrams

While word-level bigrams are the most commonly discussed type, character-level bigrams (also called character bigrams or letter bigrams) are widely used in several domains.

Language identification

Different languages have distinctive character bigram frequency distributions. For example, "th" is extremely common in English but rare in French, while "qu" is frequent in French and Spanish but less so in German. By comparing the character bigram distribution of an unknown text against reference distributions for known languages, classifiers can identify the language with high accuracy.

Spelling correction

Character bigrams help spell checkers evaluate whether a word "looks right" for a given language. A misspelled word like "teh" contains the unusual bigram (e, h) in a position where (h, e) would be expected. Character bigram models assign low probability to sequences that violate typical letter patterns, helping flag potential errors.

Information retrieval

Character n-grams, including bigrams, are used in approximate string matching algorithms. They allow search engines to find documents matching a query even when there are minor spelling variations or typos, because similar strings share many character bigrams.

Authorship attribution

Character bigram frequencies can serve as a stylometric feature for identifying the author of a text. Different writers tend to use different distributions of letter pairs, making character bigrams a lightweight but effective feature for authorship analysis.

Applications of bigrams

Bigrams are used across a broad range of NLP and machine learning tasks:

Text prediction and autocomplete

Bigram models power simple next-word prediction systems. When a user types a word on a mobile keyboard, a bigram model can suggest the most likely next word based on the conditional probability P(next word | current word). Modern smartphone keyboards use bigrams as one component in a larger prediction pipeline that also incorporates trigrams, user history, and neural models.

Speech recognition

In automatic speech recognition (ASR) systems, bigram language models help disambiguate acoustically similar utterances. When the acoustic model produces multiple candidate transcriptions, the language model scores each candidate based on how likely its word sequences are. Bigram models were the standard language model in ASR systems throughout the 1980s and 1990s before being supplemented by trigram and neural models.

Machine translation

Bigram statistics contribute to translation quality in statistical machine translation systems. The language model component evaluates how natural a candidate translation sounds in the target language by scoring its bigram probabilities.

Sentiment analysis

Bigrams can capture sentiment-relevant phrases that individual words cannot. For example, the word "good" is positive, but the bigram "not good" reverses the sentiment. Including bigram features in sentiment analysis classifiers often improves accuracy over unigram-only models.

Text classification

Bigram features are commonly used in text classification tasks, where the goal is to categorize documents into predefined classes. By incorporating bigrams, classification algorithms can leverage contextual information that unigrams miss, such as distinguishing "New York" (a location) from the separate words "new" and "York."

Information retrieval

In information retrieval systems, bigrams improve search relevance by considering word co-occurrence. A query for "machine learning" should prioritize documents where these two words appear adjacently, not just documents that happen to contain both words independently.

Implementation

Bigrams can be computed efficiently using standard programming libraries. Below are common approaches in Python.

Using NLTK

The Natural Language Toolkit (NLTK) provides a built-in bigrams() function:

import nltk
from nltk import bigrams
from collections import Counter

text = "the cat sat on the mat"
tokens = nltk.word_tokenize(text)
bigram_list = list(bigrams(tokens))
# [('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]

bigram_counts = Counter(bigram_list)
print(bigram_counts.most_common(3))

Using Python's collections.Counter

Bigrams can be computed without any external library by using the built-in zip() function:

from collections import Counter

tokens = ["the", "cat", "sat", "on", "the", "mat"]
bigram_list = list(zip(tokens, tokens[1:]))
bigram_counts = Counter(bigram_list)

Computing bigram probabilities

To build a simple bigram language model:

from collections import Counter, defaultdict

def train_bigram_model(tokens):
    bigram_counts = Counter(zip(tokens, tokens[1:]))
    unigram_counts = Counter(tokens)
    model = defaultdict(dict)
    for (w1, w2), count in bigram_counts.items():
        model[w1][w2] = count / unigram_counts[w1]
    return model

tokens = "the cat sat on the mat the cat slept".split()
model = train_bigram_model(tokens)
print(model["the"])  # {'cat': 0.667, 'mat': 0.333}

Limitations of bigram models

Despite their usefulness, bigram models have significant limitations:

Limited context window. By conditioning on only one preceding word, bigram models cannot capture long-range dependencies. In a sentence like "The dog that chased the cat ran away," a bigram model cannot connect "dog" to "ran" because they are separated by several words.
Data sparsity. For a vocabulary of 50,000 words, there are 2.5 billion possible bigrams. Most of these will never appear in any training corpus, making reliable probability estimation difficult.
No semantic understanding. Bigram models treat words as arbitrary symbols with no understanding of meaning. They cannot distinguish between "bank" (financial institution) and "bank" (river bank) based on context beyond the immediately preceding word.
Sensitivity to domain. A bigram model trained on news articles will perform poorly on medical texts or social media posts because the word co-occurrence patterns differ substantially across domains.
Inability to generalize. Bigram models cannot generalize to novel word combinations that are semantically plausible but not observed in the training data. They have no mechanism for understanding that "ate breakfast" and "ate lunch" should have similar probabilities if "ate dinner" was observed frequently.

These limitations motivated the development of more powerful language models, including neural language models and ultimately transformer-based models such as BERT and GPT, which can capture long-range dependencies and semantic relationships. Nevertheless, bigram models remain valuable as baselines, components in larger systems, and educational tools for understanding the foundations of natural language understanding.

Relationship to modern NLP

Although modern large language models have largely superseded n-gram models for tasks requiring deep language understanding, bigrams continue to play roles in contemporary NLP:

Feature engineering. Bigram features are still used in traditional machine learning classifiers (such as logistic regression or support vector machines) for text classification tasks.
Tokenization. Subword tokenization algorithms like Byte Pair Encoding (BPE) use a process closely related to bigram frequency analysis, iteratively merging the most frequent adjacent pairs.
Evaluation. Metrics like BLEU (Bilingual Evaluation Understudy), widely used to evaluate machine translation quality, rely on n-gram precision, including bigram precision, as a core component.
Efficiency. For applications where computational resources are limited or latency requirements are strict, bigram models offer a lightweight alternative to neural models.

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
Jurafsky, D. & Martin, J. H. (2024). *Speech and Language Processing* (3rd ed.), Chapter 3: N-gram Language Models. Stanford University. https://web.stanford.edu/~jurafsky/slp3/3.pdf
Jelinek, F. & Mercer, R. L. (1980). "Interpolated estimation of Markov source parameters from sparse data." *Proceedings of the Workshop on Pattern Recognition in Practice*.
Kneser, R. & Ney, H. (1995). "Improved backing-off for m-gram language modeling." *Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)*.
Chen, S. F. & Goodman, J. (1999). "An empirical study of smoothing techniques for language modeling." *Computer Speech and Language*, 13(4), 359-394.
Manning, C. D. & Schutze, H. (1999). *Foundations of Statistical Natural Language Processing*. MIT Press.
Bird, S., Klein, E., & Loper, E. (2009). *Natural Language Processing with Python*. O'Reilly Media.
Keller, F. & Lapata, M. (2003). "Using the Web to Obtain Frequencies for Unseen Bigrams." *Computational Linguistics*, 29(3), 459-484.
Good, I. J. (1953). "The population frequencies of species and the estimation of population parameters." *Biometrika*, 40(3-4), 237-264.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). "BLEU: A Method for Automatic Evaluation of Machine Translation." *Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)*.

Explain Like I'm 5 (ELI5)

Definition and notation

Bigram probability and estimation

The Markov assumption

Bigram language models

Historical context

Training a bigram language model

Evaluation with perplexity

Bigram vs. unigram vs. trigram

Smoothing techniques for unseen bigrams

Laplace (add-one) smoothing

Add-k smoothing

Good-Turing smoothing

Kneser-Ney smoothing

Summary of smoothing methods

Bigram frequency and collocations

Character-level bigrams

Language identification

Spelling correction

Information retrieval

Authorship attribution

Applications of bigrams

Text prediction and autocomplete

Speech recognition

Machine translation

Sentiment analysis

Text classification

Information retrieval

Implementation

Using NLTK

Using Python's collections.Counter

Computing bigram probabilities

Limitations of bigram models

Relationship to modern NLP

References

Improve this article

Related Articles

ARC-AGI 2

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Computer-use model

Context window

Explain Like I'm 5 (ELI5)

Definition and notation

Bigram probability and estimation

The Markov assumption

Bigram language models

Historical context

Training a bigram language model

Evaluation with perplexity

Bigram vs. unigram vs. trigram

Smoothing techniques for unseen bigrams

Laplace (add-one) smoothing

Add-k smoothing

Good-Turing smoothing

Kneser-Ney smoothing

Summary of smoothing methods

Bigram frequency and collocations

Character-level bigrams

Language identification

Spelling correction

Information retrieval

Authorship attribution

Applications of bigrams

Text prediction and autocomplete

Speech recognition

Machine translation

Sentiment analysis

Text classification

Information retrieval

Implementation

Using NLTK

Using Python's collections.Counter

Computing bigram probabilities

Limitations of bigram models

Relationship to modern NLP

References

Related Articles

ARC-AGI 2