A bigram is a sequence of two consecutive elements (typically words or characters) extracted from a body of text or speech. In natural language processing (NLP) and computational linguistics, bigrams serve as the foundational building block for statistical language models, text analysis, and a wide range of applications including text prediction, speech recognition, and spell checking. The bigram is a specific case of the more general n-gram, where n equals 2.
Bigram models estimate the probability of a word based solely on the word that immediately precedes it, an assumption rooted in Markov chain theory. Despite their simplicity, bigram models have been instrumental in the development of modern NLP and continue to be used in practical systems today.
Imagine you are reading a book, and you cover up all the words except the one you just read. Now you try to guess the next word based only on that one word. If the word you see is "ice," you might guess "cream" because you have heard "ice cream" many times before. That is exactly what a bigram does. It looks at pairs of words that sit next to each other and remembers how often each pair shows up. Then, when it sees one word, it guesses the next word by picking the one that appeared most often after it. A bigram is simply two words, side by side, treated as a pair.
Formally, a bigram is an ordered pair of two consecutive tokens (w1, w2) drawn from a sequence of text. Given a sentence like "The cat sat on the mat," the word-level bigrams are:
| Position | Bigram |
|---|---|
| 1 | (The, cat) |
| 2 | (cat, sat) |
| 3 | (sat, on) |
| 4 | (on, the) |
| 5 | (the, mat) |
A bigram can also be defined at the character level. For the word "hello," the character bigrams are: (h, e), (e, l), (l, l), (l, o).
The total number of possible word-level bigrams for a vocabulary of size V is V2. In practice, only a small fraction of these pairs actually occur in any given corpus, which creates the well-known problem of data sparsity.
The central quantity of interest in a bigram model is the conditional probability of a word wn given the preceding word wn-1, written as:
P(wn | wn-1)
This probability is typically estimated using Maximum Likelihood Estimation (MLE). The MLE formula for a bigram probability is:
P(wn | wn-1) = C(wn-1, wn) / C(wn-1)
where C(wn-1, wn) is the count of the bigram (wn-1, wn) in the training corpus and C(wn-1) is the count of the word wn-1.
For example, suppose a corpus contains 1,000 occurrences of the word "artificial" and 800 of those are followed by "intelligence." The bigram probability P("intelligence" | "artificial") would be 800/1,000 = 0.8.
To compute the probability of an entire sentence, the chain rule of probability is applied. For a sentence of words w1, w2, ..., wn, the bigram model approximates the joint probability as:
P(w1, w2, ..., wn) ≈ P(w1) × P(w2 | w1) × P(w3 | w2) × ... × P(wn | wn-1)
In practice, sentence boundary tokens such as <s> (start) and </s> (end) are often added so that the model can capture which words tend to begin and end sentences.
The bigram model relies on the first-order Markov assumption: the probability of each word depends only on the immediately preceding word, not on any earlier words. Formally:
P(wn | w1, w2, ..., wn-1) ≈ P(wn | wn-1)
This simplification dramatically reduces the number of parameters the model must estimate. Instead of conditioning on the full history of all preceding words (which would require an impossibly large number of parameters), the model only needs to store probabilities for each pair of words.
The term "Markov" refers to the Russian mathematician Andrey Markov, who first studied stochastic processes with this memoryless property in the early 1900s. A bigram model is specifically a first-order Markov model because it looks exactly one step into the past. By contrast, a trigram model is a second-order Markov model (conditioning on the two preceding words), and a unigram model is a zero-order model (no conditioning at all).
A bigram language model assigns a probability to sequences of words using bigram probabilities. Language models are central to many NLP tasks because they capture the statistical regularities of a language.
The idea of using statistical models of language dates back to Claude Shannon's landmark 1948 paper, "A Mathematical Theory of Communication." Shannon demonstrated that language has measurable statistical properties and used n-gram models (including bigrams) to generate text by sampling words one at a time based on conditional probabilities. His experiments showed that bigram-generated text was noticeably more coherent than unigram-generated text, though still far from natural language.
In the 1980s and 1990s, bigram and trigram language models became standard tools in speech recognition and machine translation systems. Frederick Jelinek and his team at IBM pioneered the use of n-gram models for speech recognition, famously remarking that "every time I fire a linguist, the performance of the speech recognizer goes up," reflecting the power of purely statistical approaches.
Training a bigram model involves the following steps:
The result is a probability table (or matrix) that maps each word pair to a probability value. This table can be stored efficiently using sparse data structures since most word pairs never occur.
Bigram models are commonly evaluated using perplexity, a metric rooted in information theory. Perplexity measures how "surprised" the model is by new text. Lower perplexity indicates a better model. For a bigram model, the perplexity on a test set of N words is:
PP = P(w1, w2, ..., wN)-1/N
Perplexity was introduced as a language model evaluation metric by Frederick Jelinek and Robert Mercer at IBM in 1977. A bigram model typically achieves lower perplexity than a unigram model but higher perplexity than a trigram model on the same test data, reflecting the trade-off between model complexity and predictive accuracy.
The bigram sits between the unigram and the trigram in terms of complexity and context window. The following table compares these three n-gram orders:
| Feature | Unigram (n=1) | Bigram (n=2) | Trigram (n=3) |
|---|---|---|---|
| Context used | None | 1 preceding word | 2 preceding words |
| Markov order | 0th order | 1st order | 2nd order |
| Parameters (vocab V) | V | V2 | V3 |
| Data sparsity | Low | Moderate | High |
| Generated text quality | Incoherent word salad | Some local coherence | Noticeably more fluent |
| Training data needed | Little | Moderate | Large |
| Typical perplexity | Highest | Medium | Lowest |
As the n-gram order increases, the model captures more context and produces better probability estimates, but the number of possible n-grams grows exponentially. This exponential growth means that higher-order models require vastly more training data to obtain reliable estimates and are more susceptible to the zero-frequency problem for unseen n-grams.
A fundamental problem with bigram models estimated by MLE is that any bigram not observed in the training data receives a probability of zero. Since multiplying by zero makes the probability of an entire sentence zero, this is a serious practical issue. Several smoothing techniques have been developed to redistribute probability mass to unseen bigrams.
The simplest smoothing method adds 1 to every bigram count before computing probabilities:
PLaplace(wn | wn-1) = (C(wn-1, wn) + 1) / (C(wn-1) + V)
where V is the vocabulary size. This ensures no bigram has zero probability. However, Laplace smoothing tends to over-allocate probability to unseen events, especially when the vocabulary is large, making it a poor choice for real-world applications.
A generalization of Laplace smoothing where a fractional count k (typically between 0.01 and 1) is added instead of 1:
Padd-k(wn | wn-1) = (C(wn-1, wn) + k) / (C(wn-1) + kV)
The value of k can be tuned on held-out data to find the best balance between smoothing and fidelity to the observed counts.
Good-Turing smoothing, developed by Alan Turing and I.J. Good during World War II for cryptographic applications, reassigns probability mass by using the frequency of frequencies. The basic idea is that the count of n-grams seen r times is adjusted based on the count of n-grams seen r+1 times. This method works well when there are many n-grams with low frequency counts.
Kneser-Ney smoothing is widely regarded as the most effective smoothing method for n-gram models. Introduced by Reinhard Kneser and Hermann Ney in 1995, it uses absolute discounting (subtracting a fixed discount value d from each nonzero count) combined with a novel backoff distribution.
The key insight behind Kneser-Ney is the concept of continuation probability. Rather than using the raw unigram count of a word as the backoff, it asks: in how many different bigram contexts does this word appear? A word like "Francisco" has a high unigram count but appears in very few contexts (almost always after "San"). Kneser-Ney assigns it a low backoff probability, reflecting the fact that "Francisco" is unlikely to follow an arbitrary word.
The formula for interpolated Kneser-Ney smoothing of bigrams is:
PKN(wn | wn-1) = max(C(wn-1, wn) - d, 0) / C(wn-1) + λ(wn-1) × Pcontinuation(wn)
where λ(wn-1) is a normalizing constant and Pcontinuation(wn) is proportional to the number of distinct words that precede wn in the training data.
| Method | Approach | Strengths | Weaknesses |
|---|---|---|---|
| Laplace (add-one) | Add 1 to all counts | Simple to implement | Over-smooths; poor for large vocabularies |
| Add-k | Add k < 1 to all counts | Tunable; better than Laplace | Still somewhat ad hoc |
| Good-Turing | Use frequency of frequencies | Principled statistical basis | Complex; unreliable at high counts |
| Kneser-Ney | Absolute discount + continuation probability | Best empirical performance | More complex to implement |
Bigram frequency analysis is a core technique for identifying collocations: pairs of words that occur together more often than chance would predict. Collocations include compound nouns ("ice cream"), phrasal verbs ("give up"), and fixed expressions ("strong coffee").
To determine whether a bigram is a collocation rather than a coincidental pairing, several statistical measures are used:
| Measure | Description |
|---|---|
| Pointwise Mutual Information (PMI) | Compares the observed bigram probability to the probability expected if the two words were independent |
| Chi-squared test | Tests whether the co-occurrence of two words is statistically significant |
| Log-likelihood ratio | Compares the likelihood of the data under a model where the words are associated vs. independent |
| t-test | Tests whether the mean frequency of a bigram differs significantly from what is expected by chance |
Collocation extraction using bigrams is valuable in lexicography (identifying multi-word expressions for dictionaries), machine translation (treating collocations as translation units), and information retrieval (indexing multi-word phrases for search).
While word-level bigrams are the most commonly discussed type, character-level bigrams (also called character bigrams or letter bigrams) are widely used in several domains.
Different languages have distinctive character bigram frequency distributions. For example, "th" is extremely common in English but rare in French, while "qu" is frequent in French and Spanish but less so in German. By comparing the character bigram distribution of an unknown text against reference distributions for known languages, classifiers can identify the language with high accuracy.
Character bigrams help spell checkers evaluate whether a word "looks right" for a given language. A misspelled word like "teh" contains the unusual bigram (e, h) in a position where (h, e) would be expected. Character bigram models assign low probability to sequences that violate typical letter patterns, helping flag potential errors.
Character n-grams, including bigrams, are used in approximate string matching algorithms. They allow search engines to find documents matching a query even when there are minor spelling variations or typos, because similar strings share many character bigrams.
Character bigram frequencies can serve as a stylometric feature for identifying the author of a text. Different writers tend to use different distributions of letter pairs, making character bigrams a lightweight but effective feature for authorship analysis.
Bigrams are used across a broad range of NLP and machine learning tasks:
Bigram models power simple next-word prediction systems. When a user types a word on a mobile keyboard, a bigram model can suggest the most likely next word based on the conditional probability P(next word | current word). Modern smartphone keyboards use bigrams as one component in a larger prediction pipeline that also incorporates trigrams, user history, and neural models.
In automatic speech recognition (ASR) systems, bigram language models help disambiguate acoustically similar utterances. When the acoustic model produces multiple candidate transcriptions, the language model scores each candidate based on how likely its word sequences are. Bigram models were the standard language model in ASR systems throughout the 1980s and 1990s before being supplemented by trigram and neural models.
Bigram statistics contribute to translation quality in statistical machine translation systems. The language model component evaluates how natural a candidate translation sounds in the target language by scoring its bigram probabilities.
Bigrams can capture sentiment-relevant phrases that individual words cannot. For example, the word "good" is positive, but the bigram "not good" reverses the sentiment. Including bigram features in sentiment analysis classifiers often improves accuracy over unigram-only models.
Bigram features are commonly used in text classification tasks, where the goal is to categorize documents into predefined classes. By incorporating bigrams, classification algorithms can leverage contextual information that unigrams miss, such as distinguishing "New York" (a location) from the separate words "new" and "York."
In information retrieval systems, bigrams improve search relevance by considering word co-occurrence. A query for "machine learning" should prioritize documents where these two words appear adjacently, not just documents that happen to contain both words independently.
Bigrams can be computed efficiently using standard programming libraries. Below are common approaches in Python.
The Natural Language Toolkit (NLTK) provides a built-in bigrams() function:
import nltk
from nltk import bigrams
from collections import Counter
text = "the cat sat on the mat"
tokens = nltk.word_tokenize(text)
bigram_list = list(bigrams(tokens))
# [('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]
bigram_counts = Counter(bigram_list)
print(bigram_counts.most_common(3))
Bigrams can be computed without any external library by using the built-in zip() function:
from collections import Counter
tokens = ["the", "cat", "sat", "on", "the", "mat"]
bigram_list = list(zip(tokens, tokens[1:]))
bigram_counts = Counter(bigram_list)
To build a simple bigram language model:
from collections import Counter, defaultdict
def train_bigram_model(tokens):
bigram_counts = Counter(zip(tokens, tokens[1:]))
unigram_counts = Counter(tokens)
model = defaultdict(dict)
for (w1, w2), count in bigram_counts.items():
model[w1][w2] = count / unigram_counts[w1]
return model
tokens = "the cat sat on the mat the cat slept".split()
model = train_bigram_model(tokens)
print(model["the"]) # {'cat': 0.667, 'mat': 0.333}
Despite their usefulness, bigram models have significant limitations:
These limitations motivated the development of more powerful language models, including neural language models and ultimately transformer-based models such as BERT and GPT, which can capture long-range dependencies and semantic relationships. Nevertheless, bigram models remain valuable as baselines, components in larger systems, and educational tools for understanding the foundations of natural language understanding.
Although modern large language models have largely superseded n-gram models for tasks requiring deep language understanding, bigrams continue to play roles in contemporary NLP: