See also: N-gram, Bigram, Language model
In natural language processing (NLP) and machine learning, a trigram is a contiguous sequence of three items drawn from a sample of text or speech. The items can be words, characters, syllables, or phonemes depending on the application. Trigrams are a specific case of the more general n-gram model where n equals 3. They represent one of the most widely used n-gram sizes in computational linguistics because they strike a practical balance between capturing enough context for meaningful predictions and keeping data requirements manageable.
Trigram models played a central role in the development of statistical language models from the 1970s through the early 2010s, powering systems for speech recognition, machine translation, spelling correction, and text generation. Although neural network-based models have largely superseded classical trigram approaches, understanding trigrams remains essential for grasping the foundations of modern NLP.
A trigram is formed by sliding a window of size three across a sequence of elements. Depending on whether the elements are words or characters, the resulting trigrams serve different purposes.
Given the sentence "the quick brown fox jumps," the word-level trigrams are:
| Position | Trigram |
|---|---|
| 1 | the quick brown |
| 2 | quick brown fox |
| 3 | brown fox jumps |
Word-level trigrams capture short phrases and local syntactic patterns. They are the basis for trigram language models used in speech recognition and text prediction.
Given the word "language," the character-level trigrams are:
| Position | Trigram |
|---|---|
| 1 | lan |
| 2 | ang |
| 3 | ngu |
| 4 | gua |
| 5 | uag |
| 6 | age |
Character trigrams are especially useful for language identification, authorship attribution, and spelling correction because different languages produce distinctive distributions of character sequences.
A trigram language model estimates the probability of a word given the two preceding words. This conditional probability is written as:
P(w_3 | w_1, w_2)
The probability is estimated from a training corpus using maximum likelihood estimation (MLE):
P(w_3 | w_1, w_2) = Count(w_1, w_2, w_3) / Count(w_1, w_2)
For example, to estimate the probability of the word "minister" following "prime minister of," we divide the number of times "prime minister of" appears in the corpus by the number of times "prime minister" appears. If "prime minister" appears 5,000 times and "prime minister of" appears 3,200 times, then P("of" | "prime", "minister") = 3,200 / 5,000 = 0.64.
The probability of an entire sentence can be decomposed using the chain rule of probability:
P(w_1, w_2, ..., w_n) = P(w_1) x P(w_2 | w_1) x P(w_3 | w_1, w_2) x ... x P(w_n | w_1, ..., w_{n-1})
Computing the full conditional history for each word is impractical for long sentences. The trigram model simplifies this by applying the second-order Markov assumption: the probability of each word depends only on the two immediately preceding words, not the entire history. This reduces the chain rule to:
P(w_1, w_2, ..., w_n) ≈ ∏ P(w_i | w_{i-2}, w_{i-1})
This assumption makes computation tractable while still capturing more context than a bigram (first-order Markov) model.
Perplexity is the standard metric for evaluating language models. It measures how well a model predicts a held-out test set. Lower perplexity indicates a better model. For a trigram model evaluated on a test set of N words, perplexity is defined as:
PP = P(w_1, w_2, ..., w_N) ^ (-1/N)
The lowest perplexity reported on the Brown Corpus (one million words of American English) using a trigram model was approximately 247 per word, corresponding to a cross-entropy of about 7.95 bits per word. This result, achieved by researchers at IBM in the early 1990s, set an important benchmark for statistical language modeling.
The choice between bigram and trigram models involves a trade-off between context and data requirements.
| Feature | Bigram | Trigram |
|---|---|---|
| Context window | 1 preceding word | 2 preceding words |
| Markov order | First-order | Second-order |
| Number of possible n-grams | V^2 | V^3 |
| Data sparsity | Moderate | Higher |
| Prediction accuracy | Good for simple tasks | Better disambiguation |
| Storage requirements | Lower | Higher |
| Example | P(fox | brown) | P(fox | quick, brown) |
In this table, V represents the vocabulary size. A vocabulary of 50,000 words produces 2.5 billion possible bigrams but 125 trillion possible trigrams, illustrating why data sparsity becomes significantly worse for trigrams.
Trigrams provide better disambiguation. For instance, a bigram model treating "New York" and "New car" sees only one preceding word and cannot distinguish the two contexts effectively. A trigram model that observes "in New York" versus "a New car" captures enough context to assign appropriate probabilities to the following words.
The most significant challenge for trigram language models is data sparsity. Even with large training corpora, many valid three-word combinations never appear. If a trigram has zero count in the training data, the MLE assigns it a probability of zero. This is problematic because a single zero probability in the chain rule makes the probability of the entire sentence zero, even if the sentence is perfectly grammatical.
Several smoothing techniques have been developed to address this problem.
The simplest approach adds one to every trigram count. While easy to implement, Laplace smoothing distributes too much probability mass to unseen events and performs poorly in practice for language modeling.
Developed by Alan Turing and I.J. Good during World War II for cryptanalysis, Good-Turing discounting re-estimates the probability of n-grams that occur a small number of times by using the frequency of frequencies. N-grams that appear r times are re-estimated using the count of n-grams that appear r+1 times. The total probability mass freed from observed n-grams is redistributed to unseen ones.
Katz backoff (1987) uses the trigram probability when there is sufficient evidence (the trigram was seen in training), but "backs off" to the bigram probability when the trigram count is too low, and further backs off to the unigram probability if necessary. The key idea is to trust higher-order n-grams when data is available and rely on lower-order estimates otherwise.
Rather than choosing between trigram, bigram, and unigram estimates, interpolation combines all three using learned weights:
P_interp(w_3 | w_1, w_2) = λ_3 P_tri(w_3 | w_1, w_2) + λ_2 P_bi(w_3 | w_2) + λ_1 P_uni(w_3)
where λ_1 + λ_2 + λ_3 = 1. The weights are typically estimated on a held-out dataset. This technique was formalized by Frederick Jelinek and Robert Mercer in 1980, and the approach became known as Jelinek-Mercer smoothing.
Kneser-Ney smoothing, proposed by Reinhard Kneser, Ute Essen, and Hermann Ney in 1994, is widely considered the most effective smoothing technique for n-gram language models. It combines two key ideas:
Absolute discounting: A fixed discount value d (typically between 0 and 1) is subtracted from each observed n-gram count, freeing probability mass for redistribution.
Continuation probability: Instead of using standard unigram frequencies for the lower-order distribution, Kneser-Ney uses the number of distinct contexts in which a word appears. The word "Francisco" may have a high unigram count, but it almost always follows "San." Kneser-Ney recognizes this by assigning "Francisco" low continuation probability, since it continues very few distinct histories.
Modified Kneser-Ney smoothing, which uses multiple discount values for different count levels, was shown by Stanley Chen and Joshua Goodman (1999) to consistently outperform other smoothing methods across multiple corpora and n-gram orders.
Trigram language models were foundational to automatic speech recognition (ASR) systems from the late 1970s onward. At IBM, Frederick Jelinek and his team used trigram models in the TANGORA speech recognition system. The system used hidden Markov models (HMMs) for acoustic modeling and trigram language models to constrain word sequence predictions.
Researchers observed an interesting symbiosis between trigram models and acoustic models. When acoustic evidence was weak (as for short function words like "the," "of," "a"), the trigram model tended to be strong because these words occur in highly predictable contexts. When the trigram model was uncertain (as for content words with diverse contexts), the acoustic model was typically more reliable because content words tend to be longer and more acoustically distinct.
Estimates from the speech recognition community suggest that roughly 80% of ASR technology through the 2000s was built on refined versions of this 1970s trigram paradigm.
Statistical machine translation systems used trigram language models to evaluate the fluency of candidate translations. In the noisy channel framework, the translation model proposes candidate word sequences, and the language model scores how likely each candidate is as a sentence in the target language. Trigram models were the default choice for this fluency scoring component in systems like those developed by IBM and later by Google (before the switch to neural machine translation in 2016).
Character-level trigrams provide a remarkably effective method for identifying the language of a text sample. Each language has a distinctive distribution of three-character sequences. For example, English frequently contains trigrams like "the," "ing," and "tion," while German favors "sch," "ein," and "und."
The approach works by building a reference profile of trigram frequencies for each candidate language from training text. Given an unknown text sample, the system computes its trigram profile and measures the distance to each reference profile. The language with the smallest distance is selected. A 1994 patent by Cavnar and Trenkle demonstrated that this method achieves high accuracy even on very short text samples (fewer than 15 words), and the top 300 most frequent trigrams are nearly always sufficient to identify a language correctly.
Trigrams serve as features for text classification tasks including sentiment analysis, spam detection, and topic categorization. Because trigrams capture short phrases and word combinations, they can represent meaning that individual words (unigrams) cannot. For example, the trigram "not very good" carries negative sentiment, while the unigram "good" alone suggests positive sentiment.
In information retrieval, search engines have used trigram matching to improve query understanding. Character trigrams also support fuzzy matching, where approximate string comparisons help account for misspellings and morphological variations.
The Google Books Ngram Viewer, launched in December 2010, allows users to chart the frequency of n-grams (including trigrams) across a corpus of over 5.2 million digitized books containing approximately 500 billion words. Users can search for three-word phrases and see how their usage has changed over time in publications from 1500 to 2022. The tool supports searches across eight languages and has been widely used in digital humanities, historical linguistics, and cultural analysis. The underlying trigram dataset alone is approximately 200 GB when uncompressed.
The use of trigrams in computational linguistics has a history spanning several decades.
| Year | Milestone |
|---|---|
| 1948 | Claude Shannon applies n-gram analysis to English text in "A Mathematical Theory of Communication" |
| 1975-1976 | Jim Baker and Frederick Jelinek independently introduce n-gram language models for speech recognition at CMU and IBM |
| 1977 | Jelinek, Mercer, Bahl, and Baker introduce perplexity as a language model evaluation metric |
| 1980 | Jelinek and Mercer formalize interpolated (linear combination) smoothing for n-gram models |
| 1987 | Slava Katz publishes the backoff smoothing method |
| 1994 | Kneser and Ney propose their influential smoothing technique based on continuation counts |
| 1996 | Chen and Goodman publish a comprehensive comparison of smoothing methods, establishing Modified Kneser-Ney as the best performer |
| 2003 | Yoshua Bengio publishes "A Neural Probabilistic Language Model," introducing neural network language models that would eventually replace n-gram approaches |
| 2010 | Google launches the Ngram Viewer, making n-gram analysis accessible to the public |
| 2013-present | Word embeddings, RNNs, and Transformer models progressively replace trigram models in most production NLP systems |
Despite decades of success, trigram language models have been largely superseded by neural network-based approaches. Several fundamental limitations drove this transition.
Fixed context window. A trigram model can only consider the previous two words. It has no mechanism for capturing dependencies that span longer distances. For instance, in the sentence "The author who won the prize last year wrote a new book," a trigram model cannot connect "author" with "wrote" because they are separated by several words. Recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and especially Transformer architectures can model dependencies across entire sentences or even documents.
No generalization across similar words. Trigram models treat each word as a discrete symbol. If the model has seen "ate Chinese food" many times but never "ate Japanese food," it cannot infer that "Japanese food" is plausible in the same context. Neural models use word embeddings that represent words as dense vectors in a continuous space, allowing them to generalize from "Chinese" to semantically similar words like "Japanese," "Korean," or "Italian."
Massive storage requirements. A trigram model over a vocabulary of 100,000 words must potentially store probabilities for 10^15 trigrams. Even with pruning, the resulting probability tables are enormous. Neural language models store parameters in compact weight matrices that implicitly encode all n-gram patterns, requiring far less memory.
Inability to capture compositionality. Trigrams cannot compose meaning from subword parts. Neural models with subword tokenization (such as byte pair encoding) can handle rare words, compound words, and morphological variations by breaking them into meaningful pieces.
Despite these limitations, trigram models retain some advantages: they are simple to implement, fast to train, interpretable, and deterministic. They remain useful in resource-constrained environments and as baseline models for NLP research.
Imagine you are reading a storybook and you want to guess the next word. If you only look at the word right before the blank, you might guess wrong a lot. But if you look at the two words right before the blank, you can make a much better guess.
That is what a trigram does. It looks at groups of three words in a row. For example, in "I like ice ____," the trigram "like ice ____" helps a computer guess the next word is probably "cream." If it only saw "ice ____" (two words), it might also guess "skating" or "cold." The extra word gives it a better clue.
Computers read millions of sentences and count how often groups of three words appear together. Then, when they see two words, they can look up what word usually comes next. This helps computers understand language, recognize speech, and even translate from one language to another.