Trigram
Last reviewed
Sources
11 citations
Review status
Source-backed
Revision
v5 ยท 3,482 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
11 citations
Review status
Source-backed
Revision
v5 ยท 3,482 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: N-gram, Bigram, Language model
A trigram is a contiguous sequence of three items (most often three words) drawn from a sample of text or speech, and the basis of a classic statistical language model that predicts each word from the two words immediately before it. In the standard NLP textbook Speech and Language Processing, Daniel Jurafsky and James H. Martin define it directly: "a 3-gram (a trigram) is a three-word sequence of words like 'The water of' or 'water of Walden'."[9] A trigram is the case of the more general n-gram model where n equals 3, and it became the dominant language model in speech recognition and machine translation from the 1970s until neural networks displaced it around 2013. On a Wall Street Journal test set, Jurafsky and Martin report that a trigram model trained on 38 million words reaches a perplexity of 109, compared with 170 for a bigram and 962 for a unigram, illustrating why the trigram became the default n-gram order.[9]
Trigrams strike a practical balance: two words of context capture enough local structure for useful predictions while keeping data and storage requirements far below those of higher-order n-grams. They are also used as features beyond language modeling, including language identification, authorship attribution, spelling correction, and text classification. Although neural network-based models have largely superseded classical trigram approaches, understanding trigrams remains essential for grasping the foundations of modern NLP.[2]
In natural language processing (NLP) and machine learning, a trigram is a contiguous sequence of three items drawn from a sample of text or speech. The items can be words, characters, syllables, or phonemes depending on the application. Trigrams are a specific case of the more general n-gram model where n equals 3. They represent one of the most widely used n-gram sizes in computational linguistics because they strike a practical balance between capturing enough context for meaningful predictions and keeping data requirements manageable.[9]
Trigram models played a central role in the development of statistical language models from the 1970s through the early 2010s, powering systems for speech recognition, machine translation, spelling correction, and text generation.[2] Although neural network-based models have largely superseded classical trigram approaches, understanding trigrams remains essential for grasping the foundations of modern NLP.
A trigram is formed by sliding a window of size three across a sequence of elements. Depending on whether the elements are words or characters, the resulting trigrams serve different purposes.
Given the sentence "the quick brown fox jumps," the word-level trigrams are:
| Position | Trigram |
|---|---|
| 1 | the quick brown |
| 2 | quick brown fox |
| 3 | brown fox jumps |
Word-level trigrams capture short phrases and local syntactic patterns. They are the basis for trigram language models used in speech recognition and text prediction.[9]
Given the word "language," the character-level trigrams are:
| Position | Trigram |
|---|---|
| 1 | lan |
| 2 | ang |
| 3 | ngu |
| 4 | gua |
| 5 | uag |
| 6 | age |
Character trigrams are especially useful for language identification, authorship attribution, and spelling correction because different languages produce distinctive distributions of character sequences.[8]
A trigram language model estimates the probability of a word given the two preceding words. This conditional probability is written as:
P(w_3 | w_1, w_2)
The probability is estimated from a training corpus using maximum likelihood estimation (MLE):
P(w_3 | w_1, w_2) = Count(w_1, w_2, w_3) / Count(w_1, w_2)[9]
For example, to estimate the probability of the word "minister" following "prime minister of," we divide the number of times "prime minister of" appears in the corpus by the number of times "prime minister" appears. If "prime minister" appears 5,000 times and "prime minister of" appears 3,200 times, then P("of" | "prime", "minister") = 3,200 / 5,000 = 0.64.
The probability of an entire sentence can be decomposed using the chain rule of probability:
P(w_1, w_2, ..., w_n) = P(w_1) x P(w_2 | w_1) x P(w_3 | w_1, w_2) x ... x P(w_n | w_1, ..., w_{n-1})
Computing the full conditional history for each word is impractical for long sentences. The trigram model simplifies this by applying the second-order Markov assumption: the probability of each word depends only on the two immediately preceding words, not the entire history. This reduces the chain rule to:
P(w_1, w_2, ..., w_n) approximately equals the product of P(w_i | w_{i-2}, w_{i-1})
Jurafsky and Martin describe this generalization plainly: "We can generalize the bigram (which looks one word into the past) to the trigram (which looks two words into the past) and thus to the n-gram (which looks n - 1 words into the past)."[9] A bigram model is therefore a first-order Markov model and a trigram is a second-order Markov model. This assumption makes computation tractable while still capturing more context than a bigram (first-order Markov) model.[9]
Perplexity is the standard intrinsic metric for evaluating language models. It measures how well a model predicts a held-out test set, and Jurafsky and Martin state the rule of thumb directly: "the lower the perplexity of a model on the data, the better the model."[9] For a trigram model evaluated on a test set of N words, perplexity is the inverse probability of the test set, normalized by the number of words:
PP = P(w_1, w_2, ..., w_N) ^ (-1/N)
A concrete comparison from Speech and Language Processing shows the value of the extra context. Unigram, bigram, and trigram models trained on 38 million words of the Wall Street Journal and tested on a 1.5 million word test set produced the following perplexities:[9]
| Model | Perplexity on WSJ test set |
|---|---|
| Unigram | 962 |
| Bigram | 170 |
| Trigram | 109 |
Historically, the lowest perplexity reported on the Brown Corpus (one million words of American English) using a trigram word model was about 247 per word, corresponding to a cross-entropy of about 7.95 bits per word, or 1.75 bits per character. This result, published by Peter Brown and colleagues at IBM in 1992 using an interpolated trigram model trained on more than 600 million words, set an influential benchmark for statistical language modeling and entropy estimation of English.[11]
The choice between bigram and trigram models involves a trade-off between context and data requirements.
| Feature | Bigram | Trigram |
|---|---|---|
| Context window | 1 preceding word | 2 preceding words |
| Markov order | First-order | Second-order |
| Number of possible n-grams | V^2 | V^3 |
| Data sparsity | Moderate | Higher |
| Prediction accuracy | Good for simple tasks | Better disambiguation |
| Storage requirements | Lower | Higher |
| Example | P(fox | brown) | P(fox | quick, brown) |
In this table, V represents the vocabulary size. A vocabulary of 50,000 words produces 2.5 billion possible bigrams but 125 trillion possible trigrams, illustrating why data sparsity becomes significantly worse for trigrams.[9] The same effect appears in real corpora: Jurafsky and Martin note that a model trained on the complete works of Shakespeare has a vocabulary of V = 29,066 word types, which yields V^2 = 844 million possible bigrams, far more than the roughly 884,647 word tokens actually present in the corpus, so the overwhelming majority of possible n-grams are never observed.[9]
Trigrams provide better disambiguation. For instance, a bigram model treating "New York" and "New car" sees only one preceding word and cannot distinguish the two contexts effectively. A trigram model that observes "in New York" versus "a New car" captures enough context to assign appropriate probabilities to the following words.
The most significant challenge for trigram language models is data sparsity. Even with large training corpora, many valid three-word combinations never appear. If a trigram has zero count in the training data, the MLE assigns it a probability of zero. This is problematic because a single zero probability in the chain rule makes the probability of the entire sentence zero, even if the sentence is perfectly grammatical. As Jurafsky and Martin put it, these unseen sequences "are a problem for two reasons," both underestimating real word sequences and making perplexity impossible to compute because it requires dividing by a nonzero probability.[9]
Several smoothing techniques have been developed to address this problem.
The simplest approach adds one to every trigram count. While easy to implement, Laplace smoothing distributes too much probability mass to unseen events and performs poorly in practice for language modeling. Jurafsky and Martin note that it "does not perform well enough to be used in modern n-gram models" but is useful as a baseline and for other tasks such as text classification.[9]
Developed by Alan Turing and I.J. Good during World War II for cryptanalysis, Good-Turing discounting re-estimates the probability of n-grams that occur a small number of times by using the frequency of frequencies. N-grams that appear r times are re-estimated using the count of n-grams that appear r+1 times. The total probability mass freed from observed n-grams is redistributed to unseen ones.
Katz backoff (1987) uses the trigram probability when there is sufficient evidence (the trigram was seen in training), but "backs off" to the bigram probability when the trigram count is too low, and further backs off to the unigram probability if necessary.[4] Slava Katz's method combines Good-Turing discounting with backoff and was published in a paper titled "Estimation of probabilities from sparse data for the language model component of a speech recognizer."[4] The key idea is to trust higher-order n-grams when data is available and rely on lower-order estimates otherwise.
Rather than choosing between trigram, bigram, and unigram estimates, interpolation combines all three using learned weights:
P_interp(w_3 | w_1, w_2) = lambda_3 P_tri(w_3 | w_1, w_2) + lambda_2 P_bi(w_3 | w_2) + lambda_1 P_uni(w_3)
where lambda_1 + lambda_2 + lambda_3 = 1. The weights are typically estimated on a held-out dataset. This technique was formalized by Frederick Jelinek and Robert Mercer in 1980, and the approach became known as Jelinek-Mercer smoothing.[3]
Kneser-Ney smoothing, proposed by Reinhard Kneser and Hermann Ney in 1995 (building on work with Ute Essen in 1994), is widely considered the most effective smoothing technique for n-gram language models.[5] It was introduced in a paper titled "Improved backing-off for M-gram language modeling."[5] It combines two key ideas:
Absolute discounting: A fixed discount value d (typically between 0 and 1) is subtracted from each observed n-gram count, freeing probability mass for redistribution.
Continuation probability: Instead of using standard unigram frequencies for the lower-order distribution, Kneser-Ney uses the number of distinct contexts in which a word appears. The word "Francisco" may have a high unigram count, but it almost always follows "San." Kneser-Ney recognizes this by assigning "Francisco" low continuation probability, since it continues very few distinct histories.[5]
Modified Kneser-Ney smoothing, which uses three separate discount values for n-grams with counts of one, two, and three or more, was shown by Stanley Chen and Joshua Goodman (1999) to consistently outperform other smoothing methods across multiple corpora and n-gram orders.[6]
Trigram language models were foundational to automatic speech recognition (ASR) systems from the late 1970s onward. At IBM, Frederick Jelinek and his team used trigram models in the TANGORA speech recognition system. The system used hidden Markov models (HMMs) for acoustic modeling and trigram language models to constrain word sequence predictions.[2]
Researchers observed an interesting symbiosis between trigram models and acoustic models. When acoustic evidence was weak (as for short function words like "the," "of," "a"), the trigram model tended to be strong because these words occur in highly predictable contexts. When the trigram model was uncertain (as for content words with diverse contexts), the acoustic model was typically more reliable because content words tend to be longer and more acoustically distinct.
Estimates from the speech recognition community suggest that roughly 80% of ASR technology through the 2000s was built on refined versions of this 1970s trigram paradigm.
Statistical machine translation systems used trigram language models to evaluate the fluency of candidate translations. In the noisy channel framework, the translation model proposes candidate word sequences, and the language model scores how likely each candidate is as a sentence in the target language. Trigram models were the default choice for this fluency scoring component in systems like those developed by IBM and later by Google (before the switch to neural machine translation in 2016).[9]
Character-level trigrams provide a remarkably effective method for identifying the language of a text sample. Each language has a distinctive distribution of three-character sequences. For example, English frequently contains trigrams like "the," "ing," and "tion," while German favors "sch," "ein," and "und."
The approach works by building a reference profile of trigram frequencies for each candidate language from training text. Given an unknown text sample, the system computes its trigram profile and measures the distance to each reference profile. The language with the smallest distance is selected. A 1994 study by William Cavnar and John Trenkle showed that this method achieves an average accuracy of 99.8% on documents longer than 300 characters, and that profiles of roughly the top 300 most frequent n-grams are nearly always sufficient to identify a language correctly, even on short and error-laden text.[8]
Trigrams serve as features for text classification tasks including sentiment analysis, spam detection, and topic categorization. Because trigrams capture short phrases and word combinations, they can represent meaning that individual words (unigrams) cannot. For example, the trigram "not very good" carries negative sentiment, while the unigram "good" alone suggests positive sentiment.
In information retrieval, search engines have used trigram matching to improve query understanding. Character trigrams also support fuzzy matching, where approximate string comparisons help account for misspellings and morphological variations.
The Google Books Ngram Viewer, launched on December 16, 2010, allows users to chart the frequency of n-grams (including trigrams) across a corpus that initially contained about 500 billion words from 5.2 million digitized books.[10] Users can search for three-word phrases and see how their usage has changed over time in publications from 1500 to 2022. The viewer was built by Google engineers Will Brockman and Jon Orwant together with Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden, and it displays a phrase only if it appears in at least 40 books. Later releases expanded the underlying dataset to roughly 800 billion tokens across eight languages, and the tool has been widely used in digital humanities, historical linguistics, and cultural analysis.[9]
The use of trigrams in computational linguistics has a history spanning several decades.
| Year | Milestone |
|---|---|
| 1948 | Claude Shannon applies n-gram analysis to English text in "A Mathematical Theory of Communication"[1] |
| 1975-1976 | Jim Baker and Frederick Jelinek independently introduce n-gram language models for speech recognition at CMU and IBM[2] |
| 1977 | Jelinek, Mercer, Bahl, and Baker introduce perplexity as a language model evaluation metric[2] |
| 1980 | Jelinek and Mercer formalize interpolated (linear combination) smoothing for n-gram models[3] |
| 1987 | Slava Katz publishes the backoff smoothing method[4] |
| 1992 | Brown and colleagues at IBM report a trigram cross-entropy of 1.75 bits per character on the Brown Corpus[11] |
| 1995 | Kneser and Ney propose their influential smoothing technique based on continuation counts[5] |
| 1999 | Chen and Goodman publish a comprehensive comparison of smoothing methods, establishing Modified Kneser-Ney as the best performer[6] |
| 2003 | Yoshua Bengio publishes "A Neural Probabilistic Language Model," introducing neural network language models that would eventually replace n-gram approaches[7] |
| 2010 | Google launches the Ngram Viewer, making n-gram analysis accessible to the public[10] |
| 2013-present | Word embeddings, RNNs, and Transformer models progressively replace trigram models in most production NLP systems |
Despite decades of success, trigram language models have been largely superseded by neural network-based approaches. Several fundamental limitations drove this transition.
Fixed context window. A trigram model can only consider the previous two words. It has no mechanism for capturing dependencies that span longer distances. For instance, in the sentence "The author who won the prize last year wrote a new book," a trigram model cannot connect "author" with "wrote" because they are separated by several words. Recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and especially Transformer architectures can model dependencies across entire sentences or even documents.
No generalization across similar words. Trigram models treat each word as a discrete symbol. If the model has seen "ate Chinese food" many times but never "ate Japanese food," it cannot infer that "Japanese food" is plausible in the same context. Bengio and colleagues framed this as the "curse of dimensionality" and proposed fighting it by learning a distributed representation for words, so that "a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence."[7] Neural models use word embeddings that represent words as dense vectors in a continuous space, allowing them to generalize from "Chinese" to semantically similar words like "Japanese," "Korean," or "Italian."[7]
Massive storage requirements. A trigram model over a vocabulary of 100,000 words must potentially store probabilities for 10^15 trigrams. Even with pruning, the resulting probability tables are enormous. Neural language models store parameters in compact weight matrices that implicitly encode all n-gram patterns, requiring far less memory.
Inability to capture compositionality. Trigrams cannot compose meaning from subword parts. Neural models with subword tokenization (such as byte pair encoding) can handle rare words, compound words, and morphological variations by breaking them into meaningful pieces.
Despite these limitations, trigram models retain some advantages: they are simple to implement, fast to train, interpretable, and deterministic. They remain useful in resource-constrained environments and as baseline models for NLP research.
Imagine you are reading a storybook and you want to guess the next word. If you only look at the word right before the blank, you might guess wrong a lot. But if you look at the two words right before the blank, you can make a much better guess.
That is what a trigram does. It looks at groups of three words in a row. For example, in "I like ice ____," the trigram "like ice ____" helps a computer guess the next word is probably "cream." If it only saw "ice ____" (two words), it might also guess "skating" or "cold." The extra word gives it a better clue.
Computers read millions of sentences and count how often groups of three words appear together. Then, when they see two words, they can look up what word usually comes next. This helps computers understand language, recognize speech, and even translate from one language to another.