See also: Machine learning terms
An n-gram is a contiguous sequence of n items extracted from a given sample of text or speech. The items can be characters, syllables, words, or other linguistic units depending on the application. N-grams are one of the foundational concepts in natural language processing (NLP) and computational linguistics, serving as the basis for statistical language models, text classification systems, spell checkers, and many other tools that process human language.
The term "n-gram" combines the variable n with "gram," which derives from the Greek word gramma, meaning "letter" or "written character." Claude Shannon introduced the concept of character-level and word-level n-gram models in his 1948 and 1951 papers on information theory, where he used them to demonstrate the statistical structure of the English language.
Imagine you have a bunch of Lego blocks with words written on them. An n-gram is a way to connect these blocks to make different combinations. If you pick up just one block, that is a unigram. If you snap two blocks together, that is a bigram. And if you connect three blocks in a row, that is a trigram.
These word combos help computers figure out which words usually go together. For example, if a computer sees "peanut butter and," it can guess the next word is probably "jelly" because it has seen those words together many times before. That is basically how n-grams work: they help computers learn patterns in language by looking at small groups of words that appear next to each other.
N-grams are classified by their length, which is determined by the value of n.
| Type | n | Description | Example (from "the cat sat on the mat") |
|---|---|---|---|
| Unigram | 1 | Single item | "the", "cat", "sat", "on", "the", "mat" |
| Bigram | 2 | Pair of consecutive items | "the cat", "cat sat", "sat on", "on the", "the mat" |
| Trigram | 3 | Triple of consecutive items | "the cat sat", "cat sat on", "sat on the", "on the mat" |
| 4-gram | 4 | Four consecutive items | "the cat sat on", "cat sat on the", "sat on the mat" |
| 5-gram | 5 | Five consecutive items | "the cat sat on the", "cat sat on the mat" |
In practice, bigrams and trigrams are the most commonly used in NLP applications. Higher-order n-grams (4-grams and above) capture more context but suffer from data sparsity, since the number of possible n-grams grows exponentially with n.
N-grams can operate at different levels of granularity. The two most common types are word n-grams and character n-grams, each with distinct strengths.
Word n-grams treat each word as an atomic unit. For the sentence "I love cats," the word bigrams are "I love" and "love cats." Word n-grams capture syntactic patterns and phrase-level meaning, making them useful for language modeling and text classification.
Character n-grams treat each character (including spaces and punctuation) as a unit. For the word "cats," the character trigrams are "cat", "ats." Character n-grams are particularly valuable for:
| Feature | Word n-grams | Character n-grams |
|---|---|---|
| Unit of analysis | Whole words | Individual characters |
| Vocabulary size | Large (all unique words) | Small (alphabet size) |
| Robustness to typos | Low | High |
| Captures syntax | Yes | Limited |
| Language identification | Moderate | Excellent |
| Subword morphology | No | Yes |
An n-gram language model estimates the probability of a sequence of words by decomposing it into a product of conditional probabilities. This relies on two key ideas: the chain rule of probability and the Markov assumption.
The probability of a sentence W = w1, w2, ..., wm can be expressed using the chain rule of probability:
P(w1, w2, ..., wm) = P(w1) * P(w2 | w1) * P(w3 | w1, w2) * ... * P(wm | w1, ..., wm-1)
Computing the full conditional probability P(wm | w1, ..., wm-1) for every word is impractical because it requires observing every possible word history, which becomes astronomically large for long sentences.
To make the problem tractable, n-gram models apply the Markov assumption: the probability of a word depends only on the preceding n - 1 words rather than the entire history. For a bigram model (n = 2):
P(wi | w1, w2, ..., wi-1) ≈ P(wi | wi-1)
For a trigram model (n = 3):
P(wi | w1, w2, ..., wi-1) ≈ P(wi | wi-2, wi-1)
This corresponds directly to a Markov chain, where a unigram model is a zeroth-order Markov model, a bigram model is first-order, and a trigram model is second-order.
The simplest way to estimate n-gram probabilities is maximum likelihood estimation (MLE) from a training corpus. The bigram probability is:
P(wi | wi-1) = Count(wi-1, wi) / Count(wi-1)
For example, if "the cat" appears 50 times in a corpus and "the" appears 10,000 times, then P(cat | the) = 50 / 10,000 = 0.005.
The trigram probability generalizes this:
P(wi | wi-2, wi-1) = Count(wi-2, wi-1, wi) / Count(wi-2, wi-1)
Perplexity is the standard metric for evaluating language models. It measures how well a probability model predicts a test set. For a test set W of N words, the perplexity (PP) of a language model is:
PP(W) = P(w1, w2, ..., wN) ^ (-1/N)
A lower perplexity indicates a better model, meaning the model assigns higher probability to the test data. Intuitively, perplexity can be interpreted as the weighted average number of choices the model faces when predicting the next word. A perplexity of 100 means the model is, on average, as uncertain as if it had to choose uniformly among 100 words at each step.
A major challenge for n-gram models is data sparsity: many valid n-grams never appear in the training corpus and receive a probability of zero under MLE. Since a single zero probability in a product drives the entire sentence probability to zero, smoothing techniques are essential. Smoothing redistributes probability mass from observed n-grams to unobserved ones.
| Technique | Key idea | Strengths | Weaknesses |
|---|---|---|---|
| Laplace (add-one) | Add 1 to every n-gram count | Simple to implement | Distorts probabilities heavily for large vocabularies |
| Add-k | Add a fractional count k < 1 | Less aggressive than Laplace | Choosing optimal k is difficult |
| Good-Turing | Re-estimate counts using frequency of frequencies | Principled statistical foundation | Complex implementation; unreliable for high counts |
| Kneser-Ney | Absolute discounting plus continuation probability | State-of-the-art for n-gram models | More complex to implement |
| Witten-Bell | Uses count of novel events to estimate unseen probability | Good for sparse data | Less effective than Kneser-Ney in most benchmarks |
Laplace smoothing adds 1 to every n-gram count before normalization:
P_Laplace(wi | wi-1) = (Count(wi-1, wi) + 1) / (Count(wi-1) + V)
where V is the vocabulary size. Although simple, Laplace smoothing assigns too much probability mass to unseen events when the vocabulary is large, making it impractical for word-level language models. It remains useful for character-level models and text classification tasks where vocabulary sizes are small.
Add-k smoothing generalizes Laplace smoothing by adding a fractional count k (typically 0.01 to 0.5) instead of 1. This reduces the amount of probability mass shifted to unseen events but still requires tuning k on held-out data.
Good-Turing discounting, proposed by Alan Turing and I. J. Good during World War II for cryptanalysis, adjusts counts based on the frequency of frequencies. The central idea is that the adjusted count for n-grams occurring c times is:
c* = (c + 1) * N(c+1) / N(c)
where N(c) is the number of n-grams that occur exactly c times. This effectively redistributes probability mass from common n-grams to rare and unseen ones, guided by observed statistics rather than arbitrary constants.
Kneser-Ney smoothing, introduced by Reinhard Kneser and Hermann Ney in 1995, is widely considered the most effective smoothing method for n-gram language models. It combines two innovations:
Modified Kneser-Ney, which uses three different discount values based on whether a count is 1, 2, or 3 or more, further improves performance and is the default in most modern n-gram toolkits such as SRILM and KenLM.
Backoff and interpolation are strategies for combining n-gram models of different orders to handle unseen n-grams.
Backoff (Katz backoff) uses the highest-order n-gram model that has sufficient data. If a trigram has been observed, its probability is used directly (with discounting). If the trigram has not been observed, the model "backs off" to the bigram estimate, and if that is also missing, it falls back to the unigram.
Interpolation (Jelinek-Mercer interpolation) always mixes the probabilities from all n-gram orders using learned weights:
P_interp(wi | wi-2, wi-1) = lambda3 * P(wi | wi-2, wi-1) + lambda2 * P(wi | wi-1) + lambda1 * P(wi)
where lambda1 + lambda2 + lambda3 = 1. The weights are typically optimized on a held-out development set using the expectation-maximization algorithm. Interpolation tends to perform better than simple backoff because it always incorporates information from all available n-gram orders.
N-grams have been applied across a wide range of tasks in NLP and beyond.
Before the rise of neural networks, n-gram models were the dominant approach to language modeling. They powered predictive text systems, autocomplete features, and were core components of speech recognition and machine translation systems. Even today, n-gram models are sometimes used as baselines or combined with neural models in hybrid systems.
N-grams serve as features in many text classification systems. In the bag of words model, each unique word is a feature; extending this to bigrams or trigrams (sometimes called a "bag of n-grams") captures word order information that unigrams alone miss. For example, "not good" as a bigram carries very different sentiment from the individual unigrams "not" and "good." N-gram features have been widely used in spam detection, sentiment analysis, and topic categorization.
Search engines use n-grams in several ways. N-gram indexes help match queries to documents even when exact word matches fail. Character n-grams enable fuzzy matching, handling spelling variations and typos in search queries. N-gram overlap also helps measure document similarity for deduplication and clustering.
Spell checkers use character n-gram similarity to suggest corrections for misspelled words. When a user types an unrecognized word, the system computes the character n-gram overlap between the input and dictionary words, ranking candidates by similarity. Word-level n-gram context then helps select the most likely correction, such as choosing between "their" and "there" based on surrounding words.
Historically, n-gram language models were a critical component of automatic speech recognition (ASR) systems. The acoustic model generates candidate word sequences from audio, and the n-gram language model rescores these candidates based on how likely each word sequence is in the language. Although neural language models have largely replaced n-grams in modern ASR, n-gram models remain useful for first-pass decoding due to their computational efficiency.
N-grams play a central role in automatic evaluation metrics for natural language generation tasks.
BLEU (Bilingual Evaluation Understudy), introduced by Kishore Papineni and colleagues at IBM in 2002, evaluates machine translation quality by computing the modified precision of n-grams (unigrams through 4-grams) in the machine output against one or more reference translations. A brevity penalty is applied to discourage overly short translations. BLEU remains the most widely used automatic metric for machine translation.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation), developed by Chin-Yew Lin in 2004, is primarily used for evaluating text summarization. ROUGE-N measures n-gram recall between the system summary and reference summaries. While BLEU focuses on precision (what fraction of generated n-grams appear in references), ROUGE emphasizes recall (what fraction of reference n-grams appear in the generated text).
Analyzing n-gram frequencies across large text corpora reveals patterns about language use, cultural trends, and historical change.
N-gram frequency distributions typically follow Zipf's law: a small number of n-grams are extremely common, while the vast majority are rare. In English, the most frequent bigrams include "of the," "in the," and "to the." This heavy-tailed distribution is the root cause of the data sparsity problem in n-gram models.
The Google Books Ngram Viewer, launched in December 2010, charts the frequencies of n-grams (up to 5-grams) across a corpus of over 8 million digitized books spanning from 1500 to 2022. The tool was created by Google engineers Jon Orwant and Will Brockman in collaboration with Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden, who coined the term "culturomics" to describe the quantitative analysis of cultural trends through digitized text.
The Ngram Viewer has been used to study diverse phenomena, including the evolution of grammar (such as the shift from irregular to regular verb forms), the rise and fall of cultural figures in public discourse, the adoption of technologies, and patterns of censorship across different countries. The underlying dataset covers roughly 4% of all books ever published and is available for download in multiple languages.
Despite their simplicity and historical importance, n-gram models have several fundamental limitations.
Data sparsity. As n increases, the number of possible n-grams grows exponentially (V^n for a vocabulary of size V). Even very large corpora cannot provide reliable frequency estimates for all possible trigrams or 4-grams, let alone higher orders. This makes it difficult to increase the context window beyond 4 or 5 words.
No generalization across similar words. N-gram models treat each word as an atomic symbol with no notion of similarity. The model cannot transfer knowledge from "the dog is running" to "the cat is running" because "dog" and "cat" are entirely different symbols. Word embeddings and neural models address this by representing words as dense vectors in a continuous space where similar words have similar representations.
Fixed context window. N-gram models can only capture dependencies within a window of n - 1 preceding words. They cannot model long-range dependencies, such as subject-verb agreement across a relative clause ("The keys that were on the table are missing"), where the verb "are" depends on "keys" several words back.
Storage requirements. Storing all observed n-grams and their counts requires substantial memory, especially for large corpora and higher-order models. Google's Web 1T 5-gram dataset, released in 2006, contains approximately 1 trillion tokens worth of n-gram counts and requires roughly 24 GB of compressed storage.
No semantic understanding. N-gram models capture surface-level co-occurrence patterns but have no understanding of meaning. They cannot distinguish between "bank" as a financial institution and "bank" as a river bank without sufficient contextual n-grams to disambiguate.
The limitations of n-gram models motivated the development of neural language models, which have largely replaced n-grams as the state of the art.
Yoshua Bengio and colleagues introduced the neural probabilistic language model in 2003, which addressed two key weaknesses of n-gram models. First, it represented each word as a continuous word embedding vector, enabling the model to generalize across similar words. Second, it used a neural network to estimate conditional probabilities, allowing it to share parameters across different contexts.
Recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, extended neural language models by processing sequences of arbitrary length, removing the fixed context window limitation. Transformer-based models, introduced by Vaswani et al. in 2017, further advanced the field through self-attention mechanisms that can capture dependencies across very long sequences.
| Feature | N-gram models | Neural language models |
|---|---|---|
| Context window | Fixed (n - 1 words) | Variable (entire sequence for Transformers) |
| Word representation | Discrete symbols | Continuous vectors (embeddings) |
| Generalization | No similarity-based generalization | Shares knowledge across similar words |
| Training data efficiency | Requires enormous corpora for higher n | More data-efficient through parameter sharing |
| Computational cost (inference) | Very fast (table lookup) | Higher (matrix operations) |
| Interpretability | High (explicit counts) | Lower (learned parameters) |
| Storage | Large for high-order models | Model parameters (fixed size) |
Despite the dominance of neural models, n-grams retain practical value. They are fast to train and query, require no GPU hardware, and are effective for applications where computational resources are limited. Many production speech recognition systems still use n-gram models for first-pass decoding before rescoring with neural models.
Byte-pair encoding (BPE), originally a data compression algorithm, was adapted by Rico Sennrich and colleagues in 2016 for subword tokenization in neural machine translation. BPE can be understood as an evolution of the character n-gram concept.
BPE starts with individual characters and iteratively merges the most frequent adjacent pair of symbols into a new symbol. After many iterations, the resulting vocabulary contains a mix of individual characters, common character sequences (essentially learned character n-grams), and full words. This approach handles rare and unseen words gracefully by decomposing them into known subword units.
Variants of BPE, including WordPiece (used in BERT) and SentencePiece (used in many multilingual models), have become the standard tokenization method for modern large language models. These methods preserve the core n-gram insight (that contiguous subsequences carry useful information) while learning the optimal set of subsequences from data.