N-gram

See also: Machine learning terms

An n-gram is a contiguous sequence of n items extracted from a given sample of text or speech. The items can be characters, syllables, words, or other linguistic units depending on the application. N-grams are one of the foundational concepts in natural language processing (NLP) and computational linguistics, serving as the basis for statistical language models, text classification systems, spell checkers, and many other tools that process human language.

The term "n-gram" combines the variable n with "gram," which derives from the Greek word gramma, meaning "letter" or "written character." Claude Shannon introduced the concept of character-level and word-level n-gram models in his 1948 and 1951 papers on information theory, where he used them to demonstrate the statistical structure of the English language.

Explain like I'm 5 (ELI5)

Imagine you have a bunch of Lego blocks with words written on them. An n-gram is a way to connect these blocks to make different combinations. If you pick up just one block, that is a unigram. If you snap two blocks together, that is a bigram. And if you connect three blocks in a row, that is a trigram.

These word combos help computers figure out which words usually go together. For example, if a computer sees "peanut butter and," it can guess the next word is probably "jelly" because it has seen those words together many times before. That is basically how n-grams work: they help computers learn patterns in language by looking at small groups of words that appear next to each other.

Types of n-grams

N-grams are classified by their length, which is determined by the value of n.

Type	n	Description	Example (from "the cat sat on the mat")
Unigram	1	Single item	"the", "cat", "sat", "on", "the", "mat"
Bigram	2	Pair of consecutive items	"the cat", "cat sat", "sat on", "on the", "the mat"
Trigram	3	Triple of consecutive items	"the cat sat", "cat sat on", "sat on the", "on the mat"
4-gram	4	Four consecutive items	"the cat sat on", "cat sat on the", "sat on the mat"
5-gram	5	Five consecutive items	"the cat sat on the", "cat sat on the mat"

In practice, bigrams and trigrams are the most commonly used in NLP applications. Higher-order n-grams (4-grams and above) capture more context but suffer from data sparsity, since the number of possible n-grams grows exponentially with n.

Character n-grams vs. word n-grams

N-grams can operate at different levels of granularity. The two most common types are word n-grams and character n-grams, each with distinct strengths.

Word n-grams treat each word as an atomic unit. For the sentence "I love cats," the word bigrams are "I love" and "love cats." Word n-grams capture syntactic patterns and phrase-level meaning, making them useful for language modeling and text classification.

Character n-grams treat each character (including spaces and punctuation) as a unit. For the word "cats," the character trigrams are "cat", "ats." Character n-grams are particularly valuable for:

Language identification. Character frequency patterns differ across languages, so character n-grams can reliably distinguish between languages even with very short text samples.
Spell checking and correction. Because character n-grams are tolerant of misspellings and typographical errors, they form the basis of many spelling correction algorithms.
Handling morphologically rich languages. In languages with complex inflectional systems (such as Turkish, Finnish, or Hungarian), character n-grams capture subword patterns without requiring explicit morphological analysis.
Authorship attribution. Character-level patterns often reflect an author's writing habits, including punctuation and spacing preferences.

Feature	Word n-grams	Character n-grams
Unit of analysis	Whole words	Individual characters
Vocabulary size	Large (all unique words)	Small (alphabet size)
Robustness to typos	Low	High
Captures syntax	Yes	Limited
Language identification	Moderate	Excellent
Subword morphology	No	Yes

N-gram language models

An n-gram language model estimates the probability of a sequence of words by decomposing it into a product of conditional probabilities. This relies on two key ideas: the chain rule of probability and the Markov assumption.

The chain rule

The probability of a sentence W = w1, w2, ..., wm can be expressed using the chain rule of probability:

P(w1, w2, ..., wm) = P(w1) * P(w2 | w1) * P(w3 | w1, w2) * ... * P(wm | w1, ..., wm-1)

Computing the full conditional probability P(wm | w1, ..., wm-1) for every word is impractical because it requires observing every possible word history, which becomes astronomically large for long sentences.

The Markov assumption

To make the problem tractable, n-gram models apply the Markov assumption: the probability of a word depends only on the preceding n - 1 words rather than the entire history. For a bigram model (n = 2):

P(wi | w1, w2, ..., wi-1) ≈ P(wi | wi-1)

For a trigram model (n = 3):

P(wi | w1, w2, ..., wi-1) ≈ P(wi | wi-2, wi-1)

This corresponds directly to a Markov chain, where a unigram model is a zeroth-order Markov model, a bigram model is first-order, and a trigram model is second-order.

Maximum likelihood estimation

The simplest way to estimate n-gram probabilities is maximum likelihood estimation (MLE) from a training corpus. The bigram probability is:

P(wi | wi-1) = Count(wi-1, wi) / Count(wi-1)

For example, if "the cat" appears 50 times in a corpus and "the" appears 10,000 times, then P(cat | the) = 50 / 10,000 = 0.005.

The trigram probability generalizes this:

P(wi | wi-2, wi-1) = Count(wi-2, wi-1, wi) / Count(wi-2, wi-1)

Perplexity

Perplexity is the standard metric for evaluating language models. It measures how well a probability model predicts a test set. For a test set W of N words, the perplexity (PP) of a language model is:

PP(W) = P(w1, w2, ..., wN) ^ (-1/N)

A lower perplexity indicates a better model, meaning the model assigns higher probability to the test data. Intuitively, perplexity can be interpreted as the weighted average number of choices the model faces when predicting the next word. A perplexity of 100 means the model is, on average, as uncertain as if it had to choose uniformly among 100 words at each step.

Smoothing techniques

A major challenge for n-gram models is data sparsity: many valid n-grams never appear in the training corpus and receive a probability of zero under MLE. Since a single zero probability in a product drives the entire sentence probability to zero, smoothing techniques are essential. Smoothing redistributes probability mass from observed n-grams to unobserved ones.

Technique	Key idea	Strengths	Weaknesses
Laplace (add-one)	Add 1 to every n-gram count	Simple to implement	Distorts probabilities heavily for large vocabularies
Add-k	Add a fractional count k < 1	Less aggressive than Laplace	Choosing optimal k is difficult
Good-Turing	Re-estimate counts using frequency of frequencies	Principled statistical foundation	Complex implementation; unreliable for high counts
Kneser-Ney	Absolute discounting plus continuation probability	State-of-the-art for n-gram models	More complex to implement
Witten-Bell	Uses count of novel events to estimate unseen probability	Good for sparse data	Less effective than Kneser-Ney in most benchmarks

Laplace (add-one) smoothing

Laplace smoothing adds 1 to every n-gram count before normalization:

P_Laplace(wi | wi-1) = (Count(wi-1, wi) + 1) / (Count(wi-1) + V)

where V is the vocabulary size. Although simple, Laplace smoothing assigns too much probability mass to unseen events when the vocabulary is large, making it impractical for word-level language models. It remains useful for character-level models and text classification tasks where vocabulary sizes are small.

Add-k smoothing

Add-k smoothing generalizes Laplace smoothing by adding a fractional count k (typically 0.01 to 0.5) instead of 1. This reduces the amount of probability mass shifted to unseen events but still requires tuning k on held-out data.

Good-Turing discounting

Good-Turing discounting, proposed by Alan Turing and I. J. Good during World War II for cryptanalysis, adjusts counts based on the frequency of frequencies. The central idea is that the adjusted count for n-grams occurring c times is:

c* = (c + 1) * N(c+1) / N(c)

where N(c) is the number of n-grams that occur exactly c times. This effectively redistributes probability mass from common n-grams to rare and unseen ones, guided by observed statistics rather than arbitrary constants.

Kneser-Ney smoothing

Kneser-Ney smoothing, introduced by Reinhard Kneser and Hermann Ney in 1995, is widely considered the most effective smoothing method for n-gram language models. It combines two innovations:

Absolute discounting. A fixed discount d (typically around 0.75) is subtracted from each observed count, and the freed probability mass is redistributed to lower-order models.
Continuation probability. Instead of using the raw frequency of a word for the lower-order distribution, Kneser-Ney uses the number of distinct contexts in which the word appears. This captures the intuition that a word like "Kong" may be very frequent overall but appears almost exclusively after "Hong," while a word like "glasses" appears in many different contexts and should receive higher backoff probability.

Modified Kneser-Ney, which uses three different discount values based on whether a count is 1, 2, or 3 or more, further improves performance and is the default in most modern n-gram toolkits such as SRILM and KenLM.

Backoff and interpolation

Backoff and interpolation are strategies for combining n-gram models of different orders to handle unseen n-grams.

Backoff (Katz backoff) uses the highest-order n-gram model that has sufficient data. If a trigram has been observed, its probability is used directly (with discounting). If the trigram has not been observed, the model "backs off" to the bigram estimate, and if that is also missing, it falls back to the unigram.

Interpolation (Jelinek-Mercer interpolation) always mixes the probabilities from all n-gram orders using learned weights:

P_interp(wi | wi-2, wi-1) = lambda3 * P(wi | wi-2, wi-1) + lambda2 * P(wi | wi-1) + lambda1 * P(wi)

where lambda1 + lambda2 + lambda3 = 1. The weights are typically optimized on a held-out development set using the expectation-maximization algorithm. Interpolation tends to perform better than simple backoff because it always incorporates information from all available n-gram orders.

Applications

N-grams have been applied across a wide range of tasks in NLP and beyond.

Language modeling

Before the rise of neural networks, n-gram models were the dominant approach to language modeling. They powered predictive text systems, autocomplete features, and were core components of speech recognition and machine translation systems. Even today, n-gram models are sometimes used as baselines or combined with neural models in hybrid systems.

Text classification and feature extraction

N-grams serve as features in many text classification systems. In the bag of words model, each unique word is a feature; extending this to bigrams or trigrams (sometimes called a "bag of n-grams") captures word order information that unigrams alone miss. For example, "not good" as a bigram carries very different sentiment from the individual unigrams "not" and "good." N-gram features have been widely used in spam detection, sentiment analysis, and topic categorization.

Search engines and information retrieval

Search engines use n-grams in several ways. N-gram indexes help match queries to documents even when exact word matches fail. Character n-grams enable fuzzy matching, handling spelling variations and typos in search queries. N-gram overlap also helps measure document similarity for deduplication and clustering.

Spell checking and autocorrect

Spell checkers use character n-gram similarity to suggest corrections for misspelled words. When a user types an unrecognized word, the system computes the character n-gram overlap between the input and dictionary words, ranking candidates by similarity. Word-level n-gram context then helps select the most likely correction, such as choosing between "their" and "there" based on surrounding words.

Speech recognition

Historically, n-gram language models were a critical component of automatic speech recognition (ASR) systems. The acoustic model generates candidate word sequences from audio, and the n-gram language model rescores these candidates based on how likely each word sequence is in the language. Although neural language models have largely replaced n-grams in modern ASR, n-gram models remain useful for first-pass decoding due to their computational efficiency.

Machine translation evaluation (BLEU and ROUGE)

N-grams play a central role in automatic evaluation metrics for natural language generation tasks.

BLEU (Bilingual Evaluation Understudy), introduced by Kishore Papineni and colleagues at IBM in 2002, evaluates machine translation quality by computing the modified precision of n-grams (unigrams through 4-grams) in the machine output against one or more reference translations. A brevity penalty is applied to discourage overly short translations. BLEU remains the most widely used automatic metric for machine translation.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation), developed by Chin-Yew Lin in 2004, is primarily used for evaluating text summarization. ROUGE-N measures n-gram recall between the system summary and reference summaries. While BLEU focuses on precision (what fraction of generated n-grams appear in references), ROUGE emphasizes recall (what fraction of reference n-grams appear in the generated text).

N-gram frequency analysis

Analyzing n-gram frequencies across large text corpora reveals patterns about language use, cultural trends, and historical change.

Zipf's law

N-gram frequency distributions typically follow Zipf's law: a small number of n-grams are extremely common, while the vast majority are rare. In English, the most frequent bigrams include "of the," "in the," and "to the." This heavy-tailed distribution is the root cause of the data sparsity problem in n-gram models.

Google Books Ngram Viewer

The Google Books Ngram Viewer, launched in December 2010, charts the frequencies of n-grams (up to 5-grams) across a corpus of over 8 million digitized books spanning from 1500 to 2022. The tool was created by Google engineers Jon Orwant and Will Brockman in collaboration with Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden, who coined the term "culturomics" to describe the quantitative analysis of cultural trends through digitized text.

The Ngram Viewer has been used to study diverse phenomena, including the evolution of grammar (such as the shift from irregular to regular verb forms), the rise and fall of cultural figures in public discourse, the adoption of technologies, and patterns of censorship across different countries. The underlying dataset covers roughly 4% of all books ever published and is available for download in multiple languages.

Limitations of n-gram models

Despite their simplicity and historical importance, n-gram models have several fundamental limitations.

Data sparsity. As n increases, the number of possible n-grams grows exponentially (V^n for a vocabulary of size V). Even very large corpora cannot provide reliable frequency estimates for all possible trigrams or 4-grams, let alone higher orders. This makes it difficult to increase the context window beyond 4 or 5 words.

No generalization across similar words. N-gram models treat each word as an atomic symbol with no notion of similarity. The model cannot transfer knowledge from "the dog is running" to "the cat is running" because "dog" and "cat" are entirely different symbols. Word embeddings and neural models address this by representing words as dense vectors in a continuous space where similar words have similar representations.

Fixed context window. N-gram models can only capture dependencies within a window of n - 1 preceding words. They cannot model long-range dependencies, such as subject-verb agreement across a relative clause ("The keys that were on the table are missing"), where the verb "are" depends on "keys" several words back.

Storage requirements. Storing all observed n-grams and their counts requires substantial memory, especially for large corpora and higher-order models. Google's Web 1T 5-gram dataset, released in 2006, contains approximately 1 trillion tokens worth of n-gram counts and requires roughly 24 GB of compressed storage.

No semantic understanding. N-gram models capture surface-level co-occurrence patterns but have no understanding of meaning. They cannot distinguish between "bank" as a financial institution and "bank" as a river bank without sufficient contextual n-grams to disambiguate.

N-grams vs. neural language models

The limitations of n-gram models motivated the development of neural language models, which have largely replaced n-grams as the state of the art.

Yoshua Bengio and colleagues introduced the neural probabilistic language model in 2003, which addressed two key weaknesses of n-gram models. First, it represented each word as a continuous word embedding vector, enabling the model to generalize across similar words. Second, it used a neural network to estimate conditional probabilities, allowing it to share parameters across different contexts.

Recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, extended neural language models by processing sequences of arbitrary length, removing the fixed context window limitation. Transformer-based models, introduced by Vaswani et al. in 2017, further advanced the field through self-attention mechanisms that can capture dependencies across very long sequences.

Feature	N-gram models	Neural language models
Context window	Fixed (n - 1 words)	Variable (entire sequence for Transformers)
Word representation	Discrete symbols	Continuous vectors (embeddings)
Generalization	No similarity-based generalization	Shares knowledge across similar words
Training data efficiency	Requires enormous corpora for higher n	More data-efficient through parameter sharing
Computational cost (inference)	Very fast (table lookup)	Higher (matrix operations)
Interpretability	High (explicit counts)	Lower (learned parameters)
Storage	Large for high-order models	Model parameters (fixed size)

Despite the dominance of neural models, n-grams retain practical value. They are fast to train and query, require no GPU hardware, and are effective for applications where computational resources are limited. Many production speech recognition systems still use n-gram models for first-pass decoding before rescoring with neural models.

Byte-pair encoding and subword tokenization

Byte-pair encoding (BPE), originally a data compression algorithm, was adapted by Rico Sennrich and colleagues in 2016 for subword tokenization in neural machine translation. BPE can be understood as an evolution of the character n-gram concept.

BPE starts with individual characters and iteratively merges the most frequent adjacent pair of symbols into a new symbol. After many iterations, the resulting vocabulary contains a mix of individual characters, common character sequences (essentially learned character n-grams), and full words. This approach handles rare and unseen words gracefully by decomposing them into known subword units.

Variants of BPE, including WordPiece (used in BERT) and SentencePiece (used in many multilingual models), have become the standard tokenization method for modern large language models. These methods preserve the core n-gram insight (that contiguous subsequences carry useful information) while learning the optimal set of subsequences from data.

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
Jurafsky, D. and Martin, J. H. (2024). *Speech and Language Processing* (3rd ed.), Chapter 3: N-gram Language Models. Stanford University.
Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: A Method for Automatic Evaluation of Machine Translation." *Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)*, 311-318.
Lin, C. Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." *Text Summarization Branches Out: Proceedings of the ACL-04 Workshop*, 74-81.
Kneser, R. and Ney, H. (1995). "Improved Backing-off for M-gram Language Modeling." *Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)*, 181-184.
Good, I. J. (1953). "The Population Frequencies of Species and the Estimation of Population Parameters." *Biometrika*, 40(3-4), 237-264.
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research*, 3, 1137-1155.
Michel, J. B., Shen, Y. K., Aiden, A. P., et al. (2011). "Quantitative Analysis of Culture Using Millions of Digitized Books." *Science*, 331(6014), 176-182.
Sennrich, R., Haddow, B., and Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." *Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)*, 1715-1725.
Chen, S. F. and Goodman, J. (1999). "An Empirical Study of Smoothing Techniques for Language Modeling." *Computer Speech and Language*, 13(4), 359-394.
Cavnar, W. B. and Trenkle, J. M. (1994). "N-gram-based Text Categorization." *Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94)*, 161-175.

Explain like I'm 5 (ELI5)

Types of n-grams

Character n-grams vs. word n-grams

N-gram language models

The chain rule

The Markov assumption

Maximum likelihood estimation

Perplexity

Smoothing techniques

Laplace (add-one) smoothing

Add-k smoothing

Good-Turing discounting

Kneser-Ney smoothing

Backoff and interpolation

Applications

Language modeling

Text classification and feature extraction

Search engines and information retrieval

Spell checking and autocorrect

Speech recognition

Machine translation evaluation (BLEU and ROUGE)

N-gram frequency analysis

Zipf's law

Google Books Ngram Viewer

Limitations of n-gram models

N-grams vs. neural language models

Byte-pair encoding and subword tokenization

See also

References

Improve this article

Related Articles

ARC-AGI 2

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Computer-use model

Context window

Explain like I'm 5 (ELI5)

Types of n-grams

Character n-grams vs. word n-grams

N-gram language models

The chain rule

The Markov assumption

Maximum likelihood estimation

Perplexity

Smoothing techniques

Laplace (add-one) smoothing

Add-k smoothing

Good-Turing discounting

Kneser-Ney smoothing

Backoff and interpolation

Applications

Language modeling

Text classification and feature extraction

Search engines and information retrieval

Spell checking and autocorrect

Speech recognition

Machine translation evaluation (BLEU and ROUGE)

N-gram frequency analysis

Zipf's law

Google Books Ngram Viewer

Limitations of n-gram models

N-grams vs. neural language models

Byte-pair encoding and subword tokenization

See also

References

Related Articles

ARC-AGI 2

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Computer-use model

Context window