# N-gram

> Source: https://aiwiki.ai/wiki/n-gram
> Updated: 2026-06-21
> Categories: Machine Learning, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

An **n-gram** is a contiguous sequence of *n* items extracted from a given sample of text or speech, where the items can be characters, syllables, words, or other linguistic units. N-grams are one of the foundational concepts in [natural language processing](/wiki/natural_language_processing) (NLP) and computational linguistics, serving as the basis for statistical [language model](/wiki/language_model)s, text classification systems, spell checkers, and many other tools that process human language. An n-gram [language model](/wiki/language_model) predicts the next word from only the preceding *n* minus 1 words, a simplification (the Markov assumption) that made statistical language modeling tractable and dominant from the 1980s until neural networks replaced it in the 2010s.[2]

The term "n-gram" combines the variable *n* with "gram," which derives from the Greek word *gramma*, meaning "letter" or "written character." Claude Shannon introduced character-level and word-level n-gram approximations of English in his 1948 paper "A Mathematical Theory of Communication" and his 1951 paper "Prediction and Entropy of Printed English," where he used them to demonstrate the statistical structure of the language and to estimate its entropy at roughly 1 bit per letter.[1]

## Explain like I'm 5 (ELI5)

Imagine you have a bunch of Lego blocks with words written on them. An n-gram is a way to connect these blocks to make different combinations. If you pick up just one block, that is a unigram. If you snap two blocks together, that is a bigram. And if you connect three blocks in a row, that is a trigram.

These word combos help computers figure out which words usually go together. For example, if a computer sees "peanut butter and," it can guess the next word is probably "jelly" because it has seen those words together many times before. That is basically how n-grams work: they help computers learn patterns in language by looking at small groups of words that appear next to each other.

## What are the types of n-grams?

N-grams are classified by their length, which is determined by the value of *n*.

| Type | *n* | Description | Example (from "the cat sat on the mat") |
|------|-----|-------------|------------------------------------------|
| Unigram | 1 | Single item | "the", "cat", "sat", "on", "the", "mat" |
| [Bigram](/wiki/bigram) | 2 | Pair of consecutive items | "the cat", "cat sat", "sat on", "on the", "the mat" |
| [Trigram](/wiki/trigram) | 3 | Triple of consecutive items | "the cat sat", "cat sat on", "sat on the", "on the mat" |
| 4-gram | 4 | Four consecutive items | "the cat sat on", "cat sat on the", "sat on the mat" |
| 5-gram | 5 | Five consecutive items | "the cat sat on the", "cat sat on the mat" |

In practice, bigrams and trigrams are the most commonly used in NLP applications. Higher-order n-grams (4-grams and above) capture more context but suffer from data sparsity, since the number of possible n-grams grows exponentially with *n*.

## How do character n-grams differ from word n-grams?

N-grams can operate at different levels of granularity. The two most common types are word n-grams and character n-grams, each with distinct strengths.

**Word n-grams** treat each word as an atomic unit. For the sentence "I love cats," the word bigrams are "I love" and "love cats." Word n-grams capture syntactic patterns and phrase-level meaning, making them useful for language modeling and text classification.

**Character n-grams** treat each character (including spaces and punctuation) as a unit. For the word "cats," the character trigrams are "cat", "ats." Character n-grams are particularly valuable for:

- **Language identification.** Character frequency patterns differ across languages, so character n-grams can reliably distinguish between languages even with very short text samples.[11]
- **Spell checking and correction.** Because character n-grams are tolerant of misspellings and typographical errors, they form the basis of many spelling correction algorithms.
- **Handling morphologically rich languages.** In languages with complex inflectional systems (such as Turkish, Finnish, or Hungarian), character n-grams capture subword patterns without requiring explicit morphological analysis.
- **Authorship attribution.** Character-level patterns often reflect an author's writing habits, including punctuation and spacing preferences.

| Feature | Word n-grams | Character n-grams |
|---------|-------------|--------------------|
| Unit of analysis | Whole words | Individual characters |
| Vocabulary size | Large (all unique words) | Small (alphabet size) |
| Robustness to typos | Low | High |
| Captures syntax | Yes | Limited |
| Language identification | Moderate | Excellent |
| Subword morphology | No | Yes |

## How does an n-gram language model work?

An n-gram language model estimates the probability of a sequence of words by decomposing it into a product of conditional probabilities. This relies on two key ideas: the chain rule of probability and the Markov assumption.[2]

### The chain rule

The probability of a sentence W = w1, w2, ..., wm can be expressed using the chain rule of probability:

P(w1, w2, ..., wm) = P(w1) * P(w2 | w1) * P(w3 | w1, w2) * ... * P(wm | w1, ..., wm-1)

Computing the full conditional probability P(wm | w1, ..., wm-1) for every word is impractical because it requires observing every possible word history, which becomes astronomically large for long sentences.

### The Markov assumption

To make the problem tractable, n-gram models apply the Markov assumption: the probability of a word depends only on the preceding *n* - 1 words rather than the entire history.[2] For a bigram model (n = 2):

P(wi | w1, w2, ..., wi-1) ≈ P(wi | wi-1)

For a trigram model (n = 3):

P(wi | w1, w2, ..., wi-1) ≈ P(wi | wi-2, wi-1)

This corresponds directly to a Markov chain, where a unigram model is a zeroth-order Markov model, a bigram model is first-order, and a trigram model is second-order.

### Maximum likelihood estimation

The simplest way to estimate n-gram probabilities is maximum likelihood estimation (MLE) from a training corpus.[2] The bigram probability is:

P(wi | wi-1) = Count(wi-1, wi) / Count(wi-1)

For example, if "the cat" appears 50 times in a corpus and "the" appears 10,000 times, then P(cat | the) = 50 / 10,000 = 0.005.

The trigram probability generalizes this:

P(wi | wi-2, wi-1) = Count(wi-2, wi-1, wi) / Count(wi-2, wi-1)

### Perplexity

[Perplexity](/wiki/perplexity) is the standard metric for evaluating language models. It measures how well a probability model predicts a test set. For a test set W of *N* words, the perplexity (PP) of a language model is:

PP(W) = P(w1, w2, ..., wN) ^ (-1/N)

A lower perplexity indicates a better model, meaning the model assigns higher probability to the test data. Intuitively, perplexity can be interpreted as the weighted average number of choices the model faces when predicting the next word.[2] A perplexity of 100 means the model is, on average, as uncertain as if it had to choose uniformly among 100 words at each step.

## What is smoothing in n-gram models?

A major challenge for n-gram models is **data sparsity**: many valid n-grams never appear in the training corpus and receive a probability of zero under MLE. Since a single zero probability in a product drives the entire sentence probability to zero, smoothing techniques are essential. Smoothing redistributes probability mass from observed n-grams to unobserved ones.[10]

| Technique | Key idea | Strengths | Weaknesses |
|-----------|----------|-----------|------------|
| Laplace (add-one) | Add 1 to every n-gram count | Simple to implement | Distorts probabilities heavily for large vocabularies |
| Add-k | Add a fractional count k < 1 | Less aggressive than Laplace | Choosing optimal k is difficult |
| Good-Turing | Re-estimate counts using frequency of frequencies | Principled statistical foundation | Complex implementation; unreliable for high counts |
| Kneser-Ney | Absolute discounting plus continuation probability | State-of-the-art for n-gram models | More complex to implement |
| Witten-Bell | Uses count of novel events to estimate unseen probability | Good for sparse data | Less effective than Kneser-Ney in most benchmarks |

### Laplace (add-one) smoothing

Laplace smoothing adds 1 to every n-gram count before normalization:

P_Laplace(wi | wi-1) = (Count(wi-1, wi) + 1) / (Count(wi-1) + V)

where V is the vocabulary size. Although simple, Laplace smoothing assigns too much probability mass to unseen events when the vocabulary is large, making it impractical for word-level language models. It remains useful for character-level models and text classification tasks where vocabulary sizes are small.

### Add-k smoothing

Add-k smoothing generalizes Laplace smoothing by adding a fractional count k (typically 0.01 to 0.5) instead of 1. This reduces the amount of probability mass shifted to unseen events but still requires tuning k on held-out data.

### Good-Turing discounting

Good-Turing discounting, proposed by Alan Turing and I. J. Good during World War II for cryptanalysis, adjusts counts based on the frequency of frequencies.[6] The central idea is that the adjusted count for n-grams occurring *c* times is:

c* = (c + 1) * N(c+1) / N(c)

where N(c) is the number of n-grams that occur exactly *c* times. This effectively redistributes probability mass from common n-grams to rare and unseen ones, guided by observed statistics rather than arbitrary constants.

### Kneser-Ney smoothing

Kneser-Ney smoothing, introduced by Reinhard Kneser and Hermann Ney in 1995, is widely considered the most effective smoothing method for n-gram language models.[5] In their large empirical comparison, Chen and Goodman concluded that "the modified Kneser-Ney smoothing method we introduce consistently outperforms all other methods," a finding that made it the default choice for n-gram toolkits.[10] It combines two innovations:

1. **Absolute discounting.** A fixed discount *d* (typically around 0.75) is subtracted from each observed count, and the freed probability mass is redistributed to lower-order models.
2. **Continuation probability.** Instead of using the raw frequency of a word for the lower-order distribution, Kneser-Ney uses the number of distinct contexts in which the word appears. This captures the intuition that a word like "Kong" may be very frequent overall but appears almost exclusively after "Hong," while a word like "glasses" appears in many different contexts and should receive higher backoff probability.

Modified Kneser-Ney, which uses three different discount values based on whether a count is 1, 2, or 3 or more, further improves performance and is the default in most modern n-gram toolkits such as SRILM and KenLM.[10]

## Backoff and interpolation

Backoff and interpolation are strategies for combining n-gram models of different orders to handle unseen n-grams.

**Backoff** (Katz backoff) uses the highest-order n-gram model that has sufficient data. If a trigram has been observed, its probability is used directly (with discounting). If the trigram has not been observed, the model "backs off" to the bigram estimate, and if that is also missing, it falls back to the unigram.

**Interpolation** (Jelinek-Mercer interpolation) always mixes the probabilities from all n-gram orders using learned weights:

P_interp(wi | wi-2, wi-1) = lambda3 * P(wi | wi-2, wi-1) + lambda2 * P(wi | wi-1) + lambda1 * P(wi)

where lambda1 + lambda2 + lambda3 = 1. The weights are typically optimized on a held-out development set using the expectation-maximization algorithm. Interpolation tends to perform better than simple backoff because it always incorporates information from all available n-gram orders.[2]

## What are n-grams used for?

N-grams have been applied across a wide range of tasks in NLP and beyond.

### Language modeling

Before the rise of neural networks, n-gram models were the dominant approach to language modeling. They powered predictive text systems, autocomplete features, and were core components of speech recognition and [machine translation](/wiki/machine_translation) systems. Even today, n-gram models are sometimes used as baselines or combined with neural models in hybrid systems.

### Text classification and feature extraction

N-grams serve as features in many text classification systems. In the [bag of words](/wiki/bag_of_words) model, each unique word is a feature; extending this to bigrams or trigrams (sometimes called a "bag of n-grams") captures word order information that unigrams alone miss. For example, "not good" as a bigram carries very different sentiment from the individual unigrams "not" and "good." N-gram features have been widely used in spam detection, sentiment analysis, and topic categorization.[11]

### Search engines and information retrieval

Search engines use n-grams in several ways. N-gram indexes help match queries to documents even when exact word matches fail. Character n-grams enable fuzzy matching, handling spelling variations and typos in search queries. N-gram overlap also helps measure document similarity for deduplication and clustering.

### Spell checking and autocorrect

Spell checkers use character n-gram similarity to suggest corrections for misspelled words. When a user types an unrecognized word, the system computes the character n-gram overlap between the input and dictionary words, ranking candidates by similarity. Word-level n-gram context then helps select the most likely correction, such as choosing between "their" and "there" based on surrounding words.

### Speech recognition

Historically, n-gram language models were a critical component of automatic speech recognition (ASR) systems. The acoustic model generates candidate word sequences from audio, and the n-gram language model rescores these candidates based on how likely each word sequence is in the language. Although neural language models have largely replaced n-grams in modern ASR, n-gram models remain useful for first-pass decoding due to their computational efficiency.

### Machine translation evaluation (BLEU and ROUGE)

N-grams play a central role in automatic evaluation metrics for [natural language generation](/wiki/natural_language_generation) tasks.

**BLEU** (Bilingual Evaluation Understudy), introduced by Kishore Papineni and colleagues at IBM in 2002, evaluates machine translation quality by computing the modified precision of n-grams (unigrams through 4-grams) in the machine output against one or more reference translations. A brevity penalty is applied to discourage overly short translations.[3] In the original study, BLEU achieved a correlation coefficient of 0.99 with monolingual human judgments of translation quality, and it remains the most widely used automatic metric for machine translation.[3]

**ROUGE** (Recall-Oriented Understudy for Gisting Evaluation), developed by Chin-Yew Lin in 2004, is primarily used for evaluating text summarization. ROUGE-N measures n-gram recall between the system summary and reference summaries.[4] While BLEU focuses on precision (what fraction of generated n-grams appear in references), ROUGE emphasizes recall (what fraction of reference n-grams appear in the generated text).

## N-gram frequency analysis

Analyzing n-gram frequencies across large text corpora reveals patterns about language use, cultural trends, and historical change.

### Zipf's law

N-gram frequency distributions typically follow Zipf's law: a small number of n-grams are extremely common, while the vast majority are rare. In English, the most frequent bigrams include "of the," "in the," and "to the." This heavy-tailed distribution is the root cause of the data sparsity problem in n-gram models.

### What is the Google Books Ngram Viewer?

The Google Books Ngram Viewer, released on December 16, 2010, charts the frequencies of n-grams (up to 5-grams) across a corpus of digitized books, with the current version spanning sources published from 1500 to 2022. The tool was created by Google software engineers Will Brockman and Jon Orwant in collaboration with Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden, who coined the term "culturomics" to describe the quantitative analysis of cultural trends through digitized text.[8] The accompanying 2011 study in *Science* analyzed a corpus of about 500 billion words from more than 5 million books, which the authors estimated at roughly 4% of all books ever printed.[8]

The authors framed the project's ambition directly, writing that "culturomics is the application of high-throughput data collection and analysis to the study of human culture."[8] The Ngram Viewer has been used to study diverse phenomena, including the evolution of grammar (such as the shift from irregular to regular verb forms), the rise and fall of cultural figures in public discourse, the adoption of technologies, and patterns of censorship across different countries.[8]

## What are the limitations of n-gram models?

Despite their simplicity and historical importance, n-gram models have several fundamental limitations.

**Data sparsity.** As *n* increases, the number of possible n-grams grows exponentially (V^n for a vocabulary of size V). Even very large corpora cannot provide reliable frequency estimates for all possible trigrams or 4-grams, let alone higher orders. This makes it difficult to increase the context window beyond 4 or 5 words.

**No generalization across similar words.** N-gram models treat each word as an atomic symbol with no notion of similarity. The model cannot transfer knowledge from "the dog is running" to "the cat is running" because "dog" and "cat" are entirely different symbols. [Word embedding](/wiki/word_embedding)s and neural models address this by representing words as dense vectors in a continuous space where similar words have similar representations.

**Fixed context window.** N-gram models can only capture dependencies within a window of *n* - 1 preceding words. They cannot model long-range dependencies, such as subject-verb agreement across a relative clause ("The keys that were on the table are missing"), where the verb "are" depends on "keys" several words back.

**Storage requirements.** Storing all observed n-grams and their counts requires substantial memory, especially for large corpora and higher-order models. Google's Web 1T 5-gram dataset, contributed by Google and released through the Linguistic Data Consortium in 2006, contains n-gram counts generated from approximately 1 trillion [token](/wiki/token)s of web text and ships on six DVDs of compressed data.[12]

**No semantic understanding.** N-gram models capture surface-level co-occurrence patterns but have no understanding of meaning. They cannot distinguish between "bank" as a financial institution and "bank" as a river bank without sufficient contextual n-grams to disambiguate.

## How do n-grams compare to neural language models?

The limitations of n-gram models motivated the development of neural language models, which have largely replaced n-grams as the state of the art.

Yoshua Bengio and colleagues introduced the neural probabilistic language model in 2003, which addressed two key weaknesses of n-gram models. First, it represented each word as a continuous [word embedding](/wiki/word_embedding) vector, enabling the model to generalize across similar words. Second, it used a neural network to estimate conditional probabilities, allowing it to share parameters across different contexts.[7] Bengio and colleagues argued that the approach was "fighting the curse of dimensionality with its own weapons," using distributed word representations to let the model generalize to word sequences never seen during training.[7]

Recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, extended neural language models by processing sequences of arbitrary length, removing the fixed context window limitation. Transformer-based models, introduced by Vaswani et al. in 2017, further advanced the field through self-attention mechanisms that can capture dependencies across very long sequences.

| Feature | N-gram models | Neural language models |
|---------|--------------|------------------------|
| Context window | Fixed (*n* - 1 words) | Variable (entire sequence for Transformers) |
| Word representation | Discrete symbols | Continuous vectors ([embeddings](/wiki/word_embedding)) |
| Generalization | No similarity-based generalization | Shares knowledge across similar words |
| Training data efficiency | Requires enormous corpora for higher *n* | More data-efficient through parameter sharing |
| Computational cost (inference) | Very fast (table lookup) | Higher (matrix operations) |
| Interpretability | High (explicit counts) | Lower (learned parameters) |
| Storage | Large for high-order models | Model parameters (fixed size) |

Despite the dominance of neural models, n-grams retain practical value. They are fast to train and query, require no GPU hardware, and are effective for applications where computational resources are limited. Many production speech recognition systems still use n-gram models for first-pass decoding before rescoring with neural models.

## Byte-pair encoding and subword tokenization

Byte-pair encoding (BPE), originally a data compression algorithm, was adapted by Rico Sennrich and colleagues in 2016 for subword [token](/wiki/token)ization in neural machine translation.[9] BPE can be understood as an evolution of the character n-gram concept.

BPE starts with individual characters and iteratively merges the most frequent adjacent pair of symbols into a new symbol. After many iterations, the resulting vocabulary contains a mix of individual characters, common character sequences (essentially learned character n-grams), and full words. This approach handles rare and unseen words gracefully by decomposing them into known subword units.

Variants of BPE, including WordPiece (used in BERT) and SentencePiece (used in many multilingual models), have become the standard tokenization method for modern large language models. These methods preserve the core n-gram insight (that contiguous subsequences carry useful information) while learning the optimal set of subsequences from data.

## See also

- [Bag of words](/wiki/bag_of_words)
- [Language model](/wiki/language_model)
- [Word embedding](/wiki/word_embedding)
- [Natural language processing](/wiki/natural_language_processing)
- [Perplexity](/wiki/perplexity)
- [TF-IDF](/wiki/tf_idf)
- [Tokenization](/wiki/tokenization)

## References

1. Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423. See also Shannon, C. E. (1951). "Prediction and Entropy of Printed English." *Bell System Technical Journal*, 30(1), 50-64.
2. Jurafsky, D. and Martin, J. H. (2024). *Speech and Language Processing* (3rd ed.), Chapter 3: N-gram Language Models. Stanford University.
3. Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: A Method for Automatic Evaluation of Machine Translation." *Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)*, 311-318.
4. Lin, C. Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." *Text Summarization Branches Out: Proceedings of the ACL-04 Workshop*, 74-81.
5. Kneser, R. and Ney, H. (1995). "Improved Backing-off for M-gram Language Modeling." *Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)*, 181-184.
6. Good, I. J. (1953). "The Population Frequencies of Species and the Estimation of Population Parameters." *Biometrika*, 40(3-4), 237-264.
7. Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research*, 3, 1137-1155.
8. Michel, J. B., Shen, Y. K., Aiden, A. P., et al. (2011). "Quantitative Analysis of Culture Using Millions of Digitized Books." *Science*, 331(6014), 176-182.
9. Sennrich, R., Haddow, B., and Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." *Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)*, 1715-1725.
10. Chen, S. F. and Goodman, J. (1999). "An Empirical Study of Smoothing Techniques for Language Modeling." *Computer Speech and Language*, 13(4), 359-394.
11. Cavnar, W. B. and Trenkle, J. M. (1994). "N-gram-based Text Categorization." *Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR-94)*, 161-175.
12. Brants, T. and Franz, A. (2006). "Web 1T 5-gram Version 1." Linguistic Data Consortium, Philadelphia (LDC2006T13).