Language Model

Introduction

A language model is a probabilistic model that assigns probabilities to sequences of words or tokens in a natural language. Given a sequence of words w₁, w₂, ..., w_m, a language model estimates the joint probability P(w₁, w₂, ..., w_m) or, equivalently, the conditional probability of the next word given the preceding context: P(w_m | w₁, w₂, ..., w_m-1). Language models form a cornerstone of natural language processing (NLP) and are used in tasks ranging from speech recognition and machine translation to text generation and information retrieval.

The central idea behind language modeling is that not all word sequences are equally likely. For example, the sequence "the cat sat on the mat" is far more probable than "mat the on sat cat the" in English. By capturing these statistical regularities, language models enable computers to predict, generate, and evaluate natural language text.

The field of language modeling has evolved dramatically over seven decades, progressing from simple statistical counting methods to neural network-based architectures with hundreds of billions of parameters. Modern language models, particularly those based on the transformer architecture, have demonstrated remarkable capabilities across a wide range of language tasks and have become central to contemporary artificial intelligence research.

Historical evolution

The history of language modeling spans from information theory in the 1940s to the massive pretrained models of the 2020s.

Shannon and information theory (1948)

The foundations of language modeling trace back to Claude Shannon's 1948 paper "A Mathematical Theory of Communication." While Shannon was primarily concerned with the engineering problem of transmitting messages efficiently through noisy communication channels, his work laid the mathematical groundwork for statistical language modeling. Shannon proposed that language could be modeled as a stochastic process and introduced the concept of entropy as a measure of the information content (or unpredictability) of a language source. He conducted experiments using n-gram approximations of English text, demonstrating that statistical patterns in language could guide prediction. Shannon estimated the entropy of English to be between 0.6 and 1.3 bits per character, a measurement that remains a reference point in the field.

Formal grammars and early statistical models (1950s to 1980s)

In the 1950s, Noam Chomsky developed the theory of formal grammars, proposing that language could be described by a set of generative rules. While Chomsky's rule-based approach dominated theoretical linguistics for decades, practical applications increasingly turned to statistical methods. By the 1980s, researchers at IBM and elsewhere demonstrated that statistical approaches to language modeling were more effective for tasks like speech recognition. The IBM speech recognition group, led by Frederick Jelinek, made pioneering contributions to n-gram modeling and smoothing techniques during this period. Jelinek is often quoted as saying, "Every time I fire a linguist, the performance of the speech recognizer goes up," reflecting the pragmatic success of statistical methods.

The rise of neural approaches (2003 onward)

The shift from purely statistical models to neural approaches began in earnest with Yoshua Bengio and colleagues' 2003 paper "A Neural Probabilistic Language Model." This work introduced the idea of using a feedforward neural network to learn distributed representations of words (later called word embeddings) alongside the language model itself. The key innovation was representing each word as a continuous real-valued vector rather than a discrete symbol, which allowed the model to generalize across similar words and alleviate the curse of dimensionality that plagued n-gram models.

Recurrent and LSTM models (2010 to 2017)

Tomas Mikolov and colleagues applied recurrent neural networks (RNNs) to language modeling in 2010, achieving substantial improvements over n-gram baselines. Unlike feedforward models, RNNs could process sequences of arbitrary length by maintaining a hidden state that accumulated information from all previous time steps. However, vanilla RNNs suffered from the vanishing gradient problem, which made it difficult to capture long-range dependencies. Long short-term memory (LSTM) networks, originally proposed by Hochreiter and Schmidhuber in 1997, addressed this limitation with gating mechanisms that controlled information flow. LSTM-based language models, refined by researchers such as Alex Graves, became the dominant approach from roughly 2013 to 2017.

The transformer era (2017 to present)

The introduction of the transformer architecture by Vaswani et al. in the 2017 paper "Attention Is All You Need" revolutionized language modeling. Transformers replaced recurrence with self-attention mechanisms, allowing the model to attend to all positions in the input sequence simultaneously. This parallel processing capability made transformers significantly more efficient to train on modern hardware. The transformer gave rise to three major families of language models: autoregressive models like GPT (2018), masked language models like BERT (2018), and encoder-decoder models like T5 (2019). Since 2019, the field has been dominated by increasingly large transformer-based models, culminating in systems with hundreds of billions or even trillions of parameters.

Year	Milestone	Key contribution
1948	Shannon's "A Mathematical Theory of Communication"	Founded information theory; introduced n-gram approximations of language
1980s	IBM statistical language models	N-gram models with smoothing for speech recognition
2003	Bengio et al. neural probabilistic LM	First neural language model with learned word embeddings
2010	Mikolov et al. RNN language model	Applied recurrent neural networks to language modeling
2013	Word2Vec (Mikolov et al.)	Efficient word embedding training at scale
2017	Vaswani et al. "Attention Is All You Need"	Introduced the transformer architecture
2018	GPT (Radford et al.)	Demonstrated effectiveness of autoregressive transformer pretraining
2018	BERT (Devlin et al.)	Introduced masked language model pretraining with bidirectional context
2019	T5 (Raffel et al.)	Unified text-to-text framework for NLP tasks
2020	GPT-3 (Brown et al.)	175 billion parameter model demonstrating few-shot learning
2022	Chinchilla (Hoffmann et al.)	Established compute-optimal scaling laws

N-gram language models

N-gram models are the simplest and historically most important class of language models. An n-gram model estimates the probability of a word based on the preceding (n-1) words, applying the Markov assumption to approximate the full joint probability of a sequence.

Mathematical formulation

The probability of a sequence of words can be decomposed using the chain rule of probability:

P(w₁, w₂, ..., w_m) = P(w₁) × P(w₂ | w₁) × P(w₃ | w₁, w₂) × ... × P(w_m | w₁, ..., w_m-1)

An n-gram model simplifies each conditional by assuming that the probability of a word depends only on the previous (n-1) words:

P(w_m | w₁, ..., w_m-1) ≈ P(w_m | w_m-n+1, ..., w_m-1)

These conditional probabilities are typically estimated using maximum likelihood estimation (MLE) from counts in a training corpus:

P(w_m | w_m-n+1, ..., w_m-1) = count(w_m-n+1, ..., w_m) / count(w_m-n+1, ..., w_m-1)

Common n-gram orders include unigrams (n=1, no context), bigrams (n=2, one word of context), and trigrams (n=3, two words of context).

Smoothing techniques

Raw MLE estimates assign zero probability to any n-gram not seen in the training data, which is problematic for real-world use. Several smoothing techniques address this issue:

Technique	Description
Add-one (Laplace) smoothing	Adds 1 to every n-gram count; simple but imprecise
Add-k smoothing	Adds a fractional count k < 1 to every n-gram
Good-Turing discounting	Redistributes probability mass from seen to unseen events based on frequency of frequencies
Kneser-Ney smoothing	Uses absolute discounting combined with a lower-order distribution based on continuation probability; widely considered the best n-gram smoothing method
Backoff and interpolation	Combines estimates from different n-gram orders; falls back to shorter contexts when longer ones have insufficient data

Limitations of n-gram models

Despite their simplicity and efficiency, n-gram models have significant limitations. They cannot capture dependencies beyond the fixed context window of (n-1) words. Increasing n leads to exponential growth in the number of possible n-grams, causing severe data sparsity. Even with sophisticated smoothing, n-gram models struggle to generalize to word sequences that do not appear in the training corpus, because they treat each word as a discrete symbol with no notion of semantic similarity.

Neural language models

Neural language models use neural networks to estimate the probability distribution over word sequences. By representing words as continuous vectors rather than discrete symbols, these models can generalize across semantically similar words and capture more complex patterns in language.

Feedforward neural language models

Bengio et al.'s 2003 model used a feedforward neural network that took as input the vector representations (embeddings) of the n previous words, concatenated them, passed the result through a hidden layer with a tanh activation function, and produced a probability distribution over the vocabulary using a softmax output layer. The model simultaneously learned the word embeddings and the language model parameters during training. This approach demonstrated that neural models could outperform n-gram models, particularly on tasks involving rare or unseen word combinations, because similar words received similar embeddings and thus shared statistical strength.

Recurrent neural language models

Recurrent neural networks (RNNs) extended neural language modeling by removing the fixed-context-window limitation. At each time step, an RNN processes the current input word and the hidden state from the previous time step, producing a new hidden state and a prediction for the next word. This architecture allows the model to, in principle, condition on the entire preceding sequence.

Mikolov et al. (2010) demonstrated that RNN language models could significantly outperform both n-gram and feedforward neural models. However, vanilla RNNs struggled with long-range dependencies due to vanishing gradients. LSTM networks addressed this with memory cells and three gating mechanisms (input, forget, and output gates) that regulated the flow of information. Gated recurrent units (GRUs), proposed by Cho et al. in 2014, offered a simplified alternative with two gates (reset and update) and comparable performance.

Transformer-based language models

Transformer-based language models have become the dominant paradigm since 2018. The transformer architecture relies entirely on self-attention mechanisms to compute representations of the input sequence, dispensing with recurrence and convolutions. Self-attention allows each position in the sequence to attend to every other position, enabling efficient capture of both local and long-range dependencies.

Transformer language models fall into three main architectural categories:

Architecture	Context direction	Representative models	Typical use cases
Decoder-only (autoregressive)	Left-to-right (causal)	GPT series, LLaMA, PaLM	Text generation, code generation, dialogue
Encoder-only (autoencoding)	Bidirectional	BERT, RoBERTa, ALBERT	Classification, named entity recognition, question answering
Encoder-decoder	Bidirectional encoder, autoregressive decoder	T5, BART, mBART	Translation, summarization, question answering

Pre-training objectives

The training objective defines what task the language model learns during pre-training. Different objectives lead to models with different strengths.

Next-token prediction (causal language modeling)

Autoregressive or causal language models are trained to predict the next token given all preceding tokens. The training objective minimizes the negative log-likelihood:

L = - ∑_t log P(w_t | w₁, ..., w_t-1)

During training, a causal attention mask ensures that each position can only attend to positions to its left. This unidirectional approach is natural for text generation, since text is produced left to right. GPT (Radford et al., 2018) and its successors use this objective.

Masked language modeling

Masked language modeling (MLM), introduced by BERT (Devlin et al., 2018), randomly masks a fraction of input tokens (typically 15%) and trains the model to predict the original tokens from the surrounding bidirectional context. Because the model sees both left and right context for each masked position, it builds richer representations for understanding tasks. However, MLM models are not naturally suited for text generation, since they do not learn a left-to-right factorization of the sequence probability.

Denoising objectives

Models like T5 and BART use more general denoising objectives. T5 uses a "span corruption" objective that masks contiguous spans of tokens and trains the model to reconstruct them. BART combines several noise functions, including token masking, token deletion, sentence permutation, and document rotation. These objectives allow encoder-decoder models to learn both comprehension and generation capabilities.

Causal vs. masked language models

The distinction between causal (autoregressive) and masked (autoencoding) language models reflects a fundamental tradeoff in language modeling.

Causal language models process text in one direction (left to right) and predict each token based solely on the preceding tokens. This makes them well suited for generative tasks such as text completion, dialogue, and creative writing. Because they learn a proper probability distribution over sequences, they can be used directly for sampling and text generation.

Masked language models access bidirectional context, allowing each token prediction to be informed by both preceding and following tokens. This bidirectional understanding makes them stronger for discriminative tasks like classification, named entity recognition, and extractive question answering. However, they do not define a straightforward generative model over sequences.

In practice, autoregressive models have dominated the landscape of large language models since 2020, as scaling has proven especially effective for causal language modeling, and the ability to generate text is central to many applications.

Tokenization and vocabulary

Before a language model can process text, the text must be converted into a sequence of discrete units called tokens. The choice of tokenization strategy significantly affects model performance, vocabulary size, and the ability to handle multiple languages or rare words.

Early language models operated at the word level, assigning each word in the vocabulary its own index. This approach creates very large vocabularies and cannot handle out-of-vocabulary (OOV) words. Character-level models avoid OOV issues but require the model to learn longer-range dependencies.

Modern language models use subword tokenization, which splits text into units between words and characters. The three most common subword algorithms are:

Algorithm	Method	Used by
Byte Pair Encoding (BPE)	Iteratively merges the most frequent pair of adjacent tokens	GPT series, LLaMA, Gemma
WordPiece	Merges pairs that maximize the likelihood of the training corpus	BERT, DistilBERT
SentencePiece (Unigram)	Starts with a large vocabulary and iteratively removes tokens that least reduce the training likelihood	T5, ALBERT, mBART

Subword tokenization provides a good balance: common words are kept as single tokens, while rare words are split into meaningful subword units. For instance, the word "unhappiness" might be tokenized as ["un", "happiness"] or ["un", "happi", "ness"], depending on the algorithm and training data. Typical vocabulary sizes for modern language models range from 30,000 to 128,000 tokens.

Evaluation metrics

Evaluating language models requires both intrinsic metrics that measure how well the model fits the data and extrinsic metrics that assess performance on downstream tasks.

Perplexity

Perplexity (PPL) is the standard intrinsic evaluation metric for language models. It measures how "surprised" the model is when predicting the next token, averaged across a test set. Formally, perplexity is defined as:

PPL = exp(- (1/N) ∑_t=1^N log P(w_t | w₁, ..., w_t-1))

A lower perplexity indicates a better model. Intuitively, a perplexity of k means the model is, on average, as uncertain as if it were choosing uniformly among k options at each step. State-of-the-art language models achieve perplexities below 20 on standard benchmarks like Penn Treebank, compared to over 100 for simple n-gram models.

Bits per character and bits per byte

Bits per character (BPC) and bits per byte (BPB) normalize the cross-entropy loss by the number of characters or bytes rather than tokens. This makes them particularly useful for comparing models that use different tokenizers, since perplexity is sensitive to the tokenization scheme. A model using a larger vocabulary makes fewer but harder predictions per text segment, which can give a misleadingly lower per-token perplexity.

Downstream task evaluation

Intrinsic metrics like perplexity do not always correlate with practical usefulness. Language models are also evaluated on downstream tasks through benchmarks such as:

Benchmark	What it measures
GLUE / SuperGLUE	General language understanding (classification, entailment, similarity)
SQuAD	Reading comprehension and extractive question answering
MMLU	Multitask accuracy across 57 academic subjects
HumanEval	Code generation correctness
HellaSwag	Commonsense reasoning and sentence completion
WinoGrande	Coreference resolution requiring world knowledge
TruthfulQA	Factual accuracy and resistance to generating false claims

Decoding strategies

When using a language model to generate text, the model produces a probability distribution over the vocabulary at each step. The decoding strategy determines how the next token is selected from this distribution.

Greedy decoding

Greedy decoding selects the token with the highest probability at each step. It is fast and deterministic but tends to produce repetitive and generic text, because it always takes the locally optimal choice without considering the global quality of the sequence.

Beam search

Beam search maintains a set of the K most probable partial sequences (where K is called the beam width) and expands each at every step, keeping only the top K candidates. It produces more coherent output than greedy decoding and is widely used in machine translation and summarization. However, beam search can still produce repetitive text in open-ended generation tasks.

Top-k sampling

Top-k sampling restricts the candidate pool to the K most probable tokens, redistributes the probability mass among them, and samples from this truncated distribution. Introduced and popularized by the GPT-2 paper (Radford et al., 2019), top-k sampling improves diversity while maintaining reasonable coherence.

Top-p (nucleus) sampling

Nucleus sampling, proposed by Holtzman et al. (2019), dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p (for example, p = 0.9). Unlike top-k, which uses a fixed number of candidates regardless of the distribution shape, top-p adapts to the model's confidence at each step. When the model is confident, fewer tokens are considered; when it is uncertain, more tokens are included.

Temperature scaling

Temperature is a parameter that adjusts the sharpness of the probability distribution before sampling. Given logits z, the probability of token i is computed as:

P(i) = exp(z_i / T) / ∑_j exp(z_j / T)

A temperature T < 1 makes the distribution sharper (more deterministic), while T > 1 makes it flatter (more random). At T approaching 0, sampling becomes equivalent to greedy decoding. In practice, temperature is often combined with top-k or top-p sampling.

Strategy	Deterministic?	Diversity	Best suited for
Greedy	Yes	Low	Short, factual outputs
Beam search	Nearly (width-dependent)	Low to moderate	Translation, summarization
Top-k sampling	No	Moderate to high	Creative text generation
Top-p (nucleus) sampling	No	Moderate to high	Open-ended generation
Temperature scaling	Depends on T	Adjustable	Combined with other strategies

Scaling laws

Research has revealed that language model performance follows predictable power-law relationships with respect to model size, dataset size, and training compute.

Kaplan scaling laws (2020)

Kaplan et al. (2020) at OpenAI published "Scaling Laws for Neural Language Models," demonstrating that cross-entropy loss decreases as a smooth power law when any of three factors (parameters, data, or compute) increases, provided the other factors are not bottlenecked. Their analysis, spanning seven orders of magnitude in compute, suggested that model size was roughly three times more important than dataset size for reducing loss, leading to the practice of training very large models on relatively modest datasets.

Chinchilla scaling laws (2022)

Hoffmann et al. (2022) at DeepMind challenged the Kaplan findings with their paper "Training Compute-Optimal Large Language Models." Their analysis showed that for a given compute budget, model size and training data should be scaled equally. The resulting "Chinchilla rule" recommends approximately 20 training tokens per parameter for compute-optimal training. This finding implied that many existing large models, including the 175-billion-parameter GPT-3, were significantly undertrained relative to their size. The 70-billion-parameter Chinchilla model, trained on 1.4 trillion tokens, outperformed the much larger 280-billion-parameter Gopher on most benchmarks.

Beyond parameter count

More recent research has explored additional dimensions of scaling, including the quality and diversity of training data, the effects of repeated data, and inference-time compute scaling. The general trend from millions of parameters (early neural LMs) to hundreds of billions (GPT-3, PaLM) and beyond has been accompanied by the emergence of capabilities such as in-context learning, chain-of-thought reasoning, and instruction following that do not appear at smaller scales.

Conditional language generation

While standard language models learn an unconditional (or context-only) distribution over text, many practical applications require generating text conditioned on some input. Conditional language generation extends the basic language modeling framework to produce output text given a specific input signal.

Examples include:

Machine translation: generating text in a target language conditioned on source language input
Summarization: generating a concise summary conditioned on a longer document
Question answering: generating an answer conditioned on a question and optional context
Dialogue: generating a response conditioned on conversation history
Code generation: generating source code conditioned on natural language descriptions

Encoder-decoder architectures like T5 are explicitly designed for conditional generation. Decoder-only models can also perform conditional generation by prepending the conditioning input to the generation context (prompt-based conditioning), a technique that has proven highly effective in large-scale autoregressive models.

Applications

Language models underpin a wide range of NLP and AI applications:

Application	Description	Example systems
Text generation	Producing coherent, contextually relevant text	GPT-4, Claude, Gemini
Machine translation	Converting text between natural languages	Google Translate, DeepL
Text summarization	Condensing long documents into shorter summaries	Pegasus, BART
Speech recognition	Converting spoken language to text; LMs rescore hypotheses	Whisper, Google ASR
Sentiment analysis	Classifying the emotional tone of text	Fine-tuned BERT models
Question answering	Providing answers to natural language questions	GPT-4, Perplexity AI
Code generation	Writing source code from natural language specifications	Codex, GitHub Copilot, Claude
Information retrieval	Ranking documents by relevance to a query	ColBERT, MonoT5
Chatbots and dialogue	Sustaining multi-turn conversations	ChatGPT, Claude, Gemini

Explain like I'm 5 (ELI5)

Imagine you are playing a guessing game where someone reads you the beginning of a sentence, and you have to guess the next word. If you have read lots and lots of books, you get pretty good at guessing. A language model is like a computer playing this guessing game. It reads billions of sentences and learns which words usually follow other words. When someone says "the cat sat on the," the model knows that "mat" or "floor" are much better guesses than "elephant" or "guitar." The better a language model is at guessing, the better it can help with things like translating languages, answering questions, or writing stories. Early language models just counted how often words appeared together. Newer ones use special math (neural networks) to understand patterns in language much more deeply, which is why they can now write whole essays, have conversations, and even write computer code.

References

Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27(3), 379-423.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research*, 3, 1137-1155.
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). "Recurrent Neural Network Based Language Model." *Proceedings of Interspeech*.
Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT*.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). "Language Models Are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., et al. (2022). "Training Compute-Optimal Large Language Models." *Advances in Neural Information Processing Systems*, 35.
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." *Proceedings of ICLR*.

Introduction

Historical evolution

Shannon and information theory (1948)

Formal grammars and early statistical models (1950s to 1980s)

The rise of neural approaches (2003 onward)

Recurrent and LSTM models (2010 to 2017)

The transformer era (2017 to present)

N-gram language models

Mathematical formulation

Smoothing techniques

Limitations of n-gram models

Neural language models

Feedforward neural language models

Recurrent neural language models

Transformer-based language models

Pre-training objectives

Next-token prediction (causal language modeling)

Masked language modeling

Denoising objectives

Causal vs. masked language models

Tokenization and vocabulary

Evaluation metrics

Perplexity

Bits per character and bits per byte

Downstream task evaluation

Decoding strategies

Greedy decoding

Beam search

Top-k sampling

Top-p (nucleus) sampling

Temperature scaling

Scaling laws

Kaplan scaling laws (2020)

Chinchilla scaling laws (2022)

Beyond parameter count

Conditional language generation

Applications

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Supervised fine-tuning

Introduction

Historical evolution

Shannon and information theory (1948)

Formal grammars and early statistical models (1950s to 1980s)

The rise of neural approaches (2003 onward)

Recurrent and LSTM models (2010 to 2017)

The transformer era (2017 to present)

N-gram language models

Mathematical formulation

Smoothing techniques

Limitations of n-gram models

Neural language models

Feedforward neural language models

Recurrent neural language models

Transformer-based language models

Pre-training objectives

Next-token prediction (causal language modeling)

Masked language modeling

Denoising objectives

Causal vs. masked language models

Tokenization and vocabulary

Evaluation metrics

Perplexity

Bits per character and bits per byte

Downstream task evaluation

Decoding strategies

Greedy decoding

Beam search

Top-k sampling

Top-p (nucleus) sampling

Temperature scaling

Scaling laws

Kaplan scaling laws (2020)