See also: Machine learning terms, Large language model
A language model is a probabilistic model that assigns probabilities to sequences of words or tokens in a natural language. Given a sequence of words w1, w2, ..., wm, a language model estimates the joint probability P(w1, w2, ..., wm) or, equivalently, the conditional probability of the next word given the preceding context: P(wm | w1, w2, ..., wm-1). Language models form a cornerstone of natural language processing (NLP) and are used in tasks ranging from speech recognition and machine translation to text generation and information retrieval.
The central idea behind language modeling is that not all word sequences are equally likely. For example, the sequence "the cat sat on the mat" is far more probable than "mat the on sat cat the" in English. By capturing these statistical regularities, language models enable computers to predict, generate, and evaluate natural language text.
The field of language modeling has evolved dramatically over seven decades, progressing from simple statistical counting methods to neural network-based architectures with hundreds of billions of parameters. Modern language models, particularly those based on the transformer architecture, have demonstrated remarkable capabilities across a wide range of language tasks and have become central to contemporary artificial intelligence research.
The history of language modeling spans from information theory in the 1940s to the massive pretrained models of the 2020s.
The foundations of language modeling trace back to Claude Shannon's 1948 paper "A Mathematical Theory of Communication." While Shannon was primarily concerned with the engineering problem of transmitting messages efficiently through noisy communication channels, his work laid the mathematical groundwork for statistical language modeling. Shannon proposed that language could be modeled as a stochastic process and introduced the concept of entropy as a measure of the information content (or unpredictability) of a language source. He conducted experiments using n-gram approximations of English text, demonstrating that statistical patterns in language could guide prediction. Shannon estimated the entropy of English to be between 0.6 and 1.3 bits per character, a measurement that remains a reference point in the field.
In the 1950s, Noam Chomsky developed the theory of formal grammars, proposing that language could be described by a set of generative rules. While Chomsky's rule-based approach dominated theoretical linguistics for decades, practical applications increasingly turned to statistical methods. By the 1980s, researchers at IBM and elsewhere demonstrated that statistical approaches to language modeling were more effective for tasks like speech recognition. The IBM speech recognition group, led by Frederick Jelinek, made pioneering contributions to n-gram modeling and smoothing techniques during this period. Jelinek is often quoted as saying, "Every time I fire a linguist, the performance of the speech recognizer goes up," reflecting the pragmatic success of statistical methods.
The shift from purely statistical models to neural approaches began in earnest with Yoshua Bengio and colleagues' 2003 paper "A Neural Probabilistic Language Model." This work introduced the idea of using a feedforward neural network to learn distributed representations of words (later called word embeddings) alongside the language model itself. The key innovation was representing each word as a continuous real-valued vector rather than a discrete symbol, which allowed the model to generalize across similar words and alleviate the curse of dimensionality that plagued n-gram models.
Tomas Mikolov and colleagues applied recurrent neural networks (RNNs) to language modeling in 2010, achieving substantial improvements over n-gram baselines. Unlike feedforward models, RNNs could process sequences of arbitrary length by maintaining a hidden state that accumulated information from all previous time steps. However, vanilla RNNs suffered from the vanishing gradient problem, which made it difficult to capture long-range dependencies. Long short-term memory (LSTM) networks, originally proposed by Hochreiter and Schmidhuber in 1997, addressed this limitation with gating mechanisms that controlled information flow. LSTM-based language models, refined by researchers such as Alex Graves, became the dominant approach from roughly 2013 to 2017.
The introduction of the transformer architecture by Vaswani et al. in the 2017 paper "Attention Is All You Need" revolutionized language modeling. Transformers replaced recurrence with self-attention mechanisms, allowing the model to attend to all positions in the input sequence simultaneously. This parallel processing capability made transformers significantly more efficient to train on modern hardware. The transformer gave rise to three major families of language models: autoregressive models like GPT (2018), masked language models like BERT (2018), and encoder-decoder models like T5 (2019). Since 2019, the field has been dominated by increasingly large transformer-based models, culminating in systems with hundreds of billions or even trillions of parameters.
| Year | Milestone | Key contribution |
|---|---|---|
| 1948 | Shannon's "A Mathematical Theory of Communication" | Founded information theory; introduced n-gram approximations of language |
| 1980s | IBM statistical language models | N-gram models with smoothing for speech recognition |
| 2003 | Bengio et al. neural probabilistic LM | First neural language model with learned word embeddings |
| 2010 | Mikolov et al. RNN language model | Applied recurrent neural networks to language modeling |
| 2013 | Word2Vec (Mikolov et al.) | Efficient word embedding training at scale |
| 2017 | Vaswani et al. "Attention Is All You Need" | Introduced the transformer architecture |
| 2018 | GPT (Radford et al.) | Demonstrated effectiveness of autoregressive transformer pretraining |
| 2018 | BERT (Devlin et al.) | Introduced masked language model pretraining with bidirectional context |
| 2019 | T5 (Raffel et al.) | Unified text-to-text framework for NLP tasks |
| 2020 | GPT-3 (Brown et al.) | 175 billion parameter model demonstrating few-shot learning |
| 2022 | Chinchilla (Hoffmann et al.) | Established compute-optimal scaling laws |
N-gram models are the simplest and historically most important class of language models. An n-gram model estimates the probability of a word based on the preceding (n-1) words, applying the Markov assumption to approximate the full joint probability of a sequence.
The probability of a sequence of words can be decomposed using the chain rule of probability:
P(w1, w2, ..., wm) = P(w1) × P(w2 | w1) × P(w3 | w1, w2) × ... × P(wm | w1, ..., wm-1)
An n-gram model simplifies each conditional by assuming that the probability of a word depends only on the previous (n-1) words:
P(wm | w1, ..., wm-1) ≈ P(wm | wm-n+1, ..., wm-1)
These conditional probabilities are typically estimated using maximum likelihood estimation (MLE) from counts in a training corpus:
P(wm | wm-n+1, ..., wm-1) = count(wm-n+1, ..., wm) / count(wm-n+1, ..., wm-1)
Common n-gram orders include unigrams (n=1, no context), bigrams (n=2, one word of context), and trigrams (n=3, two words of context).
Raw MLE estimates assign zero probability to any n-gram not seen in the training data, which is problematic for real-world use. Several smoothing techniques address this issue:
| Technique | Description |
|---|---|
| Add-one (Laplace) smoothing | Adds 1 to every n-gram count; simple but imprecise |
| Add-k smoothing | Adds a fractional count k < 1 to every n-gram |
| Good-Turing discounting | Redistributes probability mass from seen to unseen events based on frequency of frequencies |
| Kneser-Ney smoothing | Uses absolute discounting combined with a lower-order distribution based on continuation probability; widely considered the best n-gram smoothing method |
| Backoff and interpolation | Combines estimates from different n-gram orders; falls back to shorter contexts when longer ones have insufficient data |
Despite their simplicity and efficiency, n-gram models have significant limitations. They cannot capture dependencies beyond the fixed context window of (n-1) words. Increasing n leads to exponential growth in the number of possible n-grams, causing severe data sparsity. Even with sophisticated smoothing, n-gram models struggle to generalize to word sequences that do not appear in the training corpus, because they treat each word as a discrete symbol with no notion of semantic similarity.
Neural language models use neural networks to estimate the probability distribution over word sequences. By representing words as continuous vectors rather than discrete symbols, these models can generalize across semantically similar words and capture more complex patterns in language.
Bengio et al.'s 2003 model used a feedforward neural network that took as input the vector representations (embeddings) of the n previous words, concatenated them, passed the result through a hidden layer with a tanh activation function, and produced a probability distribution over the vocabulary using a softmax output layer. The model simultaneously learned the word embeddings and the language model parameters during training. This approach demonstrated that neural models could outperform n-gram models, particularly on tasks involving rare or unseen word combinations, because similar words received similar embeddings and thus shared statistical strength.
Recurrent neural networks (RNNs) extended neural language modeling by removing the fixed-context-window limitation. At each time step, an RNN processes the current input word and the hidden state from the previous time step, producing a new hidden state and a prediction for the next word. This architecture allows the model to, in principle, condition on the entire preceding sequence.
Mikolov et al. (2010) demonstrated that RNN language models could significantly outperform both n-gram and feedforward neural models. However, vanilla RNNs struggled with long-range dependencies due to vanishing gradients. LSTM networks addressed this with memory cells and three gating mechanisms (input, forget, and output gates) that regulated the flow of information. Gated recurrent units (GRUs), proposed by Cho et al. in 2014, offered a simplified alternative with two gates (reset and update) and comparable performance.
Transformer-based language models have become the dominant paradigm since 2018. The transformer architecture relies entirely on self-attention mechanisms to compute representations of the input sequence, dispensing with recurrence and convolutions. Self-attention allows each position in the sequence to attend to every other position, enabling efficient capture of both local and long-range dependencies.
Transformer language models fall into three main architectural categories:
| Architecture | Context direction | Representative models | Typical use cases |
|---|---|---|---|
| Decoder-only (autoregressive) | Left-to-right (causal) | GPT series, LLaMA, PaLM | Text generation, code generation, dialogue |
| Encoder-only (autoencoding) | Bidirectional | BERT, RoBERTa, ALBERT | Classification, named entity recognition, question answering |
| Encoder-decoder | Bidirectional encoder, autoregressive decoder | T5, BART, mBART | Translation, summarization, question answering |
The training objective defines what task the language model learns during pre-training. Different objectives lead to models with different strengths.
Autoregressive or causal language models are trained to predict the next token given all preceding tokens. The training objective minimizes the negative log-likelihood:
L = - ∑t log P(wt | w1, ..., wt-1)
During training, a causal attention mask ensures that each position can only attend to positions to its left. This unidirectional approach is natural for text generation, since text is produced left to right. GPT (Radford et al., 2018) and its successors use this objective.
Masked language modeling (MLM), introduced by BERT (Devlin et al., 2018), randomly masks a fraction of input tokens (typically 15%) and trains the model to predict the original tokens from the surrounding bidirectional context. Because the model sees both left and right context for each masked position, it builds richer representations for understanding tasks. However, MLM models are not naturally suited for text generation, since they do not learn a left-to-right factorization of the sequence probability.
Models like T5 and BART use more general denoising objectives. T5 uses a "span corruption" objective that masks contiguous spans of tokens and trains the model to reconstruct them. BART combines several noise functions, including token masking, token deletion, sentence permutation, and document rotation. These objectives allow encoder-decoder models to learn both comprehension and generation capabilities.
The distinction between causal (autoregressive) and masked (autoencoding) language models reflects a fundamental tradeoff in language modeling.
Causal language models process text in one direction (left to right) and predict each token based solely on the preceding tokens. This makes them well suited for generative tasks such as text completion, dialogue, and creative writing. Because they learn a proper probability distribution over sequences, they can be used directly for sampling and text generation.
Masked language models access bidirectional context, allowing each token prediction to be informed by both preceding and following tokens. This bidirectional understanding makes them stronger for discriminative tasks like classification, named entity recognition, and extractive question answering. However, they do not define a straightforward generative model over sequences.
In practice, autoregressive models have dominated the landscape of large language models since 2020, as scaling has proven especially effective for causal language modeling, and the ability to generate text is central to many applications.
Before a language model can process text, the text must be converted into a sequence of discrete units called tokens. The choice of tokenization strategy significantly affects model performance, vocabulary size, and the ability to handle multiple languages or rare words.
Early language models operated at the word level, assigning each word in the vocabulary its own index. This approach creates very large vocabularies and cannot handle out-of-vocabulary (OOV) words. Character-level models avoid OOV issues but require the model to learn longer-range dependencies.
Modern language models use subword tokenization, which splits text into units between words and characters. The three most common subword algorithms are:
| Algorithm | Method | Used by |
|---|---|---|
| Byte Pair Encoding (BPE) | Iteratively merges the most frequent pair of adjacent tokens | GPT series, LLaMA, Gemma |
| WordPiece | Merges pairs that maximize the likelihood of the training corpus | BERT, DistilBERT |
| SentencePiece (Unigram) | Starts with a large vocabulary and iteratively removes tokens that least reduce the training likelihood | T5, ALBERT, mBART |
Subword tokenization provides a good balance: common words are kept as single tokens, while rare words are split into meaningful subword units. For instance, the word "unhappiness" might be tokenized as ["un", "happiness"] or ["un", "happi", "ness"], depending on the algorithm and training data. Typical vocabulary sizes for modern language models range from 30,000 to 128,000 tokens.
Evaluating language models requires both intrinsic metrics that measure how well the model fits the data and extrinsic metrics that assess performance on downstream tasks.
Perplexity (PPL) is the standard intrinsic evaluation metric for language models. It measures how "surprised" the model is when predicting the next token, averaged across a test set. Formally, perplexity is defined as:
PPL = exp(- (1/N) ∑t=1N log P(wt | w1, ..., wt-1))
A lower perplexity indicates a better model. Intuitively, a perplexity of k means the model is, on average, as uncertain as if it were choosing uniformly among k options at each step. State-of-the-art language models achieve perplexities below 20 on standard benchmarks like Penn Treebank, compared to over 100 for simple n-gram models.
Bits per character (BPC) and bits per byte (BPB) normalize the cross-entropy loss by the number of characters or bytes rather than tokens. This makes them particularly useful for comparing models that use different tokenizers, since perplexity is sensitive to the tokenization scheme. A model using a larger vocabulary makes fewer but harder predictions per text segment, which can give a misleadingly lower per-token perplexity.
Intrinsic metrics like perplexity do not always correlate with practical usefulness. Language models are also evaluated on downstream tasks through benchmarks such as:
| Benchmark | What it measures |
|---|---|
| GLUE / SuperGLUE | General language understanding (classification, entailment, similarity) |
| SQuAD | Reading comprehension and extractive question answering |
| MMLU | Multitask accuracy across 57 academic subjects |
| HumanEval | Code generation correctness |
| HellaSwag | Commonsense reasoning and sentence completion |
| WinoGrande | Coreference resolution requiring world knowledge |
| TruthfulQA | Factual accuracy and resistance to generating false claims |
When using a language model to generate text, the model produces a probability distribution over the vocabulary at each step. The decoding strategy determines how the next token is selected from this distribution.
Greedy decoding selects the token with the highest probability at each step. It is fast and deterministic but tends to produce repetitive and generic text, because it always takes the locally optimal choice without considering the global quality of the sequence.
Beam search maintains a set of the K most probable partial sequences (where K is called the beam width) and expands each at every step, keeping only the top K candidates. It produces more coherent output than greedy decoding and is widely used in machine translation and summarization. However, beam search can still produce repetitive text in open-ended generation tasks.
Top-k sampling restricts the candidate pool to the K most probable tokens, redistributes the probability mass among them, and samples from this truncated distribution. Introduced and popularized by the GPT-2 paper (Radford et al., 2019), top-k sampling improves diversity while maintaining reasonable coherence.
Nucleus sampling, proposed by Holtzman et al. (2019), dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p (for example, p = 0.9). Unlike top-k, which uses a fixed number of candidates regardless of the distribution shape, top-p adapts to the model's confidence at each step. When the model is confident, fewer tokens are considered; when it is uncertain, more tokens are included.
Temperature is a parameter that adjusts the sharpness of the probability distribution before sampling. Given logits z, the probability of token i is computed as:
P(i) = exp(zi / T) / ∑j exp(zj / T)
A temperature T < 1 makes the distribution sharper (more deterministic), while T > 1 makes it flatter (more random). At T approaching 0, sampling becomes equivalent to greedy decoding. In practice, temperature is often combined with top-k or top-p sampling.
| Strategy | Deterministic? | Diversity | Best suited for |
|---|---|---|---|
| Greedy | Yes | Low | Short, factual outputs |
| Beam search | Nearly (width-dependent) | Low to moderate | Translation, summarization |
| Top-k sampling | No | Moderate to high | Creative text generation |
| Top-p (nucleus) sampling | No | Moderate to high | Open-ended generation |
| Temperature scaling | Depends on T | Adjustable | Combined with other strategies |
Research has revealed that language model performance follows predictable power-law relationships with respect to model size, dataset size, and training compute.
Kaplan et al. (2020) at OpenAI published "Scaling Laws for Neural Language Models," demonstrating that cross-entropy loss decreases as a smooth power law when any of three factors (parameters, data, or compute) increases, provided the other factors are not bottlenecked. Their analysis, spanning seven orders of magnitude in compute, suggested that model size was roughly three times more important than dataset size for reducing loss, leading to the practice of training very large models on relatively modest datasets.
Hoffmann et al. (2022) at DeepMind challenged the Kaplan findings with their paper "Training Compute-Optimal Large Language Models." Their analysis showed that for a given compute budget, model size and training data should be scaled equally. The resulting "Chinchilla rule" recommends approximately 20 training tokens per parameter for compute-optimal training. This finding implied that many existing large models, including the 175-billion-parameter GPT-3, were significantly undertrained relative to their size. The 70-billion-parameter Chinchilla model, trained on 1.4 trillion tokens, outperformed the much larger 280-billion-parameter Gopher on most benchmarks.
More recent research has explored additional dimensions of scaling, including the quality and diversity of training data, the effects of repeated data, and inference-time compute scaling. The general trend from millions of parameters (early neural LMs) to hundreds of billions (GPT-3, PaLM) and beyond has been accompanied by the emergence of capabilities such as in-context learning, chain-of-thought reasoning, and instruction following that do not appear at smaller scales.
While standard language models learn an unconditional (or context-only) distribution over text, many practical applications require generating text conditioned on some input. Conditional language generation extends the basic language modeling framework to produce output text given a specific input signal.
Examples include:
Encoder-decoder architectures like T5 are explicitly designed for conditional generation. Decoder-only models can also perform conditional generation by prepending the conditioning input to the generation context (prompt-based conditioning), a technique that has proven highly effective in large-scale autoregressive models.
Language models underpin a wide range of NLP and AI applications:
| Application | Description | Example systems |
|---|---|---|
| Text generation | Producing coherent, contextually relevant text | GPT-4, Claude, Gemini |
| Machine translation | Converting text between natural languages | Google Translate, DeepL |
| Text summarization | Condensing long documents into shorter summaries | Pegasus, BART |
| Speech recognition | Converting spoken language to text; LMs rescore hypotheses | Whisper, Google ASR |
| Sentiment analysis | Classifying the emotional tone of text | Fine-tuned BERT models |
| Question answering | Providing answers to natural language questions | GPT-4, Perplexity AI |
| Code generation | Writing source code from natural language specifications | Codex, GitHub Copilot, Claude |
| Information retrieval | Ranking documents by relevance to a query | ColBERT, MonoT5 |
| Chatbots and dialogue | Sustaining multi-turn conversations | ChatGPT, Claude, Gemini |
Imagine you are playing a guessing game where someone reads you the beginning of a sentence, and you have to guess the next word. If you have read lots and lots of books, you get pretty good at guessing. A language model is like a computer playing this guessing game. It reads billions of sentences and learns which words usually follow other words. When someone says "the cat sat on the," the model knows that "mat" or "floor" are much better guesses than "elephant" or "guitar." The better a language model is at guessing, the better it can help with things like translating languages, answering questions, or writing stories. Early language models just counted how often words appeared together. Newer ones use special math (neural networks) to understand patterns in language much more deeply, which is why they can now write whole essays, have conversations, and even write computer code.