Language Model
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v4 · 4,920 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v4 · 4,920 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms, Large language model
A language model is a probabilistic model that assigns a probability to a sequence of words or tokens in a natural language, most often by estimating the probability of the next token given the preceding context. Formally, for a sequence w1, w2, ..., wm, a language model estimates the joint probability P(w1, w2, ..., wm) or, equivalently, the conditional probability P(wm | w1, ..., wm-1). This single next-token prediction objective is the foundation of nearly all modern natural language processing, including large language models such as GPT-4, Claude, and Gemini. The concept traces to Claude Shannon, who in 1948 first modeled English as a stochastic process and measured its predictability in bits per character [1].
A language model is a probabilistic model that assigns probabilities to sequences of words or tokens in a natural language. Given a sequence of words w1, w2, ..., wm, a language model estimates the joint probability P(w1, w2, ..., wm) or, equivalently, the conditional probability of the next word given the preceding context: P(wm | w1, w2, ..., wm-1). Language models form a cornerstone of natural language processing (NLP) and are used in tasks ranging from speech recognition and machine translation to text generation and information retrieval.
The central idea behind language modeling is that not all word sequences are equally likely. For example, the sequence "the cat sat on the mat" is far more probable than "mat the on sat cat the" in English. By capturing these statistical regularities, language models enable computers to predict, generate, and evaluate natural language text.
The field of language modeling has evolved dramatically over seven decades, progressing from simple statistical counting methods to neural network-based architectures with hundreds of billions of parameters. Modern language models, particularly those based on the transformer architecture, have demonstrated remarkable capabilities across a wide range of language tasks and have become central to contemporary artificial intelligence research.
Language models are the statistical engine behind most language technology. Historically they were used to rescore hypotheses in speech recognition and machine translation systems, where a separate model proposed candidate transcriptions or translations and the language model scored which candidate read most naturally. Today the same next-token objective, scaled up on the transformer architecture, directly powers text generation, dialogue, summarization, question answering, and code generation. Because a language model defines a probability distribution over all possible continuations of a prompt, it can be used both to evaluate how likely a given piece of text is and to generate new text by sampling from that distribution one token at a time.
The history of language modeling spans from information theory in the 1940s to the massive pretrained models of the 2020s.
The foundations of language modeling trace back to Claude Shannon's 1948 paper "A Mathematical Theory of Communication." While Shannon was primarily concerned with the engineering problem of transmitting messages efficiently through noisy communication channels, his work laid the mathematical groundwork for statistical language modeling. Shannon proposed that language could be modeled as a stochastic process and introduced the concept of entropy as a measure of the information content (or unpredictability) of a language source. He conducted experiments using n-gram approximations of English text, demonstrating that statistical patterns in language could guide prediction. Treating the 27-symbol English alphabet (26 letters plus the space) as a stationary stochastic process, Shannon reported a first-order entropy of 4.14 bits per character, falling to 3.56 bits at second order and 3.30 bits at third order as more preceding context was used [1]. In a follow-up study, "Prediction and Entropy of Printed English" (1951), Shannon used a human letter-guessing experiment to estimate the entropy of English at between 0.6 and 1.3 bits per character, a measurement that remains a reference point in the field [13].
In the 1950s, Noam Chomsky developed the theory of formal grammars, proposing that language could be described by a set of generative rules. While Chomsky's rule-based approach dominated theoretical linguistics for decades, practical applications increasingly turned to statistical methods. By the 1980s, researchers at IBM and elsewhere demonstrated that statistical approaches to language modeling were more effective for tasks like speech recognition. The IBM speech recognition group, led by Frederick Jelinek, made pioneering contributions to n-gram modeling and smoothing techniques during this period. Jelinek is widely quoted as having quipped, "Every time I fire a linguist, the performance of the speech recognizer goes up," a remark that Jurafsky and Martin date to roughly December 1988 and that captured the pragmatic ascendancy of statistical methods, though Jelinek himself later suggested the exact incident never happened [14].
The shift from purely statistical models to neural approaches began in earnest with Yoshua Bengio and colleagues' 2003 paper "A Neural Probabilistic Language Model" [2]. This work introduced the idea of using a feedforward neural network to learn distributed representations of words (later called word embeddings) alongside the language model itself. The key innovation was representing each word as a continuous real-valued vector rather than a discrete symbol, which allowed the model to generalize across similar words and alleviate the curse of dimensionality that plagued n-gram models. As the authors put it, they proposed "to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences" [2].
Tomas Mikolov and colleagues applied recurrent neural networks (RNNs) to language modeling in 2010, achieving substantial improvements over n-gram baselines [3]. Unlike feedforward models, RNNs could process sequences of arbitrary length by maintaining a hidden state that accumulated information from all previous time steps. However, vanilla RNNs suffered from the vanishing gradient problem, which made it difficult to capture long-range dependencies. Long short-term memory (LSTM) networks, originally proposed by Hochreiter and Schmidhuber in 1997, addressed this limitation with gating mechanisms that controlled information flow [4]. LSTM-based language models, refined by researchers such as Alex Graves, became the dominant approach from roughly 2013 to 2017.
The introduction of the transformer architecture by Vaswani et al. in the 2017 paper "Attention Is All You Need" revolutionized language modeling [5]. Transformers replaced recurrence with self-attention mechanisms, allowing the model to attend to all positions in the input sequence simultaneously. The authors described the Transformer as a network architecture "based solely on attention mechanisms, dispensing with recurrence and convolutions entirely," and reported that it reached 28.4 BLEU on the WMT 2014 English-to-German translation task, improving on the previous best results, including ensembles, by over 2 BLEU [5]. This parallel processing capability made transformers significantly more efficient to train on modern hardware. The transformer gave rise to three major families of language models: autoregressive models like GPT (2018), masked language models like BERT (2018), and encoder-decoder models like T5 (2019). Since 2019, the field has been dominated by increasingly large transformer-based models, culminating in systems with hundreds of billions or even trillions of parameters.
| Year | Milestone | Key contribution |
|---|---|---|
| 1948 | Shannon's "A Mathematical Theory of Communication" | Founded information theory; introduced n-gram approximations of language |
| 1951 | Shannon's "Prediction and Entropy of Printed English" | Estimated the entropy of English at 0.6 to 1.3 bits per character |
| 1980s | IBM statistical language models | N-gram models with smoothing for speech recognition |
| 2003 | Bengio et al. neural probabilistic LM | First neural language model with learned word embeddings |
| 2010 | Mikolov et al. RNN language model | Applied recurrent neural networks to language modeling |
| 2013 | Word2Vec (Mikolov et al.) | Efficient word embedding training at scale |
| 2017 | Vaswani et al. "Attention Is All You Need" | Introduced the transformer architecture |
| 2018 | GPT (Radford et al.) | Demonstrated effectiveness of autoregressive transformer pretraining |
| 2018 | BERT (Devlin et al.) | Introduced masked language model pretraining with bidirectional context |
| 2019 | T5 (Raffel et al.) | Unified text-to-text framework for NLP tasks |
| 2020 | GPT-3 (Brown et al.) | 175 billion parameter model demonstrating few-shot learning |
| 2022 | Chinchilla (Hoffmann et al.) | Established compute-optimal scaling laws |
N-gram models are the simplest and historically most important class of language models. An n-gram model estimates the probability of a word based on the preceding (n-1) words, applying the Markov assumption to approximate the full joint probability of a sequence.
The probability of a sequence of words can be decomposed using the chain rule of probability:
P(w1, w2, ..., wm) = P(w1) × P(w2 | w1) × P(w3 | w1, w2) × ... × P(wm | w1, ..., wm-1)
An n-gram model simplifies each conditional by assuming that the probability of a word depends only on the previous (n-1) words:
P(wm | w1, ..., wm-1) ≈ P(wm | wm-n+1, ..., wm-1)
These conditional probabilities are typically estimated using maximum likelihood estimation (MLE) from counts in a training corpus:
P(wm | wm-n+1, ..., wm-1) = count(wm-n+1, ..., wm) / count(wm-n+1, ..., wm-1)
Common n-gram orders include unigrams (n=1, no context), bigrams (n=2, one word of context), and trigrams (n=3, two words of context).
Raw MLE estimates assign zero probability to any n-gram not seen in the training data, which is problematic for real-world use. Several smoothing techniques address this issue:
| Technique | Description |
|---|---|
| Add-one (Laplace) smoothing | Adds 1 to every n-gram count; simple but imprecise |
| Add-k smoothing | Adds a fractional count k < 1 to every n-gram |
| Good-Turing discounting | Redistributes probability mass from seen to unseen events based on frequency of frequencies |
| Kneser-Ney smoothing | Uses absolute discounting combined with a lower-order distribution based on continuation probability; widely considered the best n-gram smoothing method |
| Backoff and interpolation | Combines estimates from different n-gram orders; falls back to shorter contexts when longer ones have insufficient data |
Despite their simplicity and efficiency, n-gram models have significant limitations. They cannot capture dependencies beyond the fixed context window of (n-1) words. Increasing n leads to exponential growth in the number of possible n-grams, causing severe data sparsity. Even with sophisticated smoothing, n-gram models struggle to generalize to word sequences that do not appear in the training corpus, because they treat each word as a discrete symbol with no notion of semantic similarity.
Neural language models use neural networks to estimate the probability distribution over word sequences. By representing words as continuous vectors rather than discrete symbols, these models can generalize across semantically similar words and capture more complex patterns in language.
Bengio et al.'s 2003 model used a feedforward neural network that took as input the vector representations (embeddings) of the n previous words, concatenated them, passed the result through a hidden layer with a tanh activation function, and produced a probability distribution over the vocabulary using a softmax output layer. The model simultaneously learned the word embeddings and the language model parameters during training. This approach demonstrated that neural models could outperform n-gram models, particularly on tasks involving rare or unseen word combinations, because similar words received similar embeddings and thus shared statistical strength [2].
Recurrent neural networks (RNNs) extended neural language modeling by removing the fixed-context-window limitation. At each time step, an RNN processes the current input word and the hidden state from the previous time step, producing a new hidden state and a prediction for the next word. This architecture allows the model to, in principle, condition on the entire preceding sequence.
Mikolov et al. (2010) demonstrated that RNN language models could significantly outperform both n-gram and feedforward neural models [3]. However, vanilla RNNs struggled with long-range dependencies due to vanishing gradients. LSTM networks addressed this with memory cells and three gating mechanisms (input, forget, and output gates) that regulated the flow of information. Gated recurrent units (GRUs), proposed by Cho et al. in 2014, offered a simplified alternative with two gates (reset and update) and comparable performance.
Transformer-based language models have become the dominant paradigm since 2018. The transformer architecture relies entirely on self-attention mechanisms to compute representations of the input sequence, dispensing with recurrence and convolutions. Self-attention allows each position in the sequence to attend to every other position, enabling efficient capture of both local and long-range dependencies.
Transformer language models fall into three main architectural categories:
| Architecture | Context direction | Representative models | Typical use cases |
|---|---|---|---|
| Decoder-only (autoregressive) | Left-to-right (causal) | GPT series, LLaMA, PaLM | Text generation, code generation, dialogue |
| Encoder-only (autoencoding) | Bidirectional | BERT, RoBERTa, ALBERT | Classification, named entity recognition, question answering |
| Encoder-decoder | Bidirectional encoder, autoregressive decoder | T5, BART, mBART | Translation, summarization, question answering |
The training objective defines what task the language model learns during pre-training. Different objectives lead to models with different strengths.
Autoregressive or causal language models are trained to predict the next token given all preceding tokens. The training objective minimizes the negative log-likelihood:
L = - ∑t log P(wt | w1, ..., wt-1)
During training, a causal attention mask ensures that each position can only attend to positions to its left. This unidirectional approach is natural for text generation, since text is produced left to right. GPT (Radford et al., 2018) and its successors use this objective [6].
Masked language modeling (MLM), introduced by BERT (Devlin et al., 2018), randomly masks a fraction of input tokens (typically 15%) and trains the model to predict the original tokens from the surrounding bidirectional context [7]. Because the model sees both left and right context for each masked position, it builds richer representations for understanding tasks. However, MLM models are not naturally suited for text generation, since they do not learn a left-to-right factorization of the sequence probability.
Models like T5 and BART use more general denoising objectives. T5 uses a "span corruption" objective that masks contiguous spans of tokens and trains the model to reconstruct them [8]. BART combines several noise functions, including token masking, token deletion, sentence permutation, and document rotation. These objectives allow encoder-decoder models to learn both comprehension and generation capabilities.
The distinction between causal (autoregressive) and masked (autoencoding) language models reflects a fundamental tradeoff in language modeling.
Causal language models process text in one direction (left to right) and predict each token based solely on the preceding tokens. This makes them well suited for generative tasks such as text completion, dialogue, and creative writing. Because they learn a proper probability distribution over sequences, they can be used directly for sampling and text generation.
Masked language models access bidirectional context, allowing each token prediction to be informed by both preceding and following tokens. This bidirectional understanding makes them stronger for discriminative tasks like classification, named entity recognition, and extractive question answering. However, they do not define a straightforward generative model over sequences.
In practice, autoregressive models have dominated the landscape of large language models since 2020, as scaling has proven especially effective for causal language modeling, and the ability to generate text is central to many applications.
Before a language model can process text, the text must be converted into a sequence of discrete units called tokens. The choice of tokenization strategy significantly affects model performance, vocabulary size, and the ability to handle multiple languages or rare words.
Early language models operated at the word level, assigning each word in the vocabulary its own index. This approach creates very large vocabularies and cannot handle out-of-vocabulary (OOV) words. Character-level models avoid OOV issues but require the model to learn longer-range dependencies.
Modern language models use subword tokenization, which splits text into units between words and characters. The three most common subword algorithms are:
| Algorithm | Method | Used by |
|---|---|---|
| Byte Pair Encoding (BPE) | Iteratively merges the most frequent pair of adjacent tokens | GPT series, LLaMA, Gemma |
| WordPiece | Merges pairs that maximize the likelihood of the training corpus | BERT, DistilBERT |
| SentencePiece (Unigram) | Starts with a large vocabulary and iteratively removes tokens that least reduce the training likelihood | T5, ALBERT, mBART |
Subword tokenization provides a good balance: common words are kept as single tokens, while rare words are split into meaningful subword units. For instance, the word "unhappiness" might be tokenized as ["un", "happiness"] or ["un", "happi", "ness"], depending on the algorithm and training data. Typical vocabulary sizes for modern language models range from 30,000 to 128,000 tokens.
Evaluating language models requires both intrinsic metrics that measure how well the model fits the data and extrinsic metrics that assess performance on downstream tasks.
Perplexity (PPL) is the standard intrinsic evaluation metric for language models. It measures how "surprised" the model is when predicting the next token, averaged across a test set. Formally, perplexity is defined as:
PPL = exp(- (1/N) ∑t=1N log P(wt | w1, ..., wt-1))
A lower perplexity indicates a better model. Intuitively, a perplexity of k means the model is, on average, as uncertain as if it were choosing uniformly among k options at each step. State-of-the-art language models achieve perplexities below 20 on standard benchmarks like Penn Treebank, compared to over 100 for simple n-gram models.
Bits per character (BPC) and bits per byte (BPB) normalize the cross-entropy loss by the number of characters or bytes rather than tokens. This makes them particularly useful for comparing models that use different tokenizers, since perplexity is sensitive to the tokenization scheme. A model using a larger vocabulary makes fewer but harder predictions per text segment, which can give a misleadingly lower per-token perplexity. These character-level measures connect directly back to Shannon's original framing of language as a source with a measurable entropy in bits per character [1].
Intrinsic metrics like perplexity do not always correlate with practical usefulness. Language models are also evaluated on downstream tasks through benchmarks such as:
| Benchmark | What it measures |
|---|---|
| GLUE / SuperGLUE | General language understanding (classification, entailment, similarity) |
| SQuAD | Reading comprehension and extractive question answering |
| MMLU | Multitask accuracy across 57 academic subjects |
| HumanEval | Code generation correctness |
| HellaSwag | Commonsense reasoning and sentence completion |
| WinoGrande | Coreference resolution requiring world knowledge |
| TruthfulQA | Factual accuracy and resistance to generating false claims |
When using a language model to generate text, the model produces a probability distribution over the vocabulary at each step. The decoding strategy determines how the next token is selected from this distribution.
Greedy decoding selects the token with the highest probability at each step. It is fast and deterministic but tends to produce repetitive and generic text, because it always takes the locally optimal choice without considering the global quality of the sequence.
Beam search maintains a set of the K most probable partial sequences (where K is called the beam width) and expands each at every step, keeping only the top K candidates. It produces more coherent output than greedy decoding and is widely used in machine translation and summarization. However, beam search can still produce repetitive text in open-ended generation tasks. Holtzman et al. (2019) found that maximization-based methods such as beam search lead to degeneration, producing text that is bland, incoherent, or stuck in repetitive loops [12].
Top-k sampling restricts the candidate pool to the K most probable tokens, redistributes the probability mass among them, and samples from this truncated distribution. Introduced and popularized by the GPT-2 paper (Radford et al., 2019), top-k sampling improves diversity while maintaining reasonable coherence.
Nucleus sampling, proposed by Holtzman et al. (2019), dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p (for example, p = 0.9) [12]. Unlike top-k, which uses a fixed number of candidates regardless of the distribution shape, top-p adapts to the model's confidence at each step. When the model is confident, fewer tokens are considered; when it is uncertain, more tokens are included. The authors concluded that nucleus sampling is "the best available decoding strategy for generating long-form text that is both high-quality and as diverse as human-written text" [12].
Temperature is a parameter that adjusts the sharpness of the probability distribution before sampling. Given logits z, the probability of token i is computed as:
P(i) = exp(zi / T) / ∑j exp(zj / T)
A temperature T < 1 makes the distribution sharper (more deterministic), while T > 1 makes it flatter (more random). At T approaching 0, sampling becomes equivalent to greedy decoding. In practice, temperature is often combined with top-k or top-p sampling.
| Strategy | Deterministic? | Diversity | Best suited for |
|---|---|---|---|
| Greedy | Yes | Low | Short, factual outputs |
| Beam search | Nearly (width-dependent) | Low to moderate | Translation, summarization |
| Top-k sampling | No | Moderate to high | Creative text generation |
| Top-p (nucleus) sampling | No | Moderate to high | Open-ended generation |
| Temperature scaling | Depends on T | Adjustable | Combined with other strategies |
Research has revealed that language model performance follows predictable power-law relationships with respect to model size, dataset size, and training compute.
Kaplan et al. (2020) at OpenAI published "Scaling Laws for Neural Language Models," demonstrating that cross-entropy loss decreases as a smooth power law when any of three factors (parameters, data, or compute) increases, provided the other factors are not bottlenecked. Their analysis, spanning seven orders of magnitude in compute, suggested that model size was roughly three times more important than dataset size for reducing loss, leading to the practice of training very large models on relatively modest datasets [10]. GPT-3, released the same year, embodied this philosophy: an autoregressive model with 175 billion parameters, which Brown et al. described as "10x more than any previous non-sparse language model" [9].
Hoffmann et al. (2022) at DeepMind challenged the Kaplan findings with their paper "Training Compute-Optimal Large Language Models." Their analysis showed that for a given compute budget, model size and training data should be scaled equally. The resulting "Chinchilla rule" recommends approximately 20 training tokens per parameter for compute-optimal training. This finding implied that many existing large models, including the 175-billion-parameter GPT-3, were significantly undertrained relative to their size. The 70-billion-parameter Chinchilla model, trained on 1.4 trillion tokens with the same compute budget as the 280-billion-parameter Gopher, uniformly outperformed Gopher, GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) across a wide range of evaluation tasks. On the MMLU benchmark, Chinchilla reached a state-of-the-art average accuracy of 67.5%, an improvement of more than 7 percentage points over Gopher [11].
More recent research has explored additional dimensions of scaling, including the quality and diversity of training data, the effects of repeated data, and inference-time compute scaling. The general trend from millions of parameters (early neural LMs) to hundreds of billions (GPT-3, PaLM) and beyond has been accompanied by the emergence of capabilities such as in-context learning, chain-of-thought reasoning, and instruction following that do not appear at smaller scales [9].
While standard language models learn an unconditional (or context-only) distribution over text, many practical applications require generating text conditioned on some input. Conditional language generation extends the basic language modeling framework to produce output text given a specific input signal.
Examples include:
Encoder-decoder architectures like T5 are explicitly designed for conditional generation. Decoder-only models can also perform conditional generation by prepending the conditioning input to the generation context (prompt-based conditioning), a technique that has proven highly effective in large-scale autoregressive models.
Language models underpin a wide range of NLP and AI applications:
| Application | Description | Example systems |
|---|---|---|
| Text generation | Producing coherent, contextually relevant text | GPT-4, Claude, Gemini |
| Machine translation | Converting text between natural languages | Google Translate, DeepL |
| Text summarization | Condensing long documents into shorter summaries | Pegasus, BART |
| Speech recognition | Converting spoken language to text; LMs rescore hypotheses | Whisper, Google ASR |
| Sentiment analysis | Classifying the emotional tone of text | Fine-tuned BERT models |
| Question answering | Providing answers to natural language questions | GPT-4, Perplexity AI |
| Code generation | Writing source code from natural language specifications | Codex, GitHub Copilot, Claude |
| Information retrieval | Ranking documents by relevance to a query | ColBERT, MonoT5 |
| Chatbots and dialogue | Sustaining multi-turn conversations | ChatGPT, Claude, Gemini |
A language model is the general concept: any model that assigns probabilities to sequences of tokens, from a bigram counting model to a trillion-parameter transformer. A large language model (LLM) is a specific, modern instance of that concept: a transformer-based language model with typically billions to hundreds of billions of parameters, pre-trained on internet-scale text corpora and usually fine-tuned to follow instructions. The arithmetic and the training objective are the same next-token prediction that Shannon described; what changed is the scale of the model, the size of the training data, and the emergent capabilities (in-context learning, reasoning, instruction following) that appear only at large scale. In other words, every LLM is a language model, but not every language model is an LLM.
Imagine you are playing a guessing game where someone reads you the beginning of a sentence, and you have to guess the next word. If you have read lots and lots of books, you get pretty good at guessing. A language model is like a computer playing this guessing game. It reads billions of sentences and learns which words usually follow other words. When someone says "the cat sat on the," the model knows that "mat" or "floor" are much better guesses than "elephant" or "guitar." The better a language model is at guessing, the better it can help with things like translating languages, answering questions, or writing stories. Early language models just counted how often words appeared together. Newer ones use special math (neural networks) to understand patterns in language much more deeply, which is why they can now write whole essays, have conversations, and even write computer code.