Masked Language Model

Introduction

A masked language model (MLM) is a type of language model trained to predict missing or hidden tokens within a sequence of text. Unlike autoregressive models such as GPT, which generate text from left to right by predicting the next token, masked language models learn from both the left and right context simultaneously. This bidirectional training approach allows them to build rich contextual representations of language, making them especially effective for natural language understanding tasks.

The masked language modeling objective is a form of self-supervised learning, since the training signal comes from the text itself rather than from human-provided labels. It rose to prominence with the release of BERT (Bidirectional Encoder Representations from Transformers) in 2018 by researchers at Google. Since then, the technique has become one of the foundational pre-training strategies in natural language processing (NLP), powering a wide range of models and applications. MLM-based models are routinely used in text classification, named entity recognition, sentiment analysis, question answering, and many other tasks where understanding the meaning of text is more important than generating it.

Historical Background

The Cloze Task

The conceptual roots of masked language modeling extend back to 1953, when psycholinguist Wilson L. Taylor introduced the "cloze procedure." Inspired by the Gestalt principle of closure, Taylor designed a test in which certain words were removed from a passage of text and readers were asked to fill in the blanks. The procedure was originally developed as a tool for measuring the readability of written materials, but it quickly found broader applications in education and cognitive science. The word "cloze" itself is derived from "closure," reflecting the idea that readers mentally close the gaps in a text by drawing on surrounding context.

Masked language modeling is, in effect, a computational version of the cloze task. Instead of human readers filling in blanks, a neural network learns to predict the missing tokens. Researchers frequently cite Taylor's 1953 work as the intellectual precursor to modern MLM objectives.

From Word Embeddings to Contextual Representations

Before masked language models, the NLP community relied on static word embeddings like Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). These methods assigned a single fixed vector to each word, regardless of the context in which it appeared. The word "bank" would receive the same representation whether it referred to a financial institution or the edge of a river.

ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, took a step forward by generating context-dependent representations using bidirectional LSTMs. However, ELMo's bidirectionality was shallow: it concatenated the outputs of a forward and a backward language model rather than truly jointly conditioning on both directions at every layer.

BERT solved this limitation by introducing the masked language modeling objective, which allowed a Transformer encoder to attend to context on both sides of every token at every layer. This represented a fundamental shift in how language models learned representations and set the stage for a new generation of NLP systems.

The BERT Paper

The seminal paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" was published in October 2018 by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google AI Language. It was presented at NAACL 2019 and rapidly became one of the most cited papers in the history of NLP. BERT achieved state-of-the-art results on eleven NLP benchmarks at the time of publication, including pushing the GLUE benchmark score to 80.5% (a 7.7 percentage point improvement over the previous best).

How Masked Language Modeling Works

Basic Procedure

The masked language modeling procedure can be broken down into several steps:

Token selection. Given an input sequence of tokens, a fixed percentage of tokens is randomly selected for masking. In the original BERT implementation, 15% of all tokens in each sequence are chosen.
Token replacement. The selected tokens are modified according to a replacement strategy (described below). Most commonly, the selected token is replaced with a special [MASK] token.
Forward pass. The modified sequence is fed through the model (typically a Transformer encoder). The model processes all tokens in parallel, with each token attending to every other token in the sequence through self-attention layers.
Prediction. At the positions where tokens were masked, the model outputs a probability distribution over the entire vocabulary. The goal is to assign high probability to the original token.
Loss computation. A cross-entropy loss function is computed between the model's predicted distribution and the true token identity, but only at the masked positions. The loss is then used to update the model's weights through backpropagation.

This procedure is repeated over millions or billions of training examples drawn from large text corpora, allowing the model to learn general-purpose language representations.

The 80-10-10 Replacement Strategy

One of the key design decisions in BERT's masked language modeling is what to do with the 15% of tokens selected for prediction. Simply replacing all of them with [MASK] would create a mismatch between pre-training and fine-tuning, since the [MASK] token never appears in real downstream tasks. To mitigate this discrepancy, Devlin et al. introduced the 80-10-10 strategy:

Percentage	Action	Example (original word: "cat")	Purpose
80%	Replace with `[MASK]`	"The `[MASK]` sat on the mat"	Teaches the model to predict from context
10%	Replace with a random token	"The dog sat on the mat"	Forces the model to maintain robust representations for all tokens, not just `[MASK]`
10%	Keep the original token unchanged	"The cat sat on the mat"	Biases the model's representation toward the actual observed token

The 10% random replacement prevents the model from learning a shortcut where it only needs to produce good representations at [MASK] positions. Because any token might have been randomly swapped, the model must maintain accurate contextual representations everywhere. The 10% unchanged case further encourages the model to keep its representation faithful to the observed input, which is useful during fine-tuning when no tokens are masked.

Research has shown that this combined strategy helps bridge the gap between pre-training (where [MASK] tokens exist) and fine-tuning (where they do not), though the exact percentages are not especially sensitive. Later studies, including Wettig et al. (2022), investigated whether the 15% masking rate is truly optimal and found that higher rates (up to 40%) can sometimes improve performance, depending on the model and task.

Masking Rate

The choice of masking 15% of tokens was established in the original BERT paper and has become the default for most subsequent MLM-based models. The rationale is a tradeoff: masking too few tokens makes training inefficient because the model receives a learning signal from only a small fraction of each sequence, while masking too many tokens removes so much context that prediction becomes unreliable.

Some later work has explored different masking rates. The paper "Should You Mask 15% in Masked Language Modeling?" by Wettig et al. (2022) systematically varied the masking rate and found that the optimal rate depends on factors such as model size and training duration. For models trained for fewer steps, higher masking rates (around 40%) can yield better performance because each step provides more learning signal. For models trained until convergence, the differences narrow.

MLM Loss Function

The loss function for masked language modeling is the cross-entropy loss computed exclusively over the masked positions. Formally, given an input sequence of tokens x = (x_1, x_2, ..., x_n) and a set of masked positions M, the input is modified to produce a corrupted sequence x̃, where tokens at positions in M are replaced according to the 80-10-10 strategy. The model f_θ (parameterized by θ) produces a probability distribution over the vocabulary for each masked position.

The MLM training objective minimizes the negative log-likelihood of the true tokens at the masked positions:

L_MLM(θ) = - (1/|M|) Σ_{i ∈ M} log P_θ(x_i | x̃)

where P_θ(x_i | x̃) is the probability assigned by the model to the true token x_i given the corrupted input x̃. The sum is taken only over the masked positions M, not over all positions in the sequence.

At each masked position, the transformer's hidden state is passed through a linear layer followed by a softmax over the vocabulary. The resulting probability distribution is compared against the one-hot ground-truth label for the original token. Tokens that were not masked receive a special ignore label (typically -100 in implementations like Hugging Face Transformers), so they do not contribute to the gradient.

This selective loss computation is important because the model sees the full (partially corrupted) input but only receives a learning signal from the masked positions, encouraging it to build representations that encode information about the entire sequence.

During training, gradients of this loss are computed with respect to the model parameters θ and used to update the model via gradient descent (typically using the Adam optimizer with a learning rate warm-up schedule).

Model Architecture

Masked language models are almost universally built on the Transformer encoder architecture introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." The encoder processes the entire input sequence simultaneously, with each token attending to every other token through multi-head self-attention. This is in contrast to the Transformer decoder, which uses causal masking to prevent tokens from attending to future positions.

BERT Architecture Details

BERT was released in two sizes:

Configuration	Layers	Hidden Size	Attention Heads	Parameters
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M

Both configurations use the standard Transformer encoder stack. Each layer consists of a multi-head self-attention sublayer followed by a position-wise feed-forward network, with layer normalization and residual connections. The feed-forward intermediate size is 3,072 for BERT-Base and 4,096 for BERT-Large.

BERT's input representation sums three types of embeddings: token embeddings (from a WordPiece vocabulary of 30,522 tokens), segment embeddings (to distinguish between sentence A and sentence B in sentence-pair tasks), and positional embeddings (to encode token positions within the sequence).

Pre-training Objectives

BERT used two pre-training objectives simultaneously:

Masked Language Modeling (MLM): The primary objective described above, predicting masked tokens from their bidirectional context.
Next Sentence Prediction (NSP): A binary classification task where the model predicts whether sentence B follows sentence A in the original text. This objective was intended to help the model understand relationships between sentences.

Later research by Liu et al. (2019) in the RoBERTa paper demonstrated that NSP was not beneficial and could even hurt performance. Subsequent models largely dropped or replaced it with alternative inter-sentence objectives.

Training Data and Compute

BERT was pre-trained on the BooksCorpus (800 million words) and English Wikipedia (2,500 million words). Training BERT-Base on 4 Cloud TPUs (16 TPU chips) required 4 days; training BERT-Large on 16 Cloud TPUs (64 TPU chips) also took 4 days.

Masked Language Modeling vs. Causal Language Modeling

Masked language modeling and causal language modeling represent two fundamentally different approaches to language model pre-training. Understanding their differences is important for choosing the right model for a given task.

Feature	Masked Language Model (e.g., BERT)	Causal Language Model (e.g., GPT)
Context direction	Bidirectional (attends to both left and right)	Unidirectional (attends only to the left)
Training objective	Predict masked tokens	Predict the next token
Architecture	Transformer encoder	Transformer decoder
Attention pattern	Full self-attention (no causal mask)	Causal (triangular) attention mask
Generation capability	Not designed for sequential generation	Naturally generates text token by token
Strengths	Natural language understanding, classification, extraction	Text generation, dialogue, creative writing
Pre-train/fine-tune mismatch	`[MASK]` token absent at fine-tuning time	No mismatch; same left-to-right prediction
Token independence assumption	Masked tokens predicted independently of each other	Each token conditioned on all previous tokens
Representative models	BERT, RoBERTa, ALBERT, ELECTRA	GPT series, LLaMA, PaLM

Bidirectional vs. Unidirectional Context

The core advantage of MLM is bidirectional context. When predicting a masked token, the model can use information from both the preceding and following tokens. This is particularly valuable for understanding tasks where the meaning of a word depends on its full sentence context. In an autoregressive model, each token can only attend to tokens that came before it, which means the model has an inherently incomplete view of the context at any given position.

Generation Limitations

The main weakness of MLM is that it does not naturally support text generation. Because masked language models are trained to fill in blanks within existing text rather than produce text sequentially, they cannot easily generate coherent multi-sentence outputs. Autoregressive models, by contrast, are directly trained to generate text one token at a time and can produce fluent, long-form outputs. This is the primary reason why the most successful generative models (GPT-2, GPT-3, GPT-4, and other large language models) have used autoregressive objectives.

Scaling Behavior

Autoregressive models have demonstrated more consistent scaling behavior. As model size and training data increase, autoregressive models show predictable improvements in performance (as documented by the scaling laws of Kaplan et al., 2020, and Hoffmann et al., 2022). MLM-based models have also benefited from scaling, but the largest and most capable language models (with hundreds of billions or trillions of parameters) have overwhelmingly been autoregressive. This is partly because text generation is a more commercially valuable capability and partly because the autoregressive objective is simpler and more computationally straightforward to scale.

Independence Assumption

A well-known limitation of standard MLM is the conditional independence assumption: masked tokens are predicted independently of each other. If tokens at positions 3 and 7 are both masked, the model predicts each one without considering what it predicts for the other. This can be problematic when the masked tokens are semantically related. XLNet's permutation language modeling was specifically designed to address this limitation, and later work on non-autoregressive generation has explored similar ideas.

Masking Strategies and Variants

Since the original BERT paper, researchers have proposed several alternative masking strategies that improve upon the basic random token masking approach. These variants generally aim to create harder or more linguistically meaningful prediction tasks.

Static vs. Dynamic Masking

BERT used static masking: the training corpus was preprocessed once, with mask patterns applied and saved before training began. Each training example always had the same set of tokens masked, even across different epochs. BERT did create up to 10 copies of the data with different masks, but within each copy, the pattern was fixed.

RoBERTa (Liu et al., 2019) introduced dynamic masking, where the masking pattern is generated on-the-fly each time a sequence is fed to the model during training. This means the model sees a different masking pattern for the same input in every epoch. Dynamic masking slightly improved performance across benchmarks and eliminated the need to preprocess and store multiple masked copies of the training data.

Whole Word Masking

In standard BERT, tokenization often splits words into subword units. For example, the word "playing" might be tokenized as ["play", "##ing"]. With standard random masking, only one of these subword tokens might be selected, making the prediction trivially easy since the other subword provides a strong hint.

Whole Word Masking (WWM) addresses this by ensuring that when any subword token of a word is selected for masking, all subword tokens belonging to that word are masked together. Google released whole-word-masked versions of BERT that showed improved performance, particularly on reading comprehension tasks like SQuAD. This strategy was also adopted for Chinese BERT (Cui et al., 2019), where character-level masking of multi-character words posed similar issues.

Span Masking (SpanBERT)

SpanBERT (Joshi et al., 2020) extended the masking strategy further by masking contiguous spans of tokens rather than individual tokens or whole words. Span lengths are sampled from a geometric distribution (biased toward shorter spans, with a mean of 3.8 tokens and a maximum length of 10 tokens), and the starting point is always aligned to a word boundary.

In addition to the standard MLM objective, SpanBERT introduced the Span Boundary Objective (SBO), which trains the model to predict the tokens within a masked span using only the representations at the span's boundaries (the tokens immediately before and after the span). This encourages the model to encode span-level information in its boundary representations.

SpanBERT also dropped the NSP objective entirely and trained on single contiguous segments of text rather than sentence pairs. These changes, combined with span masking, led to consistent improvements on tasks requiring span-level reasoning, such as extractive question answering and coreference resolution.

Entity and Phrase Masking (ERNIE)

ERNIE (Sun et al., 2019) proposed masking named entities and phrases as whole units rather than random tokens. The motivation is that masking only part of an entity (e.g., masking "Potter" in "Harry Potter") allows the model to predict the missing token through word collocation patterns within the entity, without needing to understand the broader semantic context. For instance, predicting "Potter" after "Harry" is trivial, but predicting the entire entity "Harry Potter" in a sentence about J. K. Rowling requires the model to reason about real-world knowledge. By masking entire entities and phrases, ERNIE forces the model to learn deeper semantic relationships from the surrounding context. This approach proved especially effective for knowledge-intensive tasks and Chinese NLP benchmarks.

MLM as Correction (MacBERT)

MacBERT (Cui et al., 2020) took a different approach to the replacement strategy. Instead of replacing masked tokens with [MASK] or random tokens, MacBERT replaces them with similar words found via a synonym dictionary or word embedding similarity. This transforms the MLM task into a correction task: the model must identify and correct tokens that have been replaced with plausible but incorrect alternatives. This strategy further reduces the pre-train/fine-tune discrepancy, since neither [MASK] tokens nor obviously random words appear in the input.

Summary of MLM Variants

Variant	Masking Unit	Key Innovation	Introduced By
Standard MLM	Random subword tokens	80/10/10 replacement rule	BERT (Devlin et al., 2019)
Whole Word Masking	Whole words	Masks all subword tokens of a word together	Google (2019)
Dynamic Masking	Random subword tokens	New mask pattern each training step	RoBERTa (Liu et al., 2019)
Span Masking	Contiguous spans	Geometric span length + boundary objective	SpanBERT (Joshi et al., 2020)
Entity/Phrase Masking	Named entities and phrases	Knowledge-aware masking units	ERNIE (Sun et al., 2019)
MLM as Correction	Similar words	Replaces masks with similar (not random) tokens	MacBERT (Cui et al., 2020)

Relationship to Denoising Objectives

Masked language modeling can be viewed as a specific instance of a broader class of denoising autoencoders applied to text. In a denoising autoencoder, the input is corrupted in some way and the model learns to reconstruct the original input. MLM corrupts text by replacing tokens with [MASK] (or random tokens) and trains the model to recover the originals at those positions.

This denoising perspective has been extended in several influential models that generalize MLM to encoder-decoder architectures:

T5 (Text-to-Text Transfer Transformer)

T5 (Raffel et al., 2020) uses a span corruption objective where contiguous spans of tokens are replaced with single sentinel tokens (e.g., <extra_id_0>), and an encoder-decoder transformer must generate the missing spans as output. Like BERT, T5 uses a 15% corruption rate, but the key difference is architectural: rather than predicting masked tokens "in place" through a classification head on top of the encoder, T5 generates the missing content autoregressively through its decoder. This design allows T5 to handle both understanding and generation tasks within a single unified text-to-text framework. The span corruption objective proved particularly effective because predicting multiple consecutive tokens at once requires richer contextual understanding than single-token prediction.

BART

BART (Lewis et al., 2020) applies multiple types of noise to the input, including token masking, token deletion, text infilling (replacing spans with a single [MASK] token), sentence permutation, and document rotation. The model is trained as a denoising autoencoder with a bidirectional encoder and an autoregressive decoder that reconstructs the original text. Text infilling is especially notable because the model must determine how many tokens are missing at each [MASK] position, making it a harder task than standard MLM. BART uses approximately 30% of tokens masked and all sentences permuted. This combination of noise functions makes BART effective for both understanding and generation tasks, particularly summarization.

UL2 (Unifying Language Learning)

UL2 (Tay et al., 2022) proposed a unified framework that mixes different denoising objectives during pre-training. It combines three types of denoising tasks: R-denoiser (regular span corruption similar to T5), S-denoiser (short span corruption, similar to standard MLM), and X-denoiser (extreme span corruption with high masking rates and long spans). By mixing these objectives, UL2 bridges the gap between MLM-style objectives (good for understanding) and autoregressive objectives (good for generation). PaLM-2, one of the largest publicly disclosed models trained in this style, adopted a similar mixture of pre-training objectives.

These denoising approaches generalize the core insight of MLM: that learning to reconstruct corrupted text produces powerful language representations.

Major Models Using MLM

The masked language modeling objective has been adopted and adapted by a wide range of models since BERT's introduction. The following table summarizes the most influential MLM-based models and their key innovations.

Model	Year	Authors / Organization	Key Innovation	Pre-training Objective
BERT	2018	Devlin et al. / Google	Introduced MLM + NSP for bidirectional pre-training	MLM + NSP
RoBERTa	2019	Liu et al. / Facebook AI	Dynamic masking, removed NSP, longer training with more data	MLM only
ALBERT	2019	Lan et al. / Google	Cross-layer parameter sharing, factorized embeddings, SOP replacing NSP	MLM + SOP
SpanBERT	2020	Joshi et al. / Facebook AI	Span masking + Span Boundary Objective	Span MLM + SBO
ELECTRA	2020	Clark et al. / Google / Stanford	Replaced token detection instead of masked token prediction	RTD (generator + discriminator)
DeBERTa	2020	He et al. / Microsoft	Disentangled attention + Enhanced Mask Decoder	MLM
XLNet	2019	Yang et al. / Google / CMU	Permutation language modeling to capture bidirectional context without masking	PLM

RoBERTa

RoBERTa (A Robustly Optimized BERT Pretraining Approach) demonstrated that BERT was significantly undertrained. By making several changes to the training recipe, Liu et al. achieved substantial improvements without altering the model architecture. Key modifications included dynamic masking, removing the NSP objective, training with larger mini-batches (8,000 sequences vs. BERT's 256), training on more data (160GB of text vs. BERT's 16GB), and training for more steps. RoBERTa matched or exceeded the performance of all models published after BERT at the time, including XLNet.

ALBERT

ALBERT (A Lite BERT) addressed the growing parameter count of language models by introducing two parameter-reduction techniques. Factorized embedding parameterization decomposes the large vocabulary embedding matrix into two smaller matrices, decoupling the vocabulary embedding size from the hidden layer size. Cross-layer parameter sharing reuses the same set of parameters across all Transformer layers, preventing parameter growth with network depth. An ALBERT configuration comparable to BERT-Large had 18 times fewer parameters and could be trained roughly 1.7 times faster. ALBERT also replaced NSP with Sentence Order Prediction (SOP), a more challenging task that requires the model to distinguish the correct order of two consecutive text segments from a swapped version.

ELECTRA

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) took a fundamentally different approach to pre-training. Instead of masking tokens and predicting them, ELECTRA uses a small generator network (trained with MLM) to produce plausible replacement tokens, and then trains a larger discriminator network to identify which tokens in the sequence have been replaced. This approach, called Replaced Token Detection (RTD), is inspired by the structure of generative adversarial networks (though the training procedure differs).

The key advantage of ELECTRA is sample efficiency. In standard MLM, the model only receives a training signal from the 15% of tokens that were masked. In ELECTRA's RTD, the discriminator makes a binary prediction (original vs. replaced) for every token in the sequence, meaning 100% of tokens contribute to the loss. This results in roughly 4 to 7 times more efficient use of compute. A small ELECTRA model trained on a single GPU for four days outperformed GPT (which used 30 times more compute) on the GLUE benchmark.

DeBERTa

DeBERTa (Decoding-enhanced BERT with Disentangled Attention) introduced two architectural innovations. The disentangled attention mechanism represents each token using two separate vectors: one for content and one for position. Attention weights are then computed using three components: content-to-content, content-to-position, and position-to-content. This separation allows the model to more flexibly capture how the meaning of a word interacts with its position in the sequence.

The Enhanced Mask Decoder adds absolute position information in the final decoding layer before the MLM prediction head. While disentangled attention captures relative positions throughout the network, absolute position can still be important for prediction. DeBERTa (1.5 billion parameters) surpassed the human baseline on the SuperGLUE benchmark for the first time, outperforming the 11-billion-parameter T5 model while being substantially smaller.

XLNet and Permutation Language Modeling

Although XLNet does not use the standard [MASK] token, it is closely related to masked language modeling. XLNet introduced Permutation Language Modeling (PLM), which maximizes the expected log-likelihood of a sequence across all possible permutations of the factorization order. This allows the model to capture bidirectional context (since any position may condition on any other position) while maintaining an autoregressive formulation (avoiding the independence assumptions inherent in standard MLM, where masked tokens are predicted independently of each other).

XLNet outperformed BERT on 20 tasks at the time of its release, demonstrating that the pretrain-finetune discrepancy caused by the [MASK] token was a meaningful limitation of standard MLM.

MLM in Modern NLP

While large-scale decoder-only models (such as the GPT series and LLaMA) have become dominant for generative AI applications, MLM-trained encoder models remain widely used in practice across many settings.

Production Use Cases

Models pre-trained with MLM objectives continue to excel at tasks where understanding is more important than generation:

Text classification and sentiment analysis: Bidirectional representations capture the full meaning of a passage, making fine-tuned encoders highly effective classifiers.
Named entity recognition and sequence labeling: Token-level predictions benefit from context on both sides of each token.
Semantic similarity and retrieval: Sentence encoders trained with MLM produce embeddings useful for search, duplicate detection, and clustering. Sentence-BERT (Reimers & Gurevych, 2019) demonstrated that fine-tuning BERT with a siamese network structure produces semantically meaningful sentence embeddings.
Question answering: Extractive QA tasks require understanding the relationship between a question and a passage, where bidirectional context is essential.
Information extraction and relation extraction: Understanding the full sentence context is critical for identifying relationships between entities.

As of the mid-2020s, models like BERT, RoBERTa, DeBERTa, and their multilingual variants (mBERT, XLM-RoBERTa) remain the workhorses for many production NLP systems where understanding, rather than generation, is the primary goal. These models are often preferred over larger generative models for latency-sensitive or resource-constrained deployments because they are typically smaller and faster at inference time.

Multimodal Extensions

The MLM objective has also influenced the design of multimodal models beyond text. Masked Autoencoders (MAE; He et al., 2022) apply the same masking-and-prediction paradigm to image patches in vision transformers, randomly masking 75% of image patches and training the model to reconstruct the missing pixels. Similar approaches have been applied to audio spectrograms, video frames, and protein sequences, demonstrating that the denoising-based self-supervised learning paradigm pioneered by MLM generalizes effectively across modalities.

Domain-Adaptive Pre-training

Researchers have found that continuing MLM pre-training on domain-specific text before fine-tuning improves performance on domain-specific downstream tasks. For example, BioBERT (Lee et al., 2019) continued pre-training BERT on biomedical literature, and SciBERT (Beltagy et al., 2019) did the same for scientific text. This technique, sometimes called domain-adaptive pre-training or further pre-training, has become a standard practice for applying MLM-based models to specialized domains such as law, medicine, and finance.

Pre-training and Fine-tuning Paradigm

Masked language models are a central component of the pre-train and fine-tune paradigm that has dominated NLP since 2018.

Why Pre-training Works

Pre-training on a large corpus allows the model to learn general linguistic knowledge: syntax, semantics, factual associations, and discourse structure. This knowledge is encoded in the model's weights and can be transferred to downstream tasks through fine-tuning. Because the pre-training corpus is typically orders of magnitude larger than any task-specific dataset, the model acquires knowledge that would be impossible to learn from the downstream data alone.

Fine-tuning Approaches

Fine-tuning a pre-trained MLM typically involves adding a task-specific output layer on top of the pre-trained encoder and training the entire model end-to-end on the downstream task. Common fine-tuning configurations include:

Task Type	Output Layer	Example Tasks
Sequence classification	Linear layer on `[CLS]` token	Sentiment analysis, topic classification, natural language inference
Token classification	Linear layer on each token	Named entity recognition, part-of-speech tagging
Span extraction	Start/end position prediction	Extractive question answering (SQuAD)
Sentence pair classification	Linear layer on `[CLS]` token	Paraphrase detection, textual entailment

Fine-tuning is typically fast and requires relatively little labeled data. A pre-trained BERT model can be fine-tuned on a downstream task in a few hours on a single GPU, achieving performance that rivals or exceeds models trained from scratch on much larger task-specific datasets.

Limitations

Despite their effectiveness for understanding tasks, masked language models have several important limitations.

Pre-train/Fine-tune Discrepancy

During pre-training, the model sees input sequences containing [MASK] tokens, but during fine-tuning and inference, no [MASK] tokens are present. This mismatch can cause the model's representations to be slightly different in the two settings. The 80-10-10 strategy mitigates but does not fully eliminate this issue.

Computational Cost of Pre-training

Because MLM only computes loss on 15% of tokens per sequence, each training step provides a relatively sparse learning signal. ELECTRA's replaced token detection addressed this by providing a signal for all tokens, achieving comparable performance with significantly less compute. Standard MLM remains less sample-efficient than some alternative objectives.

Not Suitable for Open-ended Generation

MLM-based models are fundamentally designed for understanding, not generation. While they can be used for fill-in-the-blank generation or incorporated into encoder-decoder architectures, they cannot match pure autoregressive models for open-ended text generation tasks.

Fixed Masking Rate

The standard 15% masking rate is a compromise that works reasonably well across settings but may not be optimal for any particular model size, training duration, or downstream task. Adaptive masking strategies, where the masking rate changes during training or is tuned per task, remain an active area of research.

Token-level Predictions

MLM operates at the token level, which means it may not capture higher-level structures such as discourse coherence or document-level themes as effectively as objectives that operate on larger units of text. Span masking (SpanBERT) and sentence-level objectives (SOP in ALBERT) partially address this limitation.

Explain Like I'm 5 (ELI5)

Imagine you are reading a book, and some of the words are covered with stickers. You have to guess what those words are based on the words around them. If the sentence says "The ___ chased the mouse," you can probably guess the missing word is "cat" because cats chase mice.

A masked language model does exactly this, but with a computer. It reads millions of sentences where some words have been hidden, and it tries to guess the hidden words. The more sentences it reads and guesses correctly, the better it gets at understanding how language works. After enough practice, it becomes really good at understanding what words mean and how sentences fit together. Once it has learned enough, people can use it to help with things like figuring out whether a movie review is positive or negative, finding the names of people and places in a news article, or answering questions about a passage of text.

References

Taylor, W. L. (1953). "Cloze Procedure: A New Tool for Measuring Readability." *Journalism Quarterly*, 30(4), 415-433.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *NeurIPS*.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). "Deep Contextualized Word Representations." *NAACL-HLT*.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL-HLT*. https://arxiv.org/abs/1810.04805
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." *arXiv preprint arXiv:1907.11692*.
Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., & Wu, H. (2019). "ERNIE: Enhanced Representation through Knowledge Integration." *arXiv preprint arXiv:1904.09223*.
Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). "SpanBERT: Improving Pre-training by Representing and Predicting Spans." *Transactions of the ACL*, 8, 64-77.
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *ICLR*.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." *ACL*.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *JMLR*, 21(140), 1-67.
Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., & Hu, G. (2020). "Revisiting Pre-Trained Models for Chinese Natural Language Processing." *Findings of EMNLP 2020*. https://arxiv.org/abs/2004.13922
Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Wei, J., Wang, X., Chung, H. W., Bahri, D., Schuster, T., Zheng, S., Zhou, D., Houlsby, N., & Metzler, D. (2022). "UL2: Unifying Language Learning Paradigms." *ICLR 2023*. https://arxiv.org/abs/2205.05131
Wettig, A., Gao, T., Zhong, Z., & Chen, D. (2022). "Should You Mask 15% in Masked Language Modeling?" *EACL 2023*. https://arxiv.org/abs/2202.08005

Introduction

Historical Background

The Cloze Task

From Word Embeddings to Contextual Representations

The BERT Paper

How Masked Language Modeling Works

Basic Procedure

The 80-10-10 Replacement Strategy

Masking Rate

MLM Loss Function

Model Architecture

BERT Architecture Details

Pre-training Objectives

Training Data and Compute

Masked Language Modeling vs. Causal Language Modeling

Bidirectional vs. Unidirectional Context

Generation Limitations

Scaling Behavior

Independence Assumption

Masking Strategies and Variants

Static vs. Dynamic Masking

Whole Word Masking

Span Masking (SpanBERT)

Entity and Phrase Masking (ERNIE)

MLM as Correction (MacBERT)

Summary of MLM Variants

Relationship to Denoising Objectives

T5 (Text-to-Text Transfer Transformer)

BART

UL2 (Unifying Language Learning)

Major Models Using MLM

RoBERTa

ALBERT

ELECTRA

DeBERTa

XLNet and Permutation Language Modeling

MLM in Modern NLP

Production Use Cases

Multimodal Extensions

Domain-Adaptive Pre-training

Pre-training and Fine-tuning Paradigm

Why Pre-training Works

Fine-tuning Approaches

Limitations

Pre-train/Fine-tune Discrepancy

Computational Cost of Pre-training

Not Suitable for Open-ended Generation

Fixed Masking Rate

Token-level Predictions

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Supervised fine-tuning

Introduction

Historical Background

The Cloze Task

From Word Embeddings to Contextual Representations

The BERT Paper

How Masked Language Modeling Works

Basic Procedure

The 80-10-10 Replacement Strategy

Masking Rate

MLM Loss Function

Model Architecture

BERT Architecture Details

Pre-training Objectives

Training Data and Compute

Masked Language Modeling vs. Causal Language Modeling

Bidirectional vs. Unidirectional Context

Generation Limitations

Scaling Behavior

Independence Assumption

Masking Strategies and Variants

Static vs. Dynamic Masking