See also: Machine learning terms, Language model, BERT
A masked language model (MLM) is a type of language model trained to predict missing or hidden tokens within a sequence of text. Unlike autoregressive models such as GPT, which generate text from left to right by predicting the next token, masked language models learn from both the left and right context simultaneously. This bidirectional training approach allows them to build rich contextual representations of language, making them especially effective for natural language understanding tasks.
The masked language modeling objective is a form of self-supervised learning, since the training signal comes from the text itself rather than from human-provided labels. It rose to prominence with the release of BERT (Bidirectional Encoder Representations from Transformers) in 2018 by researchers at Google. Since then, the technique has become one of the foundational pre-training strategies in natural language processing (NLP), powering a wide range of models and applications. MLM-based models are routinely used in text classification, named entity recognition, sentiment analysis, question answering, and many other tasks where understanding the meaning of text is more important than generating it.
The conceptual roots of masked language modeling extend back to 1953, when psycholinguist Wilson L. Taylor introduced the "cloze procedure." Inspired by the Gestalt principle of closure, Taylor designed a test in which certain words were removed from a passage of text and readers were asked to fill in the blanks. The procedure was originally developed as a tool for measuring the readability of written materials, but it quickly found broader applications in education and cognitive science. The word "cloze" itself is derived from "closure," reflecting the idea that readers mentally close the gaps in a text by drawing on surrounding context.
Masked language modeling is, in effect, a computational version of the cloze task. Instead of human readers filling in blanks, a neural network learns to predict the missing tokens. Researchers frequently cite Taylor's 1953 work as the intellectual precursor to modern MLM objectives.
Before masked language models, the NLP community relied on static word embeddings like Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). These methods assigned a single fixed vector to each word, regardless of the context in which it appeared. The word "bank" would receive the same representation whether it referred to a financial institution or the edge of a river.
ELMo (Embeddings from Language Models), introduced by Peters et al. in 2018, took a step forward by generating context-dependent representations using bidirectional LSTMs. However, ELMo's bidirectionality was shallow: it concatenated the outputs of a forward and a backward language model rather than truly jointly conditioning on both directions at every layer.
BERT solved this limitation by introducing the masked language modeling objective, which allowed a Transformer encoder to attend to context on both sides of every token at every layer. This represented a fundamental shift in how language models learned representations and set the stage for a new generation of NLP systems.
The seminal paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" was published in October 2018 by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google AI Language. It was presented at NAACL 2019 and rapidly became one of the most cited papers in the history of NLP. BERT achieved state-of-the-art results on eleven NLP benchmarks at the time of publication, including pushing the GLUE benchmark score to 80.5% (a 7.7 percentage point improvement over the previous best).
The masked language modeling procedure can be broken down into several steps:
[MASK] token.This procedure is repeated over millions or billions of training examples drawn from large text corpora, allowing the model to learn general-purpose language representations.
One of the key design decisions in BERT's masked language modeling is what to do with the 15% of tokens selected for prediction. Simply replacing all of them with [MASK] would create a mismatch between pre-training and fine-tuning, since the [MASK] token never appears in real downstream tasks. To mitigate this discrepancy, Devlin et al. introduced the 80-10-10 strategy:
| Percentage | Action | Example (original word: "cat") | Purpose |
|---|---|---|---|
| 80% | Replace with [MASK] | "The [MASK] sat on the mat" | Teaches the model to predict from context |
| 10% | Replace with a random token | "The dog sat on the mat" | Forces the model to maintain robust representations for all tokens, not just [MASK] |
| 10% | Keep the original token unchanged | "The cat sat on the mat" | Biases the model's representation toward the actual observed token |
The 10% random replacement prevents the model from learning a shortcut where it only needs to produce good representations at [MASK] positions. Because any token might have been randomly swapped, the model must maintain accurate contextual representations everywhere. The 10% unchanged case further encourages the model to keep its representation faithful to the observed input, which is useful during fine-tuning when no tokens are masked.
Research has shown that this combined strategy helps bridge the gap between pre-training (where [MASK] tokens exist) and fine-tuning (where they do not), though the exact percentages are not especially sensitive. Later studies, including Wettig et al. (2022), investigated whether the 15% masking rate is truly optimal and found that higher rates (up to 40%) can sometimes improve performance, depending on the model and task.
The choice of masking 15% of tokens was established in the original BERT paper and has become the default for most subsequent MLM-based models. The rationale is a tradeoff: masking too few tokens makes training inefficient because the model receives a learning signal from only a small fraction of each sequence, while masking too many tokens removes so much context that prediction becomes unreliable.
Some later work has explored different masking rates. The paper "Should You Mask 15% in Masked Language Modeling?" by Wettig et al. (2022) systematically varied the masking rate and found that the optimal rate depends on factors such as model size and training duration. For models trained for fewer steps, higher masking rates (around 40%) can yield better performance because each step provides more learning signal. For models trained until convergence, the differences narrow.
The loss function for masked language modeling is the cross-entropy loss computed exclusively over the masked positions. Formally, given an input sequence of tokens x = (x_1, x_2, ..., x_n) and a set of masked positions M, the input is modified to produce a corrupted sequence x̃, where tokens at positions in M are replaced according to the 80-10-10 strategy. The model f_θ (parameterized by θ) produces a probability distribution over the vocabulary for each masked position.
The MLM training objective minimizes the negative log-likelihood of the true tokens at the masked positions:
L_MLM(θ) = - (1/|M|) Σ_{i ∈ M} log P_θ(x_i | x̃)
where P_θ(x_i | x̃) is the probability assigned by the model to the true token x_i given the corrupted input x̃. The sum is taken only over the masked positions M, not over all positions in the sequence.
At each masked position, the transformer's hidden state is passed through a linear layer followed by a softmax over the vocabulary. The resulting probability distribution is compared against the one-hot ground-truth label for the original token. Tokens that were not masked receive a special ignore label (typically -100 in implementations like Hugging Face Transformers), so they do not contribute to the gradient.
This selective loss computation is important because the model sees the full (partially corrupted) input but only receives a learning signal from the masked positions, encouraging it to build representations that encode information about the entire sequence.
During training, gradients of this loss are computed with respect to the model parameters θ and used to update the model via gradient descent (typically using the Adam optimizer with a learning rate warm-up schedule).
Masked language models are almost universally built on the Transformer encoder architecture introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." The encoder processes the entire input sequence simultaneously, with each token attending to every other token through multi-head self-attention. This is in contrast to the Transformer decoder, which uses causal masking to prevent tokens from attending to future positions.
BERT was released in two sizes:
| Configuration | Layers | Hidden Size | Attention Heads | Parameters |
|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
Both configurations use the standard Transformer encoder stack. Each layer consists of a multi-head self-attention sublayer followed by a position-wise feed-forward network, with layer normalization and residual connections. The feed-forward intermediate size is 3,072 for BERT-Base and 4,096 for BERT-Large.
BERT's input representation sums three types of embeddings: token embeddings (from a WordPiece vocabulary of 30,522 tokens), segment embeddings (to distinguish between sentence A and sentence B in sentence-pair tasks), and positional embeddings (to encode token positions within the sequence).
BERT used two pre-training objectives simultaneously:
Later research by Liu et al. (2019) in the RoBERTa paper demonstrated that NSP was not beneficial and could even hurt performance. Subsequent models largely dropped or replaced it with alternative inter-sentence objectives.
BERT was pre-trained on the BooksCorpus (800 million words) and English Wikipedia (2,500 million words). Training BERT-Base on 4 Cloud TPUs (16 TPU chips) required 4 days; training BERT-Large on 16 Cloud TPUs (64 TPU chips) also took 4 days.
Masked language modeling and causal language modeling represent two fundamentally different approaches to language model pre-training. Understanding their differences is important for choosing the right model for a given task.
| Feature | Masked Language Model (e.g., BERT) | Causal Language Model (e.g., GPT) |
|---|---|---|
| Context direction | Bidirectional (attends to both left and right) | Unidirectional (attends only to the left) |
| Training objective | Predict masked tokens | Predict the next token |
| Architecture | Transformer encoder | Transformer decoder |
| Attention pattern | Full self-attention (no causal mask) | Causal (triangular) attention mask |
| Generation capability | Not designed for sequential generation | Naturally generates text token by token |
| Strengths | Natural language understanding, classification, extraction | Text generation, dialogue, creative writing |
| Pre-train/fine-tune mismatch | [MASK] token absent at fine-tuning time | No mismatch; same left-to-right prediction |
| Token independence assumption | Masked tokens predicted independently of each other | Each token conditioned on all previous tokens |
| Representative models | BERT, RoBERTa, ALBERT, ELECTRA | GPT series, LLaMA, PaLM |
The core advantage of MLM is bidirectional context. When predicting a masked token, the model can use information from both the preceding and following tokens. This is particularly valuable for understanding tasks where the meaning of a word depends on its full sentence context. In an autoregressive model, each token can only attend to tokens that came before it, which means the model has an inherently incomplete view of the context at any given position.
The main weakness of MLM is that it does not naturally support text generation. Because masked language models are trained to fill in blanks within existing text rather than produce text sequentially, they cannot easily generate coherent multi-sentence outputs. Autoregressive models, by contrast, are directly trained to generate text one token at a time and can produce fluent, long-form outputs. This is the primary reason why the most successful generative models (GPT-2, GPT-3, GPT-4, and other large language models) have used autoregressive objectives.
Autoregressive models have demonstrated more consistent scaling behavior. As model size and training data increase, autoregressive models show predictable improvements in performance (as documented by the scaling laws of Kaplan et al., 2020, and Hoffmann et al., 2022). MLM-based models have also benefited from scaling, but the largest and most capable language models (with hundreds of billions or trillions of parameters) have overwhelmingly been autoregressive. This is partly because text generation is a more commercially valuable capability and partly because the autoregressive objective is simpler and more computationally straightforward to scale.
A well-known limitation of standard MLM is the conditional independence assumption: masked tokens are predicted independently of each other. If tokens at positions 3 and 7 are both masked, the model predicts each one without considering what it predicts for the other. This can be problematic when the masked tokens are semantically related. XLNet's permutation language modeling was specifically designed to address this limitation, and later work on non-autoregressive generation has explored similar ideas.
Since the original BERT paper, researchers have proposed several alternative masking strategies that improve upon the basic random token masking approach. These variants generally aim to create harder or more linguistically meaningful prediction tasks.
BERT used static masking: the training corpus was preprocessed once, with mask patterns applied and saved before training began. Each training example always had the same set of tokens masked, even across different epochs. BERT did create up to 10 copies of the data with different masks, but within each copy, the pattern was fixed.
RoBERTa (Liu et al., 2019) introduced dynamic masking, where the masking pattern is generated on-the-fly each time a sequence is fed to the model during training. This means the model sees a different masking pattern for the same input in every epoch. Dynamic masking slightly improved performance across benchmarks and eliminated the need to preprocess and store multiple masked copies of the training data.
In standard BERT, tokenization often splits words into subword units. For example, the word "playing" might be tokenized as ["play", "##ing"]. With standard random masking, only one of these subword tokens might be selected, making the prediction trivially easy since the other subword provides a strong hint.
Whole Word Masking (WWM) addresses this by ensuring that when any subword token of a word is selected for masking, all subword tokens belonging to that word are masked together. Google released whole-word-masked versions of BERT that showed improved performance, particularly on reading comprehension tasks like SQuAD. This strategy was also adopted for Chinese BERT (Cui et al., 2019), where character-level masking of multi-character words posed similar issues.
SpanBERT (Joshi et al., 2020) extended the masking strategy further by masking contiguous spans of tokens rather than individual tokens or whole words. Span lengths are sampled from a geometric distribution (biased toward shorter spans, with a mean of 3.8 tokens and a maximum length of 10 tokens), and the starting point is always aligned to a word boundary.
In addition to the standard MLM objective, SpanBERT introduced the Span Boundary Objective (SBO), which trains the model to predict the tokens within a masked span using only the representations at the span's boundaries (the tokens immediately before and after the span). This encourages the model to encode span-level information in its boundary representations.
SpanBERT also dropped the NSP objective entirely and trained on single contiguous segments of text rather than sentence pairs. These changes, combined with span masking, led to consistent improvements on tasks requiring span-level reasoning, such as extractive question answering and coreference resolution.
ERNIE (Sun et al., 2019) proposed masking named entities and phrases as whole units rather than random tokens. The motivation is that masking only part of an entity (e.g., masking "Potter" in "Harry Potter") allows the model to predict the missing token through word collocation patterns within the entity, without needing to understand the broader semantic context. For instance, predicting "Potter" after "Harry" is trivial, but predicting the entire entity "Harry Potter" in a sentence about J. K. Rowling requires the model to reason about real-world knowledge. By masking entire entities and phrases, ERNIE forces the model to learn deeper semantic relationships from the surrounding context. This approach proved especially effective for knowledge-intensive tasks and Chinese NLP benchmarks.
MacBERT (Cui et al., 2020) took a different approach to the replacement strategy. Instead of replacing masked tokens with [MASK] or random tokens, MacBERT replaces them with similar words found via a synonym dictionary or word embedding similarity. This transforms the MLM task into a correction task: the model must identify and correct tokens that have been replaced with plausible but incorrect alternatives. This strategy further reduces the pre-train/fine-tune discrepancy, since neither [MASK] tokens nor obviously random words appear in the input.
| Variant | Masking Unit | Key Innovation | Introduced By |
|---|---|---|---|
| Standard MLM | Random subword tokens | 80/10/10 replacement rule | BERT (Devlin et al., 2019) |
| Whole Word Masking | Whole words | Masks all subword tokens of a word together | Google (2019) |
| Dynamic Masking | Random subword tokens | New mask pattern each training step | RoBERTa (Liu et al., 2019) |
| Span Masking | Contiguous spans | Geometric span length + boundary objective | SpanBERT (Joshi et al., 2020) |
| Entity/Phrase Masking | Named entities and phrases | Knowledge-aware masking units | ERNIE (Sun et al., 2019) |
| MLM as Correction | Similar words | Replaces masks with similar (not random) tokens | MacBERT (Cui et al., 2020) |
Masked language modeling can be viewed as a specific instance of a broader class of denoising autoencoders applied to text. In a denoising autoencoder, the input is corrupted in some way and the model learns to reconstruct the original input. MLM corrupts text by replacing tokens with [MASK] (or random tokens) and trains the model to recover the originals at those positions.
This denoising perspective has been extended in several influential models that generalize MLM to encoder-decoder architectures:
T5 (Raffel et al., 2020) uses a span corruption objective where contiguous spans of tokens are replaced with single sentinel tokens (e.g., <extra_id_0>), and an encoder-decoder transformer must generate the missing spans as output. Like BERT, T5 uses a 15% corruption rate, but the key difference is architectural: rather than predicting masked tokens "in place" through a classification head on top of the encoder, T5 generates the missing content autoregressively through its decoder. This design allows T5 to handle both understanding and generation tasks within a single unified text-to-text framework. The span corruption objective proved particularly effective because predicting multiple consecutive tokens at once requires richer contextual understanding than single-token prediction.
BART (Lewis et al., 2020) applies multiple types of noise to the input, including token masking, token deletion, text infilling (replacing spans with a single [MASK] token), sentence permutation, and document rotation. The model is trained as a denoising autoencoder with a bidirectional encoder and an autoregressive decoder that reconstructs the original text. Text infilling is especially notable because the model must determine how many tokens are missing at each [MASK] position, making it a harder task than standard MLM. BART uses approximately 30% of tokens masked and all sentences permuted. This combination of noise functions makes BART effective for both understanding and generation tasks, particularly summarization.
UL2 (Tay et al., 2022) proposed a unified framework that mixes different denoising objectives during pre-training. It combines three types of denoising tasks: R-denoiser (regular span corruption similar to T5), S-denoiser (short span corruption, similar to standard MLM), and X-denoiser (extreme span corruption with high masking rates and long spans). By mixing these objectives, UL2 bridges the gap between MLM-style objectives (good for understanding) and autoregressive objectives (good for generation). PaLM-2, one of the largest publicly disclosed models trained in this style, adopted a similar mixture of pre-training objectives.
These denoising approaches generalize the core insight of MLM: that learning to reconstruct corrupted text produces powerful language representations.
The masked language modeling objective has been adopted and adapted by a wide range of models since BERT's introduction. The following table summarizes the most influential MLM-based models and their key innovations.
| Model | Year | Authors / Organization | Key Innovation | Pre-training Objective |
|---|---|---|---|---|
| BERT | 2018 | Devlin et al. / Google | Introduced MLM + NSP for bidirectional pre-training | MLM + NSP |
| RoBERTa | 2019 | Liu et al. / Facebook AI | Dynamic masking, removed NSP, longer training with more data | MLM only |
| ALBERT | 2019 | Lan et al. / Google | Cross-layer parameter sharing, factorized embeddings, SOP replacing NSP | MLM + SOP |
| SpanBERT | 2020 | Joshi et al. / Facebook AI | Span masking + Span Boundary Objective | Span MLM + SBO |
| ELECTRA | 2020 | Clark et al. / Google / Stanford | Replaced token detection instead of masked token prediction | RTD (generator + discriminator) |
| DeBERTa | 2020 | He et al. / Microsoft | Disentangled attention + Enhanced Mask Decoder | MLM |
| XLNet | 2019 | Yang et al. / Google / CMU | Permutation language modeling to capture bidirectional context without masking | PLM |
RoBERTa (A Robustly Optimized BERT Pretraining Approach) demonstrated that BERT was significantly undertrained. By making several changes to the training recipe, Liu et al. achieved substantial improvements without altering the model architecture. Key modifications included dynamic masking, removing the NSP objective, training with larger mini-batches (8,000 sequences vs. BERT's 256), training on more data (160GB of text vs. BERT's 16GB), and training for more steps. RoBERTa matched or exceeded the performance of all models published after BERT at the time, including XLNet.
ALBERT (A Lite BERT) addressed the growing parameter count of language models by introducing two parameter-reduction techniques. Factorized embedding parameterization decomposes the large vocabulary embedding matrix into two smaller matrices, decoupling the vocabulary embedding size from the hidden layer size. Cross-layer parameter sharing reuses the same set of parameters across all Transformer layers, preventing parameter growth with network depth. An ALBERT configuration comparable to BERT-Large had 18 times fewer parameters and could be trained roughly 1.7 times faster. ALBERT also replaced NSP with Sentence Order Prediction (SOP), a more challenging task that requires the model to distinguish the correct order of two consecutive text segments from a swapped version.
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) took a fundamentally different approach to pre-training. Instead of masking tokens and predicting them, ELECTRA uses a small generator network (trained with MLM) to produce plausible replacement tokens, and then trains a larger discriminator network to identify which tokens in the sequence have been replaced. This approach, called Replaced Token Detection (RTD), is inspired by the structure of generative adversarial networks (though the training procedure differs).
The key advantage of ELECTRA is sample efficiency. In standard MLM, the model only receives a training signal from the 15% of tokens that were masked. In ELECTRA's RTD, the discriminator makes a binary prediction (original vs. replaced) for every token in the sequence, meaning 100% of tokens contribute to the loss. This results in roughly 4 to 7 times more efficient use of compute. A small ELECTRA model trained on a single GPU for four days outperformed GPT (which used 30 times more compute) on the GLUE benchmark.
DeBERTa (Decoding-enhanced BERT with Disentangled Attention) introduced two architectural innovations. The disentangled attention mechanism represents each token using two separate vectors: one for content and one for position. Attention weights are then computed using three components: content-to-content, content-to-position, and position-to-content. This separation allows the model to more flexibly capture how the meaning of a word interacts with its position in the sequence.
The Enhanced Mask Decoder adds absolute position information in the final decoding layer before the MLM prediction head. While disentangled attention captures relative positions throughout the network, absolute position can still be important for prediction. DeBERTa (1.5 billion parameters) surpassed the human baseline on the SuperGLUE benchmark for the first time, outperforming the 11-billion-parameter T5 model while being substantially smaller.
Although XLNet does not use the standard [MASK] token, it is closely related to masked language modeling. XLNet introduced Permutation Language Modeling (PLM), which maximizes the expected log-likelihood of a sequence across all possible permutations of the factorization order. This allows the model to capture bidirectional context (since any position may condition on any other position) while maintaining an autoregressive formulation (avoiding the independence assumptions inherent in standard MLM, where masked tokens are predicted independently of each other).
XLNet outperformed BERT on 20 tasks at the time of its release, demonstrating that the pretrain-finetune discrepancy caused by the [MASK] token was a meaningful limitation of standard MLM.
While large-scale decoder-only models (such as the GPT series and LLaMA) have become dominant for generative AI applications, MLM-trained encoder models remain widely used in practice across many settings.
Models pre-trained with MLM objectives continue to excel at tasks where understanding is more important than generation:
As of the mid-2020s, models like BERT, RoBERTa, DeBERTa, and their multilingual variants (mBERT, XLM-RoBERTa) remain the workhorses for many production NLP systems where understanding, rather than generation, is the primary goal. These models are often preferred over larger generative models for latency-sensitive or resource-constrained deployments because they are typically smaller and faster at inference time.
The MLM objective has also influenced the design of multimodal models beyond text. Masked Autoencoders (MAE; He et al., 2022) apply the same masking-and-prediction paradigm to image patches in vision transformers, randomly masking 75% of image patches and training the model to reconstruct the missing pixels. Similar approaches have been applied to audio spectrograms, video frames, and protein sequences, demonstrating that the denoising-based self-supervised learning paradigm pioneered by MLM generalizes effectively across modalities.
Researchers have found that continuing MLM pre-training on domain-specific text before fine-tuning improves performance on domain-specific downstream tasks. For example, BioBERT (Lee et al., 2019) continued pre-training BERT on biomedical literature, and SciBERT (Beltagy et al., 2019) did the same for scientific text. This technique, sometimes called domain-adaptive pre-training or further pre-training, has become a standard practice for applying MLM-based models to specialized domains such as law, medicine, and finance.
Masked language models are a central component of the pre-train and fine-tune paradigm that has dominated NLP since 2018.
Pre-training on a large corpus allows the model to learn general linguistic knowledge: syntax, semantics, factual associations, and discourse structure. This knowledge is encoded in the model's weights and can be transferred to downstream tasks through fine-tuning. Because the pre-training corpus is typically orders of magnitude larger than any task-specific dataset, the model acquires knowledge that would be impossible to learn from the downstream data alone.
Fine-tuning a pre-trained MLM typically involves adding a task-specific output layer on top of the pre-trained encoder and training the entire model end-to-end on the downstream task. Common fine-tuning configurations include:
| Task Type | Output Layer | Example Tasks |
|---|---|---|
| Sequence classification | Linear layer on [CLS] token | Sentiment analysis, topic classification, natural language inference |
| Token classification | Linear layer on each token | Named entity recognition, part-of-speech tagging |
| Span extraction | Start/end position prediction | Extractive question answering (SQuAD) |
| Sentence pair classification | Linear layer on [CLS] token | Paraphrase detection, textual entailment |
Fine-tuning is typically fast and requires relatively little labeled data. A pre-trained BERT model can be fine-tuned on a downstream task in a few hours on a single GPU, achieving performance that rivals or exceeds models trained from scratch on much larger task-specific datasets.
Despite their effectiveness for understanding tasks, masked language models have several important limitations.
During pre-training, the model sees input sequences containing [MASK] tokens, but during fine-tuning and inference, no [MASK] tokens are present. This mismatch can cause the model's representations to be slightly different in the two settings. The 80-10-10 strategy mitigates but does not fully eliminate this issue.
Because MLM only computes loss on 15% of tokens per sequence, each training step provides a relatively sparse learning signal. ELECTRA's replaced token detection addressed this by providing a signal for all tokens, achieving comparable performance with significantly less compute. Standard MLM remains less sample-efficient than some alternative objectives.
MLM-based models are fundamentally designed for understanding, not generation. While they can be used for fill-in-the-blank generation or incorporated into encoder-decoder architectures, they cannot match pure autoregressive models for open-ended text generation tasks.
The standard 15% masking rate is a compromise that works reasonably well across settings but may not be optimal for any particular model size, training duration, or downstream task. Adaptive masking strategies, where the masking rate changes during training or is tuned per task, remain an active area of research.
MLM operates at the token level, which means it may not capture higher-level structures such as discourse coherence or document-level themes as effectively as objectives that operate on larger units of text. Span masking (SpanBERT) and sentence-level objectives (SOP in ALBERT) partially address this limitation.
Imagine you are reading a book, and some of the words are covered with stickers. You have to guess what those words are based on the words around them. If the sentence says "The ___ chased the mouse," you can probably guess the missing word is "cat" because cats chase mice.
A masked language model does exactly this, but with a computer. It reads millions of sentences where some words have been hidden, and it tries to guess the hidden words. The more sentences it reads and guesses correctly, the better it gets at understanding how language works. After enough practice, it becomes really good at understanding what words mean and how sentences fit together. Once it has learned enough, people can use it to help with things like figuring out whether a movie review is positive or negative, finding the names of people and places in a news article, or answering questions about a passage of text.