Fill-Mask Models
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,497 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,497 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Fill-mask models are language models trained with a masked language modeling (MLM) objective, in which a fraction of the tokens in an input sequence are hidden behind a special [MASK] symbol and the model predicts the original tokens from their surrounding context. The task that gave the model family its name is also a direct application: given The capital of France is [MASK]., a fill-mask model returns a ranked distribution over vocabulary tokens that could fit the slot. Because predictions depend on both left and right context, fill-mask models produce bidirectional contextual embeddings widely used for text classification, named entity recognition, question answering, retrieval, and sentence similarity.
Fill-mask differs from the other dominant transformer pretraining objectives. Autoregressive (left-to-right) language modeling, used by GPT-family decoders, predicts the next token given a prefix and suits open-ended generation. Span corruption, used by T5, replaces contiguous spans with a single sentinel token and trains the decoder to emit the missing span. Fill-mask operates on individual masked positions inside a fixed window and uses an encoder-only architecture whose final hidden states serve as token representations.
The idea that word meaning can be recovered from context predates deep learning. Static word embedding methods such as word2vec (Mikolov et al., 2013) and GloVe (Pennington, Socher, and Manning, 2014) produced a single vector per word type without conditioning on the surrounding sentence at inference time.
Contextualized representations entered the mainstream with ELMo (Peters et al., 2018), which trained a three-layer bidirectional LSTM language model and combined hidden states from each layer as downstream features. ELMo's bidirectionality came from concatenating independently trained left-to-right and right-to-left models rather than a jointly trained network.
The modern fill-mask paradigm was introduced by BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018. Devlin, Chang, Lee, and Toutanova framed pretraining as a cloze task: randomly mask 15 percent of input tokens and train a transformer encoder to predict the missing tokens from the surrounding context at once. BERT also used a next-sentence prediction (NSP) objective on segment pairs. Pretrained on BooksCorpus and English Wikipedia, BERT-base (110 million parameters) and BERT-large (340 million parameters) raised state of the art on eleven NLP benchmarks including GLUE and SQuAD.
A wave of refinements followed within twelve months. RoBERTa (Liu et al., July 2019) showed BERT was undertrained: longer training on more data, larger batches, dropping NSP, and dynamic masking (regenerating the mask each epoch) improved every downstream score. ALBERT (Lan et al., September 2019) cut parameter counts by factorizing the embedding matrix and sharing transformer weights across layers. DistilBERT (Sanh et al., October 2019) used knowledge distillation to compress BERT to 66 million parameters, retaining roughly 97 percent of GLUE performance while running 60 percent faster.
SpanBERT (Joshi et al., 2020) masks contiguous spans and adds a span-boundary objective that predicts the masked content from only the boundary representations. Whole-word masking, which keeps subword pieces of the same word together, became standard in subsequent releases.
ELECTRA (Clark, Luong, Le, and Manning, 2020) introduced replaced token detection as a more compute-efficient alternative. A small generator samples plausible replacements for masked positions, and the main discriminator classifies every token as original or replaced. Because the loss is defined over all tokens, ELECTRA reached BERT-equivalent quality with roughly a quarter of the compute.
DeBERTa (He, Liu, Gao, and Chen, 2020) introduced disentangled attention, computing attention weights from separate content and relative-position vectors, plus an enhanced mask decoder that reinjects absolute positions. DeBERTa-v3 (He, Gao, and Chen, November 2021) combined disentangled attention with ELECTRA-style pretraining and gradient-disentangled embedding sharing.
Multilingual fill-mask was scaled up with XLM-R (Conneau et al., November 2019), trained on more than two terabytes of filtered CommonCrawl across 100 languages. XLM-R beat multilingual BERT by 14.6 points on XNLI and was particularly strong on low-resource languages such as Swahili and Urdu.
Research on encoder-only architectures slowed as decoder LLMs absorbed most attention. Two recent releases revived the area. MosaicBERT (Portes et al., December 2023) combined FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units, padding-token removal, a 30 percent mask rate, and bfloat16 precision into a recipe that pretrains a BERT-base-quality model in about an hour on 8 A100 GPUs for roughly 20 dollars. ModernBERT (Warner et al., December 2024) trained on two trillion tokens of web text, code, and scientific literature using a hybrid local-global attention schedule (every third layer global; intervening layers a 128-token sliding window) with an 8,192-token context. ModernBERT-base has 149 million parameters; ModernBERT-large has 395 million.
In the canonical BERT recipe, 15 percent of input tokens are selected for prediction. Of the selected tokens, 80 percent are replaced with [MASK], 10 percent are replaced with a uniformly sampled random token, and the remaining 10 percent are kept unchanged. The mixed strategy serves two purposes. First, the [MASK] token never appears during fine-tuning, so always masking would create a distribution shift between pretraining and inference. Second, occasionally substituting random tokens forces the encoder to maintain a contextual distribution over every position rather than copying observed tokens through the residual stream.
The loss is the cross-entropy between the predicted distribution at each masked position and the original token. Because positions are independent given context, MLM trains to a pseudo-likelihood rather than the true sequence likelihood. This makes MLM weaker than autoregressive training as a direct generator but stronger as a representation learner, since attention flows in both directions.
Variants address specific weaknesses. Whole-word masking treats subword pieces of the same word as a single unit. Span masking (SpanBERT) masks contiguous spans sampled from a geometric distribution with mean length around 3.8 tokens. Dynamic masking (RoBERTa) regenerates the mask pattern each time a sentence is fed to the model. Replaced token detection (ELECTRA, DeBERTa-v3) substitutes a binary classifier for cross-entropy and exposes every token to training signal. Higher mask rates: MosaicBERT found 30 to 40 percent masking can match or exceed the original 15 percent with better optimization.
| Model | Released | Organization | Parameters | Pretraining |
|---|---|---|---|---|
| BERT-base | Oct 2018 | 110M | MLM + NSP, 3.3B tokens | |
| BERT-large | Oct 2018 | 340M | MLM + NSP, 3.3B tokens | |
| RoBERTa-base | Jul 2019 | Facebook AI | 125M | Dynamic MLM, 160GB text |
| RoBERTa-large | Jul 2019 | Facebook AI | 355M | Dynamic MLM, 160GB text |
| ALBERT | Sep 2019 | 12M to 235M | MLM + SOP, shared layers | |
| DistilBERT | Oct 2019 | Hugging Face | 66M | Distilled from BERT |
| SciBERT | 2019 | Allen Institute for AI | 110M | MLM, 1.14M scientific papers |
| BioBERT | 2019 | DMIS Lab, Korea Univ. | 110M | MLM, PubMed + PMC |
| XLM-R-base | Nov 2019 | Facebook AI | 270M | MLM, 100 languages, 2.5TB |
| XLM-R-large | Nov 2019 | Facebook AI | 550M | MLM, 100 languages, 2.5TB |
| ELECTRA-base | Mar 2020 | 110M | Replaced token detection | |
| SpanBERT | 2020 | Facebook AI | 110M to 340M | Span masking + SBO |
| DeBERTa | Jun 2020 | Microsoft | 134M to 1.5B | MLM + disentangled attention |
| DeBERTa-v3-base | Nov 2021 | Microsoft | 86M | RTD + gradient-disentangled |
| MosaicBERT | Dec 2023 | MosaicML | 137M | MLM, FlashAttention, ALiBi |
| ModernBERT-base | Dec 2024 | Answer.AI, LightOn | 149M | MLM, 2T tokens, 8K context |
| ModernBERT-large | Dec 2024 | Answer.AI, LightOn | 395M | MLM, 2T tokens, 8K context |
The most common use of an MLM-pretrained encoder is fine-tuning for a discriminative task. A task-specific head (linear classifier, span predictor, or token-level tagger) is added on top of the encoder's final hidden states and the stack is trained on labeled data. This recipe powered the state of the art on GLUE, SuperGLUE, SQuAD 1.1, SQuAD 2.0, and the CoNLL-2003 NER benchmark.
Contextual embeddings extracted from MLM models, typically by mean pooling or using the [CLS] token, are the workhorse representation for retrieval and similarity. Sentence-BERT (Reimers and Gurevych, 2019) fine-tunes BERT with a Siamese architecture and triplet loss to produce semantically meaningful pooled vectors. Such embeddings underpin most modern dense retrievers, including those used for retrieval-augmented generation, and remain the dominant choice for semantic search and clustering at scale.
Reranking uses an MLM encoder as a cross-encoder: a query and candidate document are concatenated and fed through the model, which outputs a relevance score. Cross-encoders deliver higher accuracy than dense retrievers at the cost of running once per candidate, so they are typically applied as a second-stage rescorer over a bi-encoder or BM25 index.
Direct fill-mask inference is used for cloze-style probing of model knowledge (the LAMA probe of Petroni et al., 2019), word-sense induction, text infilling for editing assistants, and grammatical error correction. The pseudo-likelihood from MLM can also score sentence acceptability.
Fill-mask models are less effective as direct generators. Generating left to right requires masking successive positions and decoding iteratively, slower than an autoregressive decoder and prone to bland output. Encoder-decoder models such as BART and T5 are usually preferred when the goal is generation from corrupted input.
| Benchmark | Year | Tasks | Typical use |
|---|---|---|---|
| GLUE | 2018 | 9 NLU tasks | General language understanding |
| SuperGLUE | 2019 | 8 harder NLU tasks | Successor to GLUE after saturation |
| SQuAD 1.1 | 2016 | Extractive QA | Span prediction on Wikipedia |
| SQuAD 2.0 | 2018 | QA with unanswerables | Adds abstention |
| CoNLL-2003 | 2003 | NER (English, German) | Token classification |
| MS MARCO | 2016 | Passage ranking | Retrieval and reranking |
| XNLI | 2018 | NLI in 15 languages | Cross-lingual transfer |
| BEIR | 2021 | 18 retrieval tasks | Zero-shot retrieval evaluation |
The rise of decoder-only LLMs after GPT-3 shifted research away from encoder pretraining, and for several years the dominant pretrained encoders remained 2019-vintage RoBERTa and DeBERTa variants. Despite this, encoder-only fill-mask models still account for the majority of production NLP outside open-ended generation. Embedding APIs that power vector database retrieval, content moderation classifiers, and on-device NER typically run BERT-class encoders because of latency, memory, and cost constraints. ModernBERT (December 2024) showed the encoder branch still has room to grow: with an 8,192-token context, code-and-text training, and architectural improvements, it delivers retrieval and classification quality competitive with much larger decoder models while running several times faster on the same hardware.
Encoder-decoder models such as T5 use a related but distinct objective called span corruption, in which contiguous spans are replaced by single sentinel tokens and the decoder is trained to emit the missing spans separated by the same sentinels. Span corruption shares MLM's bidirectional encoder context but produces a generative model; it is not strictly a fill-mask objective.
The [MASK] symbol used during pretraining never appears during fine-tuning or inference, creating a train-test mismatch that the 80/10/10 strategy only partially mitigates. ELECTRA and DeBERTa-v3 address this through replaced token detection, so the encoder always sees a coherent input sequence.
Fill-mask models are poor open-ended generators. Their pseudo-likelihood does not factor in dependencies between masked positions, so iterative decoding produces repetitive or generic text. Long-form generation is better served by autoregressive decoders or encoder-decoder models with span corruption.
The original BERT context length of 512 tokens, inherited by most descendants, limits applicability to long documents. Longformer, BigBird, and ModernBERT extend the window to 4,096 or 8,192 tokens with sparse or hybrid attention, but long-context encoding remains harder for MLM models than for modern decoder LLMs.
On reasoning-heavy benchmarks, encoder-only MLM models lag decoder LLMs trained at comparable scale. The gap reflects both the smaller scale at which encoders have been trained and the asymmetry between bidirectional representation learning and the chain-of-thought outputs decoder LLMs produce by default. For multi-step reasoning, generation, or instruction following, decoder LLMs remain more capable; fill-mask encoders dominate where compact, high-quality representations are the actual deliverable.