Fill-Mask Models
Last reviewed
May 31, 2026
Sources
26 citations
Review status
Source-backed
Revision
v3 ยท 5,298 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
26 citations
Review status
Source-backed
Revision
v3 ยท 5,298 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Fill-mask models are language models trained with a masked language modeling (MLM) objective, in which a fraction of the tokens in an input sequence are hidden behind a special [MASK] symbol and the model predicts the original tokens from their surrounding context. The task that gave the model family its name is also a direct application: given The capital of France is [MASK]., a fill-mask model returns a ranked distribution over vocabulary tokens that could fit the slot. Because predictions depend on both left and right context, fill-mask models produce bidirectional contextual embeddings widely used for text classification, named entity recognition, question answering, retrieval, and sentence similarity.
Fill-mask differs from the other dominant transformer pretraining objectives. Autoregressive (left-to-right) language modeling, used by GPT-family decoders, predicts the next token given a prefix and suits open-ended generation. Span corruption, used by T5, replaces contiguous spans with a single sentinel token and trains the decoder to emit the missing span. Fill-mask operates on individual masked positions inside a fixed window and uses an encoder-only architecture whose final hidden states serve as token representations.
| Fill-mask models | |
|---|---|
| Type | Pre-training objective and inference task |
| Core mechanism | Predict masked tokens from bidirectional context |
| Architecture | Encoder-only transformer |
| Introduced | October 2018 (BERT, Devlin et al.) |
| Key objective | Masked language modeling (MLM) |
| Alternative objectives | Replaced token detection (ELECTRA), span corruption (T5) |
| Common downstream uses | Classification, NER, retrieval, reranking, sentence similarity |
| Key models | BERT, RoBERTa, ALBERT, DeBERTa, ModernBERT |
| Dominant benchmarks | GLUE, SuperGLUE, SQuAD, CoNLL-2003 |
The idea that word meaning can be recovered from context predates deep learning. The fill-in-the-blank format traces back to the cloze test, introduced by Wilson Taylor in 1953 as a reading-comprehension measure in which words are periodically deleted from a passage and the reader must supply the missing terms. That framing anticipated MLM by more than six decades.
Static word embedding methods such as word2vec (Mikolov et al., 2013) and GloVe (Pennington, Socher, and Manning, 2014) produced a single vector per word type without conditioning on the surrounding sentence at inference time. Word2vec's continuous bag-of-words variant predicts a center word from its neighbors, which is structurally similar to fill-mask but operates at the level of a small fixed window rather than a full sequence, and does not use a transformer.
Contextualized representations entered the mainstream with ELMo (Peters et al., 2018), which trained a three-layer bidirectional LSTM language model and combined hidden states from each layer as downstream features. ELMo's bidirectionality came from concatenating independently trained left-to-right and right-to-left models rather than a jointly trained network. The ELMo representations were used as frozen features added on top of task-specific architectures rather than fine-tuned end-to-end.
The modern fill-mask paradigm was introduced by BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018. Devlin, Chang, Lee, and Toutanova framed pretraining as a cloze task: randomly mask 15 percent of input tokens and train a transformer encoder to predict the missing tokens from the surrounding context at once. BERT also used a next-sentence prediction (NSP) objective on segment pairs, where the model classified whether a second sentence genuinely followed the first in the original corpus. Pretrained on BooksCorpus and English Wikipedia (3.3 billion tokens), BERT-base (110 million parameters) and BERT-large (340 million parameters) raised state of the art on eleven NLP benchmarks including GLUE and SQuAD.
A key design decision in BERT was the addition of two special tokens: [CLS] at the start of every input and [SEP] between segments. The [CLS] token's final hidden state, aggregated through self-attention over the whole sequence, was used as the pooled sequence-level representation for classification tasks. The bidirectional attention stack, in which every token attends to every other token in the same forward pass, is what distinguishes BERT from GPT-style causal models and gives the representations their contextual richness.
A wave of refinements followed within twelve months. RoBERTa (Liu et al., July 2019) showed BERT was undertrained: longer training on more data (160 GB of text versus BERT's 16 GB), larger batches, dropping NSP, and dynamic masking (regenerating the mask pattern each epoch rather than generating it once during preprocessing) improved every downstream score without any architecture change. RoBERTa made the case that the MLM objective itself was sound and BERT's limitations were largely about data and compute.
ALBERT (Lan et al., September 2019) cut parameter counts by factorizing the embedding matrix into two smaller matrices and sharing transformer layer weights across all layers. A sentence-order prediction (SOP) objective replaced NSP, which ablation studies had found to add little signal. ALBERT-base reached GLUE scores comparable to BERT-large with six times fewer parameters.
DistilBERT (Sanh et al., October 2019) used knowledge distillation to compress BERT to 66 million parameters, retaining roughly 97 percent of GLUE performance while running 60 percent faster. The distillation loss combined the original MLM cross-entropy with a soft-target distribution over the teacher's vocabulary logits and a cosine embedding loss on the hidden representations.
SpanBERT (Joshi et al., 2020) masks contiguous spans sampled from a geometric distribution (mean length approximately 3.8 tokens) rather than independent positions. It also adds a span-boundary objective (SBO): the model must predict each masked token using only the boundary token representations on either side of the span, without attending to any of the masked positions. SpanBERT dropped NSP and trained on single-sentence segments. The resulting model was substantially better on span-selection tasks such as coreference resolution and extractive QA because the pretraining objective directly required recovering contiguous spans from context. Whole-word masking, which extends masking to all subword pieces of the same word rather than selecting pieces independently, became standard in subsequent releases by Google and others.
ELECTRA (Clark, Luong, Le, and Manning, 2020) introduced replaced token detection as a more compute-efficient alternative to MLM. Rather than predicting the original tokens at masked positions, ELECTRA trains two networks jointly. A small generator (trained with MLM) samples plausible replacements for masked positions. The main discriminator then classifies every token in the resulting sequence as original or replaced by the generator. Because the loss is defined over all tokens rather than only the masked 15 percent, ELECTRA extracts more signal per input example and reached BERT-equivalent quality on GLUE with roughly one quarter of the training compute. ELECTRA-large outperformed BERT-large and RoBERTa-large despite using fewer parameters and less training. The tradeoff is that replaced token detection is a binary classification problem per token rather than a full vocabulary softmax, which makes ELECTRA weaker as a direct fill-mask model but stronger as a downstream representation.
DeBERTa (He, Liu, Gao, and Chen, 2020) introduced disentangled attention, computing attention scores from separate content and relative-position vectors rather than a single combined representation. An enhanced mask decoder reinjects absolute position information just before the final softmax. This combination allowed DeBERTa-xxlarge (1.5 billion parameters) to surpass human performance on SuperGLUE at the time of its release. DeBERTa-v3 (He, Gao, and Chen, November 2021) combined disentangled attention with ELECTRA-style pretraining and gradient-disentangled embedding sharing, yielding the strongest encoder family on the SuperGLUE leaderboard for most of 2022 and 2023.
Multilingual fill-mask was scaled up with XLM-R (Conneau et al., November 2019), trained on more than two terabytes of filtered CommonCrawl across 100 languages. XLM-R beat multilingual BERT by 14.6 points on XNLI and was particularly strong on low-resource languages such as Swahili and Urdu. The multilingual MLM objective is identical to the monolingual version; the model implicitly aligns representations across languages by training on interleaved multilingual text.
Domain-specific variants followed quickly. SciBERT (Beltagy, Lo, and Cohan, 2019) pretrained on 1.14 million scientific papers (biomedical and computer science) using BERT's original recipe with a domain vocabulary. BioBERT (Lee et al., 2019) fine-tuned BERT-base on PubMed abstracts and PubMed Central full texts. Both substantially outperformed general-domain BERT on biomedical NER, relation extraction, and QA. FinBERT, LegalBERT, ClinicalBERT, CodeBERT, and dozens of other domain encoders followed the same template.
Research on encoder-only architectures slowed as decoder large language models absorbed most attention after GPT-3 (2020). For several years the dominant pretrained encoders remained 2019-vintage RoBERTa and DeBERTa variants.
Two recent releases revived the encoder branch. MosaicBERT (Portes et al., December 2023) combined FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units, padding-token removal, a 30 percent mask rate, and bfloat16 precision into a recipe that pretrains a BERT-base-quality model in about an hour on 8 A100 GPUs for roughly 20 dollars. ModernBERT (Warner et al., December 2024) trained on two trillion tokens of web text, code, and scientific literature using a hybrid local-global attention schedule (every third layer attends globally; intervening layers use a 128-token sliding window) with an 8,192-token context. ModernBERT-base has 149 million parameters; ModernBERT-large has 395 million. ModernBERT demonstrated that a fresh encoder trained on modern data and with modern architectural techniques could compete with or surpass much larger decoder models on retrieval and classification tasks while maintaining the latency and memory advantages of the encoder-only design.
Masked language modeling was introduced as the core pretraining objective for BERT and has since been used, with variations, by every major encoder-only transformer. Given an input sequence of tokens $x = (x_1, x_2, \ldots, x_n)$, a random subset $M \subset {1, \ldots, n}$ of positions is selected as the "masked" set. The model receives the corrupted sequence $\tilde{x}$, in which positions in $M$ have been replaced by the special [MASK] token (or by another token, depending on the masking strategy), and is trained to predict the original tokens at those positions. The objective is to minimize the sum of cross-entropy losses over the masked positions:
$$\mathcal{L}\text{MLM} = -\sum{i \in M} \log P(x_i \mid \tilde{x})$$
Because the model predicts each masked position from the full surrounding context, including tokens to the left and right, it learns bidirectional representations. The key difference from an autoregressive language model (which predicts $x_i$ from $x_{<i}$ only) is that MLM conditions on both past and future context, making the representations richer for tasks that need global understanding of a fixed input sequence.
The loss sums over masked positions only, ignoring the contribution of unmasked tokens. This means MLM trains to a pseudo-likelihood: a lower bound on the true sequence probability that factors out dependencies among masked positions. The independence assumption (positions in $M$ are predicted as if they were independent given context) is what makes bidirectional attention possible during training without needing to autoregress. It is also what makes MLM-pretrained models poor direct generators: they cannot naturally condition one generated token on previously generated tokens in the same masked set.
In BERT's implementation, exactly 15 percent of subword tokens are selected for prediction. Of the selected tokens:
[MASK] special token.The mixed strategy serves two purposes. First, the [MASK] token never appears during fine-tuning or inference, so always masking every selected position would create a train-test mismatch. Substituting random tokens and keeping some positions unchanged during pretraining prevents the model from simply ignoring unmasked positions; it must maintain a calibrated contextual representation at every position because any token could in principle be the one requiring correction. Second, the occasional random-token substitution forces the model to recognize that the observed token at a given position may not be the true token, discouraging trivial copying of observed tokens through the residual stream.
Research after BERT explored several modifications to the basic masking recipe:
Whole-word masking (WWM) treats all subword pieces of a word as a unit. If "unbelievable" is tokenized as un, ##believe, ##able and the word is selected for masking, all three pieces are masked together. Masking at the word level rather than the subword level creates harder prediction targets and generally improves downstream performance on tasks that require word-level understanding. Whole-word masking became standard in Google's later BERT releases and in most subsequent encoder models.
Span masking (used in SpanBERT and T5's encoder) extends whole-word masking to contiguous multi-word spans sampled from a geometric distribution. The span length follows a geometric distribution with mean around 3.8 tokens, giving a mixture of short and longer contiguous gaps. Span masking targets the same structures that reading comprehension and coreference tasks require: recovering a named entity, a verb phrase, or a prepositional object from surrounding context. SpanBERT additionally trains a span-boundary objective, described above.
Dynamic masking (used in RoBERTa) regenerates the mask pattern each time a training example is visited rather than assigning a fixed mask at data preprocessing time. In BERT's original setup, a fixed mask was generated during preprocessing and the same mask was used every epoch. Dynamic masking means the model sees the same text with different masked positions across epochs, effectively multiplying the number of distinct training signals and reducing overfitting.
Entity and phrase masking (used in ERNIE by Sun et al., 2019, from Baidu) extends whole-word masking to named entities and syntactic phrases such as noun phrases and named locations. The model must recover "Barack Obama" as a unit rather than recovering "Obama" independently of "Barack," forcing it to capture entity-level semantics.
Higher mask rates: The original BERT rate of 15 percent was chosen heuristically. MosaicBERT (Portes et al., 2023) found that 30 percent masking, combined with better optimization and architecture choices, matches or exceeds the quality of 15 percent masking while reducing the number of training steps needed for convergence.
Replaced token detection (used in ELECTRA and DeBERTa-v3) replaces the vocabulary softmax cross-entropy with a per-token binary classification: is this token original or was it replaced by a generator model? Because the loss is computed over every token rather than the 15 percent masked subset, the training signal is about 6.7 times denser per sequence. ELECTRA-base trained on the same data and compute budget as BERT-base substantially outperforms it on the GLUE benchmark.
The table below summarizes how MLM compares to the main alternative pretraining objectives used in large language model research:
| Objective | Model family | Context direction | Loss defined over | Primary use |
|---|---|---|---|---|
| Masked language modeling | BERT, RoBERTa, ALBERT, DeBERTa | Bidirectional (full sequence) | Masked positions only | Encoder representations |
| Replaced token detection | ELECTRA, DeBERTa-v3 | Bidirectional (full sequence) | All positions (binary) | Efficient encoder training |
| Autoregressive LM | GPT, LLaMA, Mistral | Left-to-right (causal) | All positions | Generation, instruction following |
| Span corruption (prefix LM) | T5, BART encoder | Bidirectional encoder, causal decoder | Decoder positions | Conditional generation |
| Next sentence prediction | BERT (auxiliary) | Segment-level | Binary label | Removed in RoBERTa and ALBERT |
| Sentence order prediction | ALBERT | Segment-level | Binary label | Coherence and discourse |
| Model | Released | Organization | Parameters | Pretraining |
|---|---|---|---|---|
| BERT-base | Oct 2018 | 110M | MLM + NSP, 3.3B tokens | |
| BERT-large | Oct 2018 | 340M | MLM + NSP, 3.3B tokens | |
| RoBERTa-base | Jul 2019 | Facebook AI | 125M | Dynamic MLM, 160GB text |
| RoBERTa-large | Jul 2019 | Facebook AI | 355M | Dynamic MLM, 160GB text |
| ALBERT | Sep 2019 | 12M to 235M | MLM + SOP, shared layers | |
| DistilBERT | Oct 2019 | Hugging Face | 66M | Distilled from BERT |
| SciBERT | 2019 | Allen Institute for AI | 110M | MLM, 1.14M scientific papers |
| BioBERT | 2019 | DMIS Lab, Korea Univ. | 110M | MLM, PubMed + PMC |
| XLM-R-base | Nov 2019 | Facebook AI | 270M | MLM, 100 languages, 2.5TB |
| XLM-R-large | Nov 2019 | Facebook AI | 550M | MLM, 100 languages, 2.5TB |
| ELECTRA-base | Mar 2020 | 110M | Replaced token detection | |
| SpanBERT | 2020 | Facebook AI | 110M to 340M | Span masking + SBO |
| DeBERTa | Jun 2020 | Microsoft | 134M to 1.5B | MLM + disentangled attention |
| DeBERTa-v3-base | Nov 2021 | Microsoft | 86M | RTD + gradient-disentangled |
| MosaicBERT | Dec 2023 | MosaicML | 137M | MLM, FlashAttention, ALiBi |
| ModernBERT-base | Dec 2024 | Answer.AI, LightOn | 149M | MLM, 2T tokens, 8K context |
| ModernBERT-large | Dec 2024 | Answer.AI, LightOn | 395M | MLM, 2T tokens, 8K context |
"Fill-mask" in the sense of a runnable inference task refers to passing a sentence with one or more [MASK] tokens to a pretrained encoder and retrieving the model's probability distribution over vocabulary tokens for each masked position. The output is a ranked list of candidate completions with associated probabilities, for example:
[MASK]. returns: Paris (0.98), Berlin (0.004), Rome (0.002), ...[MASK] for hypertension. returns: medication (0.31), drug (0.24), treatment (0.18), ...This inference mode is exposed as a named pipeline in the Hugging Face Transformers library (pipeline("fill-mask")) and is often the first hands-on demonstration of what an encoder model does.
Direct fill-mask inference is useful in its own right beyond probing. Text infilling tools for writing assistants can suggest contextually appropriate completions for partially written sentences. Grammar correction systems can score how well candidate replacements for a potentially wrong word fit the surrounding text. Word-sense induction uses the fill-mask distribution to identify which reading of an ambiguous context fits the slot. Masked language model scoring, where a sentence is scored by masking each token in turn and summing the log-probabilities of the original tokens, gives a sentence-acceptability measure that correlates reasonably well with human grammaticality judgments.
Fill-mask inference became a tool for probing factual knowledge stored in language model parameters. The LAMA (LAnguage Model Analysis) probe, introduced by Petroni et al. (2019), converts relational facts from knowledge graphs into fill-mask cloze queries. For example, the triple (Dante Alighieri, born-in, Florence) becomes the query: Dante Alighieri was born in [MASK]. The model's top prediction is compared to the ground-truth answer to assess how much world knowledge is implicitly encoded in the MLM weights.
LAMA revealed several interesting patterns: BERT-large stored a surprising amount of factual knowledge about birth places, occupations, and capital cities, often outperforming smaller retrieval-based systems on certain relation types; but the results were highly sensitive to the exact wording of the cloze template (a phenomenon called template sensitivity or surface form competition). Later work showed that ensembling over multiple paraphrased templates improved LAMA scores substantially, suggesting the knowledge is present in the model but can be difficult to surface reliably with a single prompt.
The LAMA probe is one of the intellectual ancestors of modern retrieval-augmented generation techniques. It raised the question of whether LLMs could serve as "soft knowledge bases" and sparked a research line on how to elicit, update, and verify factual knowledge in pretrained models. For encoder-only fill-mask models the answer proved partial: MLM-pretrained encoders hold knowledge well for frequently attested facts but are unreliable for long-tail entities and temporal facts that changed after the training cutoff.
The most common use of an MLM-pretrained encoder is fine-tuning for a discriminative task. A task-specific head (linear classifier, span predictor, or token-level tagger) is added on top of the encoder's final hidden states and the stack is trained on labeled data. This recipe, sometimes called the "pretrain then fine-tune" paradigm, powered the state of the art on GLUE, SuperGLUE, SQuAD 1.1, SQuAD 2.0, and the CoNLL-2003 NER benchmark throughout 2019 and 2020. The paradigm had a profound effect on NLP research: practitioners could fine-tune BERT on a few thousand labeled examples and match or exceed systems that required hundreds of thousands of examples when trained from scratch.
Contextual embeddings extracted from MLM models, typically by mean pooling or using the [CLS] token, are the workhorse representation for retrieval and similarity. Sentence-BERT (Reimers and Gurevych, 2019) fine-tunes BERT with a Siamese architecture and triplet loss to produce semantically meaningful pooled vectors suitable for cosine similarity comparison. Such embeddings underpin most modern dense retrievers, including those used for retrieval-augmented generation, and remain the dominant choice for semantic search and clustering at scale. A bi-encoder encodes queries and documents independently and retrieves candidates by approximate nearest neighbor search over precomputed document embeddings, making it fast enough for internet-scale retrieval.
Reranking uses an MLM encoder as a cross-encoder: a query and candidate document are concatenated, separated by [SEP], and fed through the model as a single sequence. The [CLS] token's final hidden state is passed through a linear layer to produce a relevance score. Cross-encoders deliver higher accuracy than bi-encoder dense retrievers because they can model query-document interactions directly, but at the cost of running the full encoder once per candidate. They are typically applied as a second-stage rescorer over a bi-encoder or BM25 first-stage retrieval index. The MonoBERT and MonoT5 rerankers are standard cross-encoder implementations that dominated the MS MARCO passage ranking leaderboard for several years.
Direct fill-mask inference is used for cloze-style probing of model knowledge (the LAMA probe described above), word-sense induction, text infilling for editing assistants, and grammatical error correction. The pseudo-likelihood from MLM can also score sentence acceptability: mask each token in turn, sum the log-probabilities of the original tokens, and interpret the total as a log-likelihood proxy for grammaticality or fluency.
Fill-mask models are less effective as direct open-ended generators. The core problem is that MLM treats masked positions as independent given context, so generating a sentence token by token requires iterative masking and decoding. Masked diffusion language models (e.g., MDLM, and earlier work such as BERT-based text generation) have explored iterative refinement: start with a fully masked sequence, predict all positions in parallel, re-mask uncertain positions, and iterate. These approaches narrow the generation quality gap but remain slower and less natural than autoregressive decoding. Encoder-decoder models such as BART and T5 are usually preferred when the goal is generation from corrupted or partially specified input.
| Benchmark | Year | Tasks | Typical use |
|---|---|---|---|
| GLUE | 2018 | 9 NLU tasks | General language understanding |
| SuperGLUE | 2019 | 8 harder NLU tasks | Successor to GLUE after saturation |
| SQuAD 1.1 | 2016 | Extractive QA | Span prediction on Wikipedia |
| SQuAD 2.0 | 2018 | QA with unanswerables | Adds abstention |
| CoNLL-2003 | 2003 | NER (English, German) | Token classification |
| MS MARCO | 2016 | Passage ranking | Retrieval and reranking |
| XNLI | 2018 | NLI in 15 languages | Cross-lingual transfer |
| BEIR | 2021 | 18 retrieval tasks | Zero-shot retrieval evaluation |
The rise of decoder-only LLMs after GPT-3 shifted research away from encoder pretraining, and for several years the dominant pretrained encoders remained 2019-vintage RoBERTa and DeBERTa variants. Despite this, encoder-only fill-mask models still account for the majority of production NLP outside open-ended generation. Embedding APIs that power vector database retrieval, content moderation classifiers, and on-device NER typically run BERT-class encoders because of latency, memory, and cost constraints. ModernBERT (December 2024) showed the encoder branch still has room to grow: with an 8,192-token context, code-and-text training, and architectural improvements, it delivers retrieval and classification quality competitive with much larger decoder models while running several times faster on the same hardware.
Encoder-decoder models such as T5 use a related but distinct objective called span corruption, in which contiguous spans are replaced by single sentinel tokens and the decoder is trained to emit the missing spans separated by the same sentinels. Span corruption shares MLM's bidirectional encoder context but produces a generative model; it is not strictly a fill-mask objective.
The [MASK] symbol used during pretraining never appears during fine-tuning or inference, creating a train-test mismatch. A fine-tuned NER model never sees [MASK] tokens; it sees real tokens throughout. The 80/10/10 strategy only partially mitigates this: 80 percent of selected positions still use the artificial [MASK] token during pretraining, so the model is exposed to a token that will never appear in production. ELECTRA and DeBERTa-v3 address this more cleanly through replaced token detection, where the discriminator always sees a coherent sequence of real tokens (some replaced by the generator, some original), eliminating the [MASK] artifact entirely.
Fill-mask models are poor open-ended generators. The pseudo-likelihood objective treats masked positions as independent given the observed context, so the model has no mechanism to ensure that multiple simultaneously generated tokens are consistent with each other. Iterative mask-fill generation approaches (mask-predict, CMLM) improve coherence but remain slower and less fluid than autoregressive generation. Long-form generation, dialogue, and instruction following are better served by reasoning models, autoregressive decoders, or encoder-decoder models.
The original BERT context window of 512 tokens (approximately 380 words), inherited by most 2019-vintage descendants, limits applicability to long documents, lengthy legal contracts, full-length scientific papers, and multi-turn conversations. Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) extend the window to 4,096 tokens using sparse attention patterns (local sliding window plus global tokens). ModernBERT extends to 8,192 tokens with a hybrid local-global attention schedule. Despite these advances, long-context encoding remains more difficult for MLM models than for modern decoder LLMs that have been specifically engineered for 128,000-token or longer contexts, partly because the quadratic attention cost of full bidirectional attention over very long sequences is prohibitive without architectural changes.
On reasoning-heavy benchmarks, encoder-only MLM models lag decoder LLMs trained at comparable scale. The largest decoder LLMs are trained with trillions of parameters and tens of trillions of tokens; the largest deployed fill-mask encoders remain in the hundreds of millions of parameters trained on single-digit trillions of tokens. The gap reflects both the difference in training scale and the asymmetry between bidirectional representation learning (suited to understanding) and the chain-of-thought traces that decoder LLMs generate (suited to multi-step reasoning). For tasks requiring multi-step reasoning, instruction following, code generation, or long-form writing, decoder LLMs are clearly superior. Fill-mask encoders dominate where compact, low-latency, high-quality representations are the actual deliverable: embedding APIs, retrieval, classification, NER, and cross-encoder reranking.
Like all neural models, fill-mask encoders freeze a snapshot of world knowledge at the training cutoff. Unlike decoder LLMs, which can be prompted to reason about a question and hedge appropriately, MLM encoders do not have a natural mechanism for expressing uncertainty about factual claims. LAMA probing showed that encoders often confidently predict a wrong answer rather than indicating that they do not know. This makes fill-mask models unsuitable for tasks requiring current events, newly updated facts, or careful knowledge attribution without additional retrieval augmentation.
Fill-mask models are sensitive to the exact phrasing of cloze queries. Changing "Dante Alighieri was born in [MASK]" to "The birthplace of Dante Alighieri is [MASK]" can produce substantially different top predictions, even though the two queries encode the same relational fact. This template sensitivity undermines the reliability of LAMA-style factual probing and limits the utility of fill-mask inference for knowledge extraction without ensemble-style template augmentation.