Fill-Mask Models

AI Models Natural Language Processing

26 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v4 · 5,298 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Natural Language Processing Models and Tasks

Fill-mask models are language models trained with a masked language modeling (MLM) objective, in which a fraction of the tokens in an input sequence are hidden behind a special [MASK] symbol and the model predicts the original tokens from their surrounding context.^[1] The task that gave the model family its name is also a direct application: given The capital of France is [MASK]., a fill-mask model returns a ranked distribution over vocabulary tokens that could fit the slot. Because predictions depend on both left and right context, fill-mask models produce bidirectional contextual embeddings widely used for text classification, named entity recognition, question answering, retrieval, and sentence similarity.

Fill-mask differs from the other dominant transformer pretraining objectives. Autoregressive (left-to-right) language modeling, used by GPT-family decoders, predicts the next token given a prefix and suits open-ended generation. Span corruption, used by T5, replaces contiguous spans with a single sentinel token and trains the decoder to emit the missing span.^[24] Fill-mask operates on individual masked positions inside a fixed window and uses an encoder-only architecture whose final hidden states serve as token representations.

Infobox

Fill-mask models
Type	Pre-training objective and inference task
Core mechanism	Predict masked tokens from bidirectional context
Architecture	Encoder-only transformer
Introduced	October 2018 (BERT, Devlin et al.)^[1]
Key objective	Masked language modeling (MLM)
Alternative objectives	Replaced token detection (ELECTRA)^[5], span corruption (T5)^[24]
Common downstream uses	Classification, NER, retrieval, reranking, sentence similarity
Key models	BERT, RoBERTa, ALBERT, DeBERTa, ModernBERT
Dominant benchmarks	GLUE, SuperGLUE, SQuAD, CoNLL-2003

History

Pre-transformer roots

The idea that word meaning can be recovered from context predates deep learning. The fill-in-the-blank format traces back to the cloze test, introduced by Wilson Taylor in 1953 as a reading-comprehension measure in which words are periodically deleted from a passage and the reader must supply the missing terms.^[25] That framing anticipated MLM by more than six decades.

Static word embedding methods such as word2vec (Mikolov et al., 2013)^[26] and GloVe (Pennington, Socher, and Manning, 2014) produced a single vector per word type without conditioning on the surrounding sentence at inference time. Word2vec's continuous bag-of-words variant predicts a center word from its neighbors, which is structurally similar to fill-mask but operates at the level of a small fixed window rather than a full sequence, and does not use a transformer.

Contextualized representations entered the mainstream with ELMo (Peters et al., 2018), which trained a three-layer bidirectional LSTM language model and combined hidden states from each layer as downstream features.^[10] ELMo's bidirectionality came from concatenating independently trained left-to-right and right-to-left models rather than a jointly trained network.^[10] The ELMo representations were used as frozen features added on top of task-specific architectures rather than fine-tuned end-to-end.

BERT and the fill-mask paradigm

The modern fill-mask paradigm was introduced by BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018.^[1] Devlin, Chang, Lee, and Toutanova framed pretraining as a cloze task: randomly mask 15 percent of input tokens and train a transformer encoder to predict the missing tokens from the surrounding context at once.^[1] BERT also used a next-sentence prediction (NSP) objective on segment pairs, where the model classified whether a second sentence genuinely followed the first in the original corpus. Pretrained on BooksCorpus and English Wikipedia (3.3 billion tokens), BERT-base (110 million parameters) and BERT-large (340 million parameters) raised state of the art on eleven NLP benchmarks including GLUE and SQuAD.^[1]

A key design decision in BERT was the addition of two special tokens: [CLS] at the start of every input and [SEP] between segments.^[1] The [CLS] token's final hidden state, aggregated through self-attention over the whole sequence, was used as the pooled sequence-level representation for classification tasks. The bidirectional attention stack, in which every token attends to every other token in the same forward pass, is what distinguishes BERT from GPT-style causal models and gives the representations their contextual richness.

A wave of refinements followed within twelve months. RoBERTa (Liu et al., July 2019) showed BERT was undertrained: longer training on more data (160 GB of text versus BERT's 16 GB), larger batches, dropping NSP, and dynamic masking (regenerating the mask pattern each epoch rather than generating it once during preprocessing) improved every downstream score without any architecture change.^[2] RoBERTa made the case that the MLM objective itself was sound and BERT's limitations were largely about data and compute.^[2]

ALBERT (Lan et al., September 2019) cut parameter counts by factorizing the embedding matrix into two smaller matrices and sharing transformer layer weights across all layers.^[3] A sentence-order prediction (SOP) objective replaced NSP, which ablation studies had found to add little signal. ALBERT-base reached GLUE scores comparable to BERT-large with six times fewer parameters.^[3]

DistilBERT (Sanh et al., October 2019) used knowledge distillation to compress BERT to 66 million parameters, retaining roughly 97 percent of GLUE performance while running 60 percent faster.^[4] The distillation loss combined the original MLM cross-entropy with a soft-target distribution over the teacher's vocabulary logits and a cosine embedding loss on the hidden representations.^[4]

Span masking and boundary prediction

SpanBERT (Joshi et al., 2020) masks contiguous spans sampled from a geometric distribution (mean length approximately 3.8 tokens) rather than independent positions.^[9] It also adds a span-boundary objective (SBO): the model must predict each masked token using only the boundary token representations on either side of the span, without attending to any of the masked positions.^[9] SpanBERT dropped NSP and trained on single-sentence segments. The resulting model was substantially better on span-selection tasks such as coreference resolution and extractive QA because the pretraining objective directly required recovering contiguous spans from context.^[9] Whole-word masking, which extends masking to all subword pieces of the same word rather than selecting pieces independently, became standard in subsequent releases by Google and others.

ELECTRA: replaced token detection

ELECTRA (Clark, Luong, Le, and Manning, 2020) introduced replaced token detection as a more compute-efficient alternative to MLM.^[5] Rather than predicting the original tokens at masked positions, ELECTRA trains two networks jointly. A small generator (trained with MLM) samples plausible replacements for masked positions. The main discriminator then classifies every token in the resulting sequence as original or replaced by the generator. Because the loss is defined over all tokens rather than only the masked 15 percent, ELECTRA extracts more signal per input example and reached BERT-equivalent quality on GLUE with roughly one quarter of the training compute.^[5] ELECTRA-large outperformed BERT-large and RoBERTa-large despite using fewer parameters and less training.^[5] The tradeoff is that replaced token detection is a binary classification problem per token rather than a full vocabulary softmax, which makes ELECTRA weaker as a direct fill-mask model but stronger as a downstream representation.

DeBERTa: disentangled attention

DeBERTa (He, Liu, Gao, and Chen, 2020) introduced disentangled attention, computing attention scores from separate content and relative-position vectors rather than a single combined representation.^[6] An enhanced mask decoder reinjects absolute position information just before the final softmax. This combination allowed DeBERTa-xxlarge (1.5 billion parameters) to surpass human performance on SuperGLUE at the time of its release.^[6] DeBERTa-v3 (He, Gao, and Chen, November 2021) combined disentangled attention with ELECTRA-style pretraining and gradient-disentangled embedding sharing, yielding the strongest encoder family on the SuperGLUE leaderboard for most of 2022 and 2023.^[7]

Multilingual and domain-specific variants

Multilingual fill-mask was scaled up with XLM-R (Conneau et al., November 2019), trained on more than two terabytes of filtered CommonCrawl across 100 languages.^[8] XLM-R beat multilingual BERT by 14.6 points on XNLI and was particularly strong on low-resource languages such as Swahili and Urdu.^[8] The multilingual MLM objective is identical to the monolingual version; the model implicitly aligns representations across languages by training on interleaved multilingual text.

Domain-specific variants followed quickly. SciBERT (Beltagy, Lo, and Cohan, 2019) pretrained on 1.14 million scientific papers (biomedical and computer science) using BERT's original recipe with a domain vocabulary.^[11] BioBERT (Lee et al., 2019) fine-tuned BERT-base on PubMed abstracts and PubMed Central full texts.^[12] Both substantially outperformed general-domain BERT on biomedical NER, relation extraction, and QA.^[11]^[12] FinBERT, LegalBERT, ClinicalBERT, CodeBERT, and dozens of other domain encoders followed the same template.

The post-LLM revival

Research on encoder-only architectures slowed as decoder large language models absorbed most attention after GPT-3 (2020). For several years the dominant pretrained encoders remained 2019-vintage RoBERTa and DeBERTa variants.

Two recent releases revived the encoder branch. MosaicBERT (Portes et al., December 2023) combined FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units, padding-token removal, a 30 percent mask rate, and bfloat16 precision into a recipe that pretrains a BERT-base-quality model in about an hour on 8 A100 GPUs for roughly 20 dollars.^[17] ModernBERT (Warner et al., December 2024) trained on two trillion tokens of web text, code, and scientific literature using a hybrid local-global attention schedule (every third layer attends globally; intervening layers use a 128-token sliding window) with an 8,192-token context.^[18] ModernBERT-base has 149 million parameters; ModernBERT-large has 395 million.^[18] ModernBERT demonstrated that a fresh encoder trained on modern data and with modern architectural techniques could compete with or surpass much larger decoder models on retrieval and classification tasks while maintaining the latency and memory advantages of the encoder-only design.^[18]

Masked language modeling objective

Formal definition

Masked language modeling was introduced as the core pretraining objective for BERT and has since been used, with variations, by every major encoder-only transformer.^[1] Given an input sequence of tokens $x = (x_1, x_2, \ldots, x_n)$, a random subset $M \subset {1, \ldots, n}$ of positions is selected as the "masked" set. The model receives the corrupted sequence $\tilde{x}$, in which positions in $M$ have been replaced by the special [MASK] token (or by another token, depending on the masking strategy), and is trained to predict the original tokens at those positions. The objective is to minimize the sum of cross-entropy losses over the masked positions:

$\mathcal{L}_\text{MLM} = -\sum_{i \in M} \log P(x_i \mid \tilde{x})$

Because the model predicts each masked position from the full surrounding context, including tokens to the left and right, it learns bidirectional representations. The key difference from an autoregressive language model (which predicts $x_i$ from $x_{<i}$ only) is that MLM conditions on both past and future context, making the representations richer for tasks that need global understanding of a fixed input sequence.

The loss sums over masked positions only, ignoring the contribution of unmasked tokens. This means MLM trains to a pseudo-likelihood: a lower bound on the true sequence probability that factors out dependencies among masked positions. The independence assumption (positions in $M$ are predicted as if they were independent given context) is what makes bidirectional attention possible during training without needing to autoregress. It is also what makes MLM-pretrained models poor direct generators: they cannot naturally condition one generated token on previously generated tokens in the same masked set.

The 80/10/10 strategy

In BERT's implementation, exactly 15 percent of subword tokens are selected for prediction.^[1] Of the selected tokens:

80 percent are replaced with the [MASK] special token.
10 percent are replaced with a uniformly sampled random token from the vocabulary.
10 percent are kept as the original token, unchanged.

The mixed strategy serves two purposes. First, the [MASK] token never appears during fine-tuning or inference, so always masking every selected position would create a train-test mismatch.^[1] Substituting random tokens and keeping some positions unchanged during pretraining prevents the model from simply ignoring unmasked positions; it must maintain a calibrated contextual representation at every position because any token could in principle be the one requiring correction. Second, the occasional random-token substitution forces the model to recognize that the observed token at a given position may not be the true token, discouraging trivial copying of observed tokens through the residual stream.

Masking strategy variants

Research after BERT explored several modifications to the basic masking recipe:

Whole-word masking (WWM) treats all subword pieces of a word as a unit. If "unbelievable" is tokenized as un, ##believe, ##able and the word is selected for masking, all three pieces are masked together. Masking at the word level rather than the subword level creates harder prediction targets and generally improves downstream performance on tasks that require word-level understanding. Whole-word masking became standard in Google's later BERT releases and in most subsequent encoder models.

Span masking (used in SpanBERT and T5's encoder) extends whole-word masking to contiguous multi-word spans sampled from a geometric distribution.^[9] The span length follows a geometric distribution with mean around 3.8 tokens, giving a mixture of short and longer contiguous gaps.^[9] Span masking targets the same structures that reading comprehension and coreference tasks require: recovering a named entity, a verb phrase, or a prepositional object from surrounding context. SpanBERT additionally trains a span-boundary objective, described above.

Dynamic masking (used in RoBERTa) regenerates the mask pattern each time a training example is visited rather than assigning a fixed mask at data preprocessing time.^[2] In BERT's original setup, a fixed mask was generated during preprocessing and the same mask was used every epoch. Dynamic masking means the model sees the same text with different masked positions across epochs, effectively multiplying the number of distinct training signals and reducing overfitting.^[2]

Entity and phrase masking (used in ERNIE by Sun et al., 2019, from Baidu) extends whole-word masking to named entities and syntactic phrases such as noun phrases and named locations.^[20] The model must recover "Barack Obama" as a unit rather than recovering "Obama" independently of "Barack," forcing it to capture entity-level semantics.

Higher mask rates: The original BERT rate of 15 percent was chosen heuristically. MosaicBERT (Portes et al., 2023) found that 30 percent masking, combined with better optimization and architecture choices, matches or exceeds the quality of 15 percent masking while reducing the number of training steps needed for convergence.^[17]

Replaced token detection (used in ELECTRA and DeBERTa-v3) replaces the vocabulary softmax cross-entropy with a per-token binary classification: is this token original or was it replaced by a generator model?^[5]^[7] Because the loss is computed over every token rather than the 15 percent masked subset, the training signal is about 6.7 times denser per sequence. ELECTRA-base trained on the same data and compute budget as BERT-base substantially outperforms it on the GLUE benchmark.^[5]

Comparison with other pretraining objectives

The table below summarizes how MLM compares to the main alternative pretraining objectives used in large language model research:

Objective	Model family	Context direction	Loss defined over	Primary use
Masked language modeling	BERT, RoBERTa, ALBERT, DeBERTa	Bidirectional (full sequence)	Masked positions only	Encoder representations
Replaced token detection	ELECTRA, DeBERTa-v3	Bidirectional (full sequence)	All positions (binary)	Efficient encoder training
Autoregressive LM	GPT, LLaMA, Mistral	Left-to-right (causal)	All positions	Generation, instruction following
Span corruption (prefix LM)	T5, BART encoder	Bidirectional encoder, causal decoder	Decoder positions	Conditional generation
Next sentence prediction	BERT (auxiliary)	Segment-level	Binary label	Removed in RoBERTa and ALBERT
Sentence order prediction	ALBERT	Segment-level	Binary label	Coherence and discourse

Notable models

Model	Released	Organization	Parameters	Pretraining
BERT-base	Oct 2018	Google	110M	MLM + NSP, 3.3B tokens^[1]
BERT-large	Oct 2018	Google	340M	MLM + NSP, 3.3B tokens^[1]
RoBERTa-base	Jul 2019	Facebook AI	125M	Dynamic MLM, 160GB text^[2]
RoBERTa-large	Jul 2019	Facebook AI	355M	Dynamic MLM, 160GB text^[2]
ALBERT	Sep 2019	Google	12M to 235M	MLM + SOP, shared layers^[3]
DistilBERT	Oct 2019	Hugging Face	66M	Distilled from BERT^[4]
SciBERT	2019	Allen Institute for AI	110M	MLM, 1.14M scientific papers^[11]
BioBERT	2019	DMIS Lab, Korea Univ.	110M	MLM, PubMed + PMC^[12]
XLM-R-base	Nov 2019	Facebook AI	270M	MLM, 100 languages, 2.5TB^[8]
XLM-R-large	Nov 2019	Facebook AI	550M	MLM, 100 languages, 2.5TB^[8]
ELECTRA-base	Mar 2020	Google	110M	Replaced token detection^[5]
SpanBERT	2020	Facebook AI	110M to 340M	Span masking + SBO^[9]
DeBERTa	Jun 2020	Microsoft	134M to 1.5B	MLM + disentangled attention^[6]
DeBERTa-v3-base	Nov 2021	Microsoft	86M	RTD + gradient-disentangled^[7]
MosaicBERT	Dec 2023	MosaicML	137M	MLM, FlashAttention, ALiBi^[17]
ModernBERT-base	Dec 2024	Answer.AI, LightOn	149M	MLM, 2T tokens, 8K context^[18]
ModernBERT-large	Dec 2024	Answer.AI, LightOn	395M	MLM, 2T tokens, 8K context^[18]

The fill-mask inference task

"Fill-mask" in the sense of a runnable inference task refers to passing a sentence with one or more [MASK] tokens to a pretrained encoder and retrieving the model's probability distribution over vocabulary tokens for each masked position. The output is a ranked list of candidate completions with associated probabilities, for example:

The capital of France is [MASK]. returns: Paris (0.98), Berlin (0.004), Rome (0.002), ...
The patient was prescribed a new [MASK] for hypertension. returns: medication (0.31), drug (0.24), treatment (0.18), ...

This inference mode is exposed as a named pipeline in the Hugging Face Transformers library (pipeline("fill-mask")) and is often the first hands-on demonstration of what an encoder model does.

Practical uses of direct fill-mask inference

Direct fill-mask inference is useful in its own right beyond probing. Text infilling tools for writing assistants can suggest contextually appropriate completions for partially written sentences. Grammar correction systems can score how well candidate replacements for a potentially wrong word fit the surrounding text. Word-sense induction uses the fill-mask distribution to identify which reading of an ambiguous context fits the slot. Masked language model scoring, where a sentence is scored by masking each token in turn and summing the log-probabilities of the original tokens, gives a sentence-acceptability measure that correlates reasonably well with human grammaticality judgments.

LAMA probing

Fill-mask inference became a tool for probing factual knowledge stored in language model parameters. The LAMA (LAnguage Model Analysis) probe, introduced by Petroni et al. (2019), converts relational facts from knowledge graphs into fill-mask cloze queries.^[19] For example, the triple (Dante Alighieri, born-in, Florence) becomes the query: Dante Alighieri was born in [MASK]. The model's top prediction is compared to the ground-truth answer to assess how much world knowledge is implicitly encoded in the MLM weights.

LAMA revealed several interesting patterns: BERT-large stored a surprising amount of factual knowledge about birth places, occupations, and capital cities, often outperforming smaller retrieval-based systems on certain relation types; but the results were highly sensitive to the exact wording of the cloze template (a phenomenon called template sensitivity or surface form competition).^[19] Later work showed that ensembling over multiple paraphrased templates improved LAMA scores substantially, suggesting the knowledge is present in the model but can be difficult to surface reliably with a single prompt.

The LAMA probe is one of the intellectual ancestors of modern retrieval-augmented generation techniques. It raised the question of whether LLMs could serve as "soft knowledge bases" and sparked a research line on how to elicit, update, and verify factual knowledge in pretrained models.^[19] For encoder-only fill-mask models the answer proved partial: MLM-pretrained encoders hold knowledge well for frequently attested facts but are unreliable for long-tail entities and temporal facts that changed after the training cutoff.

Uses of fill-mask models

The most common use of an MLM-pretrained encoder is fine-tuning for a discriminative task. A task-specific head (linear classifier, span predictor, or token-level tagger) is added on top of the encoder's final hidden states and the stack is trained on labeled data. This recipe, sometimes called the "pretrain then fine-tune" paradigm, powered the state of the art on GLUE, SuperGLUE, SQuAD 1.1, SQuAD 2.0, and the CoNLL-2003 NER benchmark throughout 2019 and 2020. The paradigm had a profound effect on NLP research: practitioners could fine-tune BERT on a few thousand labeled examples and match or exceed systems that required hundreds of thousands of examples when trained from scratch.^[1]

Bi-encoder retrieval and semantic similarity

Contextual embeddings extracted from MLM models, typically by mean pooling or using the [CLS] token, are the workhorse representation for retrieval and similarity. Sentence-BERT (Reimers and Gurevych, 2019) fine-tunes BERT with a Siamese architecture and triplet loss to produce semantically meaningful pooled vectors suitable for cosine similarity comparison.^[16] Such embeddings underpin most modern dense retrievers, including those used for retrieval-augmented generation, and remain the dominant choice for semantic search and clustering at scale. A bi-encoder encodes queries and documents independently and retrieves candidates by approximate nearest neighbor search over precomputed document embeddings, making it fast enough for internet-scale retrieval.

Cross-encoder reranking

Reranking uses an MLM encoder as a cross-encoder: a query and candidate document are concatenated, separated by [SEP], and fed through the model as a single sequence.^[23] The [CLS] token's final hidden state is passed through a linear layer to produce a relevance score.^[23] Cross-encoders deliver higher accuracy than bi-encoder dense retrievers because they can model query-document interactions directly, but at the cost of running the full encoder once per candidate. They are typically applied as a second-stage rescorer over a bi-encoder or BM25 first-stage retrieval index. The MonoBERT and MonoT5 rerankers are standard cross-encoder implementations that dominated the MS MARCO passage ranking leaderboard for several years.^[23]

Direct fill-mask inference

Direct fill-mask inference is used for cloze-style probing of model knowledge (the LAMA probe described above)^[19], word-sense induction, text infilling for editing assistants, and grammatical error correction. The pseudo-likelihood from MLM can also score sentence acceptability: mask each token in turn, sum the log-probabilities of the original tokens, and interpret the total as a log-likelihood proxy for grammaticality or fluency.

Generative limitations

Fill-mask models are less effective as direct open-ended generators. The core problem is that MLM treats masked positions as independent given context, so generating a sentence token by token requires iterative masking and decoding. Masked diffusion language models (e.g., MDLM, and earlier work such as BERT-based text generation) have explored iterative refinement: start with a fully masked sequence, predict all positions in parallel, re-mask uncertain positions, and iterate. These approaches narrow the generation quality gap but remain slower and less natural than autoregressive decoding. Encoder-decoder models such as BART and T5 are usually preferred when the goal is generation from corrupted or partially specified input.

Benchmarks

Benchmark	Year	Tasks	Typical use
GLUE	2018	9 NLU tasks	General language understanding^[13]
SuperGLUE	2019	8 harder NLU tasks	Successor to GLUE after saturation^[14]
SQuAD 1.1	2016	Extractive QA	Span prediction on Wikipedia^[15]
SQuAD 2.0	2018	QA with unanswerables	Adds abstention
CoNLL-2003	2003	NER (English, German)	Token classification
MS MARCO	2016	Passage ranking	Retrieval and reranking
XNLI	2018	NLI in 15 languages	Cross-lingual transfer
BEIR	2021	18 retrieval tasks	Zero-shot retrieval evaluation

Modern relevance

The rise of decoder-only LLMs after GPT-3 shifted research away from encoder pretraining, and for several years the dominant pretrained encoders remained 2019-vintage RoBERTa and DeBERTa variants. Despite this, encoder-only fill-mask models still account for the majority of production NLP outside open-ended generation. Embedding APIs that power vector database retrieval, content moderation classifiers, and on-device NER typically run BERT-class encoders because of latency, memory, and cost constraints. ModernBERT (December 2024) showed the encoder branch still has room to grow: with an 8,192-token context, code-and-text training, and architectural improvements, it delivers retrieval and classification quality competitive with much larger decoder models while running several times faster on the same hardware.^[18]

Encoder-decoder models such as T5 use a related but distinct objective called span corruption, in which contiguous spans are replaced by single sentinel tokens and the decoder is trained to emit the missing spans separated by the same sentinels.^[24] Span corruption shares MLM's bidirectional encoder context but produces a generative model; it is not strictly a fill-mask objective.

Applications

Search and retrieval: dense passage retrieval, hybrid pipelines, semantic search, RAG retrieval stages.
Classification: sentiment analysis, topic labeling, spam and toxicity detection, intent classification.
Named entity recognition: extracting people, organizations, locations, and domain-specific entities.
Sentence similarity: paraphrase mining, duplicate question detection, semantic textual similarity.
Reranking: improving retrieval precision with cross-encoder scoring.
Biomedical and scientific text mining: relation extraction, gene mention detection, and knowledge graph construction with SciBERT, BioBERT, and clinical variants such as ClinicalBERT.
Grammatical error correction and text infilling: filling missing words and suggesting alternatives for writing assistants.
Cloze-style probing: evaluating factual or commonsense knowledge in a pretrained model.

Limitations

Train-test mismatch

The [MASK] symbol used during pretraining never appears during fine-tuning or inference, creating a train-test mismatch.^[1] A fine-tuned NER model never sees [MASK] tokens; it sees real tokens throughout. The 80/10/10 strategy only partially mitigates this: 80 percent of selected positions still use the artificial [MASK] token during pretraining, so the model is exposed to a token that will never appear in production. ELECTRA and DeBERTa-v3 address this more cleanly through replaced token detection, where the discriminator always sees a coherent sequence of real tokens (some replaced by the generator, some original), eliminating the [MASK] artifact entirely.^[5]^[7]

Generative weakness

Fill-mask models are poor open-ended generators. The pseudo-likelihood objective treats masked positions as independent given the observed context, so the model has no mechanism to ensure that multiple simultaneously generated tokens are consistent with each other. Iterative mask-fill generation approaches (mask-predict, CMLM) improve coherence but remain slower and less fluid than autoregressive generation. Long-form generation, dialogue, and instruction following are better served by reasoning models, autoregressive decoders, or encoder-decoder models.

Context length limits

The original BERT context window of 512 tokens (approximately 380 words), inherited by most 2019-vintage descendants, limits applicability to long documents, lengthy legal contracts, full-length scientific papers, and multi-turn conversations.^[1] Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) extend the window to 4,096 tokens using sparse attention patterns (local sliding window plus global tokens).^[21]^[22] ModernBERT extends to 8,192 tokens with a hybrid local-global attention schedule.^[18] Despite these advances, long-context encoding remains more difficult for MLM models than for modern decoder LLMs that have been specifically engineered for 128,000-token or longer contexts, partly because the quadratic attention cost of full bidirectional attention over very long sequences is prohibitive without architectural changes.

Scale gap versus decoder LLMs

On reasoning-heavy benchmarks, encoder-only MLM models lag decoder LLMs trained at comparable scale. The largest decoder LLMs are trained with trillions of parameters and tens of trillions of tokens; the largest deployed fill-mask encoders remain in the hundreds of millions of parameters trained on single-digit trillions of tokens. The gap reflects both the difference in training scale and the asymmetry between bidirectional representation learning (suited to understanding) and the chain-of-thought traces that decoder LLMs generate (suited to multi-step reasoning). For tasks requiring multi-step reasoning, instruction following, code generation, or long-form writing, decoder LLMs are clearly superior. Fill-mask encoders dominate where compact, low-latency, high-quality representations are the actual deliverable: embedding APIs, retrieval, classification, NER, and cross-encoder reranking.

Knowledge staleness

Like all neural models, fill-mask encoders freeze a snapshot of world knowledge at the training cutoff. Unlike decoder LLMs, which can be prompted to reason about a question and hedge appropriately, MLM encoders do not have a natural mechanism for expressing uncertainty about factual claims. LAMA probing showed that encoders often confidently predict a wrong answer rather than indicating that they do not know.^[19] This makes fill-mask models unsuitable for tasks requiring current events, newly updated facts, or careful knowledge attribution without additional retrieval augmentation.

Template sensitivity

Fill-mask models are sensitive to the exact phrasing of cloze queries.^[19] Changing "Dante Alighieri was born in [MASK]" to "The birthplace of Dante Alighieri is [MASK]" can produce substantially different top predictions, even though the two queries encode the same relational fact. This template sensitivity undermines the reliability of LAMA-style factual probing and limits the utility of fill-mask inference for knowledge extraction without ensemble-style template augmentation.

References

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.* arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Liu, Y. et al. (2019). *RoBERTa: A Robustly Optimized BERT Pretraining Approach.* arXiv:1907.11692. https://arxiv.org/abs/1907.11692 ↩
Lan, Z. et al. (2019). *ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.* arXiv:1909.11942. https://arxiv.org/abs/1909.11942 ↩
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). *DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.* arXiv:1910.01108. https://arxiv.org/abs/1910.01108 ↩
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). *ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.* arXiv:2003.10555. https://arxiv.org/abs/2003.10555 ↩
He, P., Liu, X., Gao, J., and Chen, W. (2020). *DeBERTa: Decoding-enhanced BERT with Disentangled Attention.* arXiv:2006.03654. https://arxiv.org/abs/2006.03654 ↩
He, P., Gao, J., and Chen, W. (2021). *DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing.* arXiv:2111.09543. https://arxiv.org/abs/2111.09543 ↩
Conneau, A. et al. (2019). *Unsupervised Cross-lingual Representation Learning at Scale.* arXiv:1911.02116. https://arxiv.org/abs/1911.02116 ↩
Joshi, M. et al. (2020). *SpanBERT: Improving Pre-training by Representing and Predicting Spans.* arXiv:1907.10529. https://arxiv.org/abs/1907.10529 ↩
Peters, M. E. et al. (2018). *Deep contextualized word representations.* arXiv:1802.05365. https://arxiv.org/abs/1802.05365 ↩
Beltagy, I., Lo, K., and Cohan, A. (2019). *SciBERT: A Pretrained Language Model for Scientific Text.* arXiv:1903.10676. https://arxiv.org/abs/1903.10676 ↩
Lee, J. et al. (2019). *BioBERT: a pre-trained biomedical language representation model for biomedical text mining.* arXiv:1901.08746. https://arxiv.org/abs/1901.08746 ↩
Wang, A. et al. (2018). *GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.* arXiv:1804.07461. https://arxiv.org/abs/1804.07461 ↩
Wang, A. et al. (2019). *SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.* arXiv:1905.00537. https://arxiv.org/abs/1905.00537 ↩
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). *SQuAD: 100,000+ Questions for Machine Comprehension of Text.* arXiv:1606.05250. https://arxiv.org/abs/1606.05250 ↩
Reimers, N. and Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.* arXiv:1908.10084. https://arxiv.org/abs/1908.10084 ↩
Portes, J. et al. (2023). *MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining.* arXiv:2312.17482. https://arxiv.org/abs/2312.17482 ↩
Warner, B. et al. (2024). *Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.* arXiv:2412.13663. https://arxiv.org/abs/2412.13663 ↩
Petroni, F. et al. (2019). *Language Models as Knowledge Bases?* arXiv:1909.01066. https://arxiv.org/abs/1909.01066 ↩
Sun, Y. et al. (2019). *ERNIE: Enhanced Representation through Knowledge Integration.* arXiv:1904.09223. https://arxiv.org/abs/1904.09223 ↩
Beltagy, I., Peters, M. E., and Cohan, A. (2020). *Longformer: The Long-Document Transformer.* arXiv:2004.05150. https://arxiv.org/abs/2004.05150 ↩
Zaheer, M. et al. (2020). *Big Bird: Transformers for Longer Sequences.* arXiv:2007.14062. https://arxiv.org/abs/2007.14062 ↩
Nogueira, R. and Cho, K. (2019). *Passage Re-ranking with BERT.* arXiv:1901.04085. https://arxiv.org/abs/1901.04085 ↩
Raffel, C. et al. (2019). *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.* arXiv:1910.10683. https://arxiv.org/abs/1910.10683 ↩
Taylor, W. L. (1953). *"Cloze Procedure": A New Tool for Measuring Readability.* Journalism Quarterly, 30(4), 415-433. ↩
Mikolov, T. et al. (2013). *Distributed Representations of Words and Phrases and their Compositionality.* arXiv:1310.4546. https://arxiv.org/abs/1310.4546 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Llama 3 Text Classification Models

Infobox

History

Pre-transformer roots

BERT and the fill-mask paradigm

The 2019 refinement wave

Span masking and boundary prediction

ELECTRA: replaced token detection

DeBERTa: disentangled attention

Multilingual and domain-specific variants

The post-LLM revival

Masked language modeling objective

Formal definition

The 80/10/10 strategy

Masking strategy variants

Comparison with other pretraining objectives

Notable models

The fill-mask inference task

Practical uses of direct fill-mask inference

LAMA probing

Uses of fill-mask models

Bi-encoder retrieval and semantic similarity

Cross-encoder reranking

Direct fill-mask inference

Generative limitations

Benchmarks

Modern relevance

Applications

Limitations

Train-test mismatch

Generative weakness

Context length limits

Scale gap versus decoder LLMs

Knowledge staleness

Template sensitivity

See also

References

Improve this article

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

Sentence-transformers/all-mpnet-base-v2 model

What links here

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

Sentence-transformers/all-mpnet-base-v2 model

What links here