BERT

Encoder-Only Language Models Pretrained Language Models

41 min read

Updated Apr 27, 2026

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based encoder-only language model developed by researchers at Google AI Language. Introduced in October 2018, BERT changed the way natural language processing (NLP) systems are built by demonstrating that pre-training a deep bidirectional model on unlabeled text, then fine-tuning it on specific tasks, could beat purpose-built architectures across a wide range of benchmarks. The original paper, authored by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, has accumulated over 100,000 citations, making it one of the most referenced works in the history of artificial intelligence research.

BERT was open-sourced on November 2, 2018, with both pre-trained model weights and TensorFlow source code released on GitHub. Its release marked the beginning of a new era in NLP where transfer learning from large pre-trained models became the default approach for nearly every language understanding task. The paper, formally titled "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (arXiv:1810.04805), went on to win the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), held in Minneapolis. The award was later cited as one of the field's clearest acknowledgments that the pre-train-then-fine-tune paradigm had displaced the older feature-engineering tradition.

More than seven years after its release, BERT and its descendants are still the default choice for production embedding, classification, named entity recognition, and retrieval pipelines. Even as decoder-only large language models like GPT-4 dominate the headlines, encoder-only models in the BERT family power most of the search engines, recommendation systems, content-moderation pipelines, and vector database backends that quietly run modern web services.

Background and motivation

Before BERT, language models generally processed text in one direction. GPT (Generative Pre-trained Transformer), released by OpenAI in June 2018, used a left-to-right transformer decoder to predict the next token in a sequence. ELMo (Embeddings from Language Models), published earlier in 2018 by researchers at the Allen Institute for AI, concatenated the outputs of separate forward and backward LSTM networks to produce context-sensitive word representations. While ELMo captured some bidirectional context, its forward and backward components were trained independently and only combined in a shallow manner.

The BERT authors argued that existing approaches were suboptimal because they restricted the power of pre-trained representations. A truly bidirectional model, one that could attend to both left and right context simultaneously at every layer, would produce richer representations for downstream tasks. The challenge was that standard language modeling objectives (predicting the next word) inherently require unidirectional processing; allowing the model to "see" the target word during training would make the task trivial. Without a clever objective, deep bidirectional pre-training would amount to letting the model cheat.

BERT solved this with a new pre-training objective called Masked Language Modeling (MLM), which randomly hides a fraction of input tokens and trains the model to recover them from the surrounding context in both directions. This simple but effective approach enabled deep bidirectional pre-training for the first time. The idea was inspired by the older Cloze task from psycholinguistics, in which subjects fill in deleted words from a passage. By framing pre-training as a Cloze problem, the BERT team turned a methodological obstacle into a clean self-supervised objective.

Architecture

BERT uses the encoder portion of the transformer architecture introduced by Vaswani et al. in 2017. Unlike the original transformer, which has both an encoder and a decoder, BERT uses only the encoder stack. Each encoder layer consists of a multi-head self-attention mechanism followed by a position-wise feed-forward network, with layer normalization and residual connections applied to each sub-layer.

Model configurations

The original paper described two model sizes:

Configuration	Layers	Hidden size	Attention heads	Parameters	Max sequence length
BERT-Base	12	768	12	110M	512
BERT-Large	24	1024	16	340M	512

BERT-Base was designed to have roughly the same model size as GPT (which had 12 layers and 117M parameters) to allow direct comparison. BERT-Large was the larger configuration used to push state-of-the-art results.

In March 2020, Google released 24 additional smaller BERT models ranging from BERT-Tiny (2 layers, 128 hidden size, 4.4M parameters) to BERT-Base, giving practitioners more options for resource-constrained settings. The smaller models were intended for mobile, edge, and low-budget research use cases where the full BERT-Base was overkill or simply too slow.

Variant	Layers	Hidden size	Attention heads	Parameters
BERT-Tiny	2	128	2	4.4M
BERT-Mini	4	256	4	11.3M
BERT-Small	4	512	8	28.8M
BERT-Medium	8	512	8	41.4M
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M

Input representation

BERT's input representation is built by summing three types of embeddings:

Token embeddings: The model uses WordPiece tokenization with a vocabulary of 30,522 tokens. WordPiece is a subword tokenization algorithm that splits rare words into smaller pieces (prefixed with "##" for continuation tokens) while keeping common words as single tokens. Of the 30,522 entries, roughly 5,800 are continuation subwords (about 19% of the vocabulary), the first 1,000 slots are reserved for special and reserved tokens, indices 1,000 to 1,996 are individual characters and symbols, and the first whole word, "the," appears at index 1,997.
Segment embeddings: Because some tasks require understanding the relationship between two sentences, BERT adds a learned segment embedding to distinguish between Sentence A and Sentence B.
Position embeddings: Learned positional embeddings encode the position of each token in the sequence, up to a maximum of 512 positions. Unlike the sinusoidal positional encodings used in the original Transformer paper, BERT learns its position vectors from scratch as ordinary parameters. This choice made the model simpler but capped the input length at the maximum number of position embeddings learned during pre-training.

Every input sequence begins with a special [CLS] (classification) token. For tasks involving sentence pairs, a [SEP] (separator) token is inserted between the two sentences. Another [SEP] token marks the end of the input. The final hidden state of the [CLS] token serves as the aggregate sequence representation for classification tasks. Additional special tokens include [MASK] (used during pre-training), [PAD] (for padding shorter sequences), and [UNK] (for unknown tokens not in the vocabulary).

WordPiece in detail

The WordPiece algorithm itself is closely related to byte-pair encoding (BPE) but uses a different scoring function. Rather than counting raw co-occurrence frequencies, WordPiece merges the symbol pair that most increases the likelihood of the training corpus under a unigram language model. In practice, the algorithm starts with a base alphabet, then iteratively adds the most useful merge until the vocabulary reaches the target size. At inference time, BERT uses a greedy longest-match-first lookup: each input word is matched against the vocabulary from left to right, breaking the word into the longest possible subword fragments.

This design lets BERT handle rare words, neologisms, and morphologically complex inputs without an explosion of out-of-vocabulary tokens. A word like "unbelievability" might tokenize as un, ##believ, ##ability, while "playing" decomposes into play and ##ing. The vocabulary is small enough to fit comfortably in the embedding matrix yet large enough to keep most common words intact, which preserves a sense of word identity that pure character or byte models lack.

Pre-training

BERT's pre-training uses two self-supervised objectives applied simultaneously to large corpora of unlabeled text.

Masked Language Modeling (MLM)

The core innovation behind BERT is the Masked Language Modeling objective. During pre-training, 15% of the input tokens are randomly selected for prediction. To avoid a mismatch between pre-training (where [MASK] tokens appear) and fine-tuning (where they do not), the selected tokens are handled as follows:

80% of the time, the token is replaced with [MASK]
10% of the time, the token is replaced with a random token from the vocabulary
10% of the time, the token is left unchanged

The model must predict the original token for each selected position. This approach forces the model to maintain a distributional representation for every input token, since it cannot know which tokens will be masked, and produces deep bidirectional representations because each masked token can attend to context on both sides.

The loss for MLM is computed only over the masked positions. A cross-entropy loss is taken between the model's softmax output (over the entire 30,522-token vocabulary) and the true token. Because only 15% of positions contribute to the loss, MLM is less sample-efficient than left-to-right language modeling, where every token contributes a prediction. Later models such as ELECTRA were designed in part to fix this inefficiency.

Next Sentence Prediction (NSP)

Many downstream tasks, such as question answering and natural language inference, require understanding the relationship between two sentences. To capture this, BERT was pre-trained on a binary Next Sentence Prediction task. Given two sentences A and B, the model must predict whether B actually follows A in the original corpus (labeled "IsNext") or whether B is a random sentence (labeled "NotNext"). Training examples are constructed with a 50/50 split between real consecutive pairs and random pairs. The output of the [CLS] token feeds a small binary classifier to make the prediction.

Later research by the RoBERTa team at Facebook AI and others found that NSP does not consistently help downstream performance and may even hurt it. The likely problem is that the random negative pairs are easy to detect from topic alone, so the model learns to discriminate topics rather than the harder coherence relation NSP was meant to teach. As a result, many BERT successors dropped or replaced this objective. ALBERT replaced NSP with Sentence Order Prediction (SOP), which forces the model to distinguish two consecutive sentences in the original order from the same two sentences swapped, removing the topic shortcut.

Training data and procedure

BERT was pre-trained on two large English-language corpora:

Corpus	Size	Description
BooksCorpus	800M words	Collection of 11,038 unpublished books from various genres
English Wikipedia	2,500M words	Text content extracted from English Wikipedia (lists, tables, and headers excluded)

The combined training corpus contains roughly 3.3 billion words. Pre-training used a batch size of 256 sequences of 512 tokens each (131,072 tokens per batch) for 1,000,000 steps, which amounts to approximately 40 epochs over the combined dataset. Training used the Adam optimizer with a learning rate of 1e-4 and linear warmup over the first 10,000 steps followed by linear decay. BERT-Base was trained on 4 Cloud TPUs (16 TPU chips) for 4 days, and BERT-Large was trained on 16 Cloud TPUs (64 TPU chips) for 4 days.

The pre-training procedure actually used two phases. In the first phase, the model was trained for 900,000 steps with a maximum sequence length of 128 tokens. In the second phase, training continued for 100,000 steps with a maximum sequence length of 512 tokens, allowing the model to learn longer-range positional embeddings. This staged schedule reduced wall-clock time, since the dominant cost in self-attention scales quadratically with sequence length.

Google reported the dollar cost of training BERT-Large in 2018 at roughly $7,000 on Cloud TPU pricing, although third-party estimates ranged as high as $12,500 depending on assumptions about hardware utilization. By 2023, MosaicML demonstrated that a BERT-Base model could be pre-trained from scratch to competitive accuracy for approximately $20 using modern hardware, optimized data loading, and improved training recipes. The drop in cost over five years made it cheap for individual labs and small companies to pre-train custom encoders rather than relying on public checkpoints.

Fine-tuning

One of BERT's main contributions is that a single pre-trained model can be adapted to many different tasks by adding a simple task-specific output layer and fine-tuning all parameters end-to-end. This is in contrast to feature-based approaches like ELMo, where the pre-trained model's weights are frozen and its outputs are used as fixed features.

Fine-tuning is straightforward. For classification tasks (sentiment analysis, textual entailment), a linear classifier is added on top of the [CLS] token's final hidden state. For token-level tasks (named entity recognition, part-of-speech tagging), a linear classifier is added on top of each token's final hidden state. For span-extraction tasks like question answering, start and end pointers are learned over the input sequence.

Fine-tuning is fast and inexpensive compared to pre-training. Most tasks can be fine-tuned in 1 to 4 epochs with a learning rate between 2e-5 and 5e-5, and training on a single GPU typically takes between 30 minutes and a few hours. The original BERT paper used a batch size of 16 or 32 for fine-tuning and tried only three learning rates (2e-5, 3e-5, 5e-5) per task, picking the best configuration on the development set. The recipe was deliberately minimalist to show that BERT's success came from the pre-trained representation rather than careful per-task tuning.

Fine-tuning task types

Task type	Output layer	Input format	Examples
Sentence classification	Linear layer on `[CLS]`	Single sentence	Sentiment analysis, spam detection
Sentence pair classification	Linear layer on `[CLS]`	Sentence A `[SEP]` Sentence B	Natural language inference, paraphrase detection
Token classification	Linear layer per token	Single sentence	Named entity recognition, POS tagging
Extractive question answering	Start/end pointers	Question `[SEP]` Passage	SQuAD, reading comprehension

Fine-tuned BERT models are still the dominant approach for production text classification and token tagging in 2026, even when the underlying base model has been swapped for a more recent encoder such as DeBERTa-v3 or ModernBERT. The procedural simplicity of "add one linear head, train for a few epochs" is hard to beat in operational settings where engineers value predictable results.

Catastrophic forgetting and instability

Fine-tuning is not always smooth. BERT-Large in particular is known to be unstable on small datasets like RTE or MRPC, where some random seeds can produce models that fail to learn beyond the majority-class baseline. A 2020 paper by Mosbach, Andriushchenko, and Klakow attributed the instability to a combination of vanishing gradients in the optimizer state and a small number of training steps. The proposed fix, training for more epochs with a longer warmup and bias-correction terms, became standard practice. Hugging Face's Trainer class incorporates these defaults and largely papers over the issue for downstream users.

Benchmark results

BERT set new state-of-the-art results on 11 NLP benchmarks upon its release. The improvements were substantial across the board.

GLUE benchmark

The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine sentence-level and sentence-pair classification tasks. BERT-Large achieved an average score of 80.5, a 7.7-point improvement over the previous best system. Individual task results for BERT-Large:

Task	Metric	BERT-Large score	Previous SOTA
MNLI (matched)	Accuracy	86.7%	82.1%
MNLI (mismatched)	Accuracy	85.9%	81.4%
QQP	F1	72.1%	66.1%
QNLI	Accuracy	92.7%	87.4%
SST-2	Accuracy	94.9%	93.2%
CoLA	Matthews Corr.	60.5%	35.0%
STS-B	Spearman Corr.	86.5%	81.0%
MRPC	F1	89.3%	85.4%
RTE	Accuracy	70.1%	56.0%

BERT's strongest individual improvement was on CoLA (Corpus of Linguistic Acceptability), where it beat the previous best by 25.5 percentage points. The RTE (Recognizing Textual Entailment) task also saw a 14.1-point gain.

SQuAD

On the Stanford Question Answering Dataset (SQuAD) version 1.1, BERT achieved an F1 score of 93.2 and an Exact Match (EM) score of 87.4, surpassing the previous best system by 1.5 F1 points and exceeding the estimated human performance F1 of 91.2. On SQuAD 2.0, which includes unanswerable questions, BERT achieved an F1 of 83.1 and EM of 80.0, a 5.1-point F1 improvement.

SWAG

On the Situations With Adversarial Generations (SWAG) dataset for grounded commonsense reasoning, BERT-Large achieved 86.3% accuracy, beating the previous best by 27.1 percentage points and approaching human performance of 88.0%.

Impact on benchmarks

BERT's performance on GLUE was so strong that it, along with subsequent models like XLNet and RoBERTa, quickly surpassed human-level performance on the benchmark by 2019. This prompted the creation of SuperGLUE, a harder benchmark designed to provide more room for improvement. SuperGLUE itself was later saturated by DeBERTa-v3 and T5 derivatives, and by 2024 most NLP researchers had moved on to LLM-era benchmarks like MMLU, BIG-Bench Hard, and HELM.

Comparison with other pre-training approaches

BERT was developed alongside several other approaches to pre-trained language representations. Understanding how these models differ helps explain why BERT was so effective.

Feature	ELMo (2018)	GPT (2018)	BERT (2018)
Architecture	Bidirectional LSTM	Transformer decoder	Transformer encoder
Directionality	Shallow bidirectional (separate L-to-R and R-to-L)	Unidirectional (left-to-right)	Deep bidirectional
Pre-training objective	Language modeling (both directions, separately)	Autoregressive language modeling	Masked LM + Next Sentence Prediction
Downstream adaptation	Feature-based (frozen weights)	Fine-tuning (all parameters)	Fine-tuning (all parameters)
Parameters	94M	117M	110M (Base), 340M (Large)
Context window	N/A (recurrent)	512 tokens	512 tokens

ELMo's main limitation was its shallow bidirectionality: the forward and backward LSTMs were trained independently and only concatenated, so there was no cross-directional attention within individual layers. GPT used the transformer architecture (which is more parallelizable than LSTMs) but was restricted to left-to-right attention, limiting its ability to capture context to the right of any given token.

BERT combined the strengths of both approaches. It used the transformer architecture (like GPT) and processed text bidirectionally (like ELMo), but with true deep bidirectionality where each layer attends to both left and right context simultaneously.

Applications

BERT has been applied to a wide range of tasks in both research and industry.

Google Search

In October 2019, Google announced that it was using BERT to improve search results for English-language queries, calling it "the biggest change in search in the past five years." By December 2019, BERT had been deployed for search queries in over 70 languages. By 2020, nearly every English search query processed by Google involved a BERT model. BERT helped Google better understand the intent behind search queries, especially longer and more conversational ones where prepositions and context words carry meaning. For example, a query like "2019 brazil traveler to usa need a visa" required understanding that the traveler is from Brazil (not going to Brazil), something earlier systems struggled with.

The rollout was expensive enough that Google publicly noted it had to deploy a fresh generation of Cloud TPUs to serve BERT at search-query latencies. Search Engine Land later reported that DeepRank, an internal Google ranking system launched in 2019, was a productionized BERT variant tuned for ranking rather than open-domain question answering. As of 2024, Google still uses BERT-class models in its featured-snippet generation, query rewriting, and natural-language understanding pipelines, though they now sit alongside larger generative models such as PaLM and Gemini for specific subtasks.

Named entity recognition

Fine-tuned BERT models achieve strong results on named entity recognition (NER) tasks, where the goal is to identify and classify entities such as person names, organizations, and locations in text. On the CoNLL-2003 NER benchmark, BERT-Large achieved an F1 score of 92.8.

NER is one of the workloads where BERT-style encoders continue to dominate. Decoder-only LLMs can perform NER zero-shot by prompting, but a fine-tuned BERT runs at a fraction of the cost, returns structured spans without parsing tricks, and matches or beats the prompted LLM on standard datasets like OntoNotes and Few-NERD. In production, encoder-based NER is the default in document processing, intelligent inboxes, healthcare data extraction, and compliance review systems.

Sentiment analysis

BERT is widely used for sentiment analysis in customer service, product reviews, and social media monitoring. Its ability to capture context-dependent meaning makes it better than earlier bag-of-words or simple embedding methods at handling negation, sarcasm, and nuanced opinions.

Question answering

BERT's performance on SQuAD made it a natural fit for building question answering systems. Developers can fine-tune BERT on domain-specific QA datasets to build customer support bots, medical information retrieval systems, and educational tools. Google itself used BERT-based models in its featured snippets and Google Assistant.

Information retrieval and embeddings

The biggest commercial use of BERT today may be retrieval rather than classification. Sentence-BERT and its successors turned BERT into a tool for producing fixed-size sentence embeddings that can be compared with cosine similarity, which is the foundation for modern semantic search and retrieval-augmented generation. The sentence-transformers library, maintained by Nils Reimers and the UKP Lab at TU Darmstadt, is one of the most-downloaded packages on Hugging Face, and the models it ships are nearly all BERT or BERT-family encoders.

Leading 2024 and 2025 embedding models such as BGE-M3 from BAAI, NV-Embed from NVIDIA, GTE-large from Alibaba's DAMO Academy, and Cohere's English v3 embeddings all use BERT-family encoders or close descendants as their backbone. Even where the public-facing brand emphasizes proprietary improvements, the underlying network is almost always a BERT, RoBERTa, XLM-R, or DeBERTa derivative trained with contrastive objectives on top of an encoder pre-train.

Clinical and biomedical NLP

Specialized versions of BERT, such as BioBERT and ClinicalBERT, have been pre-trained on biomedical literature and clinical notes. These domain-adapted models outperform general-purpose BERT on tasks like biomedical NER, relation extraction, and clinical text classification. BioBERT, introduced by Lee et al. in 2019, was trained on PubMed abstracts and PubMed Central full-text articles. ClinicalBERT, introduced by Alsentzer et al. in 2019, fine-tuned BioBERT on the MIMIC-III clinical notes corpus.

Content moderation

Companies including Facebook (now Meta) have used BERT-based models for detecting hate speech, misinformation, and other harmful content on social media platforms. Meta's WPIE (Whole-Post Integrity Embeddings) system, deployed across Facebook and Instagram for content classification, was originally built on a multilingual BERT-style encoder.

Multilingual NLP

Multilingual BERT (mBERT) was pre-trained on Wikipedia text from 104 languages using a shared WordPiece vocabulary. Despite having no explicit cross-lingual training signal, mBERT shows surprisingly strong zero-shot cross-lingual transfer, where a model fine-tuned on English data performs well on the same task in other languages. Pires, Schlinger, and Garrette (2019) called this property "surprisingly multilingual" in a paper that helped establish a now-standard zero-shot evaluation protocol for cross-lingual models.

Variants and successors

BERT's release sparked a wave of follow-up work that improved on its design in various ways. These models are sometimes collectively referred to as "BERTology."

RoBERTa (2019)

RoBERTa (Robustly Optimized BERT Pre-training Approach) was developed by Facebook AI Research. The authors found that BERT was significantly undertrained and that better results could be achieved through more careful tuning of training hyperparameters and procedures. Key changes included:

Removing the Next Sentence Prediction objective
Training with dynamic masking (a new mask pattern generated for each training example in each epoch) instead of static masking
Training with larger mini-batches (8,000 sequences vs. BERT's 256)
Training on more data (160GB of text, including CC-News, OpenWebText, and Stories datasets, in addition to BooksCorpus and Wikipedia)
Using a byte-level BPE tokenizer with a vocabulary of 50,000 tokens instead of WordPiece

The RoBERTa training run used 1,024 NVIDIA V100 GPUs for 500,000 steps with sequence length 512. It matched or exceeded the performance of all models published after BERT at the time, scoring 88.5 on GLUE (compared to BERT-Large's 80.5). RoBERTa effectively reset community expectations for what a BERT-style model could achieve once you stopped underspending on pre-training.

ALBERT (2019)

ALBERT (A Lite BERT) was developed by Google Research and the Toyota Technological Institute at Chicago. It introduced two parameter-reduction techniques:

Factorized embedding parameterization: Instead of tying the WordPiece embedding size directly to the hidden layer size, ALBERT uses a smaller embedding size (e.g., 128) and projects it up to the hidden size, reducing embedding parameters from O(V x H) to O(V x E + E x H).
Cross-layer parameter sharing: All transformer layers share the same parameters, dramatically reducing model size.

ALBERT also replaced NSP with Sentence Order Prediction (SOP), where the model must determine whether two consecutive sentences are in the correct order or have been swapped. An ALBERT-xxlarge configuration with 235M parameters outperformed BERT-Large (340M parameters) on several benchmarks. ALBERT became influential for showing that parameter efficiency was a useful axis to optimize on, even when it slightly increased compute per training step.

DistilBERT (2019)

DistilBERT was created by Hugging Face using knowledge distillation. The student model has 6 transformer layers (half of BERT-Base's 12) and 66M parameters, making it 40% smaller and 60% faster at inference. Despite its reduced size, DistilBERT retains about 97% of BERT-Base's language understanding capabilities as measured on GLUE. The distillation process uses a combination of three loss functions: language modeling loss, distillation loss (matching the teacher model's probability distributions), and cosine embedding loss.

DistilBERT remained one of the most-downloaded models on Hugging Face for years and is still common in production deployments where latency budgets matter, such as real-time chat moderation or on-device inference. Its success also helped popularize knowledge distillation as a standard tool in NLP engineering.

ELECTRA (2020)

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) was developed by Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning at Stanford University and Google. Instead of masked language modeling, ELECTRA uses a "replaced token detection" objective. A small generator network (a small MLM) produces plausible replacement tokens, and a larger discriminator network must identify which tokens in the input have been replaced. Because this task is defined over all input tokens (not just the 15% that are masked), ELECTRA is much more sample-efficient. An ELECTRA-Small model trained on one GPU for 4 days outperformed GPT on GLUE, and ELECTRA-Large matched RoBERTa and XLNet while using less than a quarter of their compute.

DeBERTa (2020)

DeBERTa (Decoding-enhanced BERT with Disentangled Attention) was developed by Microsoft Research. It introduced two main improvements:

Disentangled attention: Each token is represented by two separate vectors encoding its content and its position. Attention weights are computed using separate matrices for content-to-content, content-to-position, and position-to-content interactions, rather than combining content and position into a single vector as in BERT.
Enhanced mask decoder: Absolute position information is incorporated in the decoding layer (just before the output softmax for masked token prediction), rather than in the input embedding layer.

In January 2021, a scaled-up DeBERTa model with 1.5 billion parameters became the first model to surpass human-level performance on the SuperGLUE benchmark, achieving a macro-average score of 90.3 compared to the human baseline of 89.8.

DeBERTa-v3 (2021)

DeBERTa-v3, released by Pengcheng He, Jianfeng Gao, and Weizhu Chen at Microsoft in November 2021 and accepted at ICLR 2023, replaces masked language modeling with the ELECTRA-style replaced token detection objective and adds a new gradient-disentangled embedding sharing scheme (GDES). In ELECTRA, the generator and discriminator share input embeddings, but their training losses pull the embeddings in different directions, creating a "tug of war." GDES lets the generator share its embeddings with the discriminator while stopping the discriminator's gradients from flowing back into the generator's embedding parameters, which improves training efficiency and downstream accuracy.

DeBERTa-v3 became, by most measures, the strongest publicly released encoder model of its era. The base, large, and xlarge variants topped many fine-tuning leaderboards through 2024, and the multilingual mDeBERTa-v3 model is still a popular choice for cross-lingual classification work in 2026.

SpanBERT (2020)

SpanBERT, developed by Facebook AI and other institutions, modified the masking strategy to mask contiguous random spans rather than individual random tokens. It also removed the NSP objective. SpanBERT outperformed BERT on span-selection tasks such as question answering and coreference resolution.

XLNet (2019)

XLNet, developed by researchers at Carnegie Mellon University and Google, addressed BERT's limitation that masked tokens are predicted independently of each other. XLNet uses a permutation-based language modeling objective that captures bidirectional context while maintaining autoregressive properties. It also incorporated the recurrence mechanism from Transformer-XL to handle longer sequences.

Sentence-BERT (2019)

Sentence-BERT (SBERT), introduced by Nils Reimers and Iryna Gurevych at EMNLP 2019, modifies BERT with siamese and triplet network structures so that semantically meaningful sentence embeddings can be derived and compared with cosine similarity. The motivation was practical: comparing two sentences with vanilla BERT requires running both through the network jointly, so finding the most similar pair in a collection of 10,000 sentences would require roughly 50 million inference calls and around 65 hours of compute. With SBERT, the same task takes about five seconds because each sentence can be encoded once and stored as a fixed vector.

SBERT pools BERT's output (typically with mean pooling) to produce a fixed-size embedding, then trains the model with a siamese architecture on natural language inference and semantic similarity datasets. Sentence-BERT became the foundation for the sentence-transformers library and is the direct ancestor of nearly every modern embedding model used in semantic search and RAG pipelines.

Domain-specific BERTs

A family of domain-specific BERT models extends the encoder to specialized text. The most widely cited variants:

Model	Domain	Training corpus	Released
BioBERT	Biomedical	PubMed abstracts and PMC full-text	2019, Lee et al.
SciBERT	Scientific	Semantic Scholar papers (CS + biomedical)	2019, Beltagy et al.
ClinicalBERT	Clinical	MIMIC-III ICU clinical notes	2019, Alsentzer et al.
FinBERT	Financial	Reuters TRC2 + financial corpora	2019, Araci
LegalBERT	Legal	EU legislation, court cases, contracts	2020, Chalkidis et al.
CodeBERT	Source code	GitHub repositories (Python, Java, etc.)	2020, Feng et al.
PatentBERT	Patents	USPTO patent text	2019, Lee and Hsiang

These models follow the same pattern: take a pre-trained BERT, continue pre-training on domain text, and optionally swap in a domain-specific WordPiece vocabulary. SciBERT was the first to argue that an in-domain vocabulary, not just an in-domain corpus, gives meaningful additional gains on scientific NLP tasks.

Multilingual variants

The multilingual BERT family includes:

Model	Languages	Notes
mBERT (`bert-base-multilingual-cased`)	104 languages	Trained on Wikipedia with shared WordPiece vocabulary
XLM	15 languages	Adds cross-lingual translation language modeling (Conneau and Lample, 2019)
XLM-R	100 languages	Trained on 2.5 TB of CommonCrawl, replaces XLM (Conneau et al., 2020)
mDeBERTa-v3	100 languages	Multilingual DeBERTa-v3, current strong baseline for cross-lingual transfer

XLM-R in particular dominated cross-lingual fine-tuning for several years and remains the default for low-resource language work in 2026.

Efficient and mobile variants

A parallel line of work focused on shrinking BERT for edge and mobile deployment:

Model	Year	Parameters	Key idea
MobileBERT	2020	25M	Bottleneck design with inverted-bottleneck transformer block
TinyBERT	2020	14.5M	Two-stage knowledge distillation (general + task-specific)
MiniLM	2020	33M	Self-attention distillation; popular base for sentence-transformers
FastBERT	2020	110M	Adaptive early exit with self-distilled student classifiers

MiniLM, introduced by Wang et al. at Microsoft in 2020, deserves special mention. It became the most common backbone for production embedding models because it offers a strong accuracy-to-size ratio and trains well with contrastive objectives.

ModernBERT (2024)

Released on December 19, 2024 by Answer.AI, LightOn, and collaborators, ModernBERT applies six years of architectural progress from large language model research to the encoder-only paradigm. The headline changes:

Rotary positional embeddings (RoPE) instead of learned absolute embeddings, which removes the hard 512-token cap
Native context length of 8,192 tokens (16x BERT's limit)
FlashAttention 2 for faster, more memory-efficient attention
Alternating local and global attention layers, similar to Longformer, to keep long-context cost manageable
GeGLU activation in the feed-forward layers
Pre-normalization layout for training stability
Unpadding, which removes padding tokens entirely from forward passes by treating each sequence as its own micro-batch
StableAdamW optimizer with a trapezoidal learning rate schedule

ModernBERT-Base has 139M parameters and ModernBERT-Large has 395M parameters. ModernBERT-Large goes deeper (28 layers) than RoBERTa-Large while matching it in total parameter count. The training corpus was 2 trillion tokens drawn from web text, code, scientific literature, and other diverse sources, dwarfing BERT's original 3.3 billion words by nearly three orders of magnitude.

On standard benchmarks, ModernBERT matches or beats DeBERTa-v3 while running roughly twice as fast on GPU and supporting much longer documents. Its release was widely treated as the long-overdue "return of the encoder" after several years of LLM-only headlines, and it has become the default starting point for new retrieval and classification work in 2025 and 2026.

Summary of variants

Model	Year	Developer	Key innovation	Parameters
BERT-Base	2018	Google	Masked LM + bidirectional encoder	110M
BERT-Large	2018	Google	Larger version of BERT-Base	340M
RoBERTa	2019	Facebook AI	Dynamic masking, no NSP, more data	355M
ALBERT	2019	Google / TTIC	Parameter sharing, factorized embeddings	12M to 235M
DistilBERT	2019	Hugging Face	Knowledge distillation, 6 layers	66M
XLNet	2019	CMU / Google	Permutation language modeling	340M
Sentence-BERT	2019	UKP Lab, TU Darmstadt	Siamese network for sentence embeddings	110M
BioBERT	2019	Korea University	Continued pretrain on PubMed	110M
SpanBERT	2020	Facebook AI	Span masking	340M
ELECTRA	2020	Stanford / Google	Replaced token detection	14M to 335M
DeBERTa	2020	Microsoft	Disentangled attention, enhanced mask decoder	134M to 1.5B
MobileBERT	2020	Google	Bottleneck design for mobile	25M
MiniLM	2020	Microsoft	Self-attention distillation	33M
XLM-R	2020	Facebook AI	Cross-lingual training on CommonCrawl	270M to 10.7B
DeBERTa-v3	2021	Microsoft	ELECTRA pretraining + GDES	22M to 304M
ModernBERT	2024	Answer.AI / LightOn	RoPE, FlashAttention, 8K context	139M to 395M

Technical details

Attention mechanism

Each transformer layer in BERT uses multi-head self-attention. For BERT-Base with 12 attention heads and a hidden size of 768, each head operates on a 64-dimensional subspace (768 / 12 = 64). The attention computation follows the standard scaled dot-product formula:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

where Q (queries), K (keys), and V (values) are linear projections of the input, and d_k is the dimension of the key vectors. Multi-head attention runs this computation in parallel across all heads and concatenates the results.

Unlike GPT, which uses a causal attention mask to prevent each position from attending to subsequent positions, BERT uses no such mask. Every token can attend to every other token in the sequence, enabling full bidirectional context.

Feed-forward network

After the attention sub-layer, each transformer layer applies a two-layer feed-forward network with a GELU (Gaussian Error Linear Unit) activation function. The inner dimension of the feed-forward network is 4 times the hidden size (3,072 for BERT-Base, 4,096 for BERT-Large).

Dropout and regularization

BERT applies dropout with a rate of 0.1 to all layers during training, including the attention weights and the outputs of the feed-forward sub-layers. No other regularization techniques such as weight decay were reported in the original paper for the main pre-training runs.

Computational requirements

Pre-training BERT-Large on 64 TPU chips for 4 days was a significant computational investment in 2018. Estimates place the cost at roughly $6,900 to $12,500 using cloud TPU pricing at the time. Subsequent work has reduced this cost substantially. In 2023, MosaicML demonstrated that a BERT-Base model could be pre-trained from scratch to competitive accuracy for approximately $20 using modern hardware and training optimizations.

What BERT learns: BERTology

A whole research subfield called BERTology grew up around the question of what BERT actually learns from its pre-training. Probing studies inspect BERT's hidden states with auxiliary classifiers and find that lower layers tend to encode surface features (capitalization, token identity), middle layers capture syntactic information (part-of-speech, dependency relations, constituency), and upper layers represent more semantic and task-specific information. Tenney, Das, and Pavlick (2019) coined the phrase "BERT rediscovers the classical NLP pipeline" to describe this layered organization.

Attention-head analyses by Clark et al. (2019) showed that individual heads specialize for tasks such as tracking direct objects of verbs, attending to coreferent mentions, or marking sentence boundaries. Other work probed BERT's knowledge of factual relationships using cloze-style queries (Petroni et al., 2019, the LAMA benchmark) and found that BERT stores a surprising amount of relational knowledge in its parameters even without an explicit knowledge base. These insights influenced later work on knowledge editing and retrieval-augmented language models.

Limitations

Despite its success, BERT has several known limitations.

Maximum sequence length. BERT's input is limited to 512 tokens due to its learned positional embeddings. Documents longer than this must be truncated or split into chunks, which can lose context across boundaries. Later models like Longformer, BigBird, and ModernBERT addressed this with sparse attention or rotary positional encodings.

Independence assumption in MLM. During pre-training, BERT predicts each masked token independently given the unmasked tokens. This means it does not model the joint probability of masked tokens, which can be a disadvantage for generative tasks. XLNet's permutation language modeling was designed to address this.

Pre-training/fine-tuning mismatch. The [MASK] token appears during pre-training but never during fine-tuning or inference. The 80/10/10 masking strategy (described above) partially mitigates this, but the mismatch remains.

Not generative. Because BERT is an encoder-only model, it is not designed for text generation tasks such as summarization, translation, or dialogue. For these tasks, encoder-decoder models like T5 or decoder-only models like GPT are better suited.

Computational cost for fine-tuning. Although fine-tuning is cheaper than pre-training, running BERT-Large (340M parameters) for inference in production environments can be expensive. This motivated the development of smaller, distilled models like DistilBERT and TinyBERT.

NSP weakness. The Next Sentence Prediction task was later shown to be an ineffective pre-training objective. RoBERTa, ALBERT (with SOP), and SpanBERT all dropped NSP and achieved better results.

English bias. The original BERT was trained almost entirely on English Wikipedia and BooksCorpus. Multilingual BERT helps but underperforms on low-resource languages compared to dedicated multilingual encoders like XLM-R and mDeBERTa-v3.

Fine-tuning instability. As noted earlier, BERT-Large fine-tuning on small datasets can fail to converge for a meaningful fraction of random seeds. This makes BERT less reliable than later models for low-data regimes without careful hyperparameter tuning.

Social bias. Studies of BERT and its successors have repeatedly found that the model encodes social and demographic biases present in the training corpus. Kurita et al. (2019) used template-based probes to show measurable gender and racial associations in BERT's representations, and similar follow-up work has documented religious, age, and disability-related biases. These findings motivated the broader field of bias evaluation and mitigation in pre-trained language models, and continue to inform deployment decisions for production systems.

Modern usage (2024-2026)

Despite the rise of decoder-only LLMs, the BERT family remains the workhorse of production NLP. The current 2026 landscape looks roughly like this:

For sentence and document embeddings, BERT family encoders dominate. Sentence-transformers built on MiniLM, MPNet, and BERT-Base power most semantic search systems. The leading 2024-2025 retrieval models such as BGE-M3 (BAAI), GTE-large (Alibaba), E5-Mistral (Microsoft), and NV-Embed (NVIDIA) use either BERT-family encoders directly or LLM backbones fine-tuned with encoder-style contrastive objectives.

For classification, NER, and span extraction, fine-tuned encoders are still the dominant approach for production. DeBERTa-v3 was the strongest pre-2024 baseline. ModernBERT has displaced it in many new projects since its December 2024 release. For domain-specific tasks, BioBERT, ClinicalBERT, FinBERT, and similar continue to ship in healthcare, finance, and legal pipelines.

For reranking in retrieval-augmented generation (RAG), cross-encoder rerankers built on BERT or DeBERTa remain standard. The BAAI bge-reranker family, Cohere Rerank 3, and Jina's reranker-v2 all use BERT-style cross-encoders that score query-passage pairs in a single forward pass.

For text generation, summarization, and chat, decoder-only LLMs like GPT-4, Claude, and Gemini have replaced BERT entirely. BERT was never designed to generate text autoregressively, and even encoder-decoder models like T5 have given way to decoder-only architectures for most generative work.

In aggregate, BERT and its descendants account for the overwhelming majority of NLP inference calls made in 2026, even though they receive far less media attention than the largest LLMs. Hugging Face's 2024 download statistics show BERT-base-uncased among the top three most-downloaded models, and the cumulative downloads of MiniLM-based sentence-transformers easily run into the billions.

Legacy and influence

BERT's impact on the NLP field has been wide-reaching. Before BERT, most NLP systems were trained from scratch on task-specific labeled data, often with hand-engineered features. BERT popularized the "pre-train, then fine-tune" paradigm, which has since become the standard approach in NLP and has spread to other domains including computer vision (with models like ViT) and speech recognition (with models like wav2vec).

Several trends that BERT helped set in motion include:

Large-scale pre-training. BERT showed that training on large amounts of unlabeled text produces general-purpose representations that transfer well to downstream tasks. This insight was extended by GPT-2, GPT-3, and later large language models (LLMs) that scaled up both model size and training data.

Open model release. Google's decision to open-source BERT's weights and code set an expectation in the research community. It enabled thousands of researchers and companies to build on BERT's foundation without needing Google-scale compute to pre-train their own models.

Hugging Face ecosystem. BERT was one of the first models to gain widespread adoption through the Hugging Face Transformers library, which provided a PyTorch implementation within weeks of BERT's release. As of 2024, BERT remains the second most downloaded model on Hugging Face, with over 68 million monthly downloads.

Domain-specific pre-training. BERT inspired a wave of domain-adapted language models, including SciBERT (scientific text), BioBERT (biomedical literature), ClinicalBERT (clinical notes), FinBERT (financial text), and LegalBERT (legal documents). These models are pre-trained on domain-specific corpora and consistently outperform general-purpose BERT on in-domain tasks.

Encoder models for production. Even as decoder-only LLMs like GPT-4 dominate headlines, BERT-style encoder models remain the workhorses for many production NLP systems. Their lower computational cost and strong performance on classification, retrieval, and extraction tasks make them practical choices for applications that run at scale.

BERT also contributed to the broader conversation about what language models actually learn. The "BERTology" research subfield has produced hundreds of papers analyzing BERT's internal representations, probing what linguistic information its layers capture, and studying how attention heads specialize. This work has provided insights into syntax, semantics, and the nature of contextual representations in neural networks.

The NAACL-HLT 2019 Best Long Paper Award was the first formal community recognition of BERT's significance. In retrospect, the citation count alone (more than 100,000 by 2024) places the paper among the most influential in the history of computer science, and certainly the most influential single paper in the history of natural language processing.

Availability

BERT's original TensorFlow implementation and pre-trained weights are available on GitHub at google-research/bert. The model is also available through the Hugging Face Transformers library (in PyTorch, TensorFlow, and JAX/Flax) under model identifiers such as bert-base-uncased, bert-base-cased, bert-large-uncased, and bert-large-cased. Uncased models have all text lowercased before tokenization, while cased models preserve original casing.

Model identifier	Description
`bert-base-uncased`	Base model, 110M parameters, lowercased input. See bert-base-uncased for details.
`bert-base-cased`	Base model, 110M parameters, case-preserving input
`bert-large-uncased`	Large model, 340M parameters, lowercased input
`bert-large-cased`	Large model, 340M parameters, case-preserving input
`bert-base-multilingual-cased`	mBERT, 104 languages, case-preserving
`bert-base-chinese`	Chinese-only BERT, character-level tokenization
`bert-large-uncased-whole-word-masking`	Variant with whole-word masking instead of WordPiece masking

Multilingual BERT (bert-base-multilingual-cased) covers 104 languages, and Chinese BERT (bert-base-chinese) is trained specifically on Chinese text using character-level tokenization rather than WordPiece, since Chinese characters already function as natural subword units.

For most new projects in 2026, Hugging Face's transformers library is the canonical entry point. A typical fine-tuning workflow takes fewer than 50 lines of Python and runs comfortably on a single consumer GPU. The same library exposes ModernBERT, DeBERTa-v3, and the full BERT family behind a uniform API, which means developers can swap between them with minimal code changes.

References

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pp. 4171-4186. https://arxiv.org/abs/1810.04805
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS). https://arxiv.org/abs/1706.03762
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." https://arxiv.org/abs/1907.11692
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." Proceedings of ICLR 2020. https://arxiv.org/abs/1909.11942
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." https://arxiv.org/abs/1910.01108
Clark, K., Luong, M.-T., Le, Q.V., & Manning, C.D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." Proceedings of ICLR 2020. https://arxiv.org/abs/2003.10555
He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." Proceedings of ICLR 2021. https://arxiv.org/abs/2006.03654
He, P., Gao, J., & Chen, W. (2023). "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." Proceedings of ICLR 2023. https://arxiv.org/abs/2111.09543
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q.V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." Advances in Neural Information Processing Systems 32 (NeurIPS). https://arxiv.org/abs/1906.08237
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., & Levy, O. (2020). "SpanBERT: Improving Pre-training by Representing and Predicting Spans." Transactions of the Association for Computational Linguistics, 8, 64-77. https://arxiv.org/abs/1907.10529
Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of EMNLP-IJCNLP 2019, pp. 3982-3992. https://arxiv.org/abs/1908.10084
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S.R. (2019). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." Proceedings of ICLR 2019. https://arxiv.org/abs/1804.07461
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." Proceedings of EMNLP 2016. https://arxiv.org/abs/1606.05250
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). "Deep contextualized word representations." Proceedings of NAACL-HLT 2018. https://arxiv.org/abs/1802.05365
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Warner, B., Chaffin, A., Clavié, B., Cooper, O., Adams, G., Roman, R., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." https://arxiv.org/abs/2412.13663
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., & Kang, J. (2020). "BioBERT: a pre-trained biomedical language representation model for biomedical text mining." Bioinformatics, 36(4), 1234-1240. https://academic.oup.com/bioinformatics/article/36/4/1234/5566506
Beltagy, I., Lo, K., & Cohan, A. (2019). "SciBERT: A Pretrained Language Model for Scientific Text." Proceedings of EMNLP-IJCNLP 2019. https://arxiv.org/abs/1903.10676
Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., & McDermott, M. (2019). "Publicly Available Clinical BERT Embeddings." Proceedings of the 2nd Clinical Natural Language Processing Workshop. https://arxiv.org/abs/1904.03323
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). "Unsupervised Cross-lingual Representation Learning at Scale." Proceedings of ACL 2020. https://arxiv.org/abs/1911.02116
Tenney, I., Das, D., & Pavlick, E. (2019). "BERT Rediscovers the Classical NLP Pipeline." Proceedings of ACL 2019. https://arxiv.org/abs/1905.05950
Clark, K., Khandelwal, U., Levy, O., & Manning, C.D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." Proceedings of the 2019 ACL Workshop BlackboxNLP. https://arxiv.org/abs/1906.04341
Mosbach, M., Andriushchenko, M., & Klakow, D. (2021). "On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines." Proceedings of ICLR 2021. https://arxiv.org/abs/2006.04884
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). "How to Fine-Tune BERT for Text Classification?" China National Conference on Chinese Computational Linguistics. https://arxiv.org/abs/1905.05583
Google. (2019). "Understanding searches better than ever before." The Keyword (Google blog). https://blog.google/products/search/search-language-understanding-bert/
Synced. (2019). "NAACL 2019 | Google BERT Wins Best Long Paper." Synced Review. https://syncedreview.com/2019/04/11/naacl-2019-google-bert-wins-best-long-paper/
Answer.AI. (2024). "Finally, a Replacement for BERT: Introducing ModernBERT." Answer.AI blog. https://www.answer.ai/posts/2024-12-19-modernbert.html

Background and motivation

Architecture

Model configurations

Input representation

WordPiece in detail

Pre-training

Masked Language Modeling (MLM)

Next Sentence Prediction (NSP)

Training data and procedure

Fine-tuning

Fine-tuning task types

Catastrophic forgetting and instability

Benchmark results

GLUE benchmark

SQuAD

SWAG

Impact on benchmarks

Comparison with other pre-training approaches

Applications

Google Search

Named entity recognition

Sentiment analysis

Question answering

Information retrieval and embeddings

Clinical and biomedical NLP

Content moderation

Multilingual NLP

Variants and successors

RoBERTa (2019)

ALBERT (2019)

DistilBERT (2019)

ELECTRA (2020)

DeBERTa (2020)

DeBERTa-v3 (2021)

SpanBERT (2020)

XLNet (2019)

Sentence-BERT (2019)

Domain-specific BERTs

Multilingual variants

Efficient and mobile variants

ModernBERT (2024)

Summary of variants

Technical details

Attention mechanism

Feed-forward network

Dropout and regularization

Computational requirements

What BERT learns: BERTology

Limitations

Modern usage (2024-2026)

Legacy and influence

Availability

References

Related Articles

Bert-base-uncased model

Bidirectional language model

Unidirectional language model

Llama 4

DeepSeek-R1

T5 (language model)

Background and motivation

Architecture

Model configurations

Input representation

WordPiece in detail

Pre-training

Masked Language Modeling (MLM)

Next Sentence Prediction (NSP)

Training data and procedure

Fine-tuning

Fine-tuning task types

Catastrophic forgetting and instability

Benchmark results

GLUE benchmark

SQuAD

SWAG

Impact on benchmarks

Comparison with other pre-training approaches

Applications

Google Search