BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based encoder-only language model developed by researchers at Google AI Language. Introduced in October 2018, BERT changed the way natural language processing (NLP) systems are built by demonstrating that pre-training a deep bidirectional model on unlabeled text, then fine-tuning it on specific tasks, could beat purpose-built architectures across a wide range of benchmarks. The original paper, authored by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, has accumulated over 100,000 citations, making it one of the most referenced works in the history of artificial intelligence research.
BERT was open-sourced on November 2, 2018, with both pre-trained model weights and TensorFlow source code released on GitHub. Its release marked the beginning of a new era in NLP where transfer learning from large pre-trained models became the default approach for nearly every language understanding task. The paper, formally titled "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (arXiv:1810.04805), went on to win the Best Long Paper Award at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), held in Minneapolis. The award was later cited as one of the field's clearest acknowledgments that the pre-train-then-fine-tune paradigm had displaced the older feature-engineering tradition.
More than seven years after its release, BERT and its descendants are still the default choice for production embedding, classification, named entity recognition, and retrieval pipelines. Even as decoder-only large language models like GPT-4 dominate the headlines, encoder-only models in the BERT family power most of the search engines, recommendation systems, content-moderation pipelines, and vector database backends that quietly run modern web services.
Before BERT, language models generally processed text in one direction. GPT (Generative Pre-trained Transformer), released by OpenAI in June 2018, used a left-to-right transformer decoder to predict the next token in a sequence. ELMo (Embeddings from Language Models), published earlier in 2018 by researchers at the Allen Institute for AI, concatenated the outputs of separate forward and backward LSTM networks to produce context-sensitive word representations. While ELMo captured some bidirectional context, its forward and backward components were trained independently and only combined in a shallow manner.
The BERT authors argued that existing approaches were suboptimal because they restricted the power of pre-trained representations. A truly bidirectional model, one that could attend to both left and right context simultaneously at every layer, would produce richer representations for downstream tasks. The challenge was that standard language modeling objectives (predicting the next word) inherently require unidirectional processing; allowing the model to "see" the target word during training would make the task trivial. Without a clever objective, deep bidirectional pre-training would amount to letting the model cheat.
BERT solved this with a new pre-training objective called Masked Language Modeling (MLM), which randomly hides a fraction of input tokens and trains the model to recover them from the surrounding context in both directions. This simple but effective approach enabled deep bidirectional pre-training for the first time. The idea was inspired by the older Cloze task from psycholinguistics, in which subjects fill in deleted words from a passage. By framing pre-training as a Cloze problem, the BERT team turned a methodological obstacle into a clean self-supervised objective.
BERT uses the encoder portion of the transformer architecture introduced by Vaswani et al. in 2017. Unlike the original transformer, which has both an encoder and a decoder, BERT uses only the encoder stack. Each encoder layer consists of a multi-head self-attention mechanism followed by a position-wise feed-forward network, with layer normalization and residual connections applied to each sub-layer.
The original paper described two model sizes:
| Configuration | Layers | Hidden size | Attention heads | Parameters | Max sequence length |
|---|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M | 512 |
| BERT-Large | 24 | 1024 | 16 | 340M | 512 |
BERT-Base was designed to have roughly the same model size as GPT (which had 12 layers and 117M parameters) to allow direct comparison. BERT-Large was the larger configuration used to push state-of-the-art results.
In March 2020, Google released 24 additional smaller BERT models ranging from BERT-Tiny (2 layers, 128 hidden size, 4.4M parameters) to BERT-Base, giving practitioners more options for resource-constrained settings. The smaller models were intended for mobile, edge, and low-budget research use cases where the full BERT-Base was overkill or simply too slow.
| Variant | Layers | Hidden size | Attention heads | Parameters |
|---|---|---|---|---|
| BERT-Tiny | 2 | 128 | 2 | 4.4M |
| BERT-Mini | 4 | 256 | 4 | 11.3M |
| BERT-Small | 4 | 512 | 8 | 28.8M |
| BERT-Medium | 8 | 512 | 8 | 41.4M |
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
BERT's input representation is built by summing three types of embeddings:
Token embeddings: The model uses WordPiece tokenization with a vocabulary of 30,522 tokens. WordPiece is a subword tokenization algorithm that splits rare words into smaller pieces (prefixed with "##" for continuation tokens) while keeping common words as single tokens. Of the 30,522 entries, roughly 5,800 are continuation subwords (about 19% of the vocabulary), the first 1,000 slots are reserved for special and reserved tokens, indices 1,000 to 1,996 are individual characters and symbols, and the first whole word, "the," appears at index 1,997.
Segment embeddings: Because some tasks require understanding the relationship between two sentences, BERT adds a learned segment embedding to distinguish between Sentence A and Sentence B.
Position embeddings: Learned positional embeddings encode the position of each token in the sequence, up to a maximum of 512 positions. Unlike the sinusoidal positional encodings used in the original Transformer paper, BERT learns its position vectors from scratch as ordinary parameters. This choice made the model simpler but capped the input length at the maximum number of position embeddings learned during pre-training.
Every input sequence begins with a special [CLS] (classification) token. For tasks involving sentence pairs, a [SEP] (separator) token is inserted between the two sentences. Another [SEP] token marks the end of the input. The final hidden state of the [CLS] token serves as the aggregate sequence representation for classification tasks. Additional special tokens include [MASK] (used during pre-training), [PAD] (for padding shorter sequences), and [UNK] (for unknown tokens not in the vocabulary).
The WordPiece algorithm itself is closely related to byte-pair encoding (BPE) but uses a different scoring function. Rather than counting raw co-occurrence frequencies, WordPiece merges the symbol pair that most increases the likelihood of the training corpus under a unigram language model. In practice, the algorithm starts with a base alphabet, then iteratively adds the most useful merge until the vocabulary reaches the target size. At inference time, BERT uses a greedy longest-match-first lookup: each input word is matched against the vocabulary from left to right, breaking the word into the longest possible subword fragments.
This design lets BERT handle rare words, neologisms, and morphologically complex inputs without an explosion of out-of-vocabulary tokens. A word like "unbelievability" might tokenize as un, ##believ, ##ability, while "playing" decomposes into play and ##ing. The vocabulary is small enough to fit comfortably in the embedding matrix yet large enough to keep most common words intact, which preserves a sense of word identity that pure character or byte models lack.
BERT's pre-training uses two self-supervised objectives applied simultaneously to large corpora of unlabeled text.
The core innovation behind BERT is the Masked Language Modeling objective. During pre-training, 15% of the input tokens are randomly selected for prediction. To avoid a mismatch between pre-training (where [MASK] tokens appear) and fine-tuning (where they do not), the selected tokens are handled as follows:
[MASK]The model must predict the original token for each selected position. This approach forces the model to maintain a distributional representation for every input token, since it cannot know which tokens will be masked, and produces deep bidirectional representations because each masked token can attend to context on both sides.
The loss for MLM is computed only over the masked positions. A cross-entropy loss is taken between the model's softmax output (over the entire 30,522-token vocabulary) and the true token. Because only 15% of positions contribute to the loss, MLM is less sample-efficient than left-to-right language modeling, where every token contributes a prediction. Later models such as ELECTRA were designed in part to fix this inefficiency.
Many downstream tasks, such as question answering and natural language inference, require understanding the relationship between two sentences. To capture this, BERT was pre-trained on a binary Next Sentence Prediction task. Given two sentences A and B, the model must predict whether B actually follows A in the original corpus (labeled "IsNext") or whether B is a random sentence (labeled "NotNext"). Training examples are constructed with a 50/50 split between real consecutive pairs and random pairs. The output of the [CLS] token feeds a small binary classifier to make the prediction.
Later research by the RoBERTa team at Facebook AI and others found that NSP does not consistently help downstream performance and may even hurt it. The likely problem is that the random negative pairs are easy to detect from topic alone, so the model learns to discriminate topics rather than the harder coherence relation NSP was meant to teach. As a result, many BERT successors dropped or replaced this objective. ALBERT replaced NSP with Sentence Order Prediction (SOP), which forces the model to distinguish two consecutive sentences in the original order from the same two sentences swapped, removing the topic shortcut.
BERT was pre-trained on two large English-language corpora:
| Corpus | Size | Description |
|---|---|---|
| BooksCorpus | 800M words | Collection of 11,038 unpublished books from various genres |
| English Wikipedia | 2,500M words | Text content extracted from English Wikipedia (lists, tables, and headers excluded) |
The combined training corpus contains roughly 3.3 billion words. Pre-training used a batch size of 256 sequences of 512 tokens each (131,072 tokens per batch) for 1,000,000 steps, which amounts to approximately 40 epochs over the combined dataset. Training used the Adam optimizer with a learning rate of 1e-4 and linear warmup over the first 10,000 steps followed by linear decay. BERT-Base was trained on 4 Cloud TPUs (16 TPU chips) for 4 days, and BERT-Large was trained on 16 Cloud TPUs (64 TPU chips) for 4 days.
The pre-training procedure actually used two phases. In the first phase, the model was trained for 900,000 steps with a maximum sequence length of 128 tokens. In the second phase, training continued for 100,000 steps with a maximum sequence length of 512 tokens, allowing the model to learn longer-range positional embeddings. This staged schedule reduced wall-clock time, since the dominant cost in self-attention scales quadratically with sequence length.
Google reported the dollar cost of training BERT-Large in 2018 at roughly $7,000 on Cloud TPU pricing, although third-party estimates ranged as high as $12,500 depending on assumptions about hardware utilization. By 2023, MosaicML demonstrated that a BERT-Base model could be pre-trained from scratch to competitive accuracy for approximately $20 using modern hardware, optimized data loading, and improved training recipes. The drop in cost over five years made it cheap for individual labs and small companies to pre-train custom encoders rather than relying on public checkpoints.
One of BERT's main contributions is that a single pre-trained model can be adapted to many different tasks by adding a simple task-specific output layer and fine-tuning all parameters end-to-end. This is in contrast to feature-based approaches like ELMo, where the pre-trained model's weights are frozen and its outputs are used as fixed features.
Fine-tuning is straightforward. For classification tasks (sentiment analysis, textual entailment), a linear classifier is added on top of the [CLS] token's final hidden state. For token-level tasks (named entity recognition, part-of-speech tagging), a linear classifier is added on top of each token's final hidden state. For span-extraction tasks like question answering, start and end pointers are learned over the input sequence.
Fine-tuning is fast and inexpensive compared to pre-training. Most tasks can be fine-tuned in 1 to 4 epochs with a learning rate between 2e-5 and 5e-5, and training on a single GPU typically takes between 30 minutes and a few hours. The original BERT paper used a batch size of 16 or 32 for fine-tuning and tried only three learning rates (2e-5, 3e-5, 5e-5) per task, picking the best configuration on the development set. The recipe was deliberately minimalist to show that BERT's success came from the pre-trained representation rather than careful per-task tuning.
| Task type | Output layer | Input format | Examples |
|---|---|---|---|
| Sentence classification | Linear layer on [CLS] | Single sentence | Sentiment analysis, spam detection |
| Sentence pair classification | Linear layer on [CLS] | Sentence A [SEP] Sentence B | Natural language inference, paraphrase detection |
| Token classification | Linear layer per token | Single sentence | Named entity recognition, POS tagging |
| Extractive question answering | Start/end pointers | Question [SEP] Passage | SQuAD, reading comprehension |
Fine-tuned BERT models are still the dominant approach for production text classification and token tagging in 2026, even when the underlying base model has been swapped for a more recent encoder such as DeBERTa-v3 or ModernBERT. The procedural simplicity of "add one linear head, train for a few epochs" is hard to beat in operational settings where engineers value predictable results.
Fine-tuning is not always smooth. BERT-Large in particular is known to be unstable on small datasets like RTE or MRPC, where some random seeds can produce models that fail to learn beyond the majority-class baseline. A 2020 paper by Mosbach, Andriushchenko, and Klakow attributed the instability to a combination of vanishing gradients in the optimizer state and a small number of training steps. The proposed fix, training for more epochs with a longer warmup and bias-correction terms, became standard practice. Hugging Face's Trainer class incorporates these defaults and largely papers over the issue for downstream users.
BERT set new state-of-the-art results on 11 NLP benchmarks upon its release. The improvements were substantial across the board.
The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine sentence-level and sentence-pair classification tasks. BERT-Large achieved an average score of 80.5, a 7.7-point improvement over the previous best system. Individual task results for BERT-Large:
| Task | Metric | BERT-Large score | Previous SOTA |
|---|---|---|---|
| MNLI (matched) | Accuracy | 86.7% | 82.1% |
| MNLI (mismatched) | Accuracy | 85.9% | 81.4% |
| QQP | F1 | 72.1% | 66.1% |
| QNLI | Accuracy | 92.7% | 87.4% |
| SST-2 | Accuracy | 94.9% | 93.2% |
| CoLA | Matthews Corr. | 60.5% | 35.0% |
| STS-B | Spearman Corr. | 86.5% | 81.0% |
| MRPC | F1 | 89.3% | 85.4% |
| RTE | Accuracy | 70.1% | 56.0% |
BERT's strongest individual improvement was on CoLA (Corpus of Linguistic Acceptability), where it beat the previous best by 25.5 percentage points. The RTE (Recognizing Textual Entailment) task also saw a 14.1-point gain.
On the Stanford Question Answering Dataset (SQuAD) version 1.1, BERT achieved an F1 score of 93.2 and an Exact Match (EM) score of 87.4, surpassing the previous best system by 1.5 F1 points and exceeding the estimated human performance F1 of 91.2. On SQuAD 2.0, which includes unanswerable questions, BERT achieved an F1 of 83.1 and EM of 80.0, a 5.1-point F1 improvement.
On the Situations With Adversarial Generations (SWAG) dataset for grounded commonsense reasoning, BERT-Large achieved 86.3% accuracy, beating the previous best by 27.1 percentage points and approaching human performance of 88.0%.
BERT's performance on GLUE was so strong that it, along with subsequent models like XLNet and RoBERTa, quickly surpassed human-level performance on the benchmark by 2019. This prompted the creation of SuperGLUE, a harder benchmark designed to provide more room for improvement. SuperGLUE itself was later saturated by DeBERTa-v3 and T5 derivatives, and by 2024 most NLP researchers had moved on to LLM-era benchmarks like MMLU, BIG-Bench Hard, and HELM.
BERT was developed alongside several other approaches to pre-trained language representations. Understanding how these models differ helps explain why BERT was so effective.
| Feature | ELMo (2018) | GPT (2018) | BERT (2018) |
|---|---|---|---|
| Architecture | Bidirectional LSTM | Transformer decoder | Transformer encoder |
| Directionality | Shallow bidirectional (separate L-to-R and R-to-L) | Unidirectional (left-to-right) | Deep bidirectional |
| Pre-training objective | Language modeling (both directions, separately) | Autoregressive language modeling | Masked LM + Next Sentence Prediction |
| Downstream adaptation | Feature-based (frozen weights) | Fine-tuning (all parameters) | Fine-tuning (all parameters) |
| Parameters | 94M | 117M | 110M (Base), 340M (Large) |
| Context window | N/A (recurrent) | 512 tokens | 512 tokens |
ELMo's main limitation was its shallow bidirectionality: the forward and backward LSTMs were trained independently and only concatenated, so there was no cross-directional attention within individual layers. GPT used the transformer architecture (which is more parallelizable than LSTMs) but was restricted to left-to-right attention, limiting its ability to capture context to the right of any given token.
BERT combined the strengths of both approaches. It used the transformer architecture (like GPT) and processed text bidirectionally (like ELMo), but with true deep bidirectionality where each layer attends to both left and right context simultaneously.
BERT has been applied to a wide range of tasks in both research and industry.
In October 2019, Google announced that it was using BERT to improve search results for English-language queries, calling it "the biggest change in search in the past five years." By December 2019, BERT had been deployed for search queries in over 70 languages. By 2020, nearly every English search query processed by Google involved a BERT model. BERT helped Google better understand the intent behind search queries, especially longer and more conversational ones where prepositions and context words carry meaning. For example, a query like "2019 brazil traveler to usa need a visa" required understanding that the traveler is from Brazil (not going to Brazil), something earlier systems struggled with.
The rollout was expensive enough that Google publicly noted it had to deploy a fresh generation of Cloud TPUs to serve BERT at search-query latencies. Search Engine Land later reported that DeepRank, an internal Google ranking system launched in 2019, was a productionized BERT variant tuned for ranking rather than open-domain question answering. As of 2024, Google still uses BERT-class models in its featured-snippet generation, query rewriting, and natural-language understanding pipelines, though they now sit alongside larger generative models such as PaLM and Gemini for specific subtasks.
Fine-tuned BERT models achieve strong results on named entity recognition (NER) tasks, where the goal is to identify and classify entities such as person names, organizations, and locations in text. On the CoNLL-2003 NER benchmark, BERT-Large achieved an F1 score of 92.8.
NER is one of the workloads where BERT-style encoders continue to dominate. Decoder-only LLMs can perform NER zero-shot by prompting, but a fine-tuned BERT runs at a fraction of the cost, returns structured spans without parsing tricks, and matches or beats the prompted LLM on standard datasets like OntoNotes and Few-NERD. In production, encoder-based NER is the default in document processing, intelligent inboxes, healthcare data extraction, and compliance review systems.
BERT is widely used for sentiment analysis in customer service, product reviews, and social media monitoring. Its ability to capture context-dependent meaning makes it better than earlier bag-of-words or simple embedding methods at handling negation, sarcasm, and nuanced opinions.
BERT's performance on SQuAD made it a natural fit for building question answering systems. Developers can fine-tune BERT on domain-specific QA datasets to build customer support bots, medical information retrieval systems, and educational tools. Google itself used BERT-based models in its featured snippets and Google Assistant.
The biggest commercial use of BERT today may be retrieval rather than classification. Sentence-BERT and its successors turned BERT into a tool for producing fixed-size sentence embeddings that can be compared with cosine similarity, which is the foundation for modern semantic search and retrieval-augmented generation. The sentence-transformers library, maintained by Nils Reimers and the UKP Lab at TU Darmstadt, is one of the most-downloaded packages on Hugging Face, and the models it ships are nearly all BERT or BERT-family encoders.
Leading 2024 and 2025 embedding models such as BGE-M3 from BAAI, NV-Embed from NVIDIA, GTE-large from Alibaba's DAMO Academy, and Cohere's English v3 embeddings all use BERT-family encoders or close descendants as their backbone. Even where the public-facing brand emphasizes proprietary improvements, the underlying network is almost always a BERT, RoBERTa, XLM-R, or DeBERTa derivative trained with contrastive objectives on top of an encoder pre-train.
Specialized versions of BERT, such as BioBERT and ClinicalBERT, have been pre-trained on biomedical literature and clinical notes. These domain-adapted models outperform general-purpose BERT on tasks like biomedical NER, relation extraction, and clinical text classification. BioBERT, introduced by Lee et al. in 2019, was trained on PubMed abstracts and PubMed Central full-text articles. ClinicalBERT, introduced by Alsentzer et al. in 2019, fine-tuned BioBERT on the MIMIC-III clinical notes corpus.
Companies including Facebook (now Meta) have used BERT-based models for detecting hate speech, misinformation, and other harmful content on social media platforms. Meta's WPIE (Whole-Post Integrity Embeddings) system, deployed across Facebook and Instagram for content classification, was originally built on a multilingual BERT-style encoder.
Multilingual BERT (mBERT) was pre-trained on Wikipedia text from 104 languages using a shared WordPiece vocabulary. Despite having no explicit cross-lingual training signal, mBERT shows surprisingly strong zero-shot cross-lingual transfer, where a model fine-tuned on English data performs well on the same task in other languages. Pires, Schlinger, and Garrette (2019) called this property "surprisingly multilingual" in a paper that helped establish a now-standard zero-shot evaluation protocol for cross-lingual models.
BERT's release sparked a wave of follow-up work that improved on its design in various ways. These models are sometimes collectively referred to as "BERTology."
RoBERTa (Robustly Optimized BERT Pre-training Approach) was developed by Facebook AI Research. The authors found that BERT was significantly undertrained and that better results could be achieved through more careful tuning of training hyperparameters and procedures. Key changes included:
The RoBERTa training run used 1,024 NVIDIA V100 GPUs for 500,000 steps with sequence length 512. It matched or exceeded the performance of all models published after BERT at the time, scoring 88.5 on GLUE (compared to BERT-Large's 80.5). RoBERTa effectively reset community expectations for what a BERT-style model could achieve once you stopped underspending on pre-training.
ALBERT (A Lite BERT) was developed by Google Research and the Toyota Technological Institute at Chicago. It introduced two parameter-reduction techniques:
ALBERT also replaced NSP with Sentence Order Prediction (SOP), where the model must determine whether two consecutive sentences are in the correct order or have been swapped. An ALBERT-xxlarge configuration with 235M parameters outperformed BERT-Large (340M parameters) on several benchmarks. ALBERT became influential for showing that parameter efficiency was a useful axis to optimize on, even when it slightly increased compute per training step.
DistilBERT was created by Hugging Face using knowledge distillation. The student model has 6 transformer layers (half of BERT-Base's 12) and 66M parameters, making it 40% smaller and 60% faster at inference. Despite its reduced size, DistilBERT retains about 97% of BERT-Base's language understanding capabilities as measured on GLUE. The distillation process uses a combination of three loss functions: language modeling loss, distillation loss (matching the teacher model's probability distributions), and cosine embedding loss.
DistilBERT remained one of the most-downloaded models on Hugging Face for years and is still common in production deployments where latency budgets matter, such as real-time chat moderation or on-device inference. Its success also helped popularize knowledge distillation as a standard tool in NLP engineering.
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) was developed by Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning at Stanford University and Google. Instead of masked language modeling, ELECTRA uses a "replaced token detection" objective. A small generator network (a small MLM) produces plausible replacement tokens, and a larger discriminator network must identify which tokens in the input have been replaced. Because this task is defined over all input tokens (not just the 15% that are masked), ELECTRA is much more sample-efficient. An ELECTRA-Small model trained on one GPU for 4 days outperformed GPT on GLUE, and ELECTRA-Large matched RoBERTa and XLNet while using less than a quarter of their compute.
DeBERTa (Decoding-enhanced BERT with Disentangled Attention) was developed by Microsoft Research. It introduced two main improvements:
In January 2021, a scaled-up DeBERTa model with 1.5 billion parameters became the first model to surpass human-level performance on the SuperGLUE benchmark, achieving a macro-average score of 90.3 compared to the human baseline of 89.8.
DeBERTa-v3, released by Pengcheng He, Jianfeng Gao, and Weizhu Chen at Microsoft in November 2021 and accepted at ICLR 2023, replaces masked language modeling with the ELECTRA-style replaced token detection objective and adds a new gradient-disentangled embedding sharing scheme (GDES). In ELECTRA, the generator and discriminator share input embeddings, but their training losses pull the embeddings in different directions, creating a "tug of war." GDES lets the generator share its embeddings with the discriminator while stopping the discriminator's gradients from flowing back into the generator's embedding parameters, which improves training efficiency and downstream accuracy.
DeBERTa-v3 became, by most measures, the strongest publicly released encoder model of its era. The base, large, and xlarge variants topped many fine-tuning leaderboards through 2024, and the multilingual mDeBERTa-v3 model is still a popular choice for cross-lingual classification work in 2026.
SpanBERT, developed by Facebook AI and other institutions, modified the masking strategy to mask contiguous random spans rather than individual random tokens. It also removed the NSP objective. SpanBERT outperformed BERT on span-selection tasks such as question answering and coreference resolution.
XLNet, developed by researchers at Carnegie Mellon University and Google, addressed BERT's limitation that masked tokens are predicted independently of each other. XLNet uses a permutation-based language modeling objective that captures bidirectional context while maintaining autoregressive properties. It also incorporated the recurrence mechanism from Transformer-XL to handle longer sequences.
Sentence-BERT (SBERT), introduced by Nils Reimers and Iryna Gurevych at EMNLP 2019, modifies BERT with siamese and triplet network structures so that semantically meaningful sentence embeddings can be derived and compared with cosine similarity. The motivation was practical: comparing two sentences with vanilla BERT requires running both through the network jointly, so finding the most similar pair in a collection of 10,000 sentences would require roughly 50 million inference calls and around 65 hours of compute. With SBERT, the same task takes about five seconds because each sentence can be encoded once and stored as a fixed vector.
SBERT pools BERT's output (typically with mean pooling) to produce a fixed-size embedding, then trains the model with a siamese architecture on natural language inference and semantic similarity datasets. Sentence-BERT became the foundation for the sentence-transformers library and is the direct ancestor of nearly every modern embedding model used in semantic search and RAG pipelines.
A family of domain-specific BERT models extends the encoder to specialized text. The most widely cited variants:
| Model | Domain | Training corpus | Released |
|---|---|---|---|
| BioBERT | Biomedical | PubMed abstracts and PMC full-text | 2019, Lee et al. |
| SciBERT | Scientific | Semantic Scholar papers (CS + biomedical) | 2019, Beltagy et al. |
| ClinicalBERT | Clinical | MIMIC-III ICU clinical notes | 2019, Alsentzer et al. |
| FinBERT | Financial | Reuters TRC2 + financial corpora | 2019, Araci |
| LegalBERT | Legal | EU legislation, court cases, contracts | 2020, Chalkidis et al. |
| CodeBERT | Source code | GitHub repositories (Python, Java, etc.) | 2020, Feng et al. |
| PatentBERT | Patents | USPTO patent text | 2019, Lee and Hsiang |
These models follow the same pattern: take a pre-trained BERT, continue pre-training on domain text, and optionally swap in a domain-specific WordPiece vocabulary. SciBERT was the first to argue that an in-domain vocabulary, not just an in-domain corpus, gives meaningful additional gains on scientific NLP tasks.
The multilingual BERT family includes:
| Model | Languages | Notes |
|---|---|---|
mBERT (bert-base-multilingual-cased) | 104 languages | Trained on Wikipedia with shared WordPiece vocabulary |
| XLM | 15 languages | Adds cross-lingual translation language modeling (Conneau and Lample, 2019) |
| XLM-R | 100 languages | Trained on 2.5 TB of CommonCrawl, replaces XLM (Conneau et al., 2020) |
| mDeBERTa-v3 | 100 languages | Multilingual DeBERTa-v3, current strong baseline for cross-lingual transfer |
XLM-R in particular dominated cross-lingual fine-tuning for several years and remains the default for low-resource language work in 2026.
A parallel line of work focused on shrinking BERT for edge and mobile deployment:
| Model | Year | Parameters | Key idea |
|---|---|---|---|
| MobileBERT | 2020 | 25M | Bottleneck design with inverted-bottleneck transformer block |
| TinyBERT | 2020 | 14.5M | Two-stage knowledge distillation (general + task-specific) |
| MiniLM | 2020 | 33M | Self-attention distillation; popular base for sentence-transformers |
| FastBERT | 2020 | 110M | Adaptive early exit with self-distilled student classifiers |
MiniLM, introduced by Wang et al. at Microsoft in 2020, deserves special mention. It became the most common backbone for production embedding models because it offers a strong accuracy-to-size ratio and trains well with contrastive objectives.
Released on December 19, 2024 by Answer.AI, LightOn, and collaborators, ModernBERT applies six years of architectural progress from large language model research to the encoder-only paradigm. The headline changes:
ModernBERT-Base has 139M parameters and ModernBERT-Large has 395M parameters. ModernBERT-Large goes deeper (28 layers) than RoBERTa-Large while matching it in total parameter count. The training corpus was 2 trillion tokens drawn from web text, code, scientific literature, and other diverse sources, dwarfing BERT's original 3.3 billion words by nearly three orders of magnitude.
On standard benchmarks, ModernBERT matches or beats DeBERTa-v3 while running roughly twice as fast on GPU and supporting much longer documents. Its release was widely treated as the long-overdue "return of the encoder" after several years of LLM-only headlines, and it has become the default starting point for new retrieval and classification work in 2025 and 2026.
| Model | Year | Developer | Key innovation | Parameters |
|---|---|---|---|---|
| BERT-Base | 2018 | Masked LM + bidirectional encoder | 110M | |
| BERT-Large | 2018 | Larger version of BERT-Base | 340M | |
| RoBERTa | 2019 | Facebook AI | Dynamic masking, no NSP, more data | 355M |
| ALBERT | 2019 | Google / TTIC | Parameter sharing, factorized embeddings | 12M to 235M |
| DistilBERT | 2019 | Hugging Face | Knowledge distillation, 6 layers | 66M |
| XLNet | 2019 | CMU / Google | Permutation language modeling | 340M |
| Sentence-BERT | 2019 | UKP Lab, TU Darmstadt | Siamese network for sentence embeddings | 110M |
| BioBERT | 2019 | Korea University | Continued pretrain on PubMed | 110M |
| SpanBERT | 2020 | Facebook AI | Span masking | 340M |
| ELECTRA | 2020 | Stanford / Google | Replaced token detection | 14M to 335M |
| DeBERTa | 2020 | Microsoft | Disentangled attention, enhanced mask decoder | 134M to 1.5B |
| MobileBERT | 2020 | Bottleneck design for mobile | 25M | |
| MiniLM | 2020 | Microsoft | Self-attention distillation | 33M |
| XLM-R | 2020 | Facebook AI | Cross-lingual training on CommonCrawl | 270M to 10.7B |
| DeBERTa-v3 | 2021 | Microsoft | ELECTRA pretraining + GDES | 22M to 304M |
| ModernBERT | 2024 | Answer.AI / LightOn | RoPE, FlashAttention, 8K context | 139M to 395M |
Each transformer layer in BERT uses multi-head self-attention. For BERT-Base with 12 attention heads and a hidden size of 768, each head operates on a 64-dimensional subspace (768 / 12 = 64). The attention computation follows the standard scaled dot-product formula:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
where Q (queries), K (keys), and V (values) are linear projections of the input, and d_k is the dimension of the key vectors. Multi-head attention runs this computation in parallel across all heads and concatenates the results.
Unlike GPT, which uses a causal attention mask to prevent each position from attending to subsequent positions, BERT uses no such mask. Every token can attend to every other token in the sequence, enabling full bidirectional context.
After the attention sub-layer, each transformer layer applies a two-layer feed-forward network with a GELU (Gaussian Error Linear Unit) activation function. The inner dimension of the feed-forward network is 4 times the hidden size (3,072 for BERT-Base, 4,096 for BERT-Large).
BERT applies dropout with a rate of 0.1 to all layers during training, including the attention weights and the outputs of the feed-forward sub-layers. No other regularization techniques such as weight decay were reported in the original paper for the main pre-training runs.
Pre-training BERT-Large on 64 TPU chips for 4 days was a significant computational investment in 2018. Estimates place the cost at roughly $6,900 to $12,500 using cloud TPU pricing at the time. Subsequent work has reduced this cost substantially. In 2023, MosaicML demonstrated that a BERT-Base model could be pre-trained from scratch to competitive accuracy for approximately $20 using modern hardware and training optimizations.
A whole research subfield called BERTology grew up around the question of what BERT actually learns from its pre-training. Probing studies inspect BERT's hidden states with auxiliary classifiers and find that lower layers tend to encode surface features (capitalization, token identity), middle layers capture syntactic information (part-of-speech, dependency relations, constituency), and upper layers represent more semantic and task-specific information. Tenney, Das, and Pavlick (2019) coined the phrase "BERT rediscovers the classical NLP pipeline" to describe this layered organization.
Attention-head analyses by Clark et al. (2019) showed that individual heads specialize for tasks such as tracking direct objects of verbs, attending to coreferent mentions, or marking sentence boundaries. Other work probed BERT's knowledge of factual relationships using cloze-style queries (Petroni et al., 2019, the LAMA benchmark) and found that BERT stores a surprising amount of relational knowledge in its parameters even without an explicit knowledge base. These insights influenced later work on knowledge editing and retrieval-augmented language models.
Despite its success, BERT has several known limitations.
Maximum sequence length. BERT's input is limited to 512 tokens due to its learned positional embeddings. Documents longer than this must be truncated or split into chunks, which can lose context across boundaries. Later models like Longformer, BigBird, and ModernBERT addressed this with sparse attention or rotary positional encodings.
Independence assumption in MLM. During pre-training, BERT predicts each masked token independently given the unmasked tokens. This means it does not model the joint probability of masked tokens, which can be a disadvantage for generative tasks. XLNet's permutation language modeling was designed to address this.
Pre-training/fine-tuning mismatch. The [MASK] token appears during pre-training but never during fine-tuning or inference. The 80/10/10 masking strategy (described above) partially mitigates this, but the mismatch remains.
Not generative. Because BERT is an encoder-only model, it is not designed for text generation tasks such as summarization, translation, or dialogue. For these tasks, encoder-decoder models like T5 or decoder-only models like GPT are better suited.
Computational cost for fine-tuning. Although fine-tuning is cheaper than pre-training, running BERT-Large (340M parameters) for inference in production environments can be expensive. This motivated the development of smaller, distilled models like DistilBERT and TinyBERT.
NSP weakness. The Next Sentence Prediction task was later shown to be an ineffective pre-training objective. RoBERTa, ALBERT (with SOP), and SpanBERT all dropped NSP and achieved better results.
English bias. The original BERT was trained almost entirely on English Wikipedia and BooksCorpus. Multilingual BERT helps but underperforms on low-resource languages compared to dedicated multilingual encoders like XLM-R and mDeBERTa-v3.
Fine-tuning instability. As noted earlier, BERT-Large fine-tuning on small datasets can fail to converge for a meaningful fraction of random seeds. This makes BERT less reliable than later models for low-data regimes without careful hyperparameter tuning.
Social bias. Studies of BERT and its successors have repeatedly found that the model encodes social and demographic biases present in the training corpus. Kurita et al. (2019) used template-based probes to show measurable gender and racial associations in BERT's representations, and similar follow-up work has documented religious, age, and disability-related biases. These findings motivated the broader field of bias evaluation and mitigation in pre-trained language models, and continue to inform deployment decisions for production systems.
Despite the rise of decoder-only LLMs, the BERT family remains the workhorse of production NLP. The current 2026 landscape looks roughly like this:
For sentence and document embeddings, BERT family encoders dominate. Sentence-transformers built on MiniLM, MPNet, and BERT-Base power most semantic search systems. The leading 2024-2025 retrieval models such as BGE-M3 (BAAI), GTE-large (Alibaba), E5-Mistral (Microsoft), and NV-Embed (NVIDIA) use either BERT-family encoders directly or LLM backbones fine-tuned with encoder-style contrastive objectives.
For classification, NER, and span extraction, fine-tuned encoders are still the dominant approach for production. DeBERTa-v3 was the strongest pre-2024 baseline. ModernBERT has displaced it in many new projects since its December 2024 release. For domain-specific tasks, BioBERT, ClinicalBERT, FinBERT, and similar continue to ship in healthcare, finance, and legal pipelines.
For reranking in retrieval-augmented generation (RAG), cross-encoder rerankers built on BERT or DeBERTa remain standard. The BAAI bge-reranker family, Cohere Rerank 3, and Jina's reranker-v2 all use BERT-style cross-encoders that score query-passage pairs in a single forward pass.
For text generation, summarization, and chat, decoder-only LLMs like GPT-4, Claude, and Gemini have replaced BERT entirely. BERT was never designed to generate text autoregressively, and even encoder-decoder models like T5 have given way to decoder-only architectures for most generative work.
In aggregate, BERT and its descendants account for the overwhelming majority of NLP inference calls made in 2026, even though they receive far less media attention than the largest LLMs. Hugging Face's 2024 download statistics show BERT-base-uncased among the top three most-downloaded models, and the cumulative downloads of MiniLM-based sentence-transformers easily run into the billions.
BERT's impact on the NLP field has been wide-reaching. Before BERT, most NLP systems were trained from scratch on task-specific labeled data, often with hand-engineered features. BERT popularized the "pre-train, then fine-tune" paradigm, which has since become the standard approach in NLP and has spread to other domains including computer vision (with models like ViT) and speech recognition (with models like wav2vec).
Several trends that BERT helped set in motion include:
Large-scale pre-training. BERT showed that training on large amounts of unlabeled text produces general-purpose representations that transfer well to downstream tasks. This insight was extended by GPT-2, GPT-3, and later large language models (LLMs) that scaled up both model size and training data.
Open model release. Google's decision to open-source BERT's weights and code set an expectation in the research community. It enabled thousands of researchers and companies to build on BERT's foundation without needing Google-scale compute to pre-train their own models.
Hugging Face ecosystem. BERT was one of the first models to gain widespread adoption through the Hugging Face Transformers library, which provided a PyTorch implementation within weeks of BERT's release. As of 2024, BERT remains the second most downloaded model on Hugging Face, with over 68 million monthly downloads.
Domain-specific pre-training. BERT inspired a wave of domain-adapted language models, including SciBERT (scientific text), BioBERT (biomedical literature), ClinicalBERT (clinical notes), FinBERT (financial text), and LegalBERT (legal documents). These models are pre-trained on domain-specific corpora and consistently outperform general-purpose BERT on in-domain tasks.
Encoder models for production. Even as decoder-only LLMs like GPT-4 dominate headlines, BERT-style encoder models remain the workhorses for many production NLP systems. Their lower computational cost and strong performance on classification, retrieval, and extraction tasks make them practical choices for applications that run at scale.
BERT also contributed to the broader conversation about what language models actually learn. The "BERTology" research subfield has produced hundreds of papers analyzing BERT's internal representations, probing what linguistic information its layers capture, and studying how attention heads specialize. This work has provided insights into syntax, semantics, and the nature of contextual representations in neural networks.
The NAACL-HLT 2019 Best Long Paper Award was the first formal community recognition of BERT's significance. In retrospect, the citation count alone (more than 100,000 by 2024) places the paper among the most influential in the history of computer science, and certainly the most influential single paper in the history of natural language processing.
BERT's original TensorFlow implementation and pre-trained weights are available on GitHub at google-research/bert. The model is also available through the Hugging Face Transformers library (in PyTorch, TensorFlow, and JAX/Flax) under model identifiers such as bert-base-uncased, bert-base-cased, bert-large-uncased, and bert-large-cased. Uncased models have all text lowercased before tokenization, while cased models preserve original casing.
| Model identifier | Description |
|---|---|
bert-base-uncased | Base model, 110M parameters, lowercased input. See bert-base-uncased for details. |
bert-base-cased | Base model, 110M parameters, case-preserving input |
bert-large-uncased | Large model, 340M parameters, lowercased input |
bert-large-cased | Large model, 340M parameters, case-preserving input |
bert-base-multilingual-cased | mBERT, 104 languages, case-preserving |
bert-base-chinese | Chinese-only BERT, character-level tokenization |
bert-large-uncased-whole-word-masking | Variant with whole-word masking instead of WordPiece masking |
Multilingual BERT (bert-base-multilingual-cased) covers 104 languages, and Chinese BERT (bert-base-chinese) is trained specifically on Chinese text using character-level tokenization rather than WordPiece, since Chinese characters already function as natural subword units.
For most new projects in 2026, Hugging Face's transformers library is the canonical entry point. A typical fine-tuning workflow takes fewer than 50 lines of Python and runs comfortably on a single consumer GPU. The same library exposes ModernBERT, DeBERTa-v3, and the full BERT family behind a uniform API, which means developers can swap between them with minimal code changes.