BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model developed by researchers at Google AI Language. Introduced in October 2018, BERT changed the way natural language processing (NLP) systems are built by demonstrating that pre-training a deep bidirectional model on unlabeled text, then fine-tuning it on specific tasks, could beat purpose-built architectures across a wide range of benchmarks. The original paper, authored by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, has accumulated over 100,000 citations, making it one of the most referenced works in the history of artificial intelligence research.
BERT was open-sourced on November 2, 2018, with both pre-trained model weights and TensorFlow source code released on GitHub. Its release marked the beginning of a new era in NLP where transfer learning from large pre-trained models became the default approach for nearly every language understanding task.
Before BERT, language models generally processed text in one direction. GPT (Generative Pre-trained Transformer), released by OpenAI in June 2018, used a left-to-right transformer decoder to predict the next token in a sequence. ELMo (Embeddings from Language Models), published earlier in 2018 by researchers at the Allen Institute for AI, concatenated the outputs of separate forward and backward LSTM networks to produce context-sensitive word representations. While ELMo captured some bidirectional context, its forward and backward components were trained independently and only combined in a shallow manner.
The BERT authors argued that existing approaches were suboptimal because they restricted the power of pre-trained representations. A truly bidirectional model, one that could attend to both left and right context simultaneously at every layer, would produce richer representations for downstream tasks. The challenge was that standard language modeling objectives (predicting the next word) inherently require unidirectional processing; allowing the model to "see" the target word during training would make the task trivial.
BERT solved this with a new pre-training objective called Masked Language Modeling (MLM), which randomly hides a fraction of input tokens and trains the model to recover them from the surrounding context in both directions. This simple but effective approach enabled deep bidirectional pre-training for the first time.
BERT uses the encoder portion of the transformer architecture introduced by Vaswani et al. in 2017. Unlike the original transformer, which has both an encoder and a decoder, BERT uses only the encoder stack. Each encoder layer consists of a multi-head self-attention mechanism followed by a position-wise feed-forward network, with layer normalization and residual connections applied to each sub-layer.
The original paper described two model sizes:
| Configuration | Layers | Hidden size | Attention heads | Parameters | Max sequence length |
|---|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M | 512 |
| BERT-Large | 24 | 1024 | 16 | 340M | 512 |
BERT-Base was designed to have roughly the same model size as GPT (which had 12 layers and 117M parameters) to allow direct comparison. BERT-Large was the larger configuration used to push state-of-the-art results.
In March 2020, Google released 24 additional smaller BERT models ranging from BERT-Tiny (2 layers, 128 hidden size, 4.4M parameters) to BERT-Base, giving practitioners more options for resource-constrained settings.
BERT's input representation is built by summing three types of embeddings:
Token embeddings: The model uses WordPiece tokenization with a vocabulary of 30,522 tokens. WordPiece is a subword tokenization algorithm that splits rare words into smaller pieces (prefixed with "##" for continuation tokens) while keeping common words as single tokens.
Segment embeddings: Because some tasks require understanding the relationship between two sentences, BERT adds a learned segment embedding to distinguish between Sentence A and Sentence B.
Position embeddings: Learned positional embeddings encode the position of each token in the sequence, up to a maximum of 512 positions.
Every input sequence begins with a special [CLS] (classification) token. For tasks involving sentence pairs, a [SEP] (separator) token is inserted between the two sentences. Another [SEP] token marks the end of the input. The final hidden state of the [CLS] token serves as the aggregate sequence representation for classification tasks. Additional special tokens include [MASK] (used during pre-training), [PAD] (for padding shorter sequences), and [UNK] (for unknown tokens not in the vocabulary).
BERT's pre-training uses two self-supervised objectives applied simultaneously to large corpora of unlabeled text.
The core innovation behind BERT is the Masked Language Modeling objective. During pre-training, 15% of the input tokens are randomly selected for prediction. To avoid a mismatch between pre-training (where [MASK] tokens appear) and fine-tuning (where they do not), the selected tokens are handled as follows:
[MASK]The model must predict the original token for each selected position. This approach forces the model to maintain a distributional representation for every input token, since it cannot know which tokens will be masked, and produces deep bidirectional representations because each masked token can attend to context on both sides.
Many downstream tasks, such as question answering and natural language inference, require understanding the relationship between two sentences. To capture this, BERT was pre-trained on a binary Next Sentence Prediction task. Given two sentences A and B, the model must predict whether B actually follows A in the original corpus (labeled "IsNext") or whether B is a random sentence (labeled "NotNext"). Training examples are constructed with a 50/50 split between real consecutive pairs and random pairs.
Later research by the RoBERTa team at Facebook AI and others found that NSP does not consistently help downstream performance and may even hurt it. As a result, many BERT successors dropped or replaced this objective.
BERT was pre-trained on two large English-language corpora:
| Corpus | Size | Description |
|---|---|---|
| BooksCorpus | 800M words | Collection of 11,038 unpublished books from various genres |
| English Wikipedia | 2,500M words | Text content extracted from English Wikipedia (lists, tables, and headers excluded) |
The combined training corpus contains roughly 3.3 billion words. Pre-training used a batch size of 256 sequences of 512 tokens each (131,072 tokens per batch) for 1,000,000 steps, which amounts to approximately 40 epochs over the combined dataset. Training used the Adam optimizer with a learning rate of 1e-4 and linear warmup over the first 10,000 steps followed by linear decay. BERT-Base was trained on 4 Cloud TPUs (16 TPU chips) for 4 days, and BERT-Large was trained on 16 Cloud TPUs (64 TPU chips) for 4 days.
The pre-training procedure actually used two phases. In the first phase, the model was trained for 900,000 steps with a maximum sequence length of 128 tokens. In the second phase, training continued for 100,000 steps with a maximum sequence length of 512 tokens, allowing the model to learn longer-range positional embeddings.
One of BERT's main contributions is that a single pre-trained model can be adapted to many different tasks by adding a simple task-specific output layer and fine-tuning all parameters end-to-end. This is in contrast to feature-based approaches like ELMo, where the pre-trained model's weights are frozen and its outputs are used as fixed features.
Fine-tuning is straightforward. For classification tasks (sentiment analysis, textual entailment), a linear classifier is added on top of the [CLS] token's final hidden state. For token-level tasks (named entity recognition, part-of-speech tagging), a linear classifier is added on top of each token's final hidden state. For span-extraction tasks like question answering, start and end pointers are learned over the input sequence.
Fine-tuning is fast and inexpensive compared to pre-training. Most tasks can be fine-tuned in 1 to 4 epochs with a learning rate between 2e-5 and 5e-5, and training on a single GPU typically takes between 30 minutes and a few hours.
| Task type | Output layer | Input format | Examples |
|---|---|---|---|
| Sentence classification | Linear layer on [CLS] | Single sentence | Sentiment analysis, spam detection |
| Sentence pair classification | Linear layer on [CLS] | Sentence A [SEP] Sentence B | Natural language inference, paraphrase detection |
| Token classification | Linear layer per token | Single sentence | Named entity recognition, POS tagging |
| Extractive question answering | Start/end pointers | Question [SEP] Passage | SQuAD, reading comprehension |
BERT set new state-of-the-art results on 11 NLP benchmarks upon its release. The improvements were substantial across the board.
The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine sentence-level and sentence-pair classification tasks. BERT-Large achieved an average score of 80.5, a 7.7-point improvement over the previous best system. Individual task results for BERT-Large:
| Task | Metric | BERT-Large score | Previous SOTA |
|---|---|---|---|
| MNLI (matched) | Accuracy | 86.7% | 82.1% |
| MNLI (mismatched) | Accuracy | 85.9% | 81.4% |
| QQP | F1 | 72.1% | 66.1% |
| QNLI | Accuracy | 92.7% | 87.4% |
| SST-2 | Accuracy | 94.9% | 93.2% |
| CoLA | Matthews Corr. | 60.5% | 35.0% |
| STS-B | Spearman Corr. | 86.5% | 81.0% |
| MRPC | F1 | 89.3% | 85.4% |
| RTE | Accuracy | 70.1% | 56.0% |
BERT's strongest individual improvement was on CoLA (Corpus of Linguistic Acceptability), where it beat the previous best by 25.5 percentage points. The RTE (Recognizing Textual Entailment) task also saw a 14.1-point gain.
On the Stanford Question Answering Dataset (SQuAD) version 1.1, BERT achieved an F1 score of 93.2 and an Exact Match (EM) score of 87.4, surpassing the previous best system by 1.5 F1 points and exceeding the estimated human performance F1 of 91.2. On SQuAD 2.0, which includes unanswerable questions, BERT achieved an F1 of 83.1 and EM of 80.0, a 5.1-point F1 improvement.
On the Situations With Adversarial Generations (SWAG) dataset for grounded commonsense reasoning, BERT-Large achieved 86.3% accuracy, beating the previous best by 27.1 percentage points and approaching human performance of 88.0%.
BERT's performance on GLUE was so strong that it, along with subsequent models like XLNet and RoBERTa, quickly surpassed human-level performance on the benchmark by 2019. This prompted the creation of SuperGLUE, a harder benchmark designed to provide more room for improvement.
BERT was developed alongside several other approaches to pre-trained language representations. Understanding how these models differ helps explain why BERT was so effective.
| Feature | ELMo (2018) | GPT (2018) | BERT (2018) |
|---|---|---|---|
| Architecture | Bidirectional LSTM | Transformer decoder | Transformer encoder |
| Directionality | Shallow bidirectional (separate L-to-R and R-to-L) | Unidirectional (left-to-right) | Deep bidirectional |
| Pre-training objective | Language modeling (both directions, separately) | Autoregressive language modeling | Masked LM + Next Sentence Prediction |
| Downstream adaptation | Feature-based (frozen weights) | Fine-tuning (all parameters) | Fine-tuning (all parameters) |
| Parameters | 94M | 117M | 110M (Base), 340M (Large) |
| Context window | N/A (recurrent) | 512 tokens | 512 tokens |
ELMo's main limitation was its shallow bidirectionality: the forward and backward LSTMs were trained independently and only concatenated, so there was no cross-directional attention within individual layers. GPT used the transformer architecture (which is more parallelizable than LSTMs) but was restricted to left-to-right attention, limiting its ability to capture context to the right of any given token.
BERT combined the strengths of both approaches. It used the transformer architecture (like GPT) and processed text bidirectionally (like ELMo), but with true deep bidirectionality where each layer attends to both left and right context simultaneously.
BERT has been applied to a wide range of tasks in both research and industry.
In October 2019, Google announced that it was using BERT to improve search results for English-language queries, calling it "the biggest change in search in the past five years." By December 2019, BERT had been deployed for search queries in over 70 languages. By 2020, nearly every English search query processed by Google involved a BERT model. BERT helped Google better understand the intent behind search queries, especially longer and more conversational ones where prepositions and context words carry meaning. For example, a query like "2019 brazil traveler to usa need a visa" required understanding that the traveler is from Brazil (not going to Brazil), something earlier systems struggled with.
Fine-tuned BERT models achieve strong results on named entity recognition (NER) tasks, where the goal is to identify and classify entities such as person names, organizations, and locations in text. On the CoNLL-2003 NER benchmark, BERT-Large achieved an F1 score of 92.8.
BERT is widely used for sentiment analysis in customer service, product reviews, and social media monitoring. Its ability to capture context-dependent meaning makes it better than earlier bag-of-words or simple embedding methods at handling negation, sarcasm, and nuanced opinions.
BERT's performance on SQuAD made it a natural fit for building question answering systems. Developers can fine-tune BERT on domain-specific QA datasets to build customer support bots, medical information retrieval systems, and educational tools. Google itself used BERT-based models in its featured snippets and Google Assistant.
Specialized versions of BERT, such as BioBERT and ClinicalBERT, have been pre-trained on biomedical literature and clinical notes. These domain-adapted models outperform general-purpose BERT on tasks like biomedical NER, relation extraction, and clinical text classification.
Companies including Facebook (now Meta) have used BERT-based models for detecting hate speech, misinformation, and other harmful content on social media platforms.
Multilingual BERT (mBERT) was pre-trained on Wikipedia text from 104 languages using a shared WordPiece vocabulary. Despite having no explicit cross-lingual training signal, mBERT shows surprisingly strong zero-shot cross-lingual transfer, where a model fine-tuned on English data performs well on the same task in other languages.
BERT's release sparked a wave of follow-up work that improved on its design in various ways. These models are sometimes collectively referred to as "BERTology."
RoBERTa (Robustly Optimized BERT Pre-training Approach) was developed by Facebook AI Research. The authors found that BERT was significantly undertrained and that better results could be achieved through more careful tuning of training hyperparameters and procedures. Key changes included:
RoBERTa matched or exceeded the performance of all models published after BERT at the time, scoring 88.5 on GLUE (compared to BERT-Large's 80.5).
ALBERT (A Lite BERT) was developed by Google Research and the Toyota Technological Institute at Chicago. It introduced two parameter-reduction techniques:
ALBERT also replaced NSP with Sentence Order Prediction (SOP), where the model must determine whether two consecutive sentences are in the correct order or have been swapped. An ALBERT-xxlarge configuration with 235M parameters outperformed BERT-Large (340M parameters) on several benchmarks.
DistilBERT was created by Hugging Face using knowledge distillation. The student model has 6 transformer layers (half of BERT-Base's 12) and 66M parameters, making it 40% smaller and 60% faster at inference. Despite its reduced size, DistilBERT retains about 97% of BERT-Base's language understanding capabilities as measured on GLUE. The distillation process uses a combination of three loss functions: language modeling loss, distillation loss (matching the teacher model's probability distributions), and cosine embedding loss.
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) was developed by Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning at Stanford University and Google. Instead of masked language modeling, ELECTRA uses a "replaced token detection" objective. A small generator network (a small MLM) produces plausible replacement tokens, and a larger discriminator network must identify which tokens in the input have been replaced. Because this task is defined over all input tokens (not just the 15% that are masked), ELECTRA is much more sample-efficient. An ELECTRA-Small model trained on one GPU for 4 days outperformed GPT on GLUE, and ELECTRA-Large matched RoBERTa and XLNet while using less than a quarter of their compute.
DeBERTa (Decoding-enhanced BERT with Disentangled Attention) was developed by Microsoft Research. It introduced two main improvements:
In January 2021, a scaled-up DeBERTa model with 1.5 billion parameters became the first model to surpass human-level performance on the SuperGLUE benchmark, achieving a macro-average score of 90.3 compared to the human baseline of 89.8.
SpanBERT, developed by Facebook AI and other institutions, modified the masking strategy to mask contiguous random spans rather than individual random tokens. It also removed the NSP objective. SpanBERT outperformed BERT on span-selection tasks such as question answering and coreference resolution.
XLNet, developed by researchers at Carnegie Mellon University and Google, addressed BERT's limitation that masked tokens are predicted independently of each other. XLNet uses a permutation-based language modeling objective that captures bidirectional context while maintaining autoregressive properties. It also incorporated the recurrence mechanism from Transformer-XL to handle longer sequences.
Released in December 2024 by Answer.AI, LightOn, and collaborators, ModernBERT applies architectural improvements from recent large language model research to the encoder-only paradigm. It uses rotary positional embeddings (RoPE) instead of learned absolute embeddings, supports sequence lengths up to 8,192 tokens (compared to BERT's 512), and incorporates FlashAttention for faster processing. ModernBERT-Base has 149M parameters and ModernBERT-Large has 395M parameters. The model was positioned as a drop-in replacement for BERT in retrieval, classification, and entity extraction pipelines.
| Model | Year | Developer | Key innovation | Parameters |
|---|---|---|---|---|
| BERT-Base | 2018 | Masked LM + bidirectional encoder | 110M | |
| BERT-Large | 2018 | Larger version of BERT-Base | 340M | |
| RoBERTa | 2019 | Facebook AI | Dynamic masking, no NSP, more data | 355M |
| ALBERT | 2019 | Google / TTIC | Parameter sharing, factorized embeddings | 12M to 235M |
| DistilBERT | 2019 | Hugging Face | Knowledge distillation, 6 layers | 66M |
| XLNet | 2019 | CMU / Google | Permutation language modeling | 340M |
| SpanBERT | 2020 | Facebook AI | Span masking | 340M |
| ELECTRA | 2020 | Stanford / Google | Replaced token detection | 14M to 335M |
| DeBERTa | 2020 | Microsoft | Disentangled attention, enhanced mask decoder | 134M to 1.5B |
| ModernBERT | 2024 | Answer.AI / LightOn | RoPE, FlashAttention, 8K context | 149M to 395M |
Each transformer layer in BERT uses multi-head self-attention. For BERT-Base with 12 attention heads and a hidden size of 768, each head operates on a 64-dimensional subspace (768 / 12 = 64). The attention computation follows the standard scaled dot-product formula:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
where Q (queries), K (keys), and V (values) are linear projections of the input, and d_k is the dimension of the key vectors. Multi-head attention runs this computation in parallel across all heads and concatenates the results.
Unlike GPT, which uses a causal attention mask to prevent each position from attending to subsequent positions, BERT uses no such mask. Every token can attend to every other token in the sequence, enabling full bidirectional context.
After the attention sub-layer, each transformer layer applies a two-layer feed-forward network with a GELU (Gaussian Error Linear Unit) activation function. The inner dimension of the feed-forward network is 4 times the hidden size (3,072 for BERT-Base, 4,096 for BERT-Large).
BERT applies dropout with a rate of 0.1 to all layers during training, including the attention weights and the outputs of the feed-forward sub-layers. No other regularization techniques such as weight decay were reported in the original paper for the main pre-training runs.
Pre-training BERT-Large on 64 TPU chips for 4 days was a significant computational investment in 2018. Estimates place the cost at roughly $6,900 to $12,500 using cloud TPU pricing at the time. Subsequent work has reduced this cost substantially. In 2023, MosaicML demonstrated that a BERT-Base model could be pre-trained from scratch to competitive accuracy for approximately $20 using modern hardware and training optimizations.
Despite its success, BERT has several known limitations.
Maximum sequence length: BERT's input is limited to 512 tokens due to its learned positional embeddings. Documents longer than this must be truncated or split into chunks, which can lose context across boundaries. Later models like Longformer, BigBird, and ModernBERT addressed this with sparse attention or rotary positional encodings.
Independence assumption in MLM: During pre-training, BERT predicts each masked token independently given the unmasked tokens. This means it does not model the joint probability of masked tokens, which can be a disadvantage for generative tasks. XLNet's permutation language modeling was designed to address this.
Pre-training/fine-tuning mismatch: The [MASK] token appears during pre-training but never during fine-tuning or inference. The 80/10/10 masking strategy (described above) partially mitigates this, but the mismatch remains.
Not generative: Because BERT is an encoder-only model, it is not designed for text generation tasks such as summarization, translation, or dialogue. For these tasks, encoder-decoder models like T5 or decoder-only models like GPT are better suited.
Computational cost for fine-tuning: Although fine-tuning is cheaper than pre-training, running BERT-Large (340M parameters) for inference in production environments can be expensive. This motivated the development of smaller, distilled models like DistilBERT and TinyBERT.
NSP weakness: The Next Sentence Prediction task was later shown to be an ineffective pre-training objective. RoBERTa, ALBERT (with SOP), and SpanBERT all dropped NSP and achieved better results.
BERT's impact on the NLP field has been wide-reaching. Before BERT, most NLP systems were trained from scratch on task-specific labeled data, often with hand-engineered features. BERT popularized the "pre-train, then fine-tune" paradigm, which has since become the standard approach in NLP and has spread to other domains including computer vision (with models like ViT) and speech recognition (with models like wav2vec).
Several trends that BERT helped set in motion include:
Large-scale pre-training: BERT showed that training on large amounts of unlabeled text produces general-purpose representations that transfer well to downstream tasks. This insight was extended by GPT-2, GPT-3, and later large language models (LLMs) that scaled up both model size and training data.
Open model release: Google's decision to open-source BERT's weights and code set an expectation in the research community. It enabled thousands of researchers and companies to build on BERT's foundation without needing Google-scale compute to pre-train their own models.
Hugging Face ecosystem: BERT was one of the first models to gain widespread adoption through the Hugging Face Transformers library, which provided a PyTorch implementation within weeks of BERT's release. As of 2024, BERT remains the second most downloaded model on Hugging Face, with over 68 million monthly downloads.
Domain-specific pre-training: BERT inspired a wave of domain-adapted language models, including SciBERT (scientific text), BioBERT (biomedical literature), ClinicalBERT (clinical notes), FinBERT (financial text), and LegalBERT (legal documents). These models are pre-trained on domain-specific corpora and consistently outperform general-purpose BERT on in-domain tasks.
Encoder models for production: Even as decoder-only LLMs like GPT-4 dominate headlines, BERT-style encoder models remain the workhorses for many production NLP systems. Their lower computational cost and strong performance on classification, retrieval, and extraction tasks make them practical choices for applications that run at scale.
BERT also contributed to the broader conversation about what language models actually learn. The "BERTology" research subfield has produced hundreds of papers analyzing BERT's internal representations, probing what linguistic information its layers capture, and studying how attention heads specialize. This work has provided insights into syntax, semantics, and the nature of contextual representations in neural networks.
BERT's original TensorFlow implementation and pre-trained weights are available on GitHub at google-research/bert. The model is also available through the Hugging Face Transformers library (in PyTorch, TensorFlow, and JAX/Flax) under model identifiers such as bert-base-uncased, bert-base-cased, bert-large-uncased, and bert-large-cased. Uncased models have all text lowercased before tokenization, while cased models preserve original casing.
Multilingual BERT (bert-base-multilingual-cased) covers 104 languages, and Chinese BERT (bert-base-chinese) is trained specifically on Chinese text.