BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google AI Language researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. First published as a preprint in October 2018 and formally presented at NAACL-HLT 2019, BERT introduced a pre-training approach for neural network-based language understanding that processes text bidirectionally, reading each word in the context of all surrounding words rather than left-to-right or right-to-left alone. This innovation enabled BERT to achieve state-of-the-art results across 11 natural language processing (NLP) benchmarks at the time of its release, and it has since become one of the most influential models in the history of the field. The original paper has accumulated over 100,000 citations across various academic databases, making it one of the most cited papers in all of computer science.
Explain like I'm 5 (ELI5)
Imagine you are trying to understand a word in a sentence, like the word "bank" in "I went to the bank to catch fish." If you only read the words before "bank," you might think it means a place where you keep money. But if you also read the words after it, you realize it means the side of a river. BERT works like this. Instead of reading a sentence in one direction, it looks at all the words around each word at the same time. This helps it understand what words really mean based on the full picture. First, BERT reads millions of sentences from the internet to learn how language works (this is called pre-training). Then, when you give it a specific job, like answering questions or figuring out if a movie review is positive or negative, it only needs a small amount of extra practice to get really good at it.
Background and motivation
Before BERT, most approaches to NLP relied on training models from scratch on task-specific labeled data, or on pre-trained word embedding representations like Word2Vec and GloVe that assigned a single fixed vector to each word regardless of context. The word "bank" would receive the same representation whether it appeared in a financial context or a geographical one. Context-dependent models like ELMo (Peters et al., 2018) improved on this by generating different representations based on surrounding text, but ELMo used a shallow concatenation of independently trained left-to-right and right-to-left LSTM layers, limiting the depth of bidirectional interaction.
OpenAI's GPT (Radford et al., 2018) demonstrated that pre-training a transformer model on a large corpus followed by fine-tuning on downstream tasks could yield strong results, establishing the pre-train then fine-tune paradigm. However, GPT used a unidirectional (left-to-right) architecture, meaning each token could only attend to tokens before it. This constraint limited the model's ability to capture context from both directions simultaneously.
BERT addressed these limitations by introducing a training objective, masked language modeling, that allowed a transformer encoder to learn deep bidirectional representations. The result was a general-purpose language representation model that could be fine-tuned for a wide range of tasks with minimal architectural modification, marking a significant step forward for transfer learning in NLP.
Architecture
BERT uses the encoder portion of the original transformer architecture described in "Attention Is All You Need" (Vaswani et al., 2017). Unlike the full transformer, which includes both encoder and decoder stacks, BERT uses only the encoder. This means every token in the input can attend to every other token through self-attention, without the causal masking used in decoder models that prevents tokens from attending to future positions.
Model configurations
The original paper introduced two model sizes:
| Configuration | Layers (L) | Hidden size (H) | Attention heads (A) | Feed-forward size | Parameters |
|---|
| BERT-Base | 12 | 768 | 12 | 3,072 | 110M |
| BERT-Large | 24 | 1,024 | 16 | 4,096 | 340M |
BERT-Base was designed to have the same model size as OpenAI's GPT for fair comparison, while BERT-Large demonstrated that scaling up the model yielded further improvements.
BERT uses a WordPiece tokenizer with a vocabulary of approximately 30,000 tokens. Each input sequence begins with a special [CLS] token, whose final hidden state serves as the aggregate sequence representation for classification tasks. Sentences within a pair are separated by a special [SEP] token.
The input embedding for each token is constructed by summing three components:
- Token embedding: The WordPiece token's learned vector representation.
- Segment embedding: Indicates whether the token belongs to sentence A or sentence B (used in sentence-pair tasks).
- Position embedding: Encodes the token's absolute position in the sequence, supporting sequences up to 512 tokens.
Self-attention mechanism
Each transformer layer applies multi-head self-attention followed by a position-wise feed-forward network. In self-attention, every token computes query, key, and value vectors, and the attention weight between any two tokens is determined by the dot product of their query and key vectors. Because BERT does not use causal masking, every token can attend to every other token in both directions. This all-to-all attention pattern is what makes BERT "bidirectional" in the deepest sense, and it distinguishes BERT from autoregressive models like GPT.
Pre-training
BERT is pre-trained on two self-supervised learning objectives using a large unlabeled text corpus. The pre-training data consisted of the BooksCorpus (800 million words) and English Wikipedia (2,500 million words), totaling approximately 3.3 billion words of text.
Masked language modeling (MLM)
The primary pre-training task is masked language modeling. In each training example, 15% of the input tokens are selected for prediction. Of these selected tokens:
- 80% are replaced with the special
[MASK] token.
- 10% are replaced with a random token from the vocabulary.
- 10% are left unchanged.
The model must predict the original identity of each selected token based on the surrounding context. This objective forces the model to build deep bidirectional representations, because any token might need to be predicted, and the prediction depends on both the left and right context.
The 80/10/10 split was a deliberate design choice. If all selected tokens were simply replaced with [MASK], the model would never see [MASK] tokens during fine-tuning, creating a mismatch between pre-training and fine-tuning. By sometimes keeping the original token or substituting a random one, the model learns to rely on the surrounding context for all tokens, not just masked positions.
Next sentence prediction (NSP)
The second pre-training task is next sentence prediction. The model receives two sentences and must classify whether the second sentence is the actual next sentence in the original document (labeled "IsNext") or a randomly sampled sentence from the corpus (labeled "NotNext"). Training pairs are constructed with a 50/50 split between true and random pairs.
NSP was designed to help the model understand relationships between sentence pairs, which is important for tasks like natural language inference and question answering. However, later research by Liu et al. (2019) in the RoBERTa paper found that removing NSP could actually improve performance, suggesting that the benefits of this objective were less clear than originally believed.
Training details
BERT-Base was trained on 4 Cloud TPUs (16 TPU chips) for 4 days, while BERT-Large was trained on 16 Cloud TPUs (64 TPU chips) for 4 days. The estimated training cost for BERT-Base was approximately $500 USD at the time. Both models were trained with a batch size of 256 sequences and a maximum sequence length of 512 tokens, using Adam with learning rate warmup and linear decay.
Fine-tuning for downstream tasks
One of BERT's most significant contributions was demonstrating that a single pre-trained model could be adapted to a wide variety of NLP tasks by adding a simple task-specific output layer and fine-tuning all parameters end-to-end. This approach required far less task-specific architecture engineering than previous methods.
For each downstream task, BERT's input is formatted slightly differently:
- Single sentence tasks (e.g., sentiment classification): The sentence is provided after the
[CLS] token, and the final hidden state of [CLS] is fed to a classification layer.
- Sentence pair tasks (e.g., natural language inference): Both sentences are concatenated with a
[SEP] token between them.
- Token-level tasks (e.g., named entity recognition): The final hidden state of each token is fed to a classification layer that predicts a label for each token.
- Span extraction tasks (e.g., question answering on SQuAD): The model learns start and end pointers over the input sequence to extract an answer span.
Fine-tuning is typically fast, requiring only 2 to 4 epochs on most tasks, and can be performed on a single GPU in a matter of hours.
Benchmark results
BERT established new state-of-the-art results on multiple benchmarks when it was released:
| Benchmark | Metric | BERT-Large result | Previous best | Human performance |
|---|
| GLUE (overall) | Average score | 80.5 | 72.8 | ~87 |
| MultiNLI | Accuracy | 86.7% | 82.1% | 92.0% |
| SQuAD v1.1 | F1 | 93.2 | 91.7 | 91.2 |
| SQuAD v2.0 | F1 | 83.1 | 78.0 | 86.8 |
| SWAG | Accuracy | 86.3% | 59.9% (ESIM) | 88.0% |
On SQuAD v1.1, BERT-Large's F1 score of 93.2 actually surpassed the estimated human performance of 91.2, marking one of the first times a machine learning model exceeded human-level accuracy on a major reading comprehension benchmark.
BERT variants and successors
BERT's success inspired a large family of derived models, each addressing specific limitations or optimizing for different goals.
RoBERTa (2019)
RoBERTa (Robustly Optimized BERT Pre-training Approach), developed by Liu et al. at Facebook AI, demonstrated that BERT was significantly undertrained and that simple changes to the training procedure could yield substantial improvements. Key modifications included:
- Removing the NSP objective, which was found to hurt performance on downstream tasks.
- Dynamic masking, where the masking pattern changes each time a sequence is fed to the model, rather than using a single static mask generated during data preprocessing.
- Larger training batches (8,000 sequences per batch versus BERT's 256).
- More training data, using 160 GB of text including CC-News, OpenWebText, and Stories in addition to the original BooksCorpus and Wikipedia.
- Longer training, with more total training steps.
RoBERTa achieved a GLUE score of 88.5, substantially outperforming BERT-Large's 80.5, and became a widely used baseline for subsequent encoder models.
DistilBERT (2019)
DistilBERT (Sanh et al., 2019) from Hugging Face used knowledge distillation to compress BERT into a smaller model that retained 97% of BERT's language understanding capabilities on the GLUE benchmark while being 40% smaller (66 million parameters versus 110 million) and 60% faster at inference. The student model uses 6 transformer layers (half of BERT-Base's 12) and is trained with a combination of language modeling, distillation, and cosine distance losses. DistilBERT made BERT practical for deployment in latency-sensitive and resource-constrained environments.
ALBERT (2020)
ALBERT (A Lite BERT), introduced by Lan et al. at Google Research, tackled BERT's parameter inefficiency through two techniques:
- Factorized embedding parameterization: Instead of projecting one-hot vectors directly into the hidden space of size H, ALBERT first projects them into a lower-dimensional embedding space of size E, then up to H. This reduces embedding parameters from O(V x H) to O(V x E + E x H).
- Cross-layer parameter sharing: All transformer layers share the same parameters, preventing parameter growth with depth.
These techniques allowed ALBERT to achieve 18x fewer parameters than BERT-Large while training 1.7x faster. ALBERT also replaced NSP with a sentence order prediction (SOP) task, which proved more effective.
DeBERTa (2021)
DeBERTa (Decoding-enhanced BERT with Disentangled Attention), developed by He et al. at Microsoft, introduced two key innovations:
- Disentangled attention: Each word is represented using two separate vectors encoding content and position respectively. Attention weights are computed using disentangled matrices based on both content-to-content and position-to-position interactions, rather than summing content and position embeddings before computing attention.
- Enhanced mask decoder: Absolute position information is incorporated in the decoding layer for token prediction, complementing the relative position information used in the attention layers.
DeBERTa outperformed RoBERTa on a majority of NLU tasks, and in January 2021, DeBERTa became the first model to surpass human performance on the SuperGLUE benchmark.
Multilingual BERT
Google released a multilingual version of BERT (mBERT) pre-trained on the concatenation of Wikipedia text from 104 languages. Despite having no explicit cross-lingual training signal, mBERT demonstrated surprisingly strong zero-shot cross-lingual transfer capabilities, where fine-tuning on English data alone yielded reasonable performance on other languages.
Comparison of BERT family models
| Model | Year | Parameters | Key innovation | GLUE score (approx.) |
|---|
| BERT-Base | 2018 | 110M | Masked LM + bidirectional encoder | 79.6 |
| BERT-Large | 2018 | 340M | Scaled-up BERT | 80.5 |
| DistilBERT | 2019 | 66M | Knowledge distillation from BERT | ~77 |
| RoBERTa | 2019 | 355M | No NSP, dynamic masking, more data | 88.5 |
| ALBERT | 2020 | 12M (base) | Parameter sharing, factorized embeddings | 89.4 (xxlarge) |
| ELECTRA | 2020 | 110M (base) | Replaced-token detection | 89.4 |
| DeBERTa | 2021 | 134M (base) | Disentangled attention | 90.0+ |
ModernBERT (2024)
In December 2024, Answer.AI and LightOn AI released ModernBERT, described as the first major update to the BERT architecture in years. ModernBERT incorporates several architectural modernizations that had emerged in the decoder-only model line (such as GPT-style models) but had not yet been applied to encoder models.
Key improvements in ModernBERT include:
- Rotary positional embeddings (RoPE): Replaces BERT's absolute positional embeddings, enabling better modeling of token relationships and supporting longer sequences.
- Extended context length: Supports up to 8,192 tokens (16 times BERT's 512-token limit).
- GeGLU activation layers: Upgraded from BERT's original GeLU activation function.
- Alternating attention: Uses global attention every 3 layers and local sliding-window attention in the other layers, reducing computational complexity from quadratic to near-linear for long sequences.
- Flash Attention 2 integration: Enables 2 to 3 times faster inference on long contexts.
- Larger training corpus: Trained on 2 trillion tokens of diverse English text, including web documents, code, and scientific papers.
ModernBERT-Base (149M parameters) became the first base-size encoder model to beat DeBERTaV3 on the GLUE benchmark while being 2 to 4 times faster and using roughly one-fifth the memory. ModernBERT-Large (395M parameters) achieves state-of-the-art results across retrieval, classification, and code understanding tasks.
Sentence-BERT and embedding applications
While BERT produces contextualized token representations, obtaining a single fixed-size embedding for an entire sentence is not straightforward. Simply averaging BERT's token outputs or using the [CLS] token produces sentence embeddings that perform poorly on semantic similarity tasks.
Sentence-BERT (SBERT), introduced by Reimers and Gurevych (2019), solved this problem by fine-tuning BERT using a siamese network structure. Two sentences are passed through identical BERT models, and the resulting embeddings are trained to be close in vector space when the sentences are semantically similar. SBERT reduced the time needed to find the most similar sentence pair in a collection of 10,000 sentences from roughly 65 hours (using cross-encoding with vanilla BERT) to about 5 seconds, while maintaining comparable accuracy. SBERT and its successors became the foundation for practical semantic search, clustering, and retrieval applications.
BERT vs. GPT: encoder vs. decoder
BERT and GPT represent two fundamentally different approaches to building language models from the transformer architecture, and understanding the distinction is important for choosing the right model for a given task.
| Aspect | BERT | GPT |
|---|
| Architecture | Encoder-only | Decoder-only |
| Attention direction | Bidirectional (all-to-all) | Unidirectional (causal, left-to-right) |
| Pre-training objective | Masked language modeling | Next-token prediction (autoregressive) |
| Primary strength | Understanding and classification | Text generation |
| Input/output | Encodes input into representations | Generates output token by token |
| Typical tasks | Classification, NER, QA extraction, semantic similarity | Text generation, dialogue, summarization, translation |
| Context window (original) | 512 tokens | 512 tokens (GPT-1), up to 128K+ (GPT-4) |
BERT excels at tasks that require deep understanding of the entire input, such as classification, named entity recognition, and extractive question answering. Because it can attend to all tokens simultaneously, it builds richer representations of each token's meaning. GPT excels at generative tasks because its autoregressive design naturally produces fluent, coherent text one token at a time.
The success of both approaches has led some researchers to develop models that combine elements of both, such as encoder-decoder models (T5, BART) that use bidirectional encoding of the input and autoregressive generation of the output.
Applications
Google Search
In October 2019, Google announced that BERT was being applied to English-language search queries in the United States, calling it "the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search." BERT helped Google better understand the nuances of search queries, particularly longer and more conversational ones where prepositions and context words significantly affect meaning. By October 2020, BERT was processing almost every English-based query. Google also expanded BERT to search queries in over 70 languages in December 2019.
Text classification
BERT has been widely adopted for sentiment analysis, spam detection, topic classification, and content moderation. The [CLS] token representation provides a natural mechanism for mapping entire documents or sentences to class labels.
Named entity recognition (NER)
BERT's token-level representations make it well suited for sequence labeling tasks like NER, where each token must be classified as a person, organization, location, or other entity type. Fine-tuned BERT models significantly outperformed previous feature-engineered NER systems.
Question answering
BERT's architecture is particularly effective for extractive question answering, where the answer is a span within a given passage. The model learns to predict start and end positions of the answer span. BERT-based models became dominant on benchmarks such as SQuAD and Natural Questions.
Biomedical and scientific NLP
Domain-specific variants like BioBERT (Lee et al., 2020) and SciBERT (Beltagy et al., 2019) were pre-trained on biomedical literature and scientific papers respectively, achieving strong results on tasks like biomedical named entity recognition, relation extraction, and question answering.
Limitations
Despite its impact, BERT has several well-known limitations:
- Fixed context window: BERT can process a maximum of 512 tokens per input. Longer documents must be truncated or split into chunks, potentially losing important context. ModernBERT (2024) addresses this by extending the limit to 8,192 tokens.
- Quadratic attention complexity: The self-attention mechanism computes attention between all pairs of tokens, resulting in O(n^2) time and memory complexity with respect to sequence length. This makes scaling to very long sequences computationally expensive.
- Not generative: BERT is an encoder-only model and cannot generate text autoregressively. It is not suitable for tasks like open-ended text generation, dialogue, or summarization without substantial architectural modification.
- Pre-training/fine-tuning mismatch: The
[MASK] tokens used during pre-training never appear during fine-tuning, creating a distribution gap. The 80/10/10 masking strategy mitigates but does not eliminate this issue.
- Static masking (original BERT): The original BERT used a single static masking of the training data, meaning the model saw the same masked version of each sentence throughout training. RoBERTa showed that dynamic masking, where the mask pattern changes each epoch, produces better results.
- Computational cost of fine-tuning: While fine-tuning is faster than pre-training, it still requires updating all model parameters for each downstream task, which can be resource-intensive when deploying many task-specific models.
Historical significance
BERT's release in October 2018 is widely considered a watershed moment in NLP. It demonstrated that large-scale bidirectional pre-training, combined with straightforward fine-tuning, could dramatically improve performance across a diverse set of language understanding tasks. This paradigm, sometimes called the "pre-train, fine-tune" approach, became the dominant methodology in NLP and influenced the broader field of machine learning.
BERT also sparked what some researchers have called a "Cambrian explosion" of pre-trained language models. Within a year of BERT's release, dozens of variants appeared, including RoBERTa, ALBERT, DistilBERT, XLNet, ELECTRA, and many more. The rapid iteration on BERT's ideas pushed NLP benchmarks forward at an unprecedented pace. On the GLUE benchmark, the gap between BERT's score and estimated human performance was largely closed within two years of BERT's publication.
Even as large generative models like GPT-3, GPT-4, and other decoder-only architectures have come to dominate headlines, BERT and its descendants remain widely used in production systems. Encoder models are preferred for many classification, retrieval, and embedding tasks because they are smaller, faster, and cheaper to run than large generative models while delivering comparable or superior accuracy on understanding tasks. The release of ModernBERT in 2024 demonstrates that the encoder-only paradigm continues to evolve and remains relevant.
References
- Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*. https://arxiv.org/abs/1810.04805
- Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*. https://arxiv.org/abs/1706.03762
- Liu, Y., Ott, M., Goyal, N., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." *arXiv preprint arXiv:1907.11692*. https://arxiv.org/abs/1907.11692
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." *NeurIPS 2019 Workshop on Energy Efficient Machine Learning and Cognitive Computing*. https://arxiv.org/abs/1910.01108
- Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." *Proceedings of ICLR 2020*. https://arxiv.org/abs/1909.11942
- He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." *Proceedings of ICLR 2021*. https://arxiv.org/abs/2006.03654
- Reimers, N. & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." *Proceedings of EMNLP-IJCNLP 2019*. https://arxiv.org/abs/1908.10084
- Warner, B., Chaffin, A., Clavie, B., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." *arXiv preprint arXiv:2412.13663*. https://arxiv.org/abs/2412.13663
- Nayak, P. (2019). "Understanding searches better than ever before." *Google Blog*. https://blog.google/products/search/search-language-understanding-bert/
- Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2019). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." *Proceedings of ICLR 2019*. https://arxiv.org/abs/1804.07461
- Clark, K., Luong, M.T., Le, Q.V., & Manning, C.D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *Proceedings of ICLR 2020*. https://arxiv.org/abs/2003.10555
- Peters, M.E., Neumann, M., Iyyer, M., et al. (2018). "Deep contextualized word representations." *Proceedings of NAACL-HLT 2018*. https://arxiv.org/abs/1802.05365