BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google AI Language researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. First published as a preprint in October 2018 and formally presented at NAACL-HLT 2019, BERT introduced a pre-training approach for neural network-based language understanding that processes text bidirectionally, reading each word in the context of all surrounding words rather than left-to-right or right-to-left alone. This innovation enabled BERT to achieve state-of-the-art results across 11 natural language processing (NLP) benchmarks at the time of its release, and it has since become one of the most influential models in the history of the field. The original paper has accumulated over 100,000 citations across various academic databases, making it one of the most cited papers in all of computer science.
Imagine you are trying to understand a word in a sentence, like the word "bank" in "I went to the bank to catch fish." If you only read the words before "bank," you might think it means a place where you keep money. But if you also read the words after it, you realize it means the side of a river. BERT works like this. Instead of reading a sentence in one direction, it looks at all the words around each word at the same time. This helps it understand what words really mean based on the full picture. First, BERT reads millions of sentences from the internet to learn how language works (this is called pre-training). Then, when you give it a specific job, like answering questions or figuring out if a movie review is positive or negative, it only needs a small amount of extra practice to get really good at it.
Before BERT, most approaches to NLP relied on training models from scratch on task-specific labeled data, or on pre-trained word embedding representations like Word2Vec and GloVe that assigned a single fixed vector to each word regardless of context. The word "bank" would receive the same representation whether it appeared in a financial context or a geographical one. Context-dependent models like ELMo (Peters et al., 2018) improved on this by generating different representations based on surrounding text, but ELMo used a shallow concatenation of independently trained left-to-right and right-to-left LSTM layers, limiting the depth of bidirectional interaction.
OpenAI's GPT (Radford et al., 2018) demonstrated that pre-training a transformer model on a large corpus followed by fine-tuning on downstream tasks could yield strong results, establishing the pre-train then fine-tune paradigm. However, GPT used a unidirectional (left-to-right) architecture, meaning each token could only attend to tokens before it. This constraint limited the model's ability to capture context from both directions simultaneously.
BERT addressed these limitations by introducing a training objective, masked language modeling, that allowed a transformer encoder to learn deep bidirectional representations. The result was a general-purpose language representation model that could be fine-tuned for a wide range of tasks with minimal architectural modification, marking a significant step forward for transfer learning in NLP.
BERT uses the encoder portion of the original transformer architecture described in "Attention Is All You Need" (Vaswani et al., 2017). Unlike the full transformer, which includes both encoder and decoder stacks, BERT uses only the encoder. This means every token in the input can attend to every other token through self-attention, without the causal masking used in decoder models that prevents tokens from attending to future positions.
The original paper introduced two model sizes:
| Configuration | Layers (L) | Hidden size (H) | Attention heads (A) | Feed-forward size | Parameters |
|---|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 3,072 | 110M |
| BERT-Large | 24 | 1,024 | 16 | 4,096 | 340M |
BERT-Base was designed to have the same model size as OpenAI's GPT for fair comparison, while BERT-Large demonstrated that scaling up the model yielded further improvements.
BERT uses a WordPiece tokenizer with a vocabulary of approximately 30,000 tokens. Each input sequence begins with a special [CLS] token, whose final hidden state serves as the aggregate sequence representation for classification tasks. Sentences within a pair are separated by a special [SEP] token.
The input embedding for each token is constructed by summing three components:
Each transformer layer applies multi-head self-attention followed by a position-wise feed-forward network. In self-attention, every token computes query, key, and value vectors, and the attention weight between any two tokens is determined by the dot product of their query and key vectors. Because BERT does not use causal masking, every token can attend to every other token in both directions. This all-to-all attention pattern is what makes BERT "bidirectional" in the deepest sense, and it distinguishes BERT from autoregressive models like GPT.
BERT is pre-trained on two self-supervised learning objectives using a large unlabeled text corpus. The pre-training data consisted of the BooksCorpus (800 million words) and English Wikipedia (2,500 million words), totaling approximately 3.3 billion words of text.
The primary pre-training task is masked language modeling. In each training example, 15% of the input tokens are selected for prediction. Of these selected tokens:
[MASK] token.The model must predict the original identity of each selected token based on the surrounding context. This objective forces the model to build deep bidirectional representations, because any token might need to be predicted, and the prediction depends on both the left and right context.
The 80/10/10 split was a deliberate design choice. If all selected tokens were simply replaced with [MASK], the model would never see [MASK] tokens during fine-tuning, creating a mismatch between pre-training and fine-tuning. By sometimes keeping the original token or substituting a random one, the model learns to rely on the surrounding context for all tokens, not just masked positions.
The second pre-training task is next sentence prediction. The model receives two sentences and must classify whether the second sentence is the actual next sentence in the original document (labeled "IsNext") or a randomly sampled sentence from the corpus (labeled "NotNext"). Training pairs are constructed with a 50/50 split between true and random pairs.
NSP was designed to help the model understand relationships between sentence pairs, which is important for tasks like natural language inference and question answering. However, later research by Liu et al. (2019) in the RoBERTa paper found that removing NSP could actually improve performance, suggesting that the benefits of this objective were less clear than originally believed.
BERT-Base was trained on 4 Cloud TPUs (16 TPU chips) for 4 days, while BERT-Large was trained on 16 Cloud TPUs (64 TPU chips) for 4 days. The estimated training cost for BERT-Base was approximately $500 USD at the time. Both models were trained with a batch size of 256 sequences and a maximum sequence length of 512 tokens, using Adam with learning rate warmup and linear decay.
One of BERT's most significant contributions was demonstrating that a single pre-trained model could be adapted to a wide variety of NLP tasks by adding a simple task-specific output layer and fine-tuning all parameters end-to-end. This approach required far less task-specific architecture engineering than previous methods.
For each downstream task, BERT's input is formatted slightly differently:
[CLS] token, and the final hidden state of [CLS] is fed to a classification layer.[SEP] token between them.Fine-tuning is typically fast, requiring only 2 to 4 epochs on most tasks, and can be performed on a single GPU in a matter of hours.
BERT established new state-of-the-art results on multiple benchmarks when it was released:
| Benchmark | Metric | BERT-Large result | Previous best | Human performance |
|---|---|---|---|---|
| GLUE (overall) | Average score | 80.5 | 72.8 | ~87 |
| MultiNLI | Accuracy | 86.7% | 82.1% | 92.0% |
| SQuAD v1.1 | F1 | 93.2 | 91.7 | 91.2 |
| SQuAD v2.0 | F1 | 83.1 | 78.0 | 86.8 |
| SWAG | Accuracy | 86.3% | 59.9% (ESIM) | 88.0% |
On SQuAD v1.1, BERT-Large's F1 score of 93.2 actually surpassed the estimated human performance of 91.2, marking one of the first times a machine learning model exceeded human-level accuracy on a major reading comprehension benchmark.
BERT's success inspired a large family of derived models, each addressing specific limitations or optimizing for different goals.
RoBERTa (Robustly Optimized BERT Pre-training Approach), developed by Liu et al. at Facebook AI, demonstrated that BERT was significantly undertrained and that simple changes to the training procedure could yield substantial improvements. Key modifications included:
RoBERTa achieved a GLUE score of 88.5, substantially outperforming BERT-Large's 80.5, and became a widely used baseline for subsequent encoder models.
DistilBERT (Sanh et al., 2019) from Hugging Face used knowledge distillation to compress BERT into a smaller model that retained 97% of BERT's language understanding capabilities on the GLUE benchmark while being 40% smaller (66 million parameters versus 110 million) and 60% faster at inference. The student model uses 6 transformer layers (half of BERT-Base's 12) and is trained with a combination of language modeling, distillation, and cosine distance losses. DistilBERT made BERT practical for deployment in latency-sensitive and resource-constrained environments.
ALBERT (A Lite BERT), introduced by Lan et al. at Google Research, tackled BERT's parameter inefficiency through two techniques:
These techniques allowed ALBERT to achieve 18x fewer parameters than BERT-Large while training 1.7x faster. ALBERT also replaced NSP with a sentence order prediction (SOP) task, which proved more effective.
DeBERTa (Decoding-enhanced BERT with Disentangled Attention), developed by He et al. at Microsoft, introduced two key innovations:
DeBERTa outperformed RoBERTa on a majority of NLU tasks, and in January 2021, DeBERTa became the first model to surpass human performance on the SuperGLUE benchmark.
Google released a multilingual version of BERT (mBERT) pre-trained on the concatenation of Wikipedia text from 104 languages. Despite having no explicit cross-lingual training signal, mBERT demonstrated surprisingly strong zero-shot cross-lingual transfer capabilities, where fine-tuning on English data alone yielded reasonable performance on other languages.
| Model | Year | Parameters | Key innovation | GLUE score (approx.) |
|---|---|---|---|---|
| BERT-Base | 2018 | 110M | Masked LM + bidirectional encoder | 79.6 |
| BERT-Large | 2018 | 340M | Scaled-up BERT | 80.5 |
| DistilBERT | 2019 | 66M | Knowledge distillation from BERT | ~77 |
| RoBERTa | 2019 | 355M | No NSP, dynamic masking, more data | 88.5 |
| ALBERT | 2020 | 12M (base) | Parameter sharing, factorized embeddings | 89.4 (xxlarge) |
| ELECTRA | 2020 | 110M (base) | Replaced-token detection | 89.4 |
| DeBERTa | 2021 | 134M (base) | Disentangled attention | 90.0+ |
In December 2024, Answer.AI and LightOn AI released ModernBERT, described as the first major update to the BERT architecture in years. ModernBERT incorporates several architectural modernizations that had emerged in the decoder-only model line (such as GPT-style models) but had not yet been applied to encoder models.
Key improvements in ModernBERT include:
ModernBERT-Base (149M parameters) became the first base-size encoder model to beat DeBERTaV3 on the GLUE benchmark while being 2 to 4 times faster and using roughly one-fifth the memory. ModernBERT-Large (395M parameters) achieves state-of-the-art results across retrieval, classification, and code understanding tasks.
While BERT produces contextualized token representations, obtaining a single fixed-size embedding for an entire sentence is not straightforward. Simply averaging BERT's token outputs or using the [CLS] token produces sentence embeddings that perform poorly on semantic similarity tasks.
Sentence-BERT (SBERT), introduced by Reimers and Gurevych (2019), solved this problem by fine-tuning BERT using a siamese network structure. Two sentences are passed through identical BERT models, and the resulting embeddings are trained to be close in vector space when the sentences are semantically similar. SBERT reduced the time needed to find the most similar sentence pair in a collection of 10,000 sentences from roughly 65 hours (using cross-encoding with vanilla BERT) to about 5 seconds, while maintaining comparable accuracy. SBERT and its successors became the foundation for practical semantic search, clustering, and retrieval applications.
BERT and GPT represent two fundamentally different approaches to building language models from the transformer architecture, and understanding the distinction is important for choosing the right model for a given task.
| Aspect | BERT | GPT |
|---|---|---|
| Architecture | Encoder-only | Decoder-only |
| Attention direction | Bidirectional (all-to-all) | Unidirectional (causal, left-to-right) |
| Pre-training objective | Masked language modeling | Next-token prediction (autoregressive) |
| Primary strength | Understanding and classification | Text generation |
| Input/output | Encodes input into representations | Generates output token by token |
| Typical tasks | Classification, NER, QA extraction, semantic similarity | Text generation, dialogue, summarization, translation |
| Context window (original) | 512 tokens | 512 tokens (GPT-1), up to 128K+ (GPT-4) |
BERT excels at tasks that require deep understanding of the entire input, such as classification, named entity recognition, and extractive question answering. Because it can attend to all tokens simultaneously, it builds richer representations of each token's meaning. GPT excels at generative tasks because its autoregressive design naturally produces fluent, coherent text one token at a time.
The success of both approaches has led some researchers to develop models that combine elements of both, such as encoder-decoder models (T5, BART) that use bidirectional encoding of the input and autoregressive generation of the output.
In October 2019, Google announced that BERT was being applied to English-language search queries in the United States, calling it "the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search." BERT helped Google better understand the nuances of search queries, particularly longer and more conversational ones where prepositions and context words significantly affect meaning. By October 2020, BERT was processing almost every English-based query. Google also expanded BERT to search queries in over 70 languages in December 2019.
BERT has been widely adopted for sentiment analysis, spam detection, topic classification, and content moderation. The [CLS] token representation provides a natural mechanism for mapping entire documents or sentences to class labels.
BERT's token-level representations make it well suited for sequence labeling tasks like NER, where each token must be classified as a person, organization, location, or other entity type. Fine-tuned BERT models significantly outperformed previous feature-engineered NER systems.
BERT's architecture is particularly effective for extractive question answering, where the answer is a span within a given passage. The model learns to predict start and end positions of the answer span. BERT-based models became dominant on benchmarks such as SQuAD and Natural Questions.
Domain-specific variants like BioBERT (Lee et al., 2020) and SciBERT (Beltagy et al., 2019) were pre-trained on biomedical literature and scientific papers respectively, achieving strong results on tasks like biomedical named entity recognition, relation extraction, and question answering.
Despite its impact, BERT has several well-known limitations:
[MASK] tokens used during pre-training never appear during fine-tuning, creating a distribution gap. The 80/10/10 masking strategy mitigates but does not eliminate this issue.BERT's release in October 2018 is widely considered a watershed moment in NLP. It demonstrated that large-scale bidirectional pre-training, combined with straightforward fine-tuning, could dramatically improve performance across a diverse set of language understanding tasks. This paradigm, sometimes called the "pre-train, fine-tune" approach, became the dominant methodology in NLP and influenced the broader field of machine learning.
BERT also sparked what some researchers have called a "Cambrian explosion" of pre-trained language models. Within a year of BERT's release, dozens of variants appeared, including RoBERTa, ALBERT, DistilBERT, XLNet, ELECTRA, and many more. The rapid iteration on BERT's ideas pushed NLP benchmarks forward at an unprecedented pace. On the GLUE benchmark, the gap between BERT's score and estimated human performance was largely closed within two years of BERT's publication.
Even as large generative models like GPT-3, GPT-4, and other decoder-only architectures have come to dominate headlines, BERT and its descendants remain widely used in production systems. Encoder models are preferred for many classification, retrieval, and embedding tasks because they are smaller, faster, and cheaper to run than large generative models while delivering comparable or superior accuracy on understanding tasks. The release of ModernBERT in 2024 demonstrates that the encoder-only paradigm continues to evolve and remains relevant.