See also: Machine learning terms
A bidirectional language model is a language model that, when computing a representation for a token, conditions on both the tokens that come before it (the left context) and the tokens that come after it (the right context). This contrasts with the more familiar autoregressive or unidirectional language models, which only see tokens to the left when predicting the next one. Bidirectionality matters because in real text the meaning of a word almost always depends on what comes after it as well as what comes before. The word "bank" only resolves to "financial institution" or "side of a river" once the sentence keeps going.
Bidirectional models dominate the part of natural language processing that is about understanding text: classification, named entity recognition, question answering, retrieval, and embedding generation. They are not, on their own, well suited to free-form text generation, which is why decoder-only models (the GPT family) won the generation race even as encoder-only models (the BERT family) kept owning the embedding and classification benchmarks.
Given a sequence of tokens t_1, t_2, ..., t_n, a unidirectional (causal) language model factorizes the probability of the sequence as a product of conditionals P(t_i | t_1, ..., t_{i-1}). It only ever looks left. A bidirectional language model instead learns a representation h_i for each token that depends on the entire surrounding sequence, including tokens at positions greater than i. There are two distinct families of bidirectional models, and they are often confused.
The distinction comes from the BERT paper itself, which introduced the now-standard terminology. A shallow bidirectional model, sometimes called a biLM, runs two separate sequence models: one left-to-right and one right-to-left. Their representations are concatenated at each position. The two streams never see each other during their internal processing. A deep bidirectional model uses a single architecture in which every layer can attend to every position in both directions at once, with no separate forward and backward streams. BERT achieves this by replacing the next-token prediction objective with a masked language model objective, which removes the need to factorize probabilities in any single direction.
| Property | Shallow biLM (ELMo, BiLSTM) | Deep bidirectional (BERT, RoBERTa) |
|---|---|---|
| Streams | Two separate, forward and backward | One, fully bidirectional |
| Backbone | LSTM or GRU | Transformer encoder |
| Pretraining objective | Two parallel next-token tasks | Masked language model (MLM) |
| Token sees other side via | Concatenation of final layer outputs | Self-attention at every layer |
| Per-token output | concat(h_forward, h_backward) | Single contextual vector |
| Token can "cheat" by seeing itself | No (factorized) | Prevented by mask token |
The idea that a sequence model should look in both directions is older than deep learning's modern era. Schuster and Paliwal proposed bidirectional recurrent neural networks in 1997, in a paper in IEEE Transactions on Signal Processing that explicitly framed the problem: a standard RNN only has access to past inputs, but for many sequence tasks (handwriting recognition was the motivating example) the relevant context lies on both sides. Their fix was to run two RNNs, one in each direction, and combine their hidden states.
The approach sat quietly until Alex Graves and Jurgen Schmidhuber combined it with long short-term memory in 2005. Their paper "Framewise phoneme classification with bidirectional LSTM and other neural network architectures" in Neural Networks applied the BRNN trick to LSTM cells and showed clear gains on TIMIT phoneme classification. The bidirectional LSTM became a workhorse architecture for sequence labeling for the next decade. Speech recognition, part-of-speech tagging, and named entity recognition systems built on BiLSTM dominated their respective leaderboards through roughly 2017.
The decisive shift in NLP came with ELMo (Peters et al., NAACL 2018, "Deep contextualized word representations"). ELMo trained two stacked LSTM language models (one left-to-right, one right-to-left), each predicting the next token in its direction, and exposed not just the top layer but a learned weighted combination of all internal layers as the word's contextual embedding. The model fed into the LSTMs through a character-level convolutional network so it could handle out-of-vocabulary words. ELMo improved the state of the art across six diverse NLP tasks and was the proof that contextual word representations beat static embeddings like word2vec and GloVe.
Then came BERT (Devlin et al., 2018, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"). BERT abandoned the LSTM, abandoned the dual-stream concatenation, and used a single transformer encoder trained with a masked language model objective. The result was a model that conditioned every token's representation on the entire sentence at every layer. BERT-large pushed the GLUE benchmark score from 72.8 to 80.5 in one paper, a gain of 7.7 points that more or less ended the era of building NLP systems by stacking custom architectures on top of frozen embeddings. The pretrain-then-finetune recipe took over.
| Year | Model | Authors | Architecture | What it added |
|---|---|---|---|---|
| 1997 | BRNN | Schuster, Paliwal | Two RNNs, opposite directions | The original idea |
| 2005 | BiLSTM | Graves, Schmidhuber | Two LSTMs, opposite directions | LSTM gating, full-gradient training |
| 2018 | ELMo | Peters et al. | Stacked biLM + char-CNN | Deep contextual embeddings, layer mixing |
| 2018 | BERT | Devlin et al. | Transformer encoder + MLM + NSP | Deep bidirectionality through attention |
| 2019 | RoBERTa | Liu et al. | Same as BERT, retrained | Better hyperparameters, no NSP, 160 GB data |
| 2019 | XLNet | Yang et al. | Two-stream transformer + permutation LM | Bidirectionality via permuted factorization |
| 2019 | SpanBERT | Joshi et al. | BERT + span masking + SBO | Span-level pretraining objective |
| 2020 | ELECTRA | Clark et al. | Generator-discriminator MLM | Replaced-token detection, sample efficient |
| 2020 | DeBERTa | He et al. | Disentangled attention | Separate content and position embeddings |
The transformer encoder used by BERT and its descendants is, mechanically, just a stack of self-attention layers. The bidirectionality is not built into the math of attention itself. It is built into what attention is allowed to look at. In a standard decoder-style transformer (the GPT family), each position is allowed to attend only to positions less than or equal to itself. This is the causal mask: an upper-triangular matrix of negative infinities added to the attention logits, which zeroes out attention to future positions after the softmax. In a BERT-style encoder, that mask is gone. Every position can attend to every other position, in both directions, at every layer.
The problem is that without a causal mask, a vanilla next-token prediction task becomes trivial. If position 5 can attend to position 5, the model just copies the answer. The masked language model objective fixes this by replacing 15% of the input tokens with a special [MASK] token (or a random token, or leaving them unchanged) and asking the model to recover the original. The model now has to use only context, because the answer it is asked to predict has been deleted from the input. BERT supplements MLM with next sentence prediction, a binary classification task asking whether sentence B follows sentence A in the original text.
BERT-base has 12 transformer layers, 768 hidden dimensions, 12 attention heads, and 110 million parameters. BERT-large has 24 layers, 1024 hidden dimensions, 16 heads, and 340 million parameters. Both versions tokenize input with WordPiece using a 30,000-token vocabulary and were pretrained on roughly 3.3 billion words from BooksCorpus and English Wikipedia.
A surprising amount of progress in bidirectional pretraining since BERT has come from changing not the architecture but the objective. The most influential variants:
| Model | Objective | What changes from MLM |
|---|---|---|
| BERT | MLM + NSP | The original recipe |
| RoBERTa | MLM only, dynamic masking | Drops NSP; remasks each epoch; trains 10x longer on 10x more data |
| SpanBERT | Span MLM + Span Boundary Objective | Masks contiguous spans, predicts span content from boundary tokens |
| T5 | Span corruption (denoising) | Masks contiguous spans; predicts them as a sequence (encoder-decoder) |
| BART | Token deletion, span masking, shuffling, etc. | Denoises arbitrary corruption with an encoder-decoder |
| XLNet | Permutation LM | Autoregressive but over a randomly permuted token order |
| ELECTRA | Replaced-token detection | A small generator replaces tokens, the discriminator predicts which are fake |
ELECTRA deserves special attention because it broke a quiet assumption. MLM only learns from the 15% of tokens that get masked. The other 85% provide context but receive no gradient signal from the prediction loss. ELECTRA replaces some tokens with plausible alternatives sampled from a small generator network, then trains the main model as a binary discriminator over every position: was this token in the original sentence, or was it swapped? Because the loss applies to all tokens, ELECTRA gets dramatically more bang per training step. A small ELECTRA model trained on a single GPU for four days can outperform GPT trained on roughly 30 times the compute on the GLUE benchmark.
For classification, tagging, and span-extraction tasks the asymmetry between past and future context is artificial. When you label "Apple" as ORG or FRUIT in named entity recognition, you naturally use both "the" before it and "announced new earnings" after it. A model that only sees the left context throws away half the available signal. BERT's gains over GPT on the GLUE benchmark, SQuAD, and similar question answering tasks are largely attributable to this.
For generation, the asymmetry is no longer artificial. When a model is generating text one token at a time, the right context literally does not exist yet. A bidirectional model trained with MLM cannot be unrolled into a generator without significant surgery, because its predictions assume access to surrounding context that, in a generation setting, has not been produced. This is why GPT and other decoder-only autoregressive models took over generative use cases, and why encoder-decoder architectures like T5 and BART (bidirectional encoder, causal decoder) became the natural design for tasks that involve both understanding an input and producing an output, such as machine translation, summarization, and rewriting.
BERT essentially closed out the original GLUE benchmark. Within months of BERT's release, the leaderboard saw RoBERTa, XLNet, ALBERT, and ELECTRA trade places at the top. By mid-2019, performance on most GLUE tasks was approaching the human baseline; the field had to introduce SuperGLUE because the original benchmark was saturated.
| Benchmark | Pre-BERT SOTA (~mid 2018) | BERT-large (Oct 2018) | RoBERTa (Jul 2019) | Human |
|---|---|---|---|---|
| GLUE (avg) | ~72.8 | 80.5 | 88.5 | 87.1 |
| SQuAD v1.1 (F1) | 91.7 | 93.2 | 94.6 | 91.2 |
| MultiNLI (acc) | 82.1 | 86.7 | 90.2 | 92.0 |
Numbers are taken from the original BERT paper and the RoBERTa paper. The pattern was the same on more or less every English understanding task tried in this period: bidirectional pretraining, possibly with a tweaked objective, possibly trained longer, set new state of the art.
In production, BERT-style models still underpin most modern systems for sentence and passage embeddings, semantic search, retrieval-augmented generation, classification, sentiment analysis, and ranking. Sentence-BERT, multilingual BERT, BGE, E5, and other encoder-only models continue to dominate the MTEB embedding leaderboard. The decoder-only generative models that get most of the press attention quietly call out to a bidirectional encoder under the hood for retrieval.
Deep bidirectional models have real costs. The MLM objective creates a pretrain-finetune mismatch: the [MASK] token appears during pretraining but never at inference time, which slightly biases what the model learns. BERT's mitigation (replacing only 80% of selected tokens with [MASK], 10% with random tokens, 10% unchanged) reduces but does not eliminate the gap. XLNet's permutation language model and ELECTRA's replaced-token detection were both partly motivated by removing this mismatch.
Bidirectional models also assume a fixed maximum sequence length, typically 512 tokens for BERT, after which positional embeddings run out. Long-document variants like Longformer and BigBird patch this with sparse attention patterns but inherit the underlying constraint. Causal models with rotary or relative position encodings have generally adapted to long contexts more gracefully.
The biggest limitation is the one already noted: bidirectional models are not generators. A BERT model can fill in a single masked word, but using it to write a paragraph requires hacks (iteratively unmasking left to right, or specialized variants like BERT-Gen) that consistently underperform a properly trained autoregressive model. This is structural. The training objective never asked the model to commit to a left-to-right factorization, so it never learned to produce one.
Imagine you are reading a story and trying to understand what every word means. A regular language model is like reading the story with a piece of cardboard covering everything you have not gotten to yet. You cover the next page with your hand and try to guess what each word means using only what you have already read. That works for some words. For others ("she pulled the bat out of her bag and stepped up to the plate") you really need to peek ahead to know if it is a baseball bat or a small flying mammal.
A bidirectional language model is what happens when you let yourself read the whole sentence first, in both directions, before deciding what each word means. That is great for understanding. It is terrible for finishing the story, because to write the next sentence you can't peek at words you haven't written yet.