Bidirectional language model

See also: Machine learning terms

A bidirectional language model is a language model that, when computing a representation for a token, conditions on both the tokens that come before it (the left context) and the tokens that come after it (the right context). This contrasts with the more familiar autoregressive or unidirectional language models, which only see tokens to the left when predicting the next one. Bidirectionality matters because in real text the meaning of a word almost always depends on what comes after it as well as what comes before. The word "bank" only resolves to "financial institution" or "side of a river" once the sentence keeps going.

Bidirectional models dominate the part of natural language processing that is about understanding text: classification, named entity recognition, question answering, retrieval, and embedding generation. They are not, on their own, well suited to free-form text generation, which is why decoder-only models (the GPT family) won the generation race even as encoder-only models (the BERT family) kept owning the embedding and classification benchmarks.

a quick definition

Given a sequence of tokens t_1, t_2, ..., t_n, a unidirectional (causal) language model factorizes the probability of the sequence as a product of conditionals P(t_i | t_1, ..., t_{i-1}). It only ever looks left. A bidirectional language model instead learns a representation h_i for each token that depends on the entire surrounding sequence, including tokens at positions greater than i. There are two distinct families of bidirectional models, and they are often confused.

shallow versus deep bidirectionality

The distinction comes from the BERT paper itself, which introduced the now-standard terminology. A shallow bidirectional model, sometimes called a biLM, runs two separate sequence models: one left-to-right and one right-to-left. Their representations are concatenated at each position. The two streams never see each other during their internal processing. A deep bidirectional model uses a single architecture in which every layer can attend to every position in both directions at once, with no separate forward and backward streams. BERT achieves this by replacing the next-token prediction objective with a masked language model objective, which removes the need to factorize probabilities in any single direction.

Property	Shallow biLM (ELMo, BiLSTM)	Deep bidirectional (BERT, RoBERTa)
Streams	Two separate, forward and backward	One, fully bidirectional
Backbone	LSTM or GRU	Transformer encoder
Pretraining objective	Two parallel next-token tasks	Masked language model (MLM)
Token sees other side via	Concatenation of final layer outputs	Self-attention at every layer
Per-token output	concat(h_forward, h_backward)	Single contextual vector
Token can "cheat" by seeing itself	No (factorized)	Prevented by mask token

historical evolution

The idea that a sequence model should look in both directions is older than deep learning's modern era. Schuster and Paliwal proposed bidirectional recurrent neural networks in 1997, in a paper in IEEE Transactions on Signal Processing that explicitly framed the problem: a standard RNN only has access to past inputs, but for many sequence tasks (handwriting recognition was the motivating example) the relevant context lies on both sides. Their fix was to run two RNNs, one in each direction, and combine their hidden states.

The approach sat quietly until Alex Graves and Jurgen Schmidhuber combined it with long short-term memory in 2005. Their paper "Framewise phoneme classification with bidirectional LSTM and other neural network architectures" in Neural Networks applied the BRNN trick to LSTM cells and showed clear gains on TIMIT phoneme classification. The bidirectional LSTM became a workhorse architecture for sequence labeling for the next decade. Speech recognition, part-of-speech tagging, and named entity recognition systems built on BiLSTM dominated their respective leaderboards through roughly 2017.

The decisive shift in NLP came with ELMo (Peters et al., NAACL 2018, "Deep contextualized word representations"). ELMo trained two stacked LSTM language models (one left-to-right, one right-to-left), each predicting the next token in its direction, and exposed not just the top layer but a learned weighted combination of all internal layers as the word's contextual embedding. The model fed into the LSTMs through a character-level convolutional network so it could handle out-of-vocabulary words. ELMo improved the state of the art across six diverse NLP tasks and was the proof that contextual word representations beat static embeddings like word2vec and GloVe.

Then came BERT (Devlin et al., 2018, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"). BERT abandoned the LSTM, abandoned the dual-stream concatenation, and used a single transformer encoder trained with a masked language model objective. The result was a model that conditioned every token's representation on the entire sentence at every layer. BERT-large pushed the GLUE benchmark score from 72.8 to 80.5 in one paper, a gain of 7.7 points that more or less ended the era of building NLP systems by stacking custom architectures on top of frozen embeddings. The pretrain-then-finetune recipe took over.

Year	Model	Authors	Architecture	What it added
1997	BRNN	Schuster, Paliwal	Two RNNs, opposite directions	The original idea
2005	BiLSTM	Graves, Schmidhuber	Two LSTMs, opposite directions	LSTM gating, full-gradient training
2018	ELMo	Peters et al.	Stacked biLM + char-CNN	Deep contextual embeddings, layer mixing
2018	BERT	Devlin et al.	Transformer encoder + MLM + NSP	Deep bidirectionality through attention
2019	RoBERTa	Liu et al.	Same as BERT, retrained	Better hyperparameters, no NSP, 160 GB data
2019	XLNet	Yang et al.	Two-stream transformer + permutation LM	Bidirectionality via permuted factorization
2019	SpanBERT	Joshi et al.	BERT + span masking + SBO	Span-level pretraining objective
2020	ELECTRA	Clark et al.	Generator-discriminator MLM	Replaced-token detection, sample efficient
2020	DeBERTa	He et al.	Disentangled attention	Separate content and position embeddings

architecture details

The transformer encoder used by BERT and its descendants is, mechanically, just a stack of self-attention layers. The bidirectionality is not built into the math of attention itself. It is built into what attention is allowed to look at. In a standard decoder-style transformer (the GPT family), each position is allowed to attend only to positions less than or equal to itself. This is the causal mask: an upper-triangular matrix of negative infinities added to the attention logits, which zeroes out attention to future positions after the softmax. In a BERT-style encoder, that mask is gone. Every position can attend to every other position, in both directions, at every layer.

The problem is that without a causal mask, a vanilla next-token prediction task becomes trivial. If position 5 can attend to position 5, the model just copies the answer. The masked language model objective fixes this by replacing 15% of the input tokens with a special [MASK] token (or a random token, or leaving them unchanged) and asking the model to recover the original. The model now has to use only context, because the answer it is asked to predict has been deleted from the input. BERT supplements MLM with next sentence prediction, a binary classification task asking whether sentence B follows sentence A in the original text.

BERT-base has 12 transformer layers, 768 hidden dimensions, 12 attention heads, and 110 million parameters. BERT-large has 24 layers, 1024 hidden dimensions, 16 heads, and 340 million parameters. Both versions tokenize input with WordPiece using a 30,000-token vocabulary and were pretrained on roughly 3.3 billion words from BooksCorpus and English Wikipedia.

training objectives across the family

A surprising amount of progress in bidirectional pretraining since BERT has come from changing not the architecture but the objective. The most influential variants:

Model	Objective	What changes from MLM
BERT	MLM + NSP	The original recipe
RoBERTa	MLM only, dynamic masking	Drops NSP; remasks each epoch; trains 10x longer on 10x more data
SpanBERT	Span MLM + Span Boundary Objective	Masks contiguous spans, predicts span content from boundary tokens
T5	Span corruption (denoising)	Masks contiguous spans; predicts them as a sequence (encoder-decoder)
BART	Token deletion, span masking, shuffling, etc.	Denoises arbitrary corruption with an encoder-decoder
XLNet	Permutation LM	Autoregressive but over a randomly permuted token order
ELECTRA	Replaced-token detection	A small generator replaces tokens, the discriminator predicts which are fake

ELECTRA deserves special attention because it broke a quiet assumption. MLM only learns from the 15% of tokens that get masked. The other 85% provide context but receive no gradient signal from the prediction loss. ELECTRA replaces some tokens with plausible alternatives sampled from a small generator network, then trains the main model as a binary discriminator over every position: was this token in the original sentence, or was it swapped? Because the loss applies to all tokens, ELECTRA gets dramatically more bang per training step. A small ELECTRA model trained on a single GPU for four days can outperform GPT trained on roughly 30 times the compute on the GLUE benchmark.

why bidirectionality helps for understanding but not generation

For classification, tagging, and span-extraction tasks the asymmetry between past and future context is artificial. When you label "Apple" as ORG or FRUIT in named entity recognition, you naturally use both "the" before it and "announced new earnings" after it. A model that only sees the left context throws away half the available signal. BERT's gains over GPT on the GLUE benchmark, SQuAD, and similar question answering tasks are largely attributable to this.

For generation, the asymmetry is no longer artificial. When a model is generating text one token at a time, the right context literally does not exist yet. A bidirectional model trained with MLM cannot be unrolled into a generator without significant surgery, because its predictions assume access to surrounding context that, in a generation setting, has not been produced. This is why GPT and other decoder-only autoregressive models took over generative use cases, and why encoder-decoder architectures like T5 and BART (bidirectional encoder, causal decoder) became the natural design for tasks that involve both understanding an input and producing an output, such as machine translation, summarization, and rewriting.

downstream impact and benchmarks

BERT essentially closed out the original GLUE benchmark. Within months of BERT's release, the leaderboard saw RoBERTa, XLNet, ALBERT, and ELECTRA trade places at the top. By mid-2019, performance on most GLUE tasks was approaching the human baseline; the field had to introduce SuperGLUE because the original benchmark was saturated.

Benchmark	Pre-BERT SOTA (~mid 2018)	BERT-large (Oct 2018)	RoBERTa (Jul 2019)	Human
GLUE (avg)	~72.8	80.5	88.5	87.1
SQuAD v1.1 (F1)	91.7	93.2	94.6	91.2
MultiNLI (acc)	82.1	86.7	90.2	92.0

Numbers are taken from the original BERT paper and the RoBERTa paper. The pattern was the same on more or less every English understanding task tried in this period: bidirectional pretraining, possibly with a tweaked objective, possibly trained longer, set new state of the art.

In production, BERT-style models still underpin most modern systems for sentence and passage embeddings, semantic search, retrieval-augmented generation, classification, sentiment analysis, and ranking. Sentence-BERT, multilingual BERT, BGE, E5, and other encoder-only models continue to dominate the MTEB embedding leaderboard. The decoder-only generative models that get most of the press attention quietly call out to a bidirectional encoder under the hood for retrieval.

limitations

Deep bidirectional models have real costs. The MLM objective creates a pretrain-finetune mismatch: the [MASK] token appears during pretraining but never at inference time, which slightly biases what the model learns. BERT's mitigation (replacing only 80% of selected tokens with [MASK], 10% with random tokens, 10% unchanged) reduces but does not eliminate the gap. XLNet's permutation language model and ELECTRA's replaced-token detection were both partly motivated by removing this mismatch.

Bidirectional models also assume a fixed maximum sequence length, typically 512 tokens for BERT, after which positional embeddings run out. Long-document variants like Longformer and BigBird patch this with sparse attention patterns but inherit the underlying constraint. Causal models with rotary or relative position encodings have generally adapted to long contexts more gracefully.

The biggest limitation is the one already noted: bidirectional models are not generators. A BERT model can fill in a single masked word, but using it to write a paragraph requires hacks (iteratively unmasking left to right, or specialized variants like BERT-Gen) that consistently underperform a properly trained autoregressive model. This is structural. The training objective never asked the model to commit to a left-to-right factorization, so it never learned to produce one.

explain like I'm 5

Imagine you are reading a story and trying to understand what every word means. A regular language model is like reading the story with a piece of cardboard covering everything you have not gotten to yet. You cover the next page with your hand and try to guess what each word means using only what you have already read. That works for some words. For others ("she pulled the bat out of her bag and stepped up to the plate") you really need to peek ahead to know if it is a baseball bat or a small flying mammal.

A bidirectional language model is what happens when you let yourself read the whole sentence first, in both directions, before deciding what each word means. That is great for understanding. It is terrible for finishing the story, because to write the next sentence you can't peek at words you haven't written yet.

references

Schuster, M., & Paliwal, K. K. (1997). "Bidirectional recurrent neural networks." IEEE Transactions on Signal Processing, 45(11), 2673-2681.
Graves, A., & Schmidhuber, J. (2005). "Framewise phoneme classification with bidirectional LSTM and other neural network architectures." Neural Networks, 18(5-6), 602-610.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). "Deep contextualized word representations." Proceedings of NAACL-HLT 2018. arXiv:1802.05365.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." NeurIPS 2019. arXiv:1906.08237.
Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). "SpanBERT: Improving Pre-training by Representing and Predicting Spans." Transactions of the Association for Computational Linguistics, 8, 64-77. arXiv:1907.10529.
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." ICLR 2020. arXiv:2003.10555.
He, P., Liu, X., Gao, J., & Chen, W. (2020). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." arXiv:2006.03654.
Wikipedia contributors. "BERT (language model)." Wikipedia, The Free Encyclopedia.
Wikipedia contributors. "Bidirectional recurrent neural networks." Wikipedia, The Free Encyclopedia.

a quick definition

shallow versus deep bidirectionality

historical evolution

architecture details

training objectives across the family

why bidirectionality helps for understanding but not generation

downstream impact and benchmarks

limitations

explain like I'm 5

references

Improve this article

Related Articles

Unidirectional language model

WordPiece

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Computer-use model

a quick definition

shallow versus deep bidirectionality

historical evolution

architecture details

training objectives across the family

why bidirectionality helps for understanding but not generation

downstream impact and benchmarks

limitations

explain like I'm 5

references

Related Articles

Unidirectional language model

WordPiece

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Computer-use model