See also: Machine learning terms
Bidirectional is a property of a sequence model in which each output position depends on the entire input sequence, not just the tokens that came before it. A bidirectional model processes its input in both forward (left-to-right) and backward (right-to-left) directions, so the representation at any position is informed by both past and future context. The idea predates deep learning and shows up across several architectures, including recurrent neural networks, long short-term memory networks, gated recurrent units, and modern transformer encoders.
This article covers the general concept of bidirectionality across neural architectures and tasks. For the language modeling specialization, where the same idea drives BERT-style pretraining, see bidirectional language model. For the contrast, see unidirectional language model.
A standard sequence model reads its input one position at a time and updates an internal state. In a forward-only (causal) model, the state at position t summarizes everything from position 1 through position t. The state never sees what comes later. That works for prediction tasks where the future genuinely is unknown, like writing the next token of a story, but it throws away information when the future is sitting right there in the input.
The motivating example used in almost every textbook is a homograph: "He went to the bank to deposit his check" versus "He sat by the bank of the river." A forward-only model trying to label "bank" at position 4 has no access to "deposit" or "river," because they have not been read yet. A bidirectional model has access to both sides, so it can disambiguate without any extra machinery. The same logic applies to phoneme classification (the next phoneme constrains the current one), named entity recognition ("Apple announced" pulls toward ORG, "Apple pie" toward FOOD), and many sequence labeling tasks.
There are two technically different ways to make a model bidirectional, and the literature sometimes uses the same word for both. They are worth keeping straight.
The first is sequential bidirectionality, where two separate sequence models scan the input in opposite directions and their outputs are combined. This is the original Schuster and Paliwal recipe from 1997 and applies cleanly to RNNs, LSTMs, and GRUs. The forward and backward streams never interact during their internal computation. They only meet at the end, usually through concatenation.
The second is bidirectional attention, where a single network looks at the whole sequence at once through self-attention. There are no separate streams. Each position simply attends to every other position, in both directions, at every layer. This is how BERT and other transformer encoders work. The bidirectionality is not built into the math of attention; it is built into what the attention mask allows. Removing the causal mask that decoder-style transformers use turns the same architecture into a bidirectional one.
| Property | Sequential bidirectional (BiRNN family) | Bidirectional attention (BERT family) |
|---|---|---|
| Streams | Two networks, opposite directions | One network, full self-attention |
| Backbone | RNN, LSTM, or GRU | Transformer encoder |
| Combination | Concatenation of forward and backward states | Single contextual vector per position |
| Layers see other side via | Final output only | Every layer |
| Compute cost | O(n) per layer, twice | O(n^2) per layer |
| Typical use | Sequence labeling, speech, time series | Pretraining, embeddings, classification |
| Origin paper | Schuster & Paliwal, 1997 | Devlin et al., 2018 |
The practical difference matters. A BiLSTM only mixes forward and backward information once, after the recurrent passes are done. A transformer encoder mixes them at every layer, which is why people sometimes call the transformer version "deep" bidirectionality and the BiRNN version "shallow."
The original idea is the bidirectional recurrent neural network, introduced by Mike Schuster and Kuldip Paliwal in 1997 in a paper in IEEE Transactions on Signal Processing. They were working on speech and handwriting tasks where, as they put it, the relevant context for any given frame lies on both sides of it. Their fix was structural rather than algorithmic: train one RNN that reads the input forward, train a second RNN with separate parameters that reads it backward, and at every position concatenate the two hidden states.
Formally, given an input sequence x_1 through x_n, a forward RNN computes hidden states h^f_t = f(h^f_{t-1}, x_t) for t = 1 to n, and a backward RNN computes h^b_t = f(h^b_{t+1}, x_t) for t = n down to 1. The combined representation at position t is the concatenation [h^f_t ; h^b_t]. Because the two streams are independent, they can be trained with the same backpropagation through time algorithm used for vanilla RNNs, just unrolled in both directions.
The approach sat relatively quietly for almost a decade because plain RNNs were hard to train on long sequences. The breakthrough came when Alex Graves and Jurgen Schmidhuber combined it with LSTM cells in 2005. Their paper "Framewise phoneme classification with bidirectional LSTM and other neural network architectures," published in Neural Networks, applied the BRNN trick to LSTM and showed that the resulting bidirectional LSTM beat both unidirectional LSTM and standard RNNs on TIMIT phoneme classification by a clear margin. They also introduced a full-gradient training algorithm for LSTM that fixed a subtle bug in earlier implementations. The combination of LSTM gating and bidirectional context made BiLSTM the dominant architecture for sequence labeling for roughly the next decade.
The same wrapper trick works for other recurrent cells. A bidirectional GRU is a GRU running forward and a separate one running backward, with concatenated outputs. A bidirectional vanilla RNN works the same way, although the vanishing gradient problem makes it less useful in practice. The distinction is the cell, not the bidirectionality.
| Variant | Underlying cell | First major use | Typical task |
|---|---|---|---|
| BiRNN | Vanilla RNN | Schuster & Paliwal 1997 | Speech frame classification |
| BiLSTM | LSTM | Graves & Schmidhuber 2005 | Phoneme classification, NER, POS tagging |
| BiGRU | GRU | Cho et al. 2014 (translation encoder) | Machine translation encoders, time series |
| BiQRNN | Quasi-RNN | Bradbury et al. 2017 | Faster language modeling |
| BiSRU | Simple Recurrent Unit | Lei et al. 2018 | Throughput-sensitive sequence tasks |
The transformer architecture, introduced in 2017, originally split into an encoder that read the input bidirectionally and a decoder that produced output causally. The encoder used the standard self-attention mechanism without any mask, so every position could see every other position. This was bidirectional in exactly the sense above, but it had not yet been used as a standalone pretraining target.
BERT, released by Google in 2018, took the transformer encoder, threw away the decoder, and turned the encoder into a general-purpose pretraining model. The key trick was the masked language model objective: randomly replace 15 percent of input tokens with a [MASK] token and ask the model to recover the original. Because the answer has been deleted from the input, the model is forced to use surrounding context (in both directions) to make its prediction. This sidesteps the obvious problem that a vanilla next-token objective is trivial when the model can see the next token.
BERT and its descendants (RoBERTa, ALBERT, ELECTRA, DeBERTa) all share the same structural property: a transformer encoder with no causal mask, trained on some variant of MLM. The result is what the BERT paper called "deep bidirectional" representations, contrasted with the "shallow" bidirectionality of older biLM approaches like ELMo that just concatenate a forward and a backward LM. For the full story of how this lineage played out in language modeling specifically, see bidirectional language model.
Note that not every transformer is bidirectional. The GPT family uses the same transformer building blocks but applies a causal mask to the attention layer. That makes it unidirectional. The encoder-decoder family (T5, BART) puts a bidirectional encoder and a unidirectional decoder in the same model: the encoder reads the input both ways, and the decoder generates the output one token at a time while attending to the encoder's bidirectional output.
The reason bidirectional models work so well on understanding tasks is that the asymmetry between past and future is artificial when the entire input is already in front of the model. A model labeling parts of speech in a fixed sentence has no reason to ignore the right context. A model deciding whether two sentences are paraphrases has both sentences available from the start. A model retrieving a relevant passage for a query knows the entire query before it starts.
In each of these cases, half the available signal is downstream of the position the model is currently processing. A unidirectional model has to either store everything in a single forward state and hope the relevant bits survive, or run two passes and stitch them together (which is, again, just being bidirectional). A bidirectional model gets both sides natively.
The empirical gains are easiest to see in sequence labeling. On the CoNLL-2003 English NER dataset, a BiLSTM with character embeddings reached 90.94 F1 in 2016 (Lample et al.), versus roughly 88-89 for the best unidirectional models of the same era. On the same benchmark, a BERT-large model fine-tuned in 2018 reached 92.8 F1, again pulling away from unidirectional baselines. Similar margins held on POS tagging, semantic role labeling, and question answering.
Bidirectionality is a poor fit when the model is supposed to produce text one token at a time, because the future does not exist yet. A bidirectional model's representation of position t was trained assuming access to positions t+1 through n. At inference time those positions have not been generated. There is no clean way to ask such a model for the next token.
This is why BERT and its descendants, despite their dominance on classification benchmarks, never replaced GPT for free-form generation. People have tried (BERT-Gen, masked iterative refinement, conditional MLM) but the results consistently lag a properly trained autoregressive model. The architecture and the objective enforce a structural mismatch with the generation setting that no amount of clever inference fixes.
The encoder-decoder split in T5, BART, and similar models is the standard workaround. The encoder is bidirectional and reads the full input, the decoder is unidirectional and writes the output left to right while attending to the encoder. This gives you the understanding side of bidirectionality without giving up the ability to generate.
Bidirectional models, in one form or another, dominate any task where the input is fully observed and the output is some kind of label, span, or structured prediction over that input.
| Task | Why bidirectional helps | Typical architecture |
|---|---|---|
| Named entity recognition | Entity boundaries depend on surrounding words on both sides | BiLSTM-CRF, BERT |
| Part-of-speech tagging | Word category depends on local context windows | BiLSTM, BERT |
| Semantic role labeling | Argument identification needs the whole predicate-argument structure | BiLSTM, BERT |
| Speech recognition | Phoneme identity depends on surrounding acoustic context | BiLSTM-CTC, Conformer encoder |
| Machine translation (encoder) | Source words need full source-side context for translation | BiGRU encoder, transformer encoder |
| Text classification | Global sentence meaning summarized into a fixed vector | BERT [CLS] token, BiLSTM with pooling |
| Question answering (extractive) | Answer span boundaries depend on both query and surrounding passage | BERT, RoBERTa |
| Sentence embedding | Single vector per sentence, used for retrieval | Sentence-BERT, BGE, E5 |
| Time series anomaly detection | Anomaly score depends on both past and future trend | BiLSTM autoencoder |
| Protein secondary structure prediction | Local fold depends on residues on both sides | BiLSTM, transformer encoder |
| Handwriting recognition | Character identity depends on adjacent strokes in both directions | BiLSTM-CTC |
In some of these cases the bidirectional model is the entire system. In others (machine translation, retrieval-augmented generation) it is one component sitting next to a generator.
Both major deep learning frameworks treat bidirectionality as a flag rather than a separate class.
In PyTorch, the recurrent modules accept a bidirectional=True argument. A bidirectional LSTM is just nn.LSTM(input_size, hidden_size, bidirectional=True). The output tensor at each timestep has shape (seq_len, batch, 2 * hidden_size), with the forward direction in the first hidden_size slots and the backward direction in the last hidden_size slots. The final hidden state h_n has shape (2 * num_layers, batch, hidden_size), with directions interleaved by layer. A common gotcha is that h_n contains the final forward state and the initial (in time order) backward state, while the last entry of output contains the final forward state and the first-timestep backward state. They are not the same thing.
In TensorFlow and Keras, bidirectionality is done through a wrapper layer: keras.layers.Bidirectional(keras.layers.LSTM(units)). The wrapper takes any RNN layer (LSTM, GRU, SimpleRNN, or a custom one) and produces a bidirectional version. A merge_mode argument controls how the forward and backward outputs are combined: the default is concat, but sum, mul, ave, and None (return both as a list) are available. A backward_layer argument lets the user pass a different layer for the reverse direction if the two should not share architecture.
For transformer encoders, bidirectionality is implicit. There is no special flag because the encoder simply does not apply a causal mask. The Hugging Face transformers library treats this as part of the model class: BertModel, RobertaModel, and DebertaV2Model are bidirectional encoders, while GPT2Model and LlamaModel are causal decoders. The same nn.MultiheadAttention module underlies both; the difference is whether an attn_mask of upper-triangular -inf is passed in.
Bidirectional models give up some things in exchange for the contextual gains.
The most fundamental loss is generation. As discussed above, a model trained to use both-side context cannot be used as an autoregressive generator without either accepting a quality drop or layering on a separate causal model. Architectures that need to do both (like sequence-to-sequence translation) end up being hybrids.
Bidirectional attention has a quadratic cost in sequence length. A transformer encoder with full bidirectional self-attention has to compute an n by n attention matrix at every layer. For sequences of a few hundred tokens that is fine. For sequences of tens of thousands of tokens it becomes the bottleneck. Long-context variants (Longformer, BigBird, Performer) sparsify the attention pattern to recover near-linear cost, but they pay for it with slightly weaker bidirectional context. Causal transformers have the same quadratic issue, but the lower-triangular mask makes some optimizations (like KV caching during generation) easier.
Masked language model pretraining introduces a small pretrain-finetune mismatch. The [MASK] token appears during pretraining but never at inference time. BERT mitigates this by replacing only 80 percent of selected tokens with [MASK], 10 percent with random tokens, and 10 percent unchanged, but the gap is not fully closed. ELECTRA's replaced-token detection objective and XLNet's permutation language model were both designed in part to avoid this mismatch.
Bidirectional RNNs need the entire input before they can produce any output, because the backward pass starts from the end. That makes them awkward for streaming applications where data arrives one frame at a time and predictions are needed with low latency. Streaming speech recognition systems often use unidirectional or chunked-bidirectional variants for this reason.
The big picture is a three-way split rather than a binary one.
| Property | Unidirectional (GPT, causal LSTM) | Bidirectional (BERT, BiLSTM) | Hybrid encoder-decoder (T5, BART) |
|---|---|---|---|
| Sees future tokens | No | Yes | Yes (encoder side only) |
| Can generate token by token | Yes | No | Yes (decoder generates) |
| Best for | Free-form generation, chat, code | Classification, tagging, retrieval | Translation, summarization, QA |
| Pretraining objective | Next-token prediction | Masked language model or denoising | Span corruption, denoising |
| Inference cost | Linear in output length | One forward pass over input | Encoder once, decoder per token |
| Modern flagship example | GPT-4, Llama | RoBERTa, DeBERTa, sentence-transformers | T5, BART, FLAN-T5 |
In the modern LLM era the visible part of the field is dominated by the unidirectional column. Decoder-only generative models are what people interact with when they use a chatbot. The bidirectional column is quieter but still huge. Almost every embedding model, every reranker, every retrieval system, every text classifier in production is some descendant of BERT. Sentence-transformers, the BGE family, E5, and ColBERT all use bidirectional encoders. The MTEB embedding leaderboard remains overwhelmingly bidirectional. When a chatbot retrieves documents to answer a question, the chatbot itself is unidirectional, but the retriever is almost always bidirectional.
Imagine you are filling in a crossword puzzle. To figure out a word in the middle, you look at the letters on both sides of the empty squares: some come from the across clue you already solved, and some come from the down clues. You use both directions of context at once.
That is what a bidirectional model does. When it tries to understand the word "bat" in a sentence, it looks at the words before it and the words after it. If "flapping" came earlier and "cave" comes later, it knows the bat has wings. If "swung" came earlier and "baseball" comes later, it knows the bat is for hitting balls.
This works great for understanding sentences, but it does not work for writing them. To write a story one word at a time, you cannot peek at the words you have not written yet, because they do not exist. So bidirectional models are champions of reading and unidirectional models are champions of writing. The biggest modern systems often use both, with one reading the input and the other writing the answer.