Unidirectional

See also: Machine learning terms

Unidirectional is a property of a sequence model in which the representation or output at each position depends only on inputs from one direction of the sequence, almost always the past (left-to-right) but occasionally the future (right-to-left). The model never mixes information from both sides at once. This single-direction structure is also called causal, autoregressive (when the same property is used for generation), forward-only, one-directional, or left-to-right, depending on the subcommunity. The opposite property is bidirectional, where each position can see context on both sides.

Unidirectionality shows up across almost every sequence architecture in modern machine learning. Recurrent networks, convolutional networks, and transformers all have unidirectional and bidirectional variants, and the choice between them is driven by the same handful of considerations: whether the task requires generation, whether inference must be streaming, and whether the future is even available at prediction time. This article covers the general concept across architectures. For the specialization to language modeling, where the constraint defines the entire decoder-only LLM family, see unidirectional language model.

the core idea

A sequence model takes an ordered input x_1, x_2, ..., x_n and produces some output (a label per position, a single label for the whole sequence, or a generated sequence). It is unidirectional if the computation at any position t depends only on positions in one direction relative to t. In the standard left-to-right case, the state or output at position t is a function of x_1 through x_t and never of x_{t+1} through x_n. The model has no path through which information from the future can flow into position t.

This constraint can come from three different places, and they are worth keeping straight:

The recurrence itself only updates forward in time, as in a standard RNN or LSTM.
The convolution kernel is masked or shifted so that future positions contribute zero, as in WaveNet's causal convolutions.
The attention mechanism applies a causal mask that blocks attention to future positions, as in GPT and other decoder-only transformers.

In all three cases the effect is the same: a strict information ordering in which the past flows into the present and nothing flows backward.

terminology

Different fields use different words for what is essentially the same property. The vocabulary matters because papers from speech, language, and vision often describe identical mechanisms in incompatible terms.

Term	Field where common	What it emphasizes
Unidirectional	Speech, sequence modeling	Processing direction (one way through the sequence)
Causal	Signal processing, transformers	Respecting time arrow; no influence from the future
Autoregressive	Language modeling, generative modeling	Each output conditioned on previous outputs (used at generation time)
Forward-only	Older RNN literature	Contrast with the backward pass in BiRNNs
Left-to-right	Language modeling	Reading direction in left-to-right scripts
Monotonic	Alignment, attention	Output position never moves backward in input position
Online	Streaming, real-time inference	Prediction can be made as data arrives, without future context
Streaming	Speech, video	Same as online, with emphasis on continuous arrival

The terms are not perfectly interchangeable. Causal and unidirectional describe the same architectural constraint. Autoregressive describes the same constraint when the model is generating its own future tokens. Online and streaming describe the inference setting that unidirectional models make possible. Monotonic is a more specific alignment property used in attention-based ASR and translation.

architectural realizations

Most popular sequence architectures have a unidirectional version. The mechanism by which causality is enforced varies, but the resulting information flow is the same.

unidirectional recurrent networks

The original unidirectional model is the plain recurrent neural network. At each time step the hidden state h_t is a function of the previous hidden state h_{t-1} and the current input x_t, written h_t = f(h_{t-1}, x_t). The output y_t is a function of h_t. Because h_t only references h_{t-1} and x_t, and because h_{t-1} is itself a function of h_{t-2} and x_{t-1}, and so on, the dependency chain only ever moves forward.

LSTM and GRU cells inherit this property. They add gating mechanisms that decide what to keep, forget, or write into the state, but the recurrence still runs forward only. A unidirectional LSTM is the default LSTM in PyTorch and Keras (the bidirectional flag is off by default). These models have been the workhorses of speech recognition, language modeling, and time series forecasting for most of the deep learning era.

A bidirectional RNN, in contrast, runs two independent recurrences in opposite directions and concatenates their hidden states. That construction is impossible to use as a real-time predictor because the backward pass cannot start until the entire sequence has arrived.

causal convolutions

A standard 1D convolution at position t mixes inputs from positions t-k through t+k for a kernel of width 2k+1. That violates causality whenever k is positive. The fix is to shift the kernel so it only covers past positions, t-2k through t, or equivalently to pad the input on the left and crop the output on the right. The result is a causal convolution that respects the time arrow.

The canonical example is WaveNet, introduced by Aaron van den Oord and colleagues at DeepMind in 2016 in the paper "WaveNet: A Generative Model for Raw Audio." WaveNet generates raw audio one sample at a time at 16 kHz or higher, which is a brutally long sequence. Recurrent models could not handle that length efficiently, so the authors stacked dilated causal convolutions whose receptive field grows exponentially with depth. Layer i has a dilation of 2^i in the original recipe, so a stack of ten layers reaches a receptive field of 1024 samples, and three such stacks reach about 3000 samples while keeping the total parameter count modest. The model factorizes the joint distribution over audio samples as a product of conditionals, each computed by the dilated causal stack, and is trained with the next-sample prediction loss.

PixelCNN, also from van den Oord and DeepMind in 2016, applies the same idea to images. It models the joint distribution over pixels as a product of conditionals, where each pixel is conditioned on the pixels above and to the left of it (a raster-scan order). Standard convolutions would let each pixel see its neighbors in all directions, so PixelCNN uses masked convolutions that zero out the kernel weights for positions that have not yet been generated. Two mask types are used: type A masks the current center pixel as well (used in the first layer), and type B masks only future pixels (used in subsequent layers). The resulting model is autoregressive over pixels in raster order. The PixelRNN companion paper from the same group used row LSTMs and diagonal BiLSTMs to model the same conditionals with recurrence, achieving slightly better likelihoods at a much higher training cost.

Causal convolutions are also the foundation of the temporal convolutional network (TCN), which Bai, Kolter, and Koltun proposed in 2018 as a general drop-in replacement for RNNs in many sequence tasks. A TCN is a stack of dilated causal convolutions with residual connections, designed to give RNN-like temporal modeling with the parallelism of a feedforward net.

causal attention in transformers

The transformer, introduced by Vaswani and colleagues in 2017, has a self-attention layer that, by default, lets every position attend to every other position. To make a transformer unidirectional you add a causal attention mask: an upper-triangular matrix of negative infinities that, when added to the attention logits before the softmax, drives the attention weights at future positions to zero. After masking, position t can only attend to positions 1 through t. The full attention is still computed, but the mask zeros out the half that would have looked into the future.

The transformer decoder block uses causal self-attention by construction. Decoder-only models like the GPT family, Llama, Mistral, and Claude consist of a stack of these blocks and are unidirectional throughout. Encoder-decoder models like T5 and BART have a bidirectional encoder (no mask) feeding a unidirectional decoder (causal mask plus cross-attention). For more on this lineage as it applies to language models specifically, see unidirectional language model and large language model.

A practical consequence of causal masking is the KV cache: at inference time the keys and values for past positions never change, so they can be stored once and reused for every subsequent token. This optimization is what makes streaming generation efficient at long context lengths. A bidirectional model has no KV cache equivalent because the past attention has to be recomputed every time a new position arrives.

causal state-space models and linear attention

More recent architectures like Mamba (Gu and Dao, 2023) and RWKV use selective state-space layers and linear-attention layers that are unidirectional by construction. They behave like RNNs at inference (constant memory per step, sequential generation) but parallelize during training like transformers. Their causal property is built into the recurrence, not added with a mask.

transducers and streaming models for speech

In online speech recognition, models must produce output as audio arrives. The dominant architecture is the RNN transducer (RNN-T), introduced by Alex Graves in 2012 in "Sequence Transduction with Recurrent Neural Networks." RNN-T has three components: an acoustic encoder that processes input frames, a label predictor that processes previously emitted output tokens, and a joint network that combines them to predict the next output. Because each step depends only on past acoustic frames and past labels, RNN-T can run causally and produce partial transcripts as audio comes in. Google's 2018 paper "Streaming End-to-end Speech Recognition for Mobile Devices" pushed RNN-T into production on Pixel phones and revived broad interest in the architecture.

Causal Conformer variants extend the same idea to the Conformer architecture (a convolution-augmented transformer encoder for speech). They replace bidirectional self-attention with chunked or fully causal attention and replace standard convolutions with causal convolutions, trading some accuracy for streaming capability. Many production ASR stacks use a two-pass design: a small causal model produces partial results in real time, and a larger non-causal model rescores them once the full utterance has been observed.

why use unidirectional

Unidirectional models are the default for three distinct reasons, and a given application usually involves at least one of them.

The first reason is generation. A model generating its own output one step at a time, like a GPT-style chat model writing a paragraph or WaveNet synthesizing a syllable, has nothing on the right side to look at. The future tokens have not been produced yet. The factorization P(x_1, ..., x_n) = product of P(x_t | x_{<t}) requires that each conditional only look at the past. Bidirectional models trained with masked language modeling cannot be used as autoregressive generators without significant surgery, and the surgery rarely matches a properly trained unidirectional model on quality.

The second reason is streaming inference. Many real-time applications cannot wait for the full input. Online speech recognition produces a transcript while the user is still talking. Live captioning displays text as it is recognized. Voice assistants need low end-to-end latency from microphone to action. Real-time control systems consume sensor data as it arrives. In all these cases the future does not exist yet at the moment a prediction is needed. A bidirectional model would have to wait for the end of the utterance, segment, or sequence; a unidirectional model can emit a prediction immediately.

The third reason is causality preservation in domains where the time arrow is part of the problem. In time series forecasting, financial modeling, and physical simulation, the model is supposed to predict the future from the past. Letting the future leak into the present during training would produce a model that cheats on the test set and fails in production. Even when the entire training sequence is known offline, enforcing unidirectional information flow keeps the training distribution aligned with the deployment distribution.

comparison with bidirectional models

The unidirectional and bidirectional choices are usually presented as a trade-off, with unidirectional better suited to generation and streaming and bidirectional better suited to understanding tasks where the entire input is known up front. The table below sketches the contrast.

Property	Unidirectional	Bidirectional
Sees future tokens	No	Yes
Native generation	Yes (autoregressive)	No (requires non-autoregressive workarounds)
Streaming inference	Natural	Not possible without buffering
Training parallelism	Full (with causal mask or causal conv)	Full
Inference parallelism	Sequential at generation time	One forward pass over input
KV cache reuse	Yes	No
Best for	Generation, streaming, time series	Classification, tagging, retrieval, embeddings
Pretraining objective	Next-token prediction	Masked language modeling, denoising
Sequence-labeling accuracy	Slightly weaker	Slightly better
Handles fully observed input	Yes, but throws away right context	Uses both sides natively
Flagship example	GPT-4, Llama, WaveNet, RNN-T	BERT, RoBERTa, BiLSTM-CRF, Conformer encoder

The asymmetry in the last few rows is real but smaller than people once thought. Decoder-only LLMs trained at scale match or beat bidirectional encoders on most NLU benchmarks despite the structural disadvantage, simply because they have more data, more parameters, and richer training signals. The cleanest remaining win for bidirectional models is in the embedding and retrieval space, where almost every leaderboard model is still a BERT descendant.

direction conventions

A unidirectional model has to commit to a direction. The convention varies across modalities and tasks.

Direction	Common in	Why
Left-to-right (forward)	Most NLP, audio, video	Matches reading and playback order; matches the time arrow
Right-to-left (backward)	Some embedding work, half of ELMo, RTL scripts	Pairs with a forward model to give bidirectional context without true bidirectionality
Time-forward	Time series, control, physics	The future genuinely is unknown
Raster-scan	PixelCNN, image autoregressive models	Imposes a unidirectional order on a 2D input
Outer-to-inner / inner-to-outer	Some autoregressive 3D models	Defines a 1D order over a 3D structure

For 1D sequences with no preferred direction (DNA reads, some scientific signals), the choice of direction is arbitrary and often both directions are trained as separate models or fused via concatenation. For text in left-to-right scripts, the forward direction is the obvious default. For text in right-to-left scripts (Arabic, Hebrew), the model still operates left-to-right in its internal token order even though the visual reading order goes the other way.

training

Unidirectional models are trained with one universal trick: align the loss at each position with the next-step target so that the architecture's information ordering matches the label ordering. The implementation differs by architecture.

For RNNs and LSTMs the targets are the same as the inputs, shifted by one position. The cross-entropy loss is summed over all positions and gradients flow back through the recurrence (backpropagation through time). Because the recurrence is sequential, training is also sequential within a sequence; only across sequences in a batch can computation be parallelized.

For causal convolutions and causal-masked transformers the situation is much better. The whole sequence can be processed in a single forward pass, the loss at every position is computed simultaneously, and gradients flow back through one giant computational graph. This parallelism is the main reason transformer language models scaled up so much faster than RNN language models.

Across all unidirectional architectures, training uses teacher forcing: the input at each position is the ground-truth previous token, not the model's own prediction. Teacher forcing is fast and stable, but it leaves a small exposure bias because at inference the model has to consume its own outputs. Scheduled sampling and reinforcement-style fine-tuning have been proposed to close this gap; modern LLMs largely tolerate it because the simplicity wins.

inference and decoding

A unidirectional generative model produces a probability distribution over the next token at every step. Choosing an actual sequence requires a decoding strategy. The common options are listed below; the unidirectional language model article covers them in more depth in the context of LLMs.

Method	Description
Greedy	Pick the highest-probability token at every step
Beam search	Maintain k partial sequences, expand each, keep the best k
Temperature sampling	Divide logits by T, then sample; T < 1 sharpens, T > 1 flattens
Top-k sampling	Restrict to the k most likely tokens, renormalize, sample
Nucleus (top-p) sampling	Restrict to the smallest set with cumulative probability >= p
Speculative decoding	A small draft model proposes tokens, the large model verifies in parallel

Speculative decoding, introduced by Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google in their 2023 ICML paper "Fast Inference from Transformers via Speculative Decoding," is worth flagging because it works specifically because the model is unidirectional. A small draft model and a large target model both have causal structure, so any prefix the draft proposes can be evaluated by the large model in one parallel pass over the prefix. The output distribution is mathematically identical to standard sampling from the large model, but throughput improves by 2x to 3x or more depending on the draft model and the target distribution.

streaming inference

Unidirectional models can be deployed in two regimes. Offline inference processes a complete input at once, like a batch transcription job or a document summarization run; the unidirectional structure is used during training, but at inference the whole sequence is available. Online or streaming inference processes input as it arrives, emitting output incrementally with low latency.

Streaming is what most users of voice assistants, live captioning, and chat LLMs experience. A chat model showing tokens as they are generated is doing streaming inference: the unidirectional decoder samples the next token, emits it to the screen, and feeds it back into the context for the next step. The same architecture would not be able to stream if it were bidirectional, because the right-context vectors at any position depend on tokens that have not been generated yet.

Key-value caching is the standard optimization. Because past keys and values never change in a causal transformer, they can be computed once and reused. Generating token n only requires one new query, one new key, one new value, and one attention computation between the new query and all past keys. Without this trick, generating an N-token completion would cost O(N^3) attention ops; with it, the cost is O(N^2). Variants like grouped-query attention and multi-query attention reduce the size of the cache itself, which is critical for serving large batches at long context.

limitations

The unidirectional design has real downsides outside its sweet spot.

The most obvious one is the loss of right context for understanding tasks. The textbook example is the homograph: in "He went to the bank to deposit his check," the word "bank" is disambiguated by the later word "deposit." A unidirectional model labeling "bank" at position 4 has no access to "deposit" at position 7, and has to either store everything in a forward state and hope the right signal makes it through, or wait until the whole sentence has been read and then use the final state. Bidirectional models avoid this entirely. For sequence labeling, named entity recognition, sentiment classification, and most retrieval tasks, all else equal, a bidirectional encoder outperforms a unidirectional one of the same size.

A second limitation is the sequential cost of inference. Generation is inherently one step at a time, so latency scales linearly with output length. Speculative decoding helps but does not fundamentally change the serialization. Bidirectional models doing classification or labeling are usually faster at inference because they need only one forward pass, not one per output token.

A third issue is that masked-token completion is not the natural objective. A model trained to predict the next token can be coaxed into filling in a missing token in the middle of a passage (with prefix-suffix prompting or fill-in-the-middle pretraining), but it is doing extra work to overcome its training. BERT-style models do this directly because that is what they were trained to do.

modern context

In 2026 the field has settled into a stable three-way split:

Use case	Architecture	Direction	Examples
Generative chat, code, agents	Decoder-only transformer	Unidirectional	GPT-4, Claude 4, Gemini 3, Llama 4, Mistral, DeepSeek-V3, Qwen 3, Grok
Embeddings, retrieval, classification	Encoder-only transformer	Bidirectional	BGE, E5, Sentence-BERT, RoBERTa, DeBERTa, NV-Embed
Translation, summarization, structured output	Encoder-decoder transformer	Mixed	T5, BART, FLAN-T5
Streaming speech recognition	RNN-T, causal Conformer	Unidirectional	Production ASR on phones, smart speakers, real-time captioning
High-fidelity audio synthesis	Causal CNN, autoregressive transformer, diffusion	Unidirectional (autoregressive variants)	WaveNet, neural codec models
Time series forecasting	Causal CNN, LSTM, transformer decoder	Unidirectional	Demand forecasting, energy load, finance

The overall direction of travel since GPT-3 has been toward unidirectional decoder-only architectures for everything generative, with bidirectional encoders surviving in retrieval and embedding niches. Encoder-decoder models persist in translation and a few academic settings. State-space and linear-attention architectures (Mamba, RWKV, RetNet) are unidirectional by construction and represent the main current challenge to the dominance of causal transformers, although none has displaced them at frontier scale.

relation to causal inference

"Unidirectional" and "causal" are sometimes used interchangeably in machine learning, which can be confusing because causal inference in statistics refers to something different. Causal inference is about distinguishing correlation from causation: deciding whether intervening on variable X would change variable Y, in the do-calculus sense of Judea Pearl's framework. Causal masking in a sequence model is about respecting the time arrow in the input, not about identifying interventional effects.

The two ideas share an intuition (the past influences the future, not the other way around) but they live at different levels of abstraction. A causal language model and a causal graph have the word "causal" for related but distinct reasons. The sequence-modeling sense is the one used in this article and in almost all deep learning literature.

explain like I'm 5

Imagine you are walking through a tunnel, and you can only see what is in front of you and what is right next to you, never what is behind a wall up ahead. A unidirectional model is a computer that has to make decisions while walking through this tunnel. At every step, it can only use what it has already seen, never what is around the next corner.

This sounds like a disadvantage, and sometimes it is, but it is also the only way to do certain jobs. If the computer's job is to write a story one word at a time, then the next word does not exist yet, and there is nothing to peek at anyway. If the computer's job is to listen to someone talking and write down the words as they come out, then the rest of the sentence has not been said yet. A model that needs to peek into the future cannot do these jobs at all, because the future is not there.

Unidirectional models are the storytellers and the live transcribers. Bidirectional models are the editors and the proofreaders, who get to read the whole thing before deciding what each word means. Both are useful; the right one depends on whether you are reading a finished book or watching a movie as it plays.

references

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). "WaveNet: A Generative Model for Raw Audio." arXiv:1609.03499. https://arxiv.org/abs/1609.03499
van den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). "Pixel Recurrent Neural Networks." ICML 2016. https://arxiv.org/abs/1601.06759
van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., & Kavukcuoglu, K. (2016). "Conditional Image Generation with PixelCNN Decoders." NeurIPS 2016. https://arxiv.org/abs/1606.05328
Graves, A. (2012). "Sequence Transduction with Recurrent Neural Networks." ICML 2012 Workshop on Representation Learning. https://arxiv.org/abs/1211.3711
He, Y., Sainath, T. N., Prabhavalkar, R., et al. (2019). "Streaming End-to-end Speech Recognition for Mobile Devices." ICASSP 2019. https://arxiv.org/abs/1811.06621
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." NeurIPS 2017. https://arxiv.org/abs/1706.03762
Bai, S., Kolter, J. Z., & Koltun, V. (2018). "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling." arXiv:1803.01271. https://arxiv.org/abs/1803.01271
Schuster, M., & Paliwal, K. K. (1997). "Bidirectional recurrent neural networks." IEEE Transactions on Signal Processing, 45(11), 2673-2681.
Hochreiter, S., & Schmidhuber, J. (1997). "Long short-term memory." Neural Computation, 9(8), 1735-1780.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." arXiv:1406.1078.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI Technical Report. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. https://arxiv.org/abs/2005.14165
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). "The Curious Case of Neural Text Degeneration." ICLR 2020. https://arxiv.org/abs/1904.09751
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. https://arxiv.org/abs/2211.17192
Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. https://arxiv.org/abs/2312.00752
Gulati, A., Qin, J., Chiu, C.-C., et al. (2020). "Conformer: Convolution-augmented Transformer for Speech Recognition." Interspeech 2020. https://arxiv.org/abs/2005.08100
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805.

the core idea

terminology

architectural realizations

unidirectional recurrent networks

causal convolutions

causal attention in transformers

causal state-space models and linear attention

transducers and streaming models for speech

why use unidirectional

comparison with bidirectional models

direction conventions

training

inference and decoding

streaming inference

limitations

modern context

relation to causal inference

explain like I'm 5

references

Improve this article

Related Articles

Machine learning terms/Sequence Models

Mamba

Long Short-Term Memory (LSTM)

Recurrent Neural Network

RWKV

Jamba

the core idea

terminology

architectural realizations

unidirectional recurrent networks

causal convolutions

causal attention in transformers

causal state-space models and linear attention

transducers and streaming models for speech

why use unidirectional

comparison with bidirectional models

direction conventions

training

inference and decoding

streaming inference

limitations

modern context

relation to causal inference

explain like I'm 5

references

Related Articles

Machine learning terms/Sequence Models

Mamba

Long Short-Term Memory (LSTM)

Recurrent Neural Network

RWKV

Jamba