See also: Machine learning terms
Unidirectional is a property of a sequence model in which the representation or output at each position depends only on inputs from one direction of the sequence, almost always the past (left-to-right) but occasionally the future (right-to-left). The model never mixes information from both sides at once. This single-direction structure is also called causal, autoregressive (when the same property is used for generation), forward-only, one-directional, or left-to-right, depending on the subcommunity. The opposite property is bidirectional, where each position can see context on both sides.
Unidirectionality shows up across almost every sequence architecture in modern machine learning. Recurrent networks, convolutional networks, and transformers all have unidirectional and bidirectional variants, and the choice between them is driven by the same handful of considerations: whether the task requires generation, whether inference must be streaming, and whether the future is even available at prediction time. This article covers the general concept across architectures. For the specialization to language modeling, where the constraint defines the entire decoder-only LLM family, see unidirectional language model.
A sequence model takes an ordered input x_1, x_2, ..., x_n and produces some output (a label per position, a single label for the whole sequence, or a generated sequence). It is unidirectional if the computation at any position t depends only on positions in one direction relative to t. In the standard left-to-right case, the state or output at position t is a function of x_1 through x_t and never of x_{t+1} through x_n. The model has no path through which information from the future can flow into position t.
This constraint can come from three different places, and they are worth keeping straight:
In all three cases the effect is the same: a strict information ordering in which the past flows into the present and nothing flows backward.
Different fields use different words for what is essentially the same property. The vocabulary matters because papers from speech, language, and vision often describe identical mechanisms in incompatible terms.
| Term | Field where common | What it emphasizes |
|---|---|---|
| Unidirectional | Speech, sequence modeling | Processing direction (one way through the sequence) |
| Causal | Signal processing, transformers | Respecting time arrow; no influence from the future |
| Autoregressive | Language modeling, generative modeling | Each output conditioned on previous outputs (used at generation time) |
| Forward-only | Older RNN literature | Contrast with the backward pass in BiRNNs |
| Left-to-right | Language modeling | Reading direction in left-to-right scripts |
| Monotonic | Alignment, attention | Output position never moves backward in input position |
| Online | Streaming, real-time inference | Prediction can be made as data arrives, without future context |
| Streaming | Speech, video | Same as online, with emphasis on continuous arrival |
The terms are not perfectly interchangeable. Causal and unidirectional describe the same architectural constraint. Autoregressive describes the same constraint when the model is generating its own future tokens. Online and streaming describe the inference setting that unidirectional models make possible. Monotonic is a more specific alignment property used in attention-based ASR and translation.
Most popular sequence architectures have a unidirectional version. The mechanism by which causality is enforced varies, but the resulting information flow is the same.
The original unidirectional model is the plain recurrent neural network. At each time step the hidden state h_t is a function of the previous hidden state h_{t-1} and the current input x_t, written h_t = f(h_{t-1}, x_t). The output y_t is a function of h_t. Because h_t only references h_{t-1} and x_t, and because h_{t-1} is itself a function of h_{t-2} and x_{t-1}, and so on, the dependency chain only ever moves forward.
LSTM and GRU cells inherit this property. They add gating mechanisms that decide what to keep, forget, or write into the state, but the recurrence still runs forward only. A unidirectional LSTM is the default LSTM in PyTorch and Keras (the bidirectional flag is off by default). These models have been the workhorses of speech recognition, language modeling, and time series forecasting for most of the deep learning era.
A bidirectional RNN, in contrast, runs two independent recurrences in opposite directions and concatenates their hidden states. That construction is impossible to use as a real-time predictor because the backward pass cannot start until the entire sequence has arrived.
A standard 1D convolution at position t mixes inputs from positions t-k through t+k for a kernel of width 2k+1. That violates causality whenever k is positive. The fix is to shift the kernel so it only covers past positions, t-2k through t, or equivalently to pad the input on the left and crop the output on the right. The result is a causal convolution that respects the time arrow.
The canonical example is WaveNet, introduced by Aaron van den Oord and colleagues at DeepMind in 2016 in the paper "WaveNet: A Generative Model for Raw Audio." WaveNet generates raw audio one sample at a time at 16 kHz or higher, which is a brutally long sequence. Recurrent models could not handle that length efficiently, so the authors stacked dilated causal convolutions whose receptive field grows exponentially with depth. Layer i has a dilation of 2^i in the original recipe, so a stack of ten layers reaches a receptive field of 1024 samples, and three such stacks reach about 3000 samples while keeping the total parameter count modest. The model factorizes the joint distribution over audio samples as a product of conditionals, each computed by the dilated causal stack, and is trained with the next-sample prediction loss.
PixelCNN, also from van den Oord and DeepMind in 2016, applies the same idea to images. It models the joint distribution over pixels as a product of conditionals, where each pixel is conditioned on the pixels above and to the left of it (a raster-scan order). Standard convolutions would let each pixel see its neighbors in all directions, so PixelCNN uses masked convolutions that zero out the kernel weights for positions that have not yet been generated. Two mask types are used: type A masks the current center pixel as well (used in the first layer), and type B masks only future pixels (used in subsequent layers). The resulting model is autoregressive over pixels in raster order. The PixelRNN companion paper from the same group used row LSTMs and diagonal BiLSTMs to model the same conditionals with recurrence, achieving slightly better likelihoods at a much higher training cost.
Causal convolutions are also the foundation of the temporal convolutional network (TCN), which Bai, Kolter, and Koltun proposed in 2018 as a general drop-in replacement for RNNs in many sequence tasks. A TCN is a stack of dilated causal convolutions with residual connections, designed to give RNN-like temporal modeling with the parallelism of a feedforward net.
The transformer, introduced by Vaswani and colleagues in 2017, has a self-attention layer that, by default, lets every position attend to every other position. To make a transformer unidirectional you add a causal attention mask: an upper-triangular matrix of negative infinities that, when added to the attention logits before the softmax, drives the attention weights at future positions to zero. After masking, position t can only attend to positions 1 through t. The full attention is still computed, but the mask zeros out the half that would have looked into the future.
The transformer decoder block uses causal self-attention by construction. Decoder-only models like the GPT family, Llama, Mistral, and Claude consist of a stack of these blocks and are unidirectional throughout. Encoder-decoder models like T5 and BART have a bidirectional encoder (no mask) feeding a unidirectional decoder (causal mask plus cross-attention). For more on this lineage as it applies to language models specifically, see unidirectional language model and large language model.
A practical consequence of causal masking is the KV cache: at inference time the keys and values for past positions never change, so they can be stored once and reused for every subsequent token. This optimization is what makes streaming generation efficient at long context lengths. A bidirectional model has no KV cache equivalent because the past attention has to be recomputed every time a new position arrives.
More recent architectures like Mamba (Gu and Dao, 2023) and RWKV use selective state-space layers and linear-attention layers that are unidirectional by construction. They behave like RNNs at inference (constant memory per step, sequential generation) but parallelize during training like transformers. Their causal property is built into the recurrence, not added with a mask.
In online speech recognition, models must produce output as audio arrives. The dominant architecture is the RNN transducer (RNN-T), introduced by Alex Graves in 2012 in "Sequence Transduction with Recurrent Neural Networks." RNN-T has three components: an acoustic encoder that processes input frames, a label predictor that processes previously emitted output tokens, and a joint network that combines them to predict the next output. Because each step depends only on past acoustic frames and past labels, RNN-T can run causally and produce partial transcripts as audio comes in. Google's 2018 paper "Streaming End-to-end Speech Recognition for Mobile Devices" pushed RNN-T into production on Pixel phones and revived broad interest in the architecture.
Causal Conformer variants extend the same idea to the Conformer architecture (a convolution-augmented transformer encoder for speech). They replace bidirectional self-attention with chunked or fully causal attention and replace standard convolutions with causal convolutions, trading some accuracy for streaming capability. Many production ASR stacks use a two-pass design: a small causal model produces partial results in real time, and a larger non-causal model rescores them once the full utterance has been observed.
Unidirectional models are the default for three distinct reasons, and a given application usually involves at least one of them.
The first reason is generation. A model generating its own output one step at a time, like a GPT-style chat model writing a paragraph or WaveNet synthesizing a syllable, has nothing on the right side to look at. The future tokens have not been produced yet. The factorization P(x_1, ..., x_n) = product of P(x_t | x_{<t}) requires that each conditional only look at the past. Bidirectional models trained with masked language modeling cannot be used as autoregressive generators without significant surgery, and the surgery rarely matches a properly trained unidirectional model on quality.
The second reason is streaming inference. Many real-time applications cannot wait for the full input. Online speech recognition produces a transcript while the user is still talking. Live captioning displays text as it is recognized. Voice assistants need low end-to-end latency from microphone to action. Real-time control systems consume sensor data as it arrives. In all these cases the future does not exist yet at the moment a prediction is needed. A bidirectional model would have to wait for the end of the utterance, segment, or sequence; a unidirectional model can emit a prediction immediately.
The third reason is causality preservation in domains where the time arrow is part of the problem. In time series forecasting, financial modeling, and physical simulation, the model is supposed to predict the future from the past. Letting the future leak into the present during training would produce a model that cheats on the test set and fails in production. Even when the entire training sequence is known offline, enforcing unidirectional information flow keeps the training distribution aligned with the deployment distribution.
The unidirectional and bidirectional choices are usually presented as a trade-off, with unidirectional better suited to generation and streaming and bidirectional better suited to understanding tasks where the entire input is known up front. The table below sketches the contrast.
| Property | Unidirectional | Bidirectional |
|---|---|---|
| Sees future tokens | No | Yes |
| Native generation | Yes (autoregressive) | No (requires non-autoregressive workarounds) |
| Streaming inference | Natural | Not possible without buffering |
| Training parallelism | Full (with causal mask or causal conv) | Full |
| Inference parallelism | Sequential at generation time | One forward pass over input |
| KV cache reuse | Yes | No |
| Best for | Generation, streaming, time series | Classification, tagging, retrieval, embeddings |
| Pretraining objective | Next-token prediction | Masked language modeling, denoising |
| Sequence-labeling accuracy | Slightly weaker | Slightly better |
| Handles fully observed input | Yes, but throws away right context | Uses both sides natively |
| Flagship example | GPT-4, Llama, WaveNet, RNN-T | BERT, RoBERTa, BiLSTM-CRF, Conformer encoder |
The asymmetry in the last few rows is real but smaller than people once thought. Decoder-only LLMs trained at scale match or beat bidirectional encoders on most NLU benchmarks despite the structural disadvantage, simply because they have more data, more parameters, and richer training signals. The cleanest remaining win for bidirectional models is in the embedding and retrieval space, where almost every leaderboard model is still a BERT descendant.
A unidirectional model has to commit to a direction. The convention varies across modalities and tasks.
| Direction | Common in | Why |
|---|---|---|
| Left-to-right (forward) | Most NLP, audio, video | Matches reading and playback order; matches the time arrow |
| Right-to-left (backward) | Some embedding work, half of ELMo, RTL scripts | Pairs with a forward model to give bidirectional context without true bidirectionality |
| Time-forward | Time series, control, physics | The future genuinely is unknown |
| Raster-scan | PixelCNN, image autoregressive models | Imposes a unidirectional order on a 2D input |
| Outer-to-inner / inner-to-outer | Some autoregressive 3D models | Defines a 1D order over a 3D structure |
For 1D sequences with no preferred direction (DNA reads, some scientific signals), the choice of direction is arbitrary and often both directions are trained as separate models or fused via concatenation. For text in left-to-right scripts, the forward direction is the obvious default. For text in right-to-left scripts (Arabic, Hebrew), the model still operates left-to-right in its internal token order even though the visual reading order goes the other way.
Unidirectional models are trained with one universal trick: align the loss at each position with the next-step target so that the architecture's information ordering matches the label ordering. The implementation differs by architecture.
For RNNs and LSTMs the targets are the same as the inputs, shifted by one position. The cross-entropy loss is summed over all positions and gradients flow back through the recurrence (backpropagation through time). Because the recurrence is sequential, training is also sequential within a sequence; only across sequences in a batch can computation be parallelized.
For causal convolutions and causal-masked transformers the situation is much better. The whole sequence can be processed in a single forward pass, the loss at every position is computed simultaneously, and gradients flow back through one giant computational graph. This parallelism is the main reason transformer language models scaled up so much faster than RNN language models.
Across all unidirectional architectures, training uses teacher forcing: the input at each position is the ground-truth previous token, not the model's own prediction. Teacher forcing is fast and stable, but it leaves a small exposure bias because at inference the model has to consume its own outputs. Scheduled sampling and reinforcement-style fine-tuning have been proposed to close this gap; modern LLMs largely tolerate it because the simplicity wins.
A unidirectional generative model produces a probability distribution over the next token at every step. Choosing an actual sequence requires a decoding strategy. The common options are listed below; the unidirectional language model article covers them in more depth in the context of LLMs.
| Method | Description |
|---|---|
| Greedy | Pick the highest-probability token at every step |
| Beam search | Maintain k partial sequences, expand each, keep the best k |
| Temperature sampling | Divide logits by T, then sample; T < 1 sharpens, T > 1 flattens |
| Top-k sampling | Restrict to the k most likely tokens, renormalize, sample |
| Nucleus (top-p) sampling | Restrict to the smallest set with cumulative probability >= p |
| Speculative decoding | A small draft model proposes tokens, the large model verifies in parallel |
Speculative decoding, introduced by Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google in their 2023 ICML paper "Fast Inference from Transformers via Speculative Decoding," is worth flagging because it works specifically because the model is unidirectional. A small draft model and a large target model both have causal structure, so any prefix the draft proposes can be evaluated by the large model in one parallel pass over the prefix. The output distribution is mathematically identical to standard sampling from the large model, but throughput improves by 2x to 3x or more depending on the draft model and the target distribution.
Unidirectional models can be deployed in two regimes. Offline inference processes a complete input at once, like a batch transcription job or a document summarization run; the unidirectional structure is used during training, but at inference the whole sequence is available. Online or streaming inference processes input as it arrives, emitting output incrementally with low latency.
Streaming is what most users of voice assistants, live captioning, and chat LLMs experience. A chat model showing tokens as they are generated is doing streaming inference: the unidirectional decoder samples the next token, emits it to the screen, and feeds it back into the context for the next step. The same architecture would not be able to stream if it were bidirectional, because the right-context vectors at any position depend on tokens that have not been generated yet.
Key-value caching is the standard optimization. Because past keys and values never change in a causal transformer, they can be computed once and reused. Generating token n only requires one new query, one new key, one new value, and one attention computation between the new query and all past keys. Without this trick, generating an N-token completion would cost O(N^3) attention ops; with it, the cost is O(N^2). Variants like grouped-query attention and multi-query attention reduce the size of the cache itself, which is critical for serving large batches at long context.
The unidirectional design has real downsides outside its sweet spot.
The most obvious one is the loss of right context for understanding tasks. The textbook example is the homograph: in "He went to the bank to deposit his check," the word "bank" is disambiguated by the later word "deposit." A unidirectional model labeling "bank" at position 4 has no access to "deposit" at position 7, and has to either store everything in a forward state and hope the right signal makes it through, or wait until the whole sentence has been read and then use the final state. Bidirectional models avoid this entirely. For sequence labeling, named entity recognition, sentiment classification, and most retrieval tasks, all else equal, a bidirectional encoder outperforms a unidirectional one of the same size.
A second limitation is the sequential cost of inference. Generation is inherently one step at a time, so latency scales linearly with output length. Speculative decoding helps but does not fundamentally change the serialization. Bidirectional models doing classification or labeling are usually faster at inference because they need only one forward pass, not one per output token.
A third issue is that masked-token completion is not the natural objective. A model trained to predict the next token can be coaxed into filling in a missing token in the middle of a passage (with prefix-suffix prompting or fill-in-the-middle pretraining), but it is doing extra work to overcome its training. BERT-style models do this directly because that is what they were trained to do.
In 2026 the field has settled into a stable three-way split:
| Use case | Architecture | Direction | Examples |
|---|---|---|---|
| Generative chat, code, agents | Decoder-only transformer | Unidirectional | GPT-4, Claude 4, Gemini 3, Llama 4, Mistral, DeepSeek-V3, Qwen 3, Grok |
| Embeddings, retrieval, classification | Encoder-only transformer | Bidirectional | BGE, E5, Sentence-BERT, RoBERTa, DeBERTa, NV-Embed |
| Translation, summarization, structured output | Encoder-decoder transformer | Mixed | T5, BART, FLAN-T5 |
| Streaming speech recognition | RNN-T, causal Conformer | Unidirectional | Production ASR on phones, smart speakers, real-time captioning |
| High-fidelity audio synthesis | Causal CNN, autoregressive transformer, diffusion | Unidirectional (autoregressive variants) | WaveNet, neural codec models |
| Time series forecasting | Causal CNN, LSTM, transformer decoder | Unidirectional | Demand forecasting, energy load, finance |
The overall direction of travel since GPT-3 has been toward unidirectional decoder-only architectures for everything generative, with bidirectional encoders surviving in retrieval and embedding niches. Encoder-decoder models persist in translation and a few academic settings. State-space and linear-attention architectures (Mamba, RWKV, RetNet) are unidirectional by construction and represent the main current challenge to the dominance of causal transformers, although none has displaced them at frontier scale.
"Unidirectional" and "causal" are sometimes used interchangeably in machine learning, which can be confusing because causal inference in statistics refers to something different. Causal inference is about distinguishing correlation from causation: deciding whether intervening on variable X would change variable Y, in the do-calculus sense of Judea Pearl's framework. Causal masking in a sequence model is about respecting the time arrow in the input, not about identifying interventional effects.
The two ideas share an intuition (the past influences the future, not the other way around) but they live at different levels of abstraction. A causal language model and a causal graph have the word "causal" for related but distinct reasons. The sequence-modeling sense is the one used in this article and in almost all deep learning literature.
Imagine you are walking through a tunnel, and you can only see what is in front of you and what is right next to you, never what is behind a wall up ahead. A unidirectional model is a computer that has to make decisions while walking through this tunnel. At every step, it can only use what it has already seen, never what is around the next corner.
This sounds like a disadvantage, and sometimes it is, but it is also the only way to do certain jobs. If the computer's job is to write a story one word at a time, then the next word does not exist yet, and there is nothing to peek at anyway. If the computer's job is to listen to someone talking and write down the words as they come out, then the rest of the sentence has not been said yet. A model that needs to peek into the future cannot do these jobs at all, because the future is not there.
Unidirectional models are the storytellers and the live transcribers. Bidirectional models are the editors and the proofreaders, who get to read the whole thing before deciding what each word means. Both are useful; the right one depends on whether you are reading a finished book or watching a movie as it plays.