Bahdanau attention

Deep Learning Model Architecture Natural Language Processing

28 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v7 · 5,559 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Bahdanau attention is the first attention mechanism for neural networks, introduced in 2014 to let a sequence-to-sequence decoder soft-align to every encoder hidden state instead of relying on a single fixed-length context vector. Proposed by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in the paper "Neural Machine Translation by Jointly Learning to Align and Translate" (arXiv:1409.0473, submitted September 1, 2014; accepted at ICLR 2015 as an oral presentation), it computes a separate context vector for each output word as a weighted sum of source annotations, where the weights come from a small feedforward alignment network. ^[1] This design is the direct conceptual ancestor of self-attention and the Transformer, and it lifted BLEU on long sentences far above the attention-free baseline. ^[1]

The paper addressed a fundamental limitation of encoder-decoder models for sequence-to-sequence tasks: the requirement to compress an entire source sentence into a single fixed-length vector. This fixed-length representation acted as an information bottleneck, causing translation quality to degrade significantly as sentence length increased. Bahdanau and colleagues proposed a solution that allowed the decoder to selectively attend to different parts of the source sentence at each decoding step, rather than relying on a single compressed vector. The authors described the idea as letting the model "automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly." ^[1] This soft search over encoder hidden states became known as the attention mechanism, and it fundamentally changed how neural networks process sequential data.

Bahdanau attention is also referred to as additive attention or concat attention because of its use of an additive scoring function based on a feedforward neural network. It stands in contrast to the multiplicative (dot-product) attention later proposed by Luong et al. (2015). ^[5] Together, these two papers laid the groundwork for all subsequent attention-based architectures, including the Transformer (Vaswani et al., 2017), which forms the basis of modern large language models such as GPT-4, LLaMA, and Claude. ^[7] The original paper is among the most cited works in all of machine learning, with roughly 44,000 citations on Google Scholar as of mid-2026. ^[13]

Background and Motivation

The Encoder-Decoder Framework

Before Bahdanau attention, the dominant approach to neural machine translation was the encoder-decoder framework. This architecture was independently proposed by Cho et al. (2014) in "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" ^[2] and by Sutskever, Vinyals, and Le (2014) in "Sequence to Sequence Learning with Neural Networks." ^[3]

In this framework, an encoder recurrent neural network (RNN) reads the source sentence token by token and produces a sequence of hidden states. After processing the entire input, the encoder's final hidden state is used as a fixed-length context vector, which summarizes the meaning of the entire source sentence. A decoder RNN then generates the target sentence one token at a time, conditioned on this fixed-length vector and the previously generated tokens.

Sutskever et al. (2014) demonstrated that this approach could achieve strong results using deep LSTM networks, reaching a BLEU score of 34.8 on the WMT 2014 English-to-French translation task. ^[3] However, they also found it necessary to reverse the order of the source sentence to improve performance, suggesting that the model struggled to handle long-range dependencies within the fixed-length representation.

What problem did the fixed-length bottleneck cause?

The fixed-length context vector was the core weakness of the basic encoder-decoder architecture. Regardless of whether the source sentence contained 5 words or 50 words, all of the information had to be compressed into a single vector of the same dimensionality. For short sentences, this worked reasonably well because the vector had enough capacity to capture the essential meaning. For longer sentences, the vector became an information bottleneck that forced the network to discard or conflate important details.

Cho et al. (2014b), in their paper "On the Properties of Neural Machine Translation: Encoder-Decoder Approaches," provided empirical evidence for this problem. ^[4] They showed that the performance of the basic encoder-decoder deteriorated rapidly as the length of the input sentence increased, with particularly sharp degradation beyond 20 to 30 tokens. This finding motivated the search for a mechanism that could handle variable-length inputs more gracefully.

Bahdanau et al. (2014) explicitly stated their hypothesis in the paper's abstract: "We conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture." ^[1] Their proposed solution was to allow the model to soft-search for the parts of a source sentence relevant to predicting each target word, without forming those parts as an explicit hard segment.

Who created Bahdanau attention?

Dzmitry Bahdanau

Dzmitry Bahdanau is a machine learning researcher who completed his PhD at Mila (Quebec Artificial Intelligence Institute) and the Universite de Montreal under the supervision of Yoshua Bengio. He previously studied at Jacobs University Bremen in Germany. Bahdanau is a Canada CIFAR AI Chair at Mila, an adjunct professor at the School of Computer Science at McGill University, and has worked at ServiceNow Research. ^[10] ^[11] His research interests include natural language understanding, semantic parsing, language user interfaces, systematic generalization, and hybrid neural-symbolic systems. He invented the content-based neural attention mechanism that became a core building block of modern deep learning-based natural language processing.

Kyunghyun Cho

Kyunghyun Cho is a South Korean-born computer scientist. He received his Bachelor of Science in Computer Science from KAIST in 2009 and his Doctor of Science from Aalto University in Finland in 2014. He was a postdoctoral fellow at the Universite de Montreal under Yoshua Bengio until 2015. In 2015 he joined New York University, where he became a tenured professor of Computer Science and Data Science in 2019. ^[12] Cho is the co-inventor of the Gated Recurrent Unit (GRU), which was proposed in his earlier 2014 paper on the RNN Encoder-Decoder architecture. ^[2]

Yoshua Bengio

Yoshua Bengio is a Canadian computer scientist and one of the pioneers of deep learning. He is a Full Professor at the Universite de Montreal and the founder and scientific advisor of Mila. In 2018, Bengio received the ACM A.M. Turing Award (often called the "Nobel Prize of Computing") jointly with Geoffrey Hinton and Yann LeCun for their foundational work on deep learning. Bengio is one of the most cited computer scientists in the world.

How does Bahdanau attention work?

The model architecture proposed by Bahdanau et al. consists of three main components: a bidirectional RNN encoder, an attention mechanism (alignment model), and a GRU-based decoder. ^[1] The paper referred to this model as RNNsearch, because the decoder "searches" through the source sentence annotations at each time step.

Bidirectional RNN Encoder

Unlike the basic encoder-decoder, which used a unidirectional RNN that only read the source sentence from left to right, Bahdanau et al. employed a bidirectional RNN (BiRNN) as the encoder. A bidirectional RNN consists of two separate RNNs: a forward RNN that reads the sentence from the first token to the last, and a backward RNN that reads the sentence from the last token to the first.

Given a source sentence x = $(x_1, x_2, \ldots, x_T)$ , the forward RNN produces a sequence of forward hidden states:

\overrightarrow{h}_j = f(x_j, \overrightarrow{h}_{j-1})

The backward RNN produces a sequence of backward hidden states:

\overleftarrow{h}_j = f(x_j, \overleftarrow{h}_{j+1})

For each position j in the source sentence, the annotation vector $h_j$ is obtained by concatenating the forward and backward hidden states:

h_j = [\overrightarrow{h}_j; \overleftarrow{h}_j]

This concatenation ensures that each annotation $h_j$ contains information about the words surrounding position j, capturing both preceding and following context. Because the forward hidden state at position j encodes information about all tokens up to and including $x_j$ , and the backward hidden state encodes information about all tokens from $x_T$ down to $x_j$ , the annotation $h_j$ effectively summarizes the entire sentence with a focus on the neighborhood around the j-th word.

Both the forward and backward RNNs used Gated Recurrent Units (GRUs) as the recurrent cell. Each directional RNN had a hidden dimension of 1000 units, so after concatenation, each annotation vector $h_j$ had a dimensionality of 2000. ^[1]

Attention Mechanism (Alignment Model)

The attention mechanism is the central innovation of the paper. Instead of relying on a single context vector for the entire translation, the attention mechanism computes a different context vector $c_i$ for each target word position i. This context vector is a weighted sum of all encoder annotations, where the weights reflect how relevant each source position is for generating the current target word.

The computation proceeds in three steps:

Step 1: Compute alignment scores. For each target position i, the alignment model computes an energy score $e_{ij}$ between the previous decoder hidden state $s_{i-1}$ and each encoder annotation $h_j$ :

e_{ij} = a(s_{i-1}, h_j) = v_a^\top \tanh(W_a s_{i-1} + U_a h_j)

Here, $W_a$ and $U_a$ are weight matrices and $v_a$ is a weight vector. All three are learnable parameters. The alignment model is a single-layer feedforward neural network with a tanh activation function. The input is the concatenation (in an additive sense) of the decoder's previous hidden state and one encoder annotation, and the output is a scalar energy score indicating how well the two states match.

Step 2: Normalize scores with softmax. The energy scores are passed through a softmax function to produce the attention weights $\alpha_{ij}$ :

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}

The softmax ensures that the attention weights for each decoder time step sum to 1, forming a valid probability distribution over the source positions. A weight $\alpha_{ij}$ close to 1 means the model is focusing heavily on source position j when generating the i-th target word, while a weight close to 0 means that position is largely ignored.

Step 3: Compute the context vector. The context vector $c_i$ is computed as the weighted sum of all encoder annotations:

c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j

This context vector is a soft, differentiable approximation of a hard alignment between source and target words. Instead of selecting a single source word to attend to (as in traditional statistical alignment models), the attention mechanism distributes its focus across all source positions, assigning higher weights to the most relevant ones.

GRU Decoder

The decoder is a unidirectional GRU-based RNN that generates the target sentence one token at a time. At each time step i, the decoder hidden state $s_i$ is computed as a function of three inputs:

s_i = f(s_{i-1}, y_{i-1}, c_i)

where $s_{i-1}$ is the previous decoder hidden state, $y_{i-1}$ is the embedding of the previously generated word, and $c_i$ is the context vector computed by the attention mechanism.

The decoder hidden state had a dimensionality of 1000 units. The initial decoder hidden state $s_0$ was computed from the last backward encoder hidden state using a tanh activation:

s_0 = \tanh(W_s \overleftarrow{h}_1)

The output probability distribution over the target vocabulary was computed using a maxout hidden layer (with $l = 500$ units) followed by a softmax layer. The maxout layer took as input the decoder hidden state $s_i$ , the context vector $c_i$ , and the embedding of the previous target word $y_{i-1}$ . ^[1]

Mathematical Formulation

The complete mathematical formulation of the Bahdanau attention model can be summarized as follows.

Given a source sentence x = $(x_1, \ldots, x_{T_x})$ and a target sentence y = $(y_1, \ldots, y_{T_y})$ , the model defines the conditional probability:

p(y_i \mid y_1, \ldots, y_{i-1}, x) = g(y_{i-1}, s_i, c_i)

where g is a nonlinear function (the maxout + softmax output layer), $s_i$ is the decoder hidden state, and $c_i$ is the context vector.

The full set of equations is:

Component	Equation	Description
Forward encoder	$\overrightarrow{h}_j = \mathrm{GRU}(x_j, \overrightarrow{h}_{j-1})$	Reads source left-to-right
Backward encoder	$\overleftarrow{h}_j = \mathrm{GRU}(x_j, \overleftarrow{h}_{j+1})$	Reads source right-to-left
Annotation vector	$h_j = [\overrightarrow{h}_j; \overleftarrow{h}_j]$	Concatenation of both directions
Alignment score	$e_{ij} = v_a^\top \tanh(W_a s_{i-1} + U_a h_j)$	Additive scoring function
Attention weight	$\alpha_{ij} = \mathrm{softmax}(e_{ij}) = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$	Normalized relevance
Context vector	$c_i = \sum_j \alpha_{ij} h_j$	Weighted sum of annotations
Decoder state	$s_i = \mathrm{GRU}(y_{i-1}, s_{i-1}, c_i)$	Conditioned on attention output
Output	$p(y_i) = \mathrm{softmax}(\mathrm{maxout}(s_i, c_i, y_{i-1}))$	Word probability distribution

Why is it called "additive" attention?

The attention mechanism proposed by Bahdanau et al. is called "additive" because the alignment score is computed by adding two linear transformations ( $W_a s_{i-1}$ and $U_a h_j$ ) and passing the result through a tanh nonlinearity. This stands in contrast to "multiplicative" or "dot-product" attention, where the score is computed directly as a dot product between the query and key vectors. ^[5] The additive formulation introduces more learnable parameters (the matrices $W_a$ , $U_a$ , and the vector $v_a$ ) but can be more expressive because it passes the combined representation through a nonlinearity.

Soft Alignment vs. Hard Alignment

A defining characteristic of Bahdanau attention is that it implements a soft alignment between source and target words. Traditional word alignment models in statistical machine translation (such as the IBM alignment models) assigned each target word to exactly one source word (hard alignment). This discrete selection is not differentiable and cannot be trained with standard backpropagation.

In contrast, Bahdanau's soft alignment computes a probability distribution over all source positions using the softmax function. Because softmax is a continuous, differentiable function, gradients flow smoothly through the entire attention mechanism. This allows the alignment model to be trained jointly with the rest of the encoder-decoder network using standard gradient-based optimization, without requiring any external alignment supervision or reinforcement learning techniques.

Xu et al. (2015) later formalized this distinction in the context of image captioning, explicitly defining "soft attention" (differentiable, trained with backpropagation) and "hard attention" (stochastic, trained with REINFORCE). ^[6] Their soft attention mechanism for images was a direct adaptation of Bahdanau's approach to the visual domain.

Training Details

Dataset

The model was trained on the WMT 2014 English-to-French translation task. The authors used a subset of the available parallel corpora, which originally totaled approximately 850 million words from several sources: ^[1]

Corpus	Approximate Size
Europarl v7	61 million words
News Commentary	5.5 million words
United Nations	421 million words
Crawled corpus 1	90 million words
Crawled corpus 2	272.5 million words
Total (original)	~850 million words
After data selection	~348 million words

The data was reduced from 850 million to 348 million words using the data selection method proposed by Axelrod et al. (2011). ^[8] Both source and target languages used a vocabulary of the 30,000 most frequent words, with all other words replaced by a special [UNK] token.

Hyperparameters

The model was trained with the following hyperparameters: ^[1]

Hyperparameter	Value
Word embedding dimension	620
Encoder hidden units (per direction)	1000
Decoder hidden units	1000
Alignment model dimension	1000
Maxout hidden layer units	500
Source vocabulary size	30,000
Target vocabulary size	30,000
Optimizer	SGD with Adadelta (epsilon=10^-6, rho=0.95)
Batch size	80 sentences
Gradient clipping	L2-norm threshold of 1
Training duration	~5 days per model
Decoding strategy	Beam search

Two sets of models were trained. Models with the suffix "-30" were trained on sentences of up to 30 words in length, while models with the suffix "-50" were trained on sentences of up to 50 words. This allowed the authors to evaluate how well the attention mechanism handled longer sentences compared to the baseline.

Development and Test Sets

The development set was formed by concatenating news-test-2012 and news-test-2013. The test set was news-test-2014, consisting of 3,003 sentences. BLEU scores were computed on this test set using the standard multi-bleu.perl script. ^[1]

How much did attention improve translation quality?

The main results of the paper are summarized in Table 1, which compares the attention-based model (RNNsearch) against the baseline encoder-decoder without attention (RNNencdec) and the phrase-based statistical machine translation system Moses. ^[1]

BLEU Scores on WMT'14 English-to-French

Model	BLEU (All)	BLEU (No UNK)
RNNencdec-30	13.93	24.19
RNNsearch-30	21.50	31.44
RNNencdec-50	17.82	26.71
RNNsearch-50	26.75	34.16
RNNsearch-50*	28.45	36.15
Moses (phrase-based SMT)	33.30	35.63

The "All" column reports BLEU scores on all test sentences. The "No UNK" column reports scores only on sentences that did not contain any unknown word tokens, providing a fairer comparison since the neural models had no mechanism for handling out-of-vocabulary words. The asterisk () on RNNsearch-50 indicates that this model was trained for an extended period until performance on the development set plateaued.

Key Findings

Several important observations emerged from the results: ^[1]

Attention dramatically improved performance. RNNsearch-30 achieved a BLEU score of 21.50, compared to 13.93 for RNNencdec-30 without attention. This represented a gain of 7.57 BLEU points, a very large improvement. The gap was even larger for models trained on longer sentences: RNNsearch-50 scored 26.75 versus 17.82 for RNNencdec-50, a difference of 8.93 points.
Attention handled long sentences much better. The performance curves plotted in the paper showed that the baseline RNNencdec model's BLEU score declined sharply for sentences longer than about 20 words. In contrast, RNNsearch maintained much more stable performance across all sentence lengths, confirming the hypothesis that the fixed-length vector was the primary bottleneck.
Competitive with phrase-based SMT. When unknown words were excluded from evaluation, RNNsearch-50* achieved a BLEU score of 36.15, surpassing the Moses phrase-based system's 35.63. This was a remarkable result at the time, as neural machine translation was still in its infancy and phrase-based systems had been the dominant paradigm for over a decade.
The model learned meaningful alignments. Qualitative analysis of the attention weights revealed that the model learned sensible word alignments between English and French. The attention heatmaps (visualized as matrices with source words on one axis and target words on the other) showed that the model correctly learned to focus on the corresponding source word when generating each target word. For example, when generating a French adjective that follows its noun (the opposite word order from English), the attention mechanism correctly shifted its focus to the adjective's position in the English source.

How does Bahdanau attention differ from Luong attention?

In 2015, Luong, Pham, and Manning published "Effective Approaches to Attention-based Neural Machine Translation" at EMNLP 2015 (pages 1412-1421), building directly on Bahdanau's work. ^[5] Luong et al. proposed several alternative scoring functions for computing attention weights and introduced the distinction between global and local attention. They characterized their global model as one that "always attends to all source words," while their local model "only looks at a subset of source words at a time." ^[5] An ensemble of their attentional models set a new state of the art on the WMT'15 English-to-German task at 25.9 BLEU, an improvement of 1.0 BLEU over the previous best system. ^[5]

Scoring Functions

Bahdanau attention computes alignment scores using an additive (concat) formulation:

\mathrm{score}(s, h) = v_a^\top \tanh(W_a s + U_a h)

Luong attention introduced three alternative scoring functions: ^[5]

Scoring Function	Formula	Type
Dot product	$\mathrm{score}(s_t, h_j) = s_t^\top h_j$	Multiplicative
General	$\mathrm{score}(s_t, h_j) = s_t^\top W_a h_j$	Multiplicative
Concat	$\mathrm{score}(s_t, h_j) = v_a^\top \tanh(W_a [s_t; h_j])$	Additive

The dot product scoring function is the simplest and requires no additional learnable parameters. The general scoring function introduces a single weight matrix W_a, which allows the model to learn a more flexible similarity measure. The concat scoring function is similar to Bahdanau's formulation.

Key Differences Between Bahdanau and Luong Attention

Aspect	Bahdanau Attention (2014)	Luong Attention (2015)
Scoring function	Additive (feedforward network)	Multiplicative (dot product, general, or concat)
Decoder state used	Previous state $s_{i-1}$	Current state $s_t$
Encoder type	Bidirectional RNN	Unidirectional (top layer of stacked LSTM)
Attention scope	Global only	Global and local
Attention placement	Before decoder RNN step	After decoder RNN step
Computational cost	Higher (requires feedforward network)	Lower (dot product is efficient)
Number of parameters	More (W_a, U_a, v_a)	Fewer (W_a only, or none for dot product)

One of the most significant practical differences is which decoder hidden state is used to compute the alignment scores. Bahdanau attention uses the previous decoder hidden state $s_{i-1}$ , meaning the attention is computed before the decoder RNN processes the current step. Luong attention uses the current decoder hidden state $s_t$ , meaning the attention is computed after the decoder RNN step. This design choice affects when the context vector is available during the decoding process.

Global vs. Local Attention

Bahdanau attention is a global attention mechanism: it computes alignment scores over all source positions for every decoder step. Luong et al. additionally proposed local attention, which restricts the attention to a small window of source positions $[p_t - D, p_t + D]$ centered around an aligned position $p_t$ . ^[5] Local attention comes in two variants:

Local-m (monotonic): Assumes monotonic alignment, setting $p_t = t$ .
Local-p (predictive): Learns to predict the aligned position $p_t$ using a simple feedforward network.

Local attention reduces the computational cost from $O(T_x)$ to $O(2D + 1)$ per decoder step and can be seen as a blend between Bahdanau's soft global attention and the hard attention approach used in some vision models.

Impact and Legacy

Spawning All Subsequent Attention Mechanisms

Bahdanau attention was the catalyst for an entire family of attention-based architectures that transformed deep learning. The timeline of developments that followed directly from this work includes:

2015: Luong et al. refined the attention mechanism with multiplicative scoring and local attention. ^[5]
2015: Xu et al. adapted Bahdanau's soft attention to image captioning in "Show, Attend and Tell," demonstrating that attention could work across modalities. ^[6]
2016: Cheng et al. proposed self-attention ("intra-attention"), where tokens within a single sequence attend to each other. ^[9]
2017: Vaswani et al. published "Attention Is All You Need," introducing the Transformer architecture, which dispensed with recurrence entirely and relied solely on multi-head self-attention. ^[7] The Transformer's scaled dot-product attention can be seen as a descendant of Luong's multiplicative attention, which in turn was a refinement of Bahdanau's original concept.

The Path to Transformers

The conceptual lineage from Bahdanau attention to the Transformer is direct. Bahdanau showed that allowing the decoder to dynamically attend to encoder hidden states could eliminate the fixed-length bottleneck. Luong simplified the scoring function to a dot product. The Transformer generalized this idea by (1) applying attention to the input sequence itself (self-attention), (2) using multiple parallel attention heads, (3) introducing the query-key-value framework, and (4) removing the recurrent connections entirely, enabling full parallelization during training. ^[7]

The Transformer's scaled dot-product attention is:

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

This can be understood as a scaled version of Luong's dot-product attention, extended from a sequence-to-sequence setting to a self-attention setting where queries, keys, and values all come from learned linear projections of the same input. The scaling factor $1/\sqrt{d_k}$ prevents the dot products from growing too large in high dimensions, keeping the softmax in a well-behaved gradient regime.

Relationship to Soft Attention for Images

Xu et al. (2015) published "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" at ICML 2015, which directly adapted Bahdanau's attention mechanism to the task of image captioning. ^[6] In their model, the encoder was a convolutional neural network (CNN) that produced a grid of feature vectors (one per spatial location in the image), and the decoder was an LSTM that generated a caption word by word. At each decoding step, the attention mechanism computed a weighted combination of the spatial feature vectors, allowing the model to "look" at different parts of the image when generating different words.

Xu et al. were the first to formally distinguish between soft attention (the differentiable weighted sum approach from Bahdanau) and hard attention (a stochastic approach that selects a single location, trained with the REINFORCE algorithm). ^[6] Their soft attention mechanism is mathematically equivalent to Bahdanau attention, with CNN feature vectors replacing RNN hidden states as the set of values being attended to. The paper demonstrated that attention was not limited to sequence-to-sequence tasks but could be applied anywhere a model needed to selectively focus on parts of a structured input.

One of the Most-Cited Machine Learning Papers

The Bahdanau attention paper has accumulated tens of thousands of citations, with roughly 44,000 on Google Scholar as of mid-2026, placing it among the most cited papers in all of machine learning and computer science. ^[13] (Citation indexes differ in coverage: Semantic Scholar lists a lower figure in the low-to-mid 20,000s.) Its influence extends far beyond machine translation: attention mechanisms are now fundamental components of models for text generation, speech recognition, image classification, protein structure prediction, reinforcement learning, and many other domains.

Comparison of Attention Mechanisms

The following table provides a comprehensive comparison of the major attention mechanism variants, from the original Bahdanau attention through to the Transformer's self-attention.

Feature	No Attention (Encoder-Decoder)	Bahdanau Attention (2014)	Luong Attention (2015)	Transformer Self-Attention (2017)
Context vector	Fixed (final encoder state)	Dynamic (weighted sum of encoder states)	Dynamic (weighted sum of encoder states)	Dynamic (weighted sum of value vectors)
Scoring function	N/A	Additive: $v^\top \tanh(Ws + Uh)$	Dot, general, or concat	Scaled dot product: $QK^\top / \sqrt{d_k}$
Encoder type	Unidirectional RNN	Bidirectional RNN	Stacked unidirectional LSTM	No RNN; positional encoding + self-attention
Decoder state used	N/A	Previous state s_{i-1}	Current state s_t	Current layer output
Attention scope	N/A	Global (all source positions)	Global or local (windowed)	Global (all positions in sequence)
Self-attention	No	No	No	Yes
Multi-head	No	No	No	Yes (h parallel heads)
Parallelizable	Limited (sequential RNN)	Limited (sequential RNN)	Limited (sequential RNN)	Fully parallelizable
Handles long sequences	Poor (fixed bottleneck)	Good (dynamic context)	Good (dynamic context)	Excellent (direct token-to-token paths)
Number of extra parameters	None	$W_a, U_a, v_a$	$W_a$ (general) or none (dot)	$W_Q, W_K, W_V, W_O$ per head
Training complexity per step	$O(1)$ for context	$O(T_x)$ per decoder step	$O(T_x)$ per decoder step (global)	$O(n^2)$ for self-attention
Year	2014	2014	2015	2017
Paper	Cho et al.; Sutskever et al.	Bahdanau, Cho, Bengio	Luong, Pham, Manning	Vaswani et al.

Qualitative Analysis and Alignment Visualization

One of the most compelling aspects of the Bahdanau attention paper was its qualitative analysis of the learned alignments. The authors visualized the attention weights as heatmap matrices, with source (English) words on one axis and target (French) words on the other. Each cell in the matrix represented the attention weight $\alpha_{ij}$ , with brighter cells indicating higher attention. ^[1]

These visualizations revealed several interesting patterns:

Monotonic alignment for similar word order: When translating between English and French (both Subject-Verb-Object languages), the attention patterns often followed a roughly diagonal path, reflecting the similar word order of the two languages.
Non-monotonic alignment for reorderings: When the two languages required different word orders (for example, adjective placement differs between English and French), the attention mechanism correctly shifted its focus away from the diagonal to attend to the correct source word.
Many-to-one and one-to-many alignments: The model handled cases where multiple target words corresponded to a single source word, or where a single target word required information from multiple source words. For instance, French compound verb forms might attend to a single English verb.
Soft rather than hard focus: Even when one source position dominated the attention distribution, other positions still received small but nonzero weights. This soft attention behavior allowed the model to integrate contextual information from the surrounding words.

These alignment visualizations provided strong evidence that the model was learning meaningful, linguistically plausible correspondences between source and target words, rather than simply memorizing surface patterns.

What are the limitations of Bahdanau attention?

Despite its groundbreaking contributions, Bahdanau attention had several limitations that subsequent work addressed:

Computational cost at each decoder step. The attention mechanism required computing alignment scores between the current decoder state and every encoder annotation at each time step. For very long source sentences, this $O(T_x)$ computation per decoder step added significant overhead. Luong's local attention addressed this by restricting the window of attended positions. ^[5]
Sequential decoding. Like all RNN-based encoder-decoder models, the Bahdanau architecture processed tokens sequentially during both encoding and decoding. This limited parallelization during training and inference. The Transformer architecture later solved this by replacing recurrence with self-attention, enabling full parallelization. ^[7]
No self-attention. The attention mechanism only operated between the encoder and decoder (cross-attention). There was no mechanism for tokens within the source sentence to attend to each other, or for generated target tokens to attend to all previously generated tokens. Self-attention, introduced later, filled this gap.
Unknown word handling. The model used a fixed vocabulary of 30,000 words and mapped all out-of-vocabulary words to a single [UNK] token. This significantly hurt performance, as seen in the large gap between "All" and "No UNK" BLEU scores. Later approaches such as byte-pair encoding (BPE) and subword tokenization largely solved this problem.
Single attention head. The model used a single alignment model, meaning it could only compute one pattern of attention per decoder step. Multi-head attention, introduced by the Transformer, allowed the model to attend to different aspects of the input simultaneously through multiple parallel attention heads. ^[7]

Explain Like I'm 5 (ELI5)

Imagine you are reading a book written in English and you need to tell the story in French. The old way of doing this was to read the entire book, close it, and then try to retell the whole story from memory. If the book was short, you could remember enough to do a good job. But if the book was long, you would forget important details.

Bahdanau attention is like being allowed to keep the book open while you retell the story. For each French sentence you need to say, you can look back at the English book and find the most relevant part. When translating the word for "cat," you look at the part of the English book that talks about the cat. When translating the word for "garden," you look at the part about the garden. You are not reading the whole book each time; instead, you are quickly scanning it and focusing on the part that helps you right now.

References

Bahdanau, D., Cho, K., and Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." *arXiv preprint arXiv:1409.0473*. Accepted at ICLR 2015 (oral presentation). https://arxiv.org/abs/1409.0473 ↩
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *Proceedings of EMNLP 2014*, pp. 1724-1734. ↩
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS) 2014*. https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural ↩
Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014b). "On the Properties of Neural Machine Translation: Encoder-Decoder Approaches." *Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8)*. ↩
Luong, M.-T., Pham, H., and Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." *Proceedings of EMNLP 2015*, pp. 1412-1421. https://aclanthology.org/D15-1166/ ↩
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." *Proceedings of ICML 2015*. ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS) 2017*. ↩
Axelrod, A., He, X., and Gao, J. (2011). "Domain Adaptation via Pseudo In-Domain Data Selection." *Proceedings of EMNLP 2011*. ↩
Cheng, J., Dong, L., and Lapata, M. (2016). "Long Short-Term Memory-Networks for Machine Reading." *Proceedings of EMNLP 2016*. ↩
Bahdanau, D. Personal website. https://rizar.github.io/ ↩
CIFAR. "Dzmitry Bahdanau." https://cifar.ca/bios/dzmitry-bahdanau/ ↩
Cho, K. Wikipedia entry. https://en.wikipedia.org/wiki/Kyunghyun_Cho ↩
Google Scholar. "Neural Machine Translation by Jointly Learning to Align and Translate" citation record. https://scholar.google.com/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

Attention Attention Is All You Need Cross-attention LSTM Machine learning terms/Natural Language Processing Machine translation Oriol Vinyals RNN Self-attention Sequence-to-Sequence Task Yoshua Bengio

Background and Motivation

The Encoder-Decoder Framework

What problem did the fixed-length bottleneck cause?

Who created Bahdanau attention?

Dzmitry Bahdanau

Kyunghyun Cho

Yoshua Bengio

How does Bahdanau attention work?

Bidirectional RNN Encoder

Attention Mechanism (Alignment Model)

GRU Decoder

Mathematical Formulation

Why is it called "additive" attention?

Soft Alignment vs. Hard Alignment

Training Details

Dataset

Hyperparameters

Development and Test Sets

How much did attention improve translation quality?

BLEU Scores on WMT'14 English-to-French

Key Findings

How does Bahdanau attention differ from Luong attention?

Scoring Functions

Key Differences Between Bahdanau and Luong Attention

Global vs. Local Attention

Impact and Legacy

Spawning All Subsequent Attention Mechanisms

The Path to Transformers

Relationship to Soft Attention for Images

One of the Most-Cited Machine Learning Papers

Comparison of Attention Mechanisms

Qualitative Analysis and Alignment Visualization

What are the limitations of Bahdanau attention?

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Large Concept Model

LSTM

Encoder

Long Short-Term Memory (LSTM)

Multi-head Latent Attention

Multi-Head Self-Attention

What links here

Related Articles

Large Concept Model

LSTM

Encoder

Long Short-Term Memory (LSTM)

Multi-head Latent Attention

Multi-Head Self-Attention

What links here