Bahdanau attention is the first attention mechanism proposed for neural machine translation. It was introduced in the paper "Neural Machine Translation by Jointly Learning to Align and Translate" by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. The paper was submitted to arXiv on September 1, 2014 (arXiv:1409.0473) and published as a conference paper at ICLR 2015. It is one of the most cited papers in the history of machine learning, with over 50,000 citations on Google Scholar as of 2025.
The paper addressed a fundamental limitation of encoder-decoder models for sequence-to-sequence tasks: the requirement to compress an entire source sentence into a single fixed-length vector. This fixed-length representation acted as an information bottleneck, causing translation quality to degrade significantly as sentence length increased. Bahdanau and colleagues proposed a solution that allowed the decoder to selectively attend to different parts of the source sentence at each decoding step, rather than relying on a single compressed vector. This "soft search" over encoder hidden states became known as the attention mechanism, and it fundamentally changed how neural networks process sequential data.
Bahdanau attention is also referred to as additive attention or concat attention because of its use of an additive scoring function based on a feedforward neural network. It stands in contrast to the multiplicative (dot-product) attention later proposed by Luong et al. (2015). Together, these two papers laid the groundwork for all subsequent attention-based architectures, including the Transformer (Vaswani et al., 2017), which forms the basis of modern large language models such as GPT-4, LLaMA, and Claude.
Before Bahdanau attention, the dominant approach to neural machine translation was the encoder-decoder framework. This architecture was independently proposed by Cho et al. (2014) in "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" and by Sutskever, Vinyals, and Le (2014) in "Sequence to Sequence Learning with Neural Networks."
In this framework, an encoder recurrent neural network (RNN) reads the source sentence token by token and produces a sequence of hidden states. After processing the entire input, the encoder's final hidden state is used as a fixed-length context vector, which summarizes the meaning of the entire source sentence. A decoder RNN then generates the target sentence one token at a time, conditioned on this fixed-length vector and the previously generated tokens.
Sutskever et al. (2014) demonstrated that this approach could achieve strong results using deep LSTM networks, reaching a BLEU score of 34.8 on the WMT 2014 English-to-French translation task. However, they also found it necessary to reverse the order of the source sentence to improve performance, suggesting that the model struggled to handle long-range dependencies within the fixed-length representation.
The fixed-length context vector was the core weakness of the basic encoder-decoder architecture. Regardless of whether the source sentence contained 5 words or 50 words, all of the information had to be compressed into a single vector of the same dimensionality. For short sentences, this worked reasonably well because the vector had enough capacity to capture the essential meaning. For longer sentences, the vector became an information bottleneck that forced the network to discard or conflate important details.
Cho et al. (2014b), in their paper "On the Properties of Neural Machine Translation: Encoder-Decoder Approaches," provided empirical evidence for this problem. They showed that the performance of the basic encoder-decoder deteriorated rapidly as the length of the input sentence increased, with particularly sharp degradation beyond 20 to 30 tokens. This finding motivated the search for a mechanism that could handle variable-length inputs more gracefully.
Bahdanau et al. (2014) explicitly stated their hypothesis in the paper's abstract: "We conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture." Their proposed solution was to allow the model to "automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly."
Dzmitry Bahdanau is a machine learning researcher who completed his PhD at Mila (Quebec Artificial Intelligence Institute) and the Universite de Montreal under the supervision of Yoshua Bengio. He previously studied at Jacobs University Bremen in Germany. Bahdanau is a Canada CIFAR AI Chair at Mila, an adjunct professor at the School of Computer Science at McGill University, and has worked at ServiceNow Research. His research interests include natural language understanding, semantic parsing, language user interfaces, systematic generalization, and hybrid neural-symbolic systems. He invented the content-based neural attention mechanism that became a core building block of modern deep learning-based natural language processing.
Kyunghyun Cho is a South Korean-born computer scientist. He received his Bachelor of Science in Computer Science from KAIST in 2009 and his Doctor of Science from Aalto University in Finland in 2014. He was a postdoctoral fellow at the Universite de Montreal under Yoshua Bengio until 2015. In 2015 he joined New York University, where he became a tenured professor of Computer Science and Data Science in 2019. Cho is the co-inventor of the Gated Recurrent Unit (GRU), which was proposed in his earlier 2014 paper on the RNN Encoder-Decoder architecture.
Yoshua Bengio is a Canadian computer scientist and one of the pioneers of deep learning. He is a Full Professor at the Universite de Montreal and the founder and scientific advisor of Mila. In 2018, Bengio received the ACM A.M. Turing Award (often called the "Nobel Prize of Computing") jointly with Geoffrey Hinton and Yann LeCun for their foundational work on deep learning. Bengio is one of the most cited computer scientists in the world.
The model architecture proposed by Bahdanau et al. consists of three main components: a bidirectional RNN encoder, an attention mechanism (alignment model), and a GRU-based decoder. The paper referred to this model as RNNsearch, because the decoder "searches" through the source sentence annotations at each time step.
Unlike the basic encoder-decoder, which used a unidirectional RNN that only read the source sentence from left to right, Bahdanau et al. employed a bidirectional RNN (BiRNN) as the encoder. A bidirectional RNN consists of two separate RNNs: a forward RNN that reads the sentence from the first token to the last, and a backward RNN that reads the sentence from the last token to the first.
Given a source sentence x = (x_1, x_2, ..., x_T), the forward RNN produces a sequence of forward hidden states:
h_j (forward) = f(x_j, h_{j-1} (forward))
The backward RNN produces a sequence of backward hidden states:
h_j (backward) = f(x_j, h_{j+1} (backward))
For each position j in the source sentence, the annotation vector h_j is obtained by concatenating the forward and backward hidden states:
h_j = [h_j (forward); h_j (backward)]
This concatenation ensures that each annotation h_j contains information about the words surrounding position j, capturing both preceding and following context. Because the forward hidden state at position j encodes information about all tokens up to and including x_j, and the backward hidden state encodes information about all tokens from x_T down to x_j, the annotation h_j effectively summarizes the entire sentence with a focus on the neighborhood around the j-th word.
Both the forward and backward RNNs used Gated Recurrent Units (GRUs) as the recurrent cell. Each directional RNN had a hidden dimension of 1000 units, so after concatenation, each annotation vector h_j had a dimensionality of 2000.
The attention mechanism is the central innovation of the paper. Instead of relying on a single context vector for the entire translation, the attention mechanism computes a different context vector c_i for each target word position i. This context vector is a weighted sum of all encoder annotations, where the weights reflect how relevant each source position is for generating the current target word.
The computation proceeds in three steps:
Step 1: Compute alignment scores. For each target position i, the alignment model computes an energy score e_{ij} between the previous decoder hidden state s_{i-1} and each encoder annotation h_j:
e_{ij} = a(s_{i-1}, h_j) = v_a^T tanh(W_a s_{i-1} + U_a h_j)
Here, W_a and U_a are weight matrices and v_a is a weight vector. All three are learnable parameters. The alignment model is a single-layer feedforward neural network with a tanh activation function. The input is the concatenation (in an additive sense) of the decoder's previous hidden state and one encoder annotation, and the output is a scalar energy score indicating how well the two states match.
Step 2: Normalize scores with softmax. The energy scores are passed through a softmax function to produce the attention weights alpha_{ij}:
alpha_{ij} = exp(e_{ij}) / sum_{k=1}^{T_x} exp(e_{ik})
The softmax ensures that the attention weights for each decoder time step sum to 1, forming a valid probability distribution over the source positions. A weight alpha_{ij} close to 1 means the model is focusing heavily on source position j when generating the i-th target word, while a weight close to 0 means that position is largely ignored.
Step 3: Compute the context vector. The context vector c_i is computed as the weighted sum of all encoder annotations:
c_i = sum_{j=1}^{T_x} alpha_{ij} * h_j
This context vector is a soft, differentiable approximation of a hard alignment between source and target words. Instead of selecting a single source word to attend to (as in traditional statistical alignment models), the attention mechanism distributes its focus across all source positions, assigning higher weights to the most relevant ones.
The decoder is a unidirectional GRU-based RNN that generates the target sentence one token at a time. At each time step i, the decoder hidden state s_i is computed as a function of three inputs:
s_i = f(s_{i-1}, y_{i-1}, c_i)
where s_{i-1} is the previous decoder hidden state, y_{i-1} is the embedding of the previously generated word, and c_i is the context vector computed by the attention mechanism.
The decoder hidden state had a dimensionality of 1000 units. The initial decoder hidden state s_0 was computed from the last backward encoder hidden state using a tanh activation:
s_0 = tanh(W_s * h_1 (backward))
The output probability distribution over the target vocabulary was computed using a maxout hidden layer (with l = 500 units) followed by a softmax layer. The maxout layer took as input the decoder hidden state s_i, the context vector c_i, and the embedding of the previous target word y_{i-1}.
The complete mathematical formulation of the Bahdanau attention model can be summarized as follows.
Given a source sentence x = (x_1, ..., x_{T_x}) and a target sentence y = (y_1, ..., y_{T_y}), the model defines the conditional probability:
p(y_i | y_1, ..., y_{i-1}, x) = g(y_{i-1}, s_i, c_i)
where g is a nonlinear function (the maxout + softmax output layer), s_i is the decoder hidden state, and c_i is the context vector.
The full set of equations is:
| Component | Equation | Description |
|---|---|---|
| Forward encoder | h_j (fwd) = GRU(x_j, h_{j-1} (fwd)) | Reads source left-to-right |
| Backward encoder | h_j (bwd) = GRU(x_j, h_{j+1} (bwd)) | Reads source right-to-left |
| Annotation vector | h_j = [h_j (fwd); h_j (bwd)] | Concatenation of both directions |
| Alignment score | e_{ij} = v_a^T tanh(W_a s_{i-1} + U_a h_j) | Additive scoring function |
| Attention weight | alpha_{ij} = softmax(e_{ij}) = exp(e_{ij}) / sum_k exp(e_{ik}) | Normalized relevance |
| Context vector | c_i = sum_j alpha_{ij} * h_j | Weighted sum of annotations |
| Decoder state | s_i = GRU(y_{i-1}, s_{i-1}, c_i) | Conditioned on attention output |
| Output | p(y_i) = softmax(maxout(s_i, c_i, y_{i-1})) | Word probability distribution |
The attention mechanism proposed by Bahdanau et al. is called "additive" because the alignment score is computed by adding two linear transformations (W_a s_{i-1} and U_a h_j) and passing the result through a tanh nonlinearity. This stands in contrast to "multiplicative" or "dot-product" attention, where the score is computed directly as a dot product between the query and key vectors. The additive formulation introduces more learnable parameters (the matrices W_a, U_a, and the vector v_a) but can be more expressive because it passes the combined representation through a nonlinearity.
A defining characteristic of Bahdanau attention is that it implements a soft alignment between source and target words. Traditional word alignment models in statistical machine translation (such as the IBM alignment models) assigned each target word to exactly one source word (hard alignment). This discrete selection is not differentiable and cannot be trained with standard backpropagation.
In contrast, Bahdanau's soft alignment computes a probability distribution over all source positions using the softmax function. Because softmax is a continuous, differentiable function, gradients flow smoothly through the entire attention mechanism. This allows the alignment model to be trained jointly with the rest of the encoder-decoder network using standard gradient-based optimization, without requiring any external alignment supervision or reinforcement learning techniques.
Xu et al. (2015) later formalized this distinction in the context of image captioning, explicitly defining "soft attention" (differentiable, trained with backpropagation) and "hard attention" (stochastic, trained with REINFORCE). Their soft attention mechanism for images was a direct adaptation of Bahdanau's approach to the visual domain.
The model was trained on the WMT 2014 English-to-French translation task. The authors used a subset of the available parallel corpora, which originally totaled approximately 850 million words from several sources:
| Corpus | Approximate Size |
|---|---|
| Europarl v7 | 61 million words |
| News Commentary | 5.5 million words |
| United Nations | 421 million words |
| Crawled corpus 1 | 90 million words |
| Crawled corpus 2 | 272.5 million words |
| Total (original) | ~850 million words |
| After data selection | ~348 million words |
The data was reduced from 850 million to 348 million words using the data selection method proposed by Axelrod et al. (2011). Both source and target languages used a vocabulary of the 30,000 most frequent words, with all other words replaced by a special [UNK] token.
The model was trained with the following hyperparameters:
| Hyperparameter | Value |
|---|---|
| Word embedding dimension | 620 |
| Encoder hidden units (per direction) | 1000 |
| Decoder hidden units | 1000 |
| Alignment model dimension | 1000 |
| Maxout hidden layer units | 500 |
| Source vocabulary size | 30,000 |
| Target vocabulary size | 30,000 |
| Optimizer | SGD with Adadelta (epsilon=10^-6, rho=0.95) |
| Batch size | 80 sentences |
| Gradient clipping | L2-norm threshold of 1 |
| Training duration | ~5 days per model |
| Decoding strategy | Beam search |
Two sets of models were trained. Models with the suffix "-30" were trained on sentences of up to 30 words in length, while models with the suffix "-50" were trained on sentences of up to 50 words. This allowed the authors to evaluate how well the attention mechanism handled longer sentences compared to the baseline.
The development set was formed by concatenating news-test-2012 and news-test-2013. The test set was news-test-2014, consisting of 3,003 sentences. BLEU scores were computed on this test set using the standard multi-bleu.perl script.
The main results of the paper are summarized in Table 1, which compares the attention-based model (RNNsearch) against the baseline encoder-decoder without attention (RNNencdec) and the phrase-based statistical machine translation system Moses.
| Model | BLEU (All) | BLEU (No UNK) |
|---|---|---|
| RNNencdec-30 | 13.93 | 24.19 |
| RNNsearch-30 | 21.50 | 31.44 |
| RNNencdec-50 | 17.82 | 26.71 |
| RNNsearch-50 | 26.75 | 34.16 |
| RNNsearch-50* | 28.45 | 36.15 |
| Moses (phrase-based SMT) | 33.30 | 35.63 |
The "All" column reports BLEU scores on all test sentences. The "No UNK" column reports scores only on sentences that did not contain any unknown word tokens, providing a fairer comparison since the neural models had no mechanism for handling out-of-vocabulary words. The asterisk () on RNNsearch-50 indicates that this model was trained for an extended period until performance on the development set plateaued.
Several important observations emerged from the results:
Attention dramatically improved performance. RNNsearch-30 achieved a BLEU score of 21.50, compared to 13.93 for RNNencdec-30 without attention. This represented a gain of 7.57 BLEU points, a very large improvement. The gap was even larger for models trained on longer sentences: RNNsearch-50 scored 26.75 versus 17.82 for RNNencdec-50, a difference of 8.93 points.
Attention handled long sentences much better. The performance curves plotted in the paper showed that the baseline RNNencdec model's BLEU score declined sharply for sentences longer than about 20 words. In contrast, RNNsearch maintained much more stable performance across all sentence lengths, confirming the hypothesis that the fixed-length vector was the primary bottleneck.
Competitive with phrase-based SMT. When unknown words were excluded from evaluation, RNNsearch-50* achieved a BLEU score of 36.15, surpassing the Moses phrase-based system's 35.63. This was a remarkable result at the time, as neural machine translation was still in its infancy and phrase-based systems had been the dominant paradigm for over a decade.
The model learned meaningful alignments. Qualitative analysis of the attention weights revealed that the model learned sensible word alignments between English and French. The attention heatmaps (visualized as matrices with source words on one axis and target words on the other) showed that the model correctly learned to focus on the corresponding source word when generating each target word. For example, when generating a French adjective that follows its noun (the opposite word order from English), the attention mechanism correctly shifted its focus to the adjective's position in the English source.
In 2015, Luong, Pham, and Manning published "Effective Approaches to Attention-based Neural Machine Translation" at EMNLP 2015, building directly on Bahdanau's work. Luong et al. proposed several alternative scoring functions for computing attention weights and introduced the distinction between global and local attention.
Bahdanau attention computes alignment scores using an additive (concat) formulation:
score(s, h) = v_a^T tanh(W_a s + U_a h)
Luong attention introduced three alternative scoring functions:
| Scoring Function | Formula | Type |
|---|---|---|
| Dot product | score(s_t, h_j) = s_t^T h_j | Multiplicative |
| General | score(s_t, h_j) = s_t^T W_a h_j | Multiplicative |
| Concat | score(s_t, h_j) = v_a^T tanh(W_a [s_t; h_j]) | Additive |
The dot product scoring function is the simplest and requires no additional learnable parameters. The general scoring function introduces a single weight matrix W_a, which allows the model to learn a more flexible similarity measure. The concat scoring function is similar to Bahdanau's formulation.
| Aspect | Bahdanau Attention (2014) | Luong Attention (2015) |
|---|---|---|
| Scoring function | Additive (feedforward network) | Multiplicative (dot product, general, or concat) |
| Decoder state used | Previous state s_{i-1} | Current state s_t |
| Encoder type | Bidirectional RNN | Unidirectional (top layer of stacked LSTM) |
| Attention scope | Global only | Global and local |
| Attention placement | Before decoder RNN step | After decoder RNN step |
| Computational cost | Higher (requires feedforward network) | Lower (dot product is efficient) |
| Number of parameters | More (W_a, U_a, v_a) | Fewer (W_a only, or none for dot product) |
One of the most significant practical differences is which decoder hidden state is used to compute the alignment scores. Bahdanau attention uses the previous decoder hidden state s_{i-1}, meaning the attention is computed before the decoder RNN processes the current step. Luong attention uses the current decoder hidden state s_t, meaning the attention is computed after the decoder RNN step. This design choice affects when the context vector is available during the decoding process.
Bahdanau attention is a global attention mechanism: it computes alignment scores over all source positions for every decoder step. Luong et al. additionally proposed local attention, which restricts the attention to a small window of source positions [p_t - D, p_t + D] centered around an aligned position p_t. Local attention comes in two variants:
Local attention reduces the computational cost from O(T_x) to O(2D + 1) per decoder step and can be seen as a blend between Bahdanau's soft global attention and the hard attention approach used in some vision models.
Bahdanau attention was the catalyst for an entire family of attention-based architectures that transformed deep learning. The timeline of developments that followed directly from this work includes:
The conceptual lineage from Bahdanau attention to the Transformer is direct. Bahdanau showed that allowing the decoder to dynamically attend to encoder hidden states could eliminate the fixed-length bottleneck. Luong simplified the scoring function to a dot product. The Transformer generalized this idea by (1) applying attention to the input sequence itself (self-attention), (2) using multiple parallel attention heads, (3) introducing the query-key-value framework, and (4) removing the recurrent connections entirely, enabling full parallelization during training.
The Transformer's scaled dot-product attention is:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
This can be understood as a scaled version of Luong's dot-product attention, extended from a sequence-to-sequence setting to a self-attention setting where queries, keys, and values all come from learned linear projections of the same input. The scaling factor 1/sqrt(d_k) prevents the dot products from growing too large in high dimensions, keeping the softmax in a well-behaved gradient regime.
Xu et al. (2015) published "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" at ICML 2015, which directly adapted Bahdanau's attention mechanism to the task of image captioning. In their model, the encoder was a convolutional neural network (CNN) that produced a grid of feature vectors (one per spatial location in the image), and the decoder was an LSTM that generated a caption word by word. At each decoding step, the attention mechanism computed a weighted combination of the spatial feature vectors, allowing the model to "look" at different parts of the image when generating different words.
Xu et al. were the first to formally distinguish between soft attention (the differentiable weighted sum approach from Bahdanau) and hard attention (a stochastic approach that selects a single location, trained with the REINFORCE algorithm). Their soft attention mechanism is mathematically equivalent to Bahdanau attention, with CNN feature vectors replacing RNN hidden states as the set of values being attended to. The paper demonstrated that attention was not limited to sequence-to-sequence tasks but could be applied anywhere a model needed to selectively focus on parts of a structured input.
The Bahdanau attention paper has accumulated over 50,000 citations on Google Scholar, placing it among the most cited papers in all of machine learning and computer science. Its influence extends far beyond machine translation: attention mechanisms are now fundamental components of models for text generation, speech recognition, image classification, protein structure prediction, reinforcement learning, and many other domains.
The following table provides a comprehensive comparison of the major attention mechanism variants, from the original Bahdanau attention through to the Transformer's self-attention.
| Feature | No Attention (Encoder-Decoder) | Bahdanau Attention (2014) | Luong Attention (2015) | Transformer Self-Attention (2017) |
|---|---|---|---|---|
| Context vector | Fixed (final encoder state) | Dynamic (weighted sum of encoder states) | Dynamic (weighted sum of encoder states) | Dynamic (weighted sum of value vectors) |
| Scoring function | N/A | Additive: v^T tanh(Ws + Uh) | Dot, general, or concat | Scaled dot product: QK^T / sqrt(d_k) |
| Encoder type | Unidirectional RNN | Bidirectional RNN | Stacked unidirectional LSTM | No RNN; positional encoding + self-attention |
| Decoder state used | N/A | Previous state s_{i-1} | Current state s_t | Current layer output |
| Attention scope | N/A | Global (all source positions) | Global or local (windowed) | Global (all positions in sequence) |
| Self-attention | No | No | No | Yes |
| Multi-head | No | No | No | Yes (h parallel heads) |
| Parallelizable | Limited (sequential RNN) | Limited (sequential RNN) | Limited (sequential RNN) | Fully parallelizable |
| Handles long sequences | Poor (fixed bottleneck) | Good (dynamic context) | Good (dynamic context) | Excellent (direct token-to-token paths) |
| Number of extra parameters | None | W_a, U_a, v_a | W_a (general) or none (dot) | W_Q, W_K, W_V, W_O per head |
| Training complexity per step | O(1) for context | O(T_x) per decoder step | O(T_x) per decoder step (global) | O(n^2) for self-attention |
| Year | 2014 | 2014 | 2015 | 2017 |
| Paper | Cho et al.; Sutskever et al. | Bahdanau, Cho, Bengio | Luong, Pham, Manning | Vaswani et al. |
One of the most compelling aspects of the Bahdanau attention paper was its qualitative analysis of the learned alignments. The authors visualized the attention weights as heatmap matrices, with source (English) words on one axis and target (French) words on the other. Each cell in the matrix represented the attention weight alpha_{ij}, with brighter cells indicating higher attention.
These visualizations revealed several interesting patterns:
Monotonic alignment for similar word order: When translating between English and French (both Subject-Verb-Object languages), the attention patterns often followed a roughly diagonal path, reflecting the similar word order of the two languages.
Non-monotonic alignment for reorderings: When the two languages required different word orders (for example, adjective placement differs between English and French), the attention mechanism correctly shifted its focus away from the diagonal to attend to the correct source word.
Many-to-one and one-to-many alignments: The model handled cases where multiple target words corresponded to a single source word, or where a single target word required information from multiple source words. For instance, French compound verb forms might attend to a single English verb.
Soft rather than hard focus: Even when one source position dominated the attention distribution, other positions still received small but nonzero weights. This soft attention behavior allowed the model to integrate contextual information from the surrounding words.
These alignment visualizations provided strong evidence that the model was learning meaningful, linguistically plausible correspondences between source and target words, rather than simply memorizing surface patterns.
Despite its groundbreaking contributions, Bahdanau attention had several limitations that subsequent work addressed:
Computational cost at each decoder step. The attention mechanism required computing alignment scores between the current decoder state and every encoder annotation at each time step. For very long source sentences, this O(T_x) computation per decoder step added significant overhead. Luong's local attention addressed this by restricting the window of attended positions.
Sequential decoding. Like all RNN-based encoder-decoder models, the Bahdanau architecture processed tokens sequentially during both encoding and decoding. This limited parallelization during training and inference. The Transformer architecture later solved this by replacing recurrence with self-attention, enabling full parallelization.
No self-attention. The attention mechanism only operated between the encoder and decoder (cross-attention). There was no mechanism for tokens within the source sentence to attend to each other, or for generated target tokens to attend to all previously generated tokens. Self-attention, introduced later, filled this gap.
Unknown word handling. The model used a fixed vocabulary of 30,000 words and mapped all out-of-vocabulary words to a single [UNK] token. This significantly hurt performance, as seen in the large gap between "All" and "No UNK" BLEU scores. Later approaches such as byte-pair encoding (BPE) and subword tokenization largely solved this problem.
Single attention head. The model used a single alignment model, meaning it could only compute one pattern of attention per decoder step. Multi-head attention, introduced by the Transformer, allowed the model to attend to different aspects of the input simultaneously through multiple parallel attention heads.
Imagine you are reading a book written in English and you need to tell the story in French. The old way of doing this was to read the entire book, close it, and then try to retell the whole story from memory. If the book was short, you could remember enough to do a good job. But if the book was long, you would forget important details.
Bahdanau attention is like being allowed to keep the book open while you retell the story. For each French sentence you need to say, you can look back at the English book and find the most relevant part. When translating the word for "cat," you look at the part of the English book that talks about the cat. When translating the word for "garden," you look at the part about the garden. You are not reading the whole book each time; instead, you are quickly scanning it and focusing on the part that helps you right now.