See also: Machine learning terms, Natural language processing, Recurrent neural network
A sequence-to-sequence (seq2seq) task is any machine learning problem in which a model receives a variable-length input sequence and produces a variable-length output sequence. The input and output sequences may differ in length, vocabulary, and even modality. Common examples include translating a sentence from one language to another, summarizing a document into a few sentences, converting spoken audio into written text, and generating source code from a natural language description. In the field of deep learning, seq2seq tasks are central to a wide range of natural language processing (NLP) and time series prediction applications.
Seq2seq models typically follow an encoder-decoder architecture: the encoder reads the entire input and compresses it into an internal representation, and the decoder generates the output one element at a time based on that representation. The framework was introduced in two independent 2014 papers by Sutskever, Vinyals, and Le at Google and by Cho et al. at the University of Montreal. Since then, seq2seq has grown from a specialized machine translation technique into the dominant paradigm behind modern large language models, dialogue agents, and multimodal systems.
The core idea behind seq2seq is deceptively simple: one network (the encoder) reads the entire input sequence and compresses it into a fixed-length internal representation, and a second network (the decoder) generates the output sequence from that representation one token at a time. Despite this simplicity, the approach proved remarkably powerful. Within a few years, seq2seq models advanced from a research curiosity to the backbone of production machine translation systems, and the architectural principles they introduced continue to shape modern large language models.
Imagine you have a friend who speaks only French, and you speak only English. You write a letter in English and hand it to a translator. The translator reads your entire letter, thinks about what it means, and then writes a new letter in French for your friend. In a seq2seq model, the "reader" is the encoder. It reads everything you wrote and creates a summary of the meaning in its head. The "writer" is the decoder. It takes that summary and writes the message in the new language, one word at a time, until the letter is finished. The encoder and decoder work as a team: one understands the input, the other produces the output.
Another way to think about it: imagine you have a bunch of colorful building blocks in a row (the input sequence), and you want to arrange them in a different order to make a new row of blocks (the output sequence). A seq2seq task is like teaching a robot to do this for you. The robot has two main parts, the encoder and the decoder. The encoder looks at the row of colorful blocks and remembers the important information about them. Then, the decoder uses that information to create the new row of blocks in the correct order. The robot can do this for different rows of blocks with different colors and lengths. This idea is used in many things, like translating languages, summarizing long texts, or even helping robots talk to people.
Before seq2seq, the dominant approaches to tasks like machine translation relied on statistical methods such as phrase-based statistical machine translation (SMT). These systems decomposed translation into small phrase-level mappings and used language models to stitch results together. While effective, they required extensive hand-engineered features, alignment heuristics, and large phrase tables.
Recurrent neural networks had been studied since the 1980s, and Long Short-Term Memory (LSTM) networks were introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997 to address the vanishing gradient problem that plagued simple RNNs. However, it took nearly two decades of hardware improvements (particularly the rise of GPU computing) and algorithmic refinements before RNNs could be scaled to large real-world sequence transduction tasks.
The seq2seq framework crystallized in 2014 with two papers published nearly simultaneously. Kyunghyun Cho and colleagues at the University of Montreal proposed the RNN Encoder-Decoder in June 2014, while Ilya Sutskever, Oriol Vinyals, and Quoc V. Le at Google published their landmark paper in September 2014. Both papers established the encoder-decoder blueprint that would define the field for the next several years.
The seq2seq framework divides the model into two distinct components that are trained jointly end to end.
The encoder processes the input sequence element by element (for example, word by word) and produces a set of hidden states that capture the meaning and structure of the input. In the original RNN-based designs, the encoder was a recurrent neural network (often an LSTM or GRU) that read the input tokens sequentially. The final hidden state of the encoder, sometimes called the context vector (also called the "thought vector"), served as a fixed-length summary of the entire input. In Transformer-based designs, the encoder consists of stacked self-attention layers and feed-forward networks that process all input positions in parallel.
Formally, given an input sequence (x_1, x_2, ..., x_T), the encoder computes a sequence of hidden states:
h_t = f(x_t, h_{t-1})
where f is the recurrent function (e.g., an LSTM cell). The context vector c is typically the final hidden state h_T, or in deeper models, the concatenation of final hidden states across layers.
The decoder generates the output sequence one token at a time in an autoregressive fashion. At each step, it takes the previously generated token (or a special start-of-sequence token at the first step), the encoder's representation, and its own internal state to predict the next output token. Generation continues until the model emits an end-of-sequence token or reaches a maximum length. Like the encoder, the decoder can be implemented with RNNs or Transformer layers.
At each decoding step t, the decoder computes:
s_t = g(y_{t-1}, s_{t-1}, c)
p(y_t | y_1, ..., y_{t-1}, x) = softmax(W_s * s_t)
where g is the decoder's recurrent function, s_t is the decoder hidden state, y_{t-1} is the previously generated token, and c is the context vector. The softmax layer produces a probability distribution over the output vocabulary at each step.
In the original RNN-based seq2seq models, the encoder compressed the entire input into a single fixed-length context vector. This design created an information bottleneck: for long input sequences, the fixed-size vector could not retain all relevant details, and performance degraded as sentence length increased. The attention mechanism, introduced by Bahdanau et al. in 2014, solved this problem by allowing the decoder to look back at all encoder hidden states at every decoding step rather than relying on a single compressed vector.
The seq2seq paradigm evolved through several distinct phases, each addressing limitations of the previous approach.
Two papers published in 2014 independently proposed the encoder-decoder framework for neural network-based sequence transduction.
| Paper | Authors | Venue | Key contribution |
|---|---|---|---|
| Sequence to Sequence Learning with Neural Networks | Ilya Sutskever, Oriol Vinyals, Quoc V. Le | NeurIPS 2014 | Used a 4-layer LSTM encoder-decoder with reversed input sequences; achieved 34.8 BLEU on WMT'14 English-to-French translation |
| Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation | Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio | EMNLP 2014 | Introduced the RNN Encoder-Decoder architecture and proposed the Gated Recurrent Unit (GRU) as a simpler alternative to LSTM |
The paper "Sequence to Sequence Learning with Neural Networks" by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le was presented at the 28th Conference on Neural Information Processing Systems (NeurIPS) in December 2014. It is one of the most cited papers in deep learning history and established the seq2seq paradigm as a practical approach to machine translation.
Architecture. The authors used a multilayered LSTM to map the input sequence to a vector of fixed dimensionality, and then another deep LSTM to decode the target sequence from that vector. The specific model employed four stacked LSTM layers with 1,000 hidden units per layer. The model had 384 million parameters in total, with an 8,000-dimensional state in each LSTM, and used a vocabulary of 160,000 words for the source language and 80,000 words for the target language. The encoder LSTM read the input sentence and produced a fixed-length vector representation at its final time step. This vector was then used to initialize the decoder LSTM, which generated the English-to-French translation one word at a time using beam search with a beam size of 12.
Reversing the input sequence. One of the most surprising and practically important findings in the paper was that reversing the order of words in the source sentence significantly improved translation quality. Instead of mapping the sentence (a, b, c) to its translation (alpha, beta, gamma), the LSTM was trained to map (c, b, a) to (alpha, beta, gamma). The intuition behind this trick is that reversing the input introduces many short-term dependencies between the source and target sequences. After reversal, the first few words of the source sentence are close to the first few words of the target sentence, making it easier for stochastic gradient descent to "establish communication" between the input and output. While the average distance between corresponding words remains unchanged, the first several words now have very small distances, and this is enough to bootstrap the optimization process. This simple trick improved BLEU scores by several points.
Results. On the WMT 2014 English-to-French translation task, the results were striking:
| Model | BLEU Score |
|---|---|
| Phrase-based SMT baseline (Moses) | 33.3 |
| Single LSTM (reversed input) | 30.6 |
| Ensemble of 5 LSTMs (reversed input) | 34.81 |
| Ensemble of 5 LSTMs + SMT rescoring (1000-best) | 36.5 |
| Previous state of the art (SMT + neural components) | 37.0 |
The ensemble of five deep LSTMs achieved a BLEU score of 34.81, outperforming the phrase-based SMT baseline of 33.3 without using any phrase tables, alignment models, or hand-engineered features. When the LSTM was used to rerank the top 1,000 hypotheses from the SMT system, the combined system reached 36.5 BLEU. These results demonstrated for the first time that a pure neural approach could compete with, and even surpass, traditional statistical translation systems.
Key insights. Beyond raw performance, the paper revealed several important observations:
Sutskever et al. showed that deep LSTMs could learn to map variable-length input sequences to variable-length output sequences without any task-specific engineering.
Published at EMNLP 2014, the paper "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" by Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio independently proposed the encoder-decoder framework for sequence-to-sequence learning.
The RNN Encoder-Decoder. Cho et al. proposed a model in which one RNN encodes a sequence of symbols into a fixed-length vector representation, and another RNN decodes that representation into a second sequence of symbols. The encoder and decoder were jointly trained to maximize the conditional probability of the target sequence given the source sequence. Rather than translating full sentences end-to-end (as Sutskever et al. did), Cho et al. used the encoder-decoder to score phrase pairs, integrating these scores as an additional feature in an existing phrase-based SMT system.
The Gated Recurrent Unit. A major contribution of this paper was the introduction of the Gated Recurrent Unit (GRU), a new type of recurrent unit designed as a simpler alternative to the LSTM. The GRU uses two gates: an update gate (which controls how much of the previous hidden state to retain) and a reset gate (which determines how much of the previous state to forget when computing the candidate activation). Compared to the LSTM, the GRU has no separate cell state and uses fewer parameters, which can make training faster. The GRU achieved comparable performance with fewer parameters and faster training. The authors showed that the RNN Encoder-Decoder with GRUs learned semantically and syntactically meaningful phrase representations. Phrases with similar meanings were mapped to nearby points in the continuous space, even when their surface forms differed.
Relationship to Sutskever et al. Although both papers proposed encoder-decoder architectures, they differed in scope and application:
| Aspect | Sutskever et al. (2014) | Cho et al. (2014) |
|---|---|---|
| Recurrent unit | LSTM | GRU |
| Depth | 4 layers | 1 layer |
| Translation mode | End-to-end sentence translation | Phrase pair scoring within SMT |
| Published at | NeurIPS 2014 | EMNLP 2014 |
| Input reversal | Yes | No |
| Vocabulary | 160K source, 80K target | 15K each |
Together, these two papers firmly established the encoder-decoder paradigm and launched a wave of research into neural sequence transduction.
The fixed-length context vector used by the original seq2seq models created a fundamental bottleneck: the entire input sequence, regardless of its length, had to be compressed into a single vector. For short sentences, this compression was manageable, but for longer inputs, the context vector could not preserve all the relevant information. Performance degraded noticeably as sentence length increased.
Bahdanau attention (additive attention). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio addressed this bottleneck in their paper "Neural Machine Translation by Jointly Learning to Align and Translate," submitted to arXiv in September 2014 and published at ICLR 2015. Their solution was the attention mechanism, which allowed the decoder to selectively focus on different parts of the input sequence at each decoding step, rather than relying on a single fixed-length vector. The key innovation was to replace the fixed context vector with a dynamic context vector that is recomputed at every decoding step. At each step, the model computes an alignment score between the current decoder state and every encoder hidden state. These scores are normalized through a softmax function to produce attention weights, which are then used to compute a weighted sum of the encoder hidden states. This weighted sum becomes the context vector for that particular decoding step.
Bahdanau et al. used a bidirectional RNN as the encoder, reading the input sequence both forward and backward. The annotation for each word was the concatenation of the forward and backward hidden states, giving the model access to context from both directions. The attention scoring function was an additive (also called "concat") function:
e_ij = v^T * tanh(W_a * s_{i-1} + U_a * h_j)
where s_{i-1} is the decoder state, h_j is the j-th encoder annotation, and W_a, U_a, v are learned parameters. This is sometimes referred to as additive attention or Bahdanau attention.
On English-to-French translation, the results showed significant improvement over the basic encoder-decoder:
| Model | BLEU (all sentences) | BLEU (no unknown words) |
|---|---|---|
| RNNenc-50 | 17.82 | 26.71 |
| RNNsearch-50 | 26.75 | 34.16 |
| RNNsearch-50* (longer training) | 28.45 | 36.15 |
| Moses (phrase-based SMT) | 33.30 | 35.63 |
The attention-based model (RNNsearch) dramatically outperformed the basic encoder-decoder (RNNenc), especially on longer sentences where the fixed-length bottleneck was most severe. On sentences with no unknown words, RNNsearch-50* actually surpassed the phrase-based SMT system Moses. Perhaps more importantly, the attention mechanism provided interpretability. By visualizing the attention weights, researchers could see which source words the model was focusing on when generating each target word. These soft alignments corresponded closely to human intuitions about word-level translation correspondences.
Luong attention (multiplicative attention). Thang Luong, Hieu Pham, and Christopher D. Manning extended the attention framework in their 2015 EMNLP paper "Effective Approaches to Attention-based Neural Machine Translation." They introduced two classes of attention mechanisms:
The scoring functions for Luong attention include:
| Score Function | Formula |
|---|---|
| Dot | s_t^T * h_j |
| General (bilinear) | s_t^T * W_a * h_j |
| Concat (additive) | v^T * tanh(W_a * [s_t ; h_j]) |
Dot-product attention became widely adopted due to its computational simplicity. With local attention and an ensemble model, Luong et al. achieved a gain of 5.0 BLEU points over non-attentional baselines on the WMT 2015 English-to-German translation task. Their ensemble reached 25.9 BLEU on WMT'15 English-German, which was a new state-of-the-art result at the time.
| Attention type | Score function | Introduced by | Year |
|---|---|---|---|
| Additive (Bahdanau) | score(s_t, h_i) = v^T tanh(W[s_t; h_i]) | Bahdanau, Cho, Bengio | 2014 |
| Dot-product (Luong) | score(s_t, h_i) = s_t^T h_i | Luong, Pham, Manning | 2015 |
| General (Luong) | score(s_t, h_i) = s_t^T W h_i | Luong, Pham, Manning | 2015 |
| Scaled dot-product | score(Q, K) = QK^T / sqrt(d_k) | Vaswani et al. | 2017 |
Gehring et al. at Facebook AI Research (2017) proposed replacing RNNs with convolutional neural networks in both the encoder and decoder. Because convolutions can be computed in parallel across all positions, this architecture trained significantly faster than RNN-based models while achieving competitive or superior translation quality. The model used gated linear units and multi-step attention. On the WMT'14 English-to-French benchmark, the convolutional seq2seq model matched the accuracy of deep LSTM systems at nine times the training speed.
While attention mechanisms dramatically improved seq2seq models, the underlying RNNs still had a fundamental limitation: they processed tokens sequentially, one at a time. This meant that training could not be fully parallelized across time steps, making it slow and expensive on long sequences.
In 2017, Ashish Vaswani and colleagues at Google published "Attention Is All You Need" at NeurIPS, introducing the Transformer architecture. The Transformer dispensed with recurrence and convolutions entirely and relied solely on attention mechanisms, specifically self-attention (also called intra-attention), to model dependencies between all positions in a sequence. The encoder consists of stacked layers, each containing multi-head self-attention and position-wise feed-forward networks. The decoder has an additional cross-attention sublayer that attends to the encoder output.
The Transformer retains the encoder-decoder structure of seq2seq models but replaces the RNN layers with stacks of multi-head self-attention layers and position-wise feedforward networks. Key components and innovations include:
The Transformer achieved new state-of-the-art results on major translation benchmarks:
| Task | Transformer BLEU | Previous Best BLEU |
|---|---|---|
| WMT 2014 English-to-German | 28.4 | 26.4 (ensemble) |
| WMT 2014 English-to-French | 41.8 | 41.0 (ensemble) |
On the WMT 2014 English-to-German task, the Transformer outperformed all previously reported models, including ensembles, by more than 2 BLEU points. On English-to-French, it established a new single-model state-of-the-art BLEU score of 41.8 while requiring a fraction of the training cost of previous models. The big Transformer model was trained for 3.5 days on 8 GPUs, compared to weeks of training for competitive RNN-based systems. The Transformer's ability to process all positions in parallel made it far more efficient to train on modern hardware, and its self-attention mechanism allowed it to capture long-range dependencies more effectively than RNNs. This paper triggered a paradigm shift: within two years, virtually all state-of-the-art NLP models adopted the Transformer architecture, and it has since become the default for nearly all seq2seq tasks.
The success of pre-training and transfer learning led to large-scale encoder-decoder Transformer models that are pre-trained on massive corpora and then fine-tuned for specific seq2seq tasks.
| Model | Authors / Organization | Year | Key features |
|---|---|---|---|
| T5 (Text-to-Text Transfer Transformer) | Raffel et al., Google | 2020 | Casts every NLP task as a text-to-text problem with task-specific prefixes; pre-trained on the C4 corpus; sizes from 60M to 11B parameters |
| BART | Lewis et al., Facebook AI | 2020 | Pre-trained by corrupting text (token masking, sentence permutation, span deletion) and learning to reconstruct the original; strong results on summarization and generation |
| mBART | Liu et al., Facebook AI | 2020 | Multilingual extension of BART; pre-trained on monolingual corpora in 25 languages; up to 12 BLEU improvement on low-resource translation |
| mT5 | Xue et al., Google | 2021 | Multilingual T5 covering 101 languages; pre-trained on mC4 |
In 2020, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu published "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" in the Journal of Machine Learning Research. The T5 model pushed the seq2seq paradigm to its logical conclusion by casting every NLP task as a text-to-text problem.
In the T5 framework, tasks such as translation, summarization, classification, regression, and question answering are all framed identically: the model receives a text input (prefixed with a task-specific instruction such as "translate English to German:" or "summarize:") and produces a text output. This unified formulation allowed the researchers to systematically compare pre-training objectives, model architectures, data sources, and transfer strategies across dozens of tasks using a single model architecture. T5 uses the standard Transformer encoder-decoder architecture. The model was pre-trained on the "Colossal Clean Crawled Corpus" (C4), a cleaned version of Common Crawl containing roughly 750 GB of English text. By combining insights from the systematic study with scale (the largest T5 model, T5-11B, has 11 billion parameters), the researchers achieved state-of-the-art results on many benchmarks including GLUE, SuperGLUE, SQuAD, and CNN/DailyMail.
BART (Bidirectional and Auto-Regressive Transformers), proposed by Mike Lewis and colleagues at Facebook AI Research in 2019, is a denoising autoencoder for pre-training seq2seq models. BART combines a bidirectional encoder (similar to BERT) with an autoregressive decoder (similar to GPT). During pre-training, the input text is corrupted using various noising functions (token masking, token deletion, sentence permutation, text infilling), and the model learns to reconstruct the original text. BART achieved state-of-the-art results on abstractive summarization tasks and strong performance on translation, question answering, and comprehension benchmarks. It demonstrated that the seq2seq encoder-decoder framework, when paired with effective pre-training, could match or exceed encoder-only models like BERT on understanding tasks while also excelling at generation tasks.
The text-to-text paradigm exemplified by T5 and BART has become a dominant approach in modern NLP. Other models in this family include:
These models show that the encoder-decoder seq2seq architecture remains highly competitive, even in an era dominated by decoder-only language models like GPT-4.
Seq2seq models are used across a wide range of tasks where the input and output are both sequences, even if they differ in length, vocabulary, or modality.
Machine translation was the original and most prominent application of seq2seq models. Before seq2seq, translation systems relied on statistical phrase-based methods that required extensive hand-crafted features and alignment tables. Neural seq2seq models replaced these pipelines with a single end-to-end trained network. Starting with Sutskever et al. (2014), neural machine translation (NMT) rapidly overtook phrase-based SMT.
Google deployed its Neural Machine Translation (GNMT) system in production in 2016, reducing translation errors by an average of 60% compared to the previous phrase-based system across more than 100 language pairs. GNMT used a deep seq2seq architecture with 8 encoder layers and 8 decoder layers, residual connections, and attention. It also introduced wordpiece tokenization to handle rare words by splitting them into subword units. Today, all major translation services (Google Translate, DeepL, Microsoft Translator) use Transformer-based seq2seq models.
Text summarization systems use seq2seq models to generate concise summaries of longer documents. There are two approaches: extractive summarization, which selects and concatenates existing sentences from the source, and abstractive summarization, which generates new sentences that may not appear verbatim in the source. Seq2seq models are particularly suited to abstractive summarization because the decoder can produce novel phrasings.
See, Liu, and Manning (2017) introduced the pointer-generator network, a seq2seq model that can both generate words from a fixed vocabulary and copy words directly from the source text via a pointing mechanism. This hybrid approach addressed the problem of out-of-vocabulary words and factual accuracy. They also introduced a coverage mechanism to reduce repetitive output. Models like BART (Lewis et al., 2019) and T5 (Raffel et al., 2020) have achieved strong results on benchmarks such as CNN/DailyMail and XSum.
Speech recognition (automatic speech recognition, or ASR) is a natural seq2seq problem: the input is a sequence of audio frames (such as mel-frequency cepstral coefficients or log-mel spectrograms), and the output is a sequence of characters or words. Traditional ASR systems combined separate acoustic models, pronunciation dictionaries, and language models. Seq2seq approaches unify these components into a single end-to-end model.
Chan et al. (2016) proposed Listen, Attend and Spell (LAS), which used a pyramidal RNN encoder (the "listener") to process audio features and an attention-based RNN decoder (the "speller") to emit characters. LAS achieved a 14.1% word error rate on a Google voice search task without any external language model.
OpenAI's Whisper (Radford et al., 2022) is a Transformer-based encoder-decoder model trained on 680,000 hours of multilingual audio data. It handles transcription in multiple languages as well as translation from other languages into English. Whisper demonstrated that scaling up weakly supervised training data could produce highly robust ASR without task-specific engineering.
Conversational AI and chatbot systems use seq2seq models to generate contextually appropriate responses given conversation history. The encoder processes the dialogue context (previous turns), and the decoder generates the next response. Early neural dialogue systems, such as the work by Vinyals and Le (2015) on a "Neural Conversational Model," showed that seq2seq could produce surprisingly coherent multi-turn conversations when trained on large corpora. Google's Meena (2020), a 2.6 billion parameter seq2seq chatbot trained on 341 GB of social media conversations, demonstrated that scaling seq2seq models improved open-domain conversation quality.
Seq2seq models power code generation tools that translate natural language descriptions or comments into source code. The input is a programming task description or docstring, and the output is the corresponding code. OpenAI's Codex (the model behind GitHub Copilot) and subsequent code-generation models use Transformer-based seq2seq principles. Code translation (converting code from one programming language to another) and code summarization are other seq2seq applications in this domain.
| Application | Input sequence | Output sequence |
|---|---|---|
| Image captioning | Image feature vectors (from a CNN) | Natural language caption |
| Question answering | Question + context passage | Answer text |
| Grammar correction | Sentence with errors | Corrected sentence |
| Data-to-text generation | Structured data (tables, knowledge graphs) | Natural language description |
| Music generation | Symbolic music notation or audio features | New musical sequence |
| Protein structure prediction | Amino acid sequence | 3D structure coordinates |
| Mathematical problem solving | Problem statement | Solution steps |
| Time series prediction | Past observations | Future forecasted values |
In image captioning, a convolutional neural network acts as the encoder to process an image, with an RNN or Transformer decoder generating the textual description. In time series prediction, the encoder reads past observations and the decoder predicts future steps. In protein structure prediction, the model encodes amino acid sequences and predicts structural properties.
During training, the standard approach feeds the ground-truth previous token as input to the decoder at each step, regardless of what the model would have predicted. This technique, called teacher forcing, stabilizes training and speeds up convergence because the decoder always conditions on correct context and does not have to recover from its own mistakes during early training. However, during inference, the model must use its own predictions as inputs, creating a mismatch between training and inference conditions known as exposure bias.
Samy Bengio et al. (2015) proposed scheduled sampling to mitigate exposure bias. During training, the model gradually transitions from using ground-truth tokens to using its own predictions as decoder input. The probability of using a model-generated token increases over the course of training according to a schedule (linear, exponential, or inverse sigmoid decay). This curriculum-based strategy helps the decoder learn to handle its own imperfect predictions and recover from errors.
Standard seq2seq training uses token-level cross-entropy loss, which maximizes the probability of each correct token independently. However, the actual evaluation metrics (such as BLEU or ROUGE) operate at the sequence level. Several methods address this discrepancy:
Handling open vocabularies is a practical challenge for seq2seq models. Early models used fixed vocabularies (often 30,000 to 80,000 words) and replaced unknown words with a special UNK token, which degraded output quality, particularly for morphologically rich languages. Several solutions were developed:
Variants of subword tokenization include WordPiece (used in BERT and GNMT) and SentencePiece (used in T5 and mBART). Subword tokenization via BPE became the standard and is used by virtually all modern seq2seq and language models.
At inference time, the decoder must select output tokens without access to ground-truth sequences. Several strategies exist, each with different trade-offs between quality and computational cost.
The simplest approach selects the highest-probability token at each step. Greedy decoding is fast but often produces suboptimal sequences because a locally optimal choice at one step may lead to a globally poor sequence.
Beam search is the most common decoding strategy for seq2seq models. Rather than greedily selecting the highest-probability token at each step, beam search maintains a fixed number of candidate sequences (the beam width, typically 4 to 12) at each decoding step. At each step, it expands all current candidates by one token, scores them, and retains only the top-scoring candidates. Beam search provides a better approximation of the optimal output than greedy decoding but is more computationally expensive. Sutskever et al. (2014) used a beam size of 12 in their experiments, and beam sizes between 4 and 10 remain typical in practice. Length normalization and coverage penalties are often applied to prevent beam search from favoring short outputs or neglecting parts of the input.
For tasks where diversity is desired (such as dialogue or creative text generation), sampling methods are used instead of deterministic search:
Seq2seq outputs are typically evaluated using automatic metrics that compare generated sequences against reference sequences.
| Metric | Full name | Used for | How it works |
|---|---|---|---|
| BLEU | Bilingual Evaluation Understudy | Machine translation | Measures n-gram precision of the generated text against references; uses a brevity penalty to discourage overly short outputs |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation | Text summarization | Measures n-gram recall (ROUGE-N), longest common subsequence (ROUGE-L), or skip-bigram overlap (ROUGE-S) between generated and reference summaries |
| METEOR | Metric for Evaluation of Translation with Explicit Ordering | Machine translation | Considers synonyms, stemming, and word order in addition to exact matches |
| Perplexity | N/A | Language model evaluation | Measures how well the model predicts the next token; lower perplexity indicates better predictive performance |
| BERTScore | N/A | General text generation | Computes semantic similarity between generated and reference texts using contextual embeddings from BERT |
| CER / WER | Character / Word Error Rate | Speech recognition | Edit distance between the predicted and reference transcriptions, normalized by the reference length |
Despite their success, seq2seq models face several well-known challenges.
The discrepancy between teacher forcing during training and autoregressive generation during inference means that errors at early decoding steps can compound throughout the sequence. Scheduled sampling, reinforcement learning fine-tuning, and non-autoregressive decoding are among the approaches that attempt to reduce this problem.
Seq2seq models can generate fluent but factually incorrect text, a phenomenon known as hallucination. This is particularly problematic in summarization, where the model may produce details that do not appear in the source document. Faithfulness metrics and constrained decoding methods have been developed to detect and mitigate hallucinations.
Even with attention mechanisms, processing very long input sequences remains challenging. RNN-based models suffer from the vanishing gradient problem over long distances, while Transformer-based models face quadratic memory and computation costs with respect to sequence length. Efficient attention variants such as sparse attention, linear attention, and sliding window attention have been proposed to address this limitation.
Large seq2seq models require substantial computational resources for both training and inference. Training seq2seq models, especially large Transformer-based ones, is computationally expensive. The self-attention mechanism in the Transformer has quadratic complexity with respect to sequence length, which limits the maximum input size. Autoregressive decoding is inherently sequential on the output side, which limits throughput. Non-autoregressive translation models attempt to generate all output tokens in parallel, trading some quality for significant speedups. Techniques for addressing this include sparse attention patterns, linear attention approximations, and efficient Transformer variants.
Automatic metrics like BLEU and ROUGE have known shortcomings. BLEU relies on exact n-gram matching and can penalize valid paraphrases. ROUGE focuses on recall and may not capture semantic correctness. Human evaluation remains the gold standard but is expensive and time-consuming.
| Feature | RNN Encoder-Decoder (2014) | RNN + Attention (2014-2015) | CNN-based (2017) | Transformer (2017) |
|---|---|---|---|---|
| Encoder type | Unidirectional LSTM/GRU | Bidirectional LSTM/GRU | Stacked convolutions | Self-attention + feedforward |
| Decoder type | Unidirectional LSTM/GRU | Unidirectional LSTM/GRU + attention | Stacked convolutions with attention | Self-attention + cross-attention + feedforward |
| Context representation | Fixed-length vector (encoder final state) | Dynamic context vector (weighted sum of encoder states) | Convolutional features with multi-step attention | Full encoder output attended at every layer |
| Parallelization | Sequential (no parallelism across time) | Sequential (no parallelism across time) | Fully parallel during training | Fully parallel across positions |
| Long-range dependencies | Limited by vanishing gradients, even with LSTM/GRU | Improved via direct attention connections | Fixed receptive field (grows with depth) | Excellent (direct attention between all position pairs) |
| Positional information | Implicit in sequential processing | Implicit in sequential processing | Implicit in convolutional structure | Explicit positional encodings required |
| Training speed | Slow (sequential computation) | Slow (sequential computation) | Fast (parallelizable) | Fast (parallelizable on GPUs/TPUs) |
| Interpretability | Low | Moderate (attention weights provide alignment) | Moderate | Moderate (multi-head attention weights) |
| Representative BLEU (En-Fr WMT'14) | 34.81 (Sutskever et al., ensemble) | 36.15 (Bahdanau et al., no UNK) | Competitive with LSTM | 41.8 (Vaswani et al., single model) |
| Representative models | Original seq2seq (2014), GNMT (2016) | Bahdanau (2015), Luong (2015) | ConvS2S (Gehring et al., 2017) | Transformer (2017), T5, BART, mBART |
| Year | Milestone |
|---|---|
| 2013 | Kalchbrenner and Blunsom propose an encoder-decoder model using a CNN encoder and RNN decoder for machine translation |
| 2014 | Sutskever, Vinyals, and Le demonstrate end-to-end seq2seq with deep LSTM networks |
| 2014 | Cho et al. introduce the RNN Encoder-Decoder and the GRU |
| 2014 | Bahdanau, Cho, and Bengio propose the attention mechanism for seq2seq |
| 2015 | Luong, Pham, and Manning explore global and local attention variants |
| 2016 | Google deploys GNMT for production machine translation |
| 2016 | Sennrich et al. introduce BPE for subword tokenization in NMT |
| 2016 | Chan et al. propose Listen, Attend and Spell for end-to-end speech recognition |
| 2017 | Gehring et al. introduce convolutional seq2seq |
| 2017 | Vaswani et al. publish "Attention Is All You Need," introducing the Transformer |
| 2017 | See, Liu, and Manning propose pointer-generator networks for summarization |
| 2019 | Facebook AI demonstrates seq2seq for symbolic mathematics |
| 2020 | Google publishes T5, unifying NLP tasks as text-to-text seq2seq |
| 2020 | Facebook AI publishes BART and mBART for pre-trained seq2seq |
| 2022 | OpenAI releases Whisper for multilingual speech recognition |
| 2022 | Amazon introduces AlexaTM 20B, a 20 billion parameter seq2seq model |
The seq2seq framework has had a profound and lasting influence on the field of artificial intelligence: