Sequence-to-Sequence Task

A sequence-to-sequence (seq2seq) task is any machine learning problem in which a model receives a variable-length input sequence and produces a variable-length output sequence. The input and output sequences may differ in length, vocabulary, and even modality. Common examples include translating a sentence from one language to another, summarizing a document into a few sentences, converting spoken audio into written text, and generating source code from a natural language description. In the field of deep learning, seq2seq tasks are central to a wide range of natural language processing (NLP) and time series prediction applications.

Seq2seq models typically follow an encoder-decoder architecture: the encoder reads the entire input and compresses it into an internal representation, and the decoder generates the output one element at a time based on that representation. The framework was introduced in two independent 2014 papers by Sutskever, Vinyals, and Le at Google and by Cho et al. at the University of Montreal. Since then, seq2seq has grown from a specialized machine translation technique into the dominant paradigm behind modern large language models, dialogue agents, and multimodal systems.

The core idea behind seq2seq is deceptively simple: one network (the encoder) reads the entire input sequence and compresses it into a fixed-length internal representation, and a second network (the decoder) generates the output sequence from that representation one token at a time. Despite this simplicity, the approach proved remarkably powerful. Within a few years, seq2seq models advanced from a research curiosity to the backbone of production machine translation systems, and the architectural principles they introduced continue to shape modern large language models.

ELI5: Explain like I'm 5

Imagine you have a friend who speaks only French, and you speak only English. You write a letter in English and hand it to a translator. The translator reads your entire letter, thinks about what it means, and then writes a new letter in French for your friend. In a seq2seq model, the "reader" is the encoder. It reads everything you wrote and creates a summary of the meaning in its head. The "writer" is the decoder. It takes that summary and writes the message in the new language, one word at a time, until the letter is finished. The encoder and decoder work as a team: one understands the input, the other produces the output.

Another way to think about it: imagine you have a bunch of colorful building blocks in a row (the input sequence), and you want to arrange them in a different order to make a new row of blocks (the output sequence). A seq2seq task is like teaching a robot to do this for you. The robot has two main parts, the encoder and the decoder. The encoder looks at the row of colorful blocks and remembers the important information about them. Then, the decoder uses that information to create the new row of blocks in the correct order. The robot can do this for different rows of blocks with different colors and lengths. This idea is used in many things, like translating languages, summarizing long texts, or even helping robots talk to people.

Historical background

Early sequence modeling

Before seq2seq, the dominant approaches to tasks like machine translation relied on statistical methods such as phrase-based statistical machine translation (SMT). These systems decomposed translation into small phrase-level mappings and used language models to stitch results together. While effective, they required extensive hand-engineered features, alignment heuristics, and large phrase tables.

Recurrent neural networks had been studied since the 1980s, and Long Short-Term Memory (LSTM) networks were introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997 to address the vanishing gradient problem that plagued simple RNNs. However, it took nearly two decades of hardware improvements (particularly the rise of GPU computing) and algorithmic refinements before RNNs could be scaled to large real-world sequence transduction tasks.

The two foundational papers of 2014

The seq2seq framework crystallized in 2014 with two papers published nearly simultaneously. Kyunghyun Cho and colleagues at the University of Montreal proposed the RNN Encoder-Decoder in June 2014, while Ilya Sutskever, Oriol Vinyals, and Quoc V. Le at Google published their landmark paper in September 2014. Both papers established the encoder-decoder blueprint that would define the field for the next several years.

Core architecture

The seq2seq framework divides the model into two distinct components that are trained jointly end to end.

Encoder

The encoder processes the input sequence element by element (for example, word by word) and produces a set of hidden states that capture the meaning and structure of the input. In the original RNN-based designs, the encoder was a recurrent neural network (often an LSTM or GRU) that read the input tokens sequentially. The final hidden state of the encoder, sometimes called the context vector (also called the "thought vector"), served as a fixed-length summary of the entire input. In Transformer-based designs, the encoder consists of stacked self-attention layers and feed-forward networks that process all input positions in parallel.

Formally, given an input sequence (x_1, x_2, ..., x_T), the encoder computes a sequence of hidden states:

h_t = f(x_t, h_{t-1})

where f is the recurrent function (e.g., an LSTM cell). The context vector c is typically the final hidden state h_T, or in deeper models, the concatenation of final hidden states across layers.

Decoder

The decoder generates the output sequence one token at a time in an autoregressive fashion. At each step, it takes the previously generated token (or a special start-of-sequence token at the first step), the encoder's representation, and its own internal state to predict the next output token. Generation continues until the model emits an end-of-sequence token or reaches a maximum length. Like the encoder, the decoder can be implemented with RNNs or Transformer layers.

At each decoding step t, the decoder computes:

s_t = g(y_{t-1}, s_{t-1}, c)
p(y_t | y_1, ..., y_{t-1}, x) = softmax(W_s * s_t)

where g is the decoder's recurrent function, s_t is the decoder hidden state, y_{t-1} is the previously generated token, and c is the context vector. The softmax layer produces a probability distribution over the output vocabulary at each step.

Context vector and the information bottleneck

In the original RNN-based seq2seq models, the encoder compressed the entire input into a single fixed-length context vector. This design created an information bottleneck: for long input sequences, the fixed-size vector could not retain all relevant details, and performance degraded as sentence length increased. The attention mechanism, introduced by Bahdanau et al. in 2014, solved this problem by allowing the decoder to look back at all encoder hidden states at every decoding step rather than relying on a single compressed vector.

Historical development

The seq2seq paradigm evolved through several distinct phases, each addressing limitations of the previous approach.

Foundational papers (2014)

Two papers published in 2014 independently proposed the encoder-decoder framework for neural network-based sequence transduction.

Paper	Authors	Venue	Key contribution
Sequence to Sequence Learning with Neural Networks	Ilya Sutskever, Oriol Vinyals, Quoc V. Le	NeurIPS 2014	Used a 4-layer LSTM encoder-decoder with reversed input sequences; achieved 34.8 BLEU on WMT'14 English-to-French translation
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation	Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio	EMNLP 2014	Introduced the RNN Encoder-Decoder architecture and proposed the Gated Recurrent Unit (GRU) as a simpler alternative to LSTM

Sutskever, Vinyals, and Le (2014)

The paper "Sequence to Sequence Learning with Neural Networks" by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le was presented at the 28th Conference on Neural Information Processing Systems (NeurIPS) in December 2014. It is one of the most cited papers in deep learning history and established the seq2seq paradigm as a practical approach to machine translation.

Architecture. The authors used a multilayered LSTM to map the input sequence to a vector of fixed dimensionality, and then another deep LSTM to decode the target sequence from that vector. The specific model employed four stacked LSTM layers with 1,000 hidden units per layer. The model had 384 million parameters in total, with an 8,000-dimensional state in each LSTM, and used a vocabulary of 160,000 words for the source language and 80,000 words for the target language. The encoder LSTM read the input sentence and produced a fixed-length vector representation at its final time step. This vector was then used to initialize the decoder LSTM, which generated the English-to-French translation one word at a time using beam search with a beam size of 12.

Reversing the input sequence. One of the most surprising and practically important findings in the paper was that reversing the order of words in the source sentence significantly improved translation quality. Instead of mapping the sentence (a, b, c) to its translation (alpha, beta, gamma), the LSTM was trained to map (c, b, a) to (alpha, beta, gamma). The intuition behind this trick is that reversing the input introduces many short-term dependencies between the source and target sequences. After reversal, the first few words of the source sentence are close to the first few words of the target sentence, making it easier for stochastic gradient descent to "establish communication" between the input and output. While the average distance between corresponding words remains unchanged, the first several words now have very small distances, and this is enough to bootstrap the optimization process. This simple trick improved BLEU scores by several points.

Results. On the WMT 2014 English-to-French translation task, the results were striking:

Model	BLEU Score
Phrase-based SMT baseline (Moses)	33.3
Single LSTM (reversed input)	30.6
Ensemble of 5 LSTMs (reversed input)	34.81
Ensemble of 5 LSTMs + SMT rescoring (1000-best)	36.5
Previous state of the art (SMT + neural components)	37.0

The ensemble of five deep LSTMs achieved a BLEU score of 34.81, outperforming the phrase-based SMT baseline of 33.3 without using any phrase tables, alignment models, or hand-engineered features. When the LSTM was used to rerank the top 1,000 hypotheses from the SMT system, the combined system reached 36.5 BLEU. These results demonstrated for the first time that a pure neural approach could compete with, and even surpass, traditional statistical translation systems.

Key insights. Beyond raw performance, the paper revealed several important observations:

Deep LSTMs outperformed shallow ones. A four-layer LSTM significantly outperformed a single-layer LSTM, indicating that depth helped the network learn more complex representations.
The model handled long sentences well. Contrary to expectations, the LSTM showed no difficulty on long sentences, likely because of the input reversal strategy.
Learned representations were meaningful. The fixed-length vectors produced by the encoder captured semantic information. Sentences with similar meanings (but different surface forms, such as active versus passive voice) were mapped to nearby points in vector space.

Sutskever et al. showed that deep LSTMs could learn to map variable-length input sequences to variable-length output sequences without any task-specific engineering.

Cho et al. (2014): RNN Encoder-Decoder and GRUs

Published at EMNLP 2014, the paper "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" by Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio independently proposed the encoder-decoder framework for sequence-to-sequence learning.

The RNN Encoder-Decoder. Cho et al. proposed a model in which one RNN encodes a sequence of symbols into a fixed-length vector representation, and another RNN decodes that representation into a second sequence of symbols. The encoder and decoder were jointly trained to maximize the conditional probability of the target sequence given the source sequence. Rather than translating full sentences end-to-end (as Sutskever et al. did), Cho et al. used the encoder-decoder to score phrase pairs, integrating these scores as an additional feature in an existing phrase-based SMT system.

The Gated Recurrent Unit. A major contribution of this paper was the introduction of the Gated Recurrent Unit (GRU), a new type of recurrent unit designed as a simpler alternative to the LSTM. The GRU uses two gates: an update gate (which controls how much of the previous hidden state to retain) and a reset gate (which determines how much of the previous state to forget when computing the candidate activation). Compared to the LSTM, the GRU has no separate cell state and uses fewer parameters, which can make training faster. The GRU achieved comparable performance with fewer parameters and faster training. The authors showed that the RNN Encoder-Decoder with GRUs learned semantically and syntactically meaningful phrase representations. Phrases with similar meanings were mapped to nearby points in the continuous space, even when their surface forms differed.

Relationship to Sutskever et al. Although both papers proposed encoder-decoder architectures, they differed in scope and application:

Aspect	Sutskever et al. (2014)	Cho et al. (2014)
Recurrent unit	LSTM	GRU
Depth	4 layers	1 layer
Translation mode	End-to-end sentence translation	Phrase pair scoring within SMT
Published at	NeurIPS 2014	EMNLP 2014
Input reversal	Yes	No
Vocabulary	160K source, 80K target	15K each

Together, these two papers firmly established the encoder-decoder paradigm and launched a wave of research into neural sequence transduction.

Attention mechanisms (2014-2015)

The fixed-length context vector used by the original seq2seq models created a fundamental bottleneck: the entire input sequence, regardless of its length, had to be compressed into a single vector. For short sentences, this compression was manageable, but for longer inputs, the context vector could not preserve all the relevant information. Performance degraded noticeably as sentence length increased.

Bahdanau attention (additive attention). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio addressed this bottleneck in their paper "Neural Machine Translation by Jointly Learning to Align and Translate," submitted to arXiv in September 2014 and published at ICLR 2015. Their solution was the attention mechanism, which allowed the decoder to selectively focus on different parts of the input sequence at each decoding step, rather than relying on a single fixed-length vector. The key innovation was to replace the fixed context vector with a dynamic context vector that is recomputed at every decoding step. At each step, the model computes an alignment score between the current decoder state and every encoder hidden state. These scores are normalized through a softmax function to produce attention weights, which are then used to compute a weighted sum of the encoder hidden states. This weighted sum becomes the context vector for that particular decoding step.

Bahdanau et al. used a bidirectional RNN as the encoder, reading the input sequence both forward and backward. The annotation for each word was the concatenation of the forward and backward hidden states, giving the model access to context from both directions. The attention scoring function was an additive (also called "concat") function:

e_ij = v^T * tanh(W_a * s_{i-1} + U_a * h_j)

where s_{i-1} is the decoder state, h_j is the j-th encoder annotation, and W_a, U_a, v are learned parameters. This is sometimes referred to as additive attention or Bahdanau attention.

On English-to-French translation, the results showed significant improvement over the basic encoder-decoder:

Model	BLEU (all sentences)	BLEU (no unknown words)
RNNenc-50	17.82	26.71
RNNsearch-50	26.75	34.16
RNNsearch-50* (longer training)	28.45	36.15
Moses (phrase-based SMT)	33.30	35.63

The attention-based model (RNNsearch) dramatically outperformed the basic encoder-decoder (RNNenc), especially on longer sentences where the fixed-length bottleneck was most severe. On sentences with no unknown words, RNNsearch-50* actually surpassed the phrase-based SMT system Moses. Perhaps more importantly, the attention mechanism provided interpretability. By visualizing the attention weights, researchers could see which source words the model was focusing on when generating each target word. These soft alignments corresponded closely to human intuitions about word-level translation correspondences.

Luong attention (multiplicative attention). Thang Luong, Hieu Pham, and Christopher D. Manning extended the attention framework in their 2015 EMNLP paper "Effective Approaches to Attention-based Neural Machine Translation." They introduced two classes of attention mechanisms:

Global attention: At each decoding step, the model attends to all source positions. The attention score is computed using a simple dot product or a bilinear ("general") form between the decoder state and each encoder hidden state. This is often called multiplicative attention or Luong attention because it uses matrix multiplication rather than the additive function of Bahdanau.
Local attention: Instead of attending to all source words, the model first predicts an aligned position in the source sentence, then attends only to a window of source words around that position. This reduces computational cost and can be seen as a differentiable approximation of hard attention.

The scoring functions for Luong attention include:

Score Function	Formula
Dot	s_t^T * h_j
General (bilinear)	s_t^T * W_a * h_j
Concat (additive)	v^T * tanh(W_a * [s_t ; h_j])

Dot-product attention became widely adopted due to its computational simplicity. With local attention and an ensemble model, Luong et al. achieved a gain of 5.0 BLEU points over non-attentional baselines on the WMT 2015 English-to-German translation task. Their ensemble reached 25.9 BLEU on WMT'15 English-German, which was a new state-of-the-art result at the time.

Attention type	Score function	Introduced by	Year
Additive (Bahdanau)	score(s_t, h_i) = v^T tanh(W[s_t; h_i])	Bahdanau, Cho, Bengio	2014
Dot-product (Luong)	score(s_t, h_i) = s_t^T h_i	Luong, Pham, Manning	2015
General (Luong)	score(s_t, h_i) = s_t^T W h_i	Luong, Pham, Manning	2015
Scaled dot-product	score(Q, K) = QK^T / sqrt(d_k)	Vaswani et al.	2017

Convolutional seq2seq (2017)

Gehring et al. at Facebook AI Research (2017) proposed replacing RNNs with convolutional neural networks in both the encoder and decoder. Because convolutions can be computed in parallel across all positions, this architecture trained significantly faster than RNN-based models while achieving competitive or superior translation quality. The model used gated linear units and multi-step attention. On the WMT'14 English-to-French benchmark, the convolutional seq2seq model matched the accuracy of deep LSTM systems at nine times the training speed.

The Transformer: replacing RNNs (2017)

While attention mechanisms dramatically improved seq2seq models, the underlying RNNs still had a fundamental limitation: they processed tokens sequentially, one at a time. This meant that training could not be fully parallelized across time steps, making it slow and expensive on long sequences.

In 2017, Ashish Vaswani and colleagues at Google published "Attention Is All You Need" at NeurIPS, introducing the Transformer architecture. The Transformer dispensed with recurrence and convolutions entirely and relied solely on attention mechanisms, specifically self-attention (also called intra-attention), to model dependencies between all positions in a sequence. The encoder consists of stacked layers, each containing multi-head self-attention and position-wise feed-forward networks. The decoder has an additional cross-attention sublayer that attends to the encoder output.

The Transformer retains the encoder-decoder structure of seq2seq models but replaces the RNN layers with stacks of multi-head self-attention layers and position-wise feedforward networks. Key components and innovations include:

Self-attention layers in both the encoder and decoder, allowing each position to attend to all other positions in the same sequence.
Cross-attention layers in the decoder, where each decoder position attends to all encoder positions (analogous to the attention mechanism in RNN-based seq2seq).
Multi-head attention. Instead of computing a single attention function, the model runs several attention heads in parallel, each with its own learned projection matrices. The outputs are concatenated and linearly transformed. This allows the model to attend to information from different representation subspaces at different positions simultaneously.
Positional encoding. Since the architecture has no recurrence or convolution to capture sequence order, sinusoidal positional encodings are added to the input embeddings so the model can use position information.
Scaled dot-product attention. The dot products of queries and keys are scaled by the square root of the key dimension to prevent them from growing too large in magnitude, which would push the softmax function into regions with very small gradients.

The Transformer achieved new state-of-the-art results on major translation benchmarks:

Task	Transformer BLEU	Previous Best BLEU
WMT 2014 English-to-German	28.4	26.4 (ensemble)
WMT 2014 English-to-French	41.8	41.0 (ensemble)

On the WMT 2014 English-to-German task, the Transformer outperformed all previously reported models, including ensembles, by more than 2 BLEU points. On English-to-French, it established a new single-model state-of-the-art BLEU score of 41.8 while requiring a fraction of the training cost of previous models. The big Transformer model was trained for 3.5 days on 8 GPUs, compared to weeks of training for competitive RNN-based systems. The Transformer's ability to process all positions in parallel made it far more efficient to train on modern hardware, and its self-attention mechanism allowed it to capture long-range dependencies more effectively than RNNs. This paper triggered a paradigm shift: within two years, virtually all state-of-the-art NLP models adopted the Transformer architecture, and it has since become the default for nearly all seq2seq tasks.

Pre-trained seq2seq models (2019-present)

The success of pre-training and transfer learning led to large-scale encoder-decoder Transformer models that are pre-trained on massive corpora and then fine-tuned for specific seq2seq tasks.

Model	Authors / Organization	Year	Key features
T5 (Text-to-Text Transfer Transformer)	Raffel et al., Google	2020	Casts every NLP task as a text-to-text problem with task-specific prefixes; pre-trained on the C4 corpus; sizes from 60M to 11B parameters
BART	Lewis et al., Facebook AI	2020	Pre-trained by corrupting text (token masking, sentence permutation, span deletion) and learning to reconstruct the original; strong results on summarization and generation
mBART	Liu et al., Facebook AI	2020	Multilingual extension of BART; pre-trained on monolingual corpora in 25 languages; up to 12 BLEU improvement on low-resource translation
mT5	Xue et al., Google	2021	Multilingual T5 covering 101 languages; pre-trained on mC4

T5: Text-to-Text Transfer Transformer

In 2020, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu published "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" in the Journal of Machine Learning Research. The T5 model pushed the seq2seq paradigm to its logical conclusion by casting every NLP task as a text-to-text problem.

In the T5 framework, tasks such as translation, summarization, classification, regression, and question answering are all framed identically: the model receives a text input (prefixed with a task-specific instruction such as "translate English to German:" or "summarize:") and produces a text output. This unified formulation allowed the researchers to systematically compare pre-training objectives, model architectures, data sources, and transfer strategies across dozens of tasks using a single model architecture. T5 uses the standard Transformer encoder-decoder architecture. The model was pre-trained on the "Colossal Clean Crawled Corpus" (C4), a cleaned version of Common Crawl containing roughly 750 GB of English text. By combining insights from the systematic study with scale (the largest T5 model, T5-11B, has 11 billion parameters), the researchers achieved state-of-the-art results on many benchmarks including GLUE, SuperGLUE, SQuAD, and CNN/DailyMail.

BART: Denoising Seq2Seq Pre-training

BART (Bidirectional and Auto-Regressive Transformers), proposed by Mike Lewis and colleagues at Facebook AI Research in 2019, is a denoising autoencoder for pre-training seq2seq models. BART combines a bidirectional encoder (similar to BERT) with an autoregressive decoder (similar to GPT). During pre-training, the input text is corrupted using various noising functions (token masking, token deletion, sentence permutation, text infilling), and the model learns to reconstruct the original text. BART achieved state-of-the-art results on abstractive summarization tasks and strong performance on translation, question answering, and comprehension benchmarks. It demonstrated that the seq2seq encoder-decoder framework, when paired with effective pre-training, could match or exceed encoder-only models like BERT on understanding tasks while also excelling at generation tasks.

The broader text-to-text family

The text-to-text paradigm exemplified by T5 and BART has become a dominant approach in modern NLP. Other models in this family include:

mT5 (Xue et al., 2021): A multilingual version of T5 pre-trained on 101 languages.
FLAN-T5 (Chung et al., 2022): T5 fine-tuned with instruction tuning across more than 1,800 tasks.
UL2 (Tay et al., 2022): A unified pre-training framework that combines different pre-training objectives within a single model.

These models show that the encoder-decoder seq2seq architecture remains highly competitive, even in an era dominated by decoder-only language models like GPT-4.

Applications

Seq2seq models are used across a wide range of tasks where the input and output are both sequences, even if they differ in length, vocabulary, or modality.

Machine translation

Machine translation was the original and most prominent application of seq2seq models. Before seq2seq, translation systems relied on statistical phrase-based methods that required extensive hand-crafted features and alignment tables. Neural seq2seq models replaced these pipelines with a single end-to-end trained network. Starting with Sutskever et al. (2014), neural machine translation (NMT) rapidly overtook phrase-based SMT.

Google deployed its Neural Machine Translation (GNMT) system in production in 2016, reducing translation errors by an average of 60% compared to the previous phrase-based system across more than 100 language pairs. GNMT used a deep seq2seq architecture with 8 encoder layers and 8 decoder layers, residual connections, and attention. It also introduced wordpiece tokenization to handle rare words by splitting them into subword units. Today, all major translation services (Google Translate, DeepL, Microsoft Translator) use Transformer-based seq2seq models.

Text summarization

Text summarization systems use seq2seq models to generate concise summaries of longer documents. There are two approaches: extractive summarization, which selects and concatenates existing sentences from the source, and abstractive summarization, which generates new sentences that may not appear verbatim in the source. Seq2seq models are particularly suited to abstractive summarization because the decoder can produce novel phrasings.

See, Liu, and Manning (2017) introduced the pointer-generator network, a seq2seq model that can both generate words from a fixed vocabulary and copy words directly from the source text via a pointing mechanism. This hybrid approach addressed the problem of out-of-vocabulary words and factual accuracy. They also introduced a coverage mechanism to reduce repetitive output. Models like BART (Lewis et al., 2019) and T5 (Raffel et al., 2020) have achieved strong results on benchmarks such as CNN/DailyMail and XSum.

Speech recognition

Speech recognition (automatic speech recognition, or ASR) is a natural seq2seq problem: the input is a sequence of audio frames (such as mel-frequency cepstral coefficients or log-mel spectrograms), and the output is a sequence of characters or words. Traditional ASR systems combined separate acoustic models, pronunciation dictionaries, and language models. Seq2seq approaches unify these components into a single end-to-end model.

Chan et al. (2016) proposed Listen, Attend and Spell (LAS), which used a pyramidal RNN encoder (the "listener") to process audio features and an attention-based RNN decoder (the "speller") to emit characters. LAS achieved a 14.1% word error rate on a Google voice search task without any external language model.

OpenAI's Whisper (Radford et al., 2022) is a Transformer-based encoder-decoder model trained on 680,000 hours of multilingual audio data. It handles transcription in multiple languages as well as translation from other languages into English. Whisper demonstrated that scaling up weakly supervised training data could produce highly robust ASR without task-specific engineering.

Dialogue systems

Conversational AI and chatbot systems use seq2seq models to generate contextually appropriate responses given conversation history. The encoder processes the dialogue context (previous turns), and the decoder generates the next response. Early neural dialogue systems, such as the work by Vinyals and Le (2015) on a "Neural Conversational Model," showed that seq2seq could produce surprisingly coherent multi-turn conversations when trained on large corpora. Google's Meena (2020), a 2.6 billion parameter seq2seq chatbot trained on 341 GB of social media conversations, demonstrated that scaling seq2seq models improved open-domain conversation quality.

Code generation

Seq2seq models power code generation tools that translate natural language descriptions or comments into source code. The input is a programming task description or docstring, and the output is the corresponding code. OpenAI's Codex (the model behind GitHub Copilot) and subsequent code-generation models use Transformer-based seq2seq principles. Code translation (converting code from one programming language to another) and code summarization are other seq2seq applications in this domain.

Other applications

Application	Input sequence	Output sequence
Image captioning	Image feature vectors (from a CNN)	Natural language caption
Question answering	Question + context passage	Answer text
Grammar correction	Sentence with errors	Corrected sentence
Data-to-text generation	Structured data (tables, knowledge graphs)	Natural language description
Music generation	Symbolic music notation or audio features	New musical sequence
Protein structure prediction	Amino acid sequence	3D structure coordinates
Mathematical problem solving	Problem statement	Solution steps
Time series prediction	Past observations	Future forecasted values

In image captioning, a convolutional neural network acts as the encoder to process an image, with an RNN or Transformer decoder generating the textual description. In time series prediction, the encoder reads past observations and the decoder predicts future steps. In protein structure prediction, the model encodes amino acid sequences and predicts structural properties.

Training techniques

Teacher forcing

During training, the standard approach feeds the ground-truth previous token as input to the decoder at each step, regardless of what the model would have predicted. This technique, called teacher forcing, stabilizes training and speeds up convergence because the decoder always conditions on correct context and does not have to recover from its own mistakes during early training. However, during inference, the model must use its own predictions as inputs, creating a mismatch between training and inference conditions known as exposure bias.

Scheduled sampling

Samy Bengio et al. (2015) proposed scheduled sampling to mitigate exposure bias. During training, the model gradually transitions from using ground-truth tokens to using its own predictions as decoder input. The probability of using a model-generated token increases over the course of training according to a schedule (linear, exponential, or inverse sigmoid decay). This curriculum-based strategy helps the decoder learn to handle its own imperfect predictions and recover from errors.

Sequence-level training objectives

Standard seq2seq training uses token-level cross-entropy loss, which maximizes the probability of each correct token independently. However, the actual evaluation metrics (such as BLEU or ROUGE) operate at the sequence level. Several methods address this discrepancy:

Minimum risk training. Directly optimizes the expected value of a sequence-level metric by sampling multiple candidate outputs and weighting the loss by their metric scores.
Reinforcement learning fine-tuning. Treats the decoder as a policy and uses REINFORCE or other policy gradient methods to optimize sequence-level rewards. Ranzato et al. (2016) applied this approach to machine translation and summarization.
Reward-augmented maximum likelihood. Samples target sequences from an exponentiated payoff distribution rather than the empirical data distribution.

Subword tokenization

Handling open vocabularies is a practical challenge for seq2seq models. Early models used fixed vocabularies (often 30,000 to 80,000 words) and replaced unknown words with a special UNK token, which degraded output quality, particularly for morphologically rich languages. Several solutions were developed:

Copying mechanisms (Gu et al., 2016; See et al., 2017): Allowing the decoder to copy words directly from the input sequence.
Subword tokenization using Byte pair encoding (BPE), proposed by Sennrich et al. (2016) for neural machine translation, addresses this by iteratively merging the most frequent pairs of adjacent characters or subwords. BPE enables open-vocabulary coverage without excessively large vocabulary sizes.
Character-level models: Operating at the character level to avoid OOV issues entirely, at the cost of longer sequences.

Variants of subword tokenization include WordPiece (used in BERT and GNMT) and SentencePiece (used in T5 and mBART). Subword tokenization via BPE became the standard and is used by virtually all modern seq2seq and language models.

Decoding strategies

At inference time, the decoder must select output tokens without access to ground-truth sequences. Several strategies exist, each with different trade-offs between quality and computational cost.

Greedy decoding

The simplest approach selects the highest-probability token at each step. Greedy decoding is fast but often produces suboptimal sequences because a locally optimal choice at one step may lead to a globally poor sequence.

Beam search

Beam search is the most common decoding strategy for seq2seq models. Rather than greedily selecting the highest-probability token at each step, beam search maintains a fixed number of candidate sequences (the beam width, typically 4 to 12) at each decoding step. At each step, it expands all current candidates by one token, scores them, and retains only the top-scoring candidates. Beam search provides a better approximation of the optimal output than greedy decoding but is more computationally expensive. Sutskever et al. (2014) used a beam size of 12 in their experiments, and beam sizes between 4 and 10 remain typical in practice. Length normalization and coverage penalties are often applied to prevent beam search from favoring short outputs or neglecting parts of the input.

Sampling-based methods

For tasks where diversity is desired (such as dialogue or creative text generation), sampling methods are used instead of deterministic search:

Top-k sampling. At each step, the model samples from the k most probable tokens.
Top-p (nucleus) sampling. At each step, the model samples from the smallest set of tokens whose cumulative probability exceeds a threshold p.
Temperature scaling. Dividing the logits by a temperature parameter before applying softmax controls the sharpness of the distribution. Lower temperatures produce more deterministic outputs; higher temperatures increase diversity.

Evaluation metrics

Seq2seq outputs are typically evaluated using automatic metrics that compare generated sequences against reference sequences.

Metric	Full name	Used for	How it works
BLEU	Bilingual Evaluation Understudy	Machine translation	Measures n-gram precision of the generated text against references; uses a brevity penalty to discourage overly short outputs
ROUGE	Recall-Oriented Understudy for Gisting Evaluation	Text summarization	Measures n-gram recall (ROUGE-N), longest common subsequence (ROUGE-L), or skip-bigram overlap (ROUGE-S) between generated and reference summaries
METEOR	Metric for Evaluation of Translation with Explicit Ordering	Machine translation	Considers synonyms, stemming, and word order in addition to exact matches
Perplexity	N/A	Language model evaluation	Measures how well the model predicts the next token; lower perplexity indicates better predictive performance
BERTScore	N/A	General text generation	Computes semantic similarity between generated and reference texts using contextual embeddings from BERT
CER / WER	Character / Word Error Rate	Speech recognition	Edit distance between the predicted and reference transcriptions, normalized by the reference length

Limitations and challenges

Despite their success, seq2seq models face several well-known challenges.

Exposure bias

The discrepancy between teacher forcing during training and autoregressive generation during inference means that errors at early decoding steps can compound throughout the sequence. Scheduled sampling, reinforcement learning fine-tuning, and non-autoregressive decoding are among the approaches that attempt to reduce this problem.

Hallucination

Seq2seq models can generate fluent but factually incorrect text, a phenomenon known as hallucination. This is particularly problematic in summarization, where the model may produce details that do not appear in the source document. Faithfulness metrics and constrained decoding methods have been developed to detect and mitigate hallucinations.

Handling long sequences

Even with attention mechanisms, processing very long input sequences remains challenging. RNN-based models suffer from the vanishing gradient problem over long distances, while Transformer-based models face quadratic memory and computation costs with respect to sequence length. Efficient attention variants such as sparse attention, linear attention, and sliding window attention have been proposed to address this limitation.

Computational cost

Large seq2seq models require substantial computational resources for both training and inference. Training seq2seq models, especially large Transformer-based ones, is computationally expensive. The self-attention mechanism in the Transformer has quadratic complexity with respect to sequence length, which limits the maximum input size. Autoregressive decoding is inherently sequential on the output side, which limits throughput. Non-autoregressive translation models attempt to generate all output tokens in parallel, trading some quality for significant speedups. Techniques for addressing this include sparse attention patterns, linear attention approximations, and efficient Transformer variants.

Evaluation limitations

Automatic metrics like BLEU and ROUGE have known shortcomings. BLEU relies on exact n-gram matching and can penalize valid paraphrases. ROUGE focuses on recall and may not capture semantic correctness. Human evaluation remains the gold standard but is expensive and time-consuming.

Comparison of seq2seq architectures

Feature	RNN Encoder-Decoder (2014)	RNN + Attention (2014-2015)	CNN-based (2017)	Transformer (2017)
Encoder type	Unidirectional LSTM/GRU	Bidirectional LSTM/GRU	Stacked convolutions	Self-attention + feedforward
Decoder type	Unidirectional LSTM/GRU	Unidirectional LSTM/GRU + attention	Stacked convolutions with attention	Self-attention + cross-attention + feedforward
Context representation	Fixed-length vector (encoder final state)	Dynamic context vector (weighted sum of encoder states)	Convolutional features with multi-step attention	Full encoder output attended at every layer
Parallelization	Sequential (no parallelism across time)	Sequential (no parallelism across time)	Fully parallel during training	Fully parallel across positions
Long-range dependencies	Limited by vanishing gradients, even with LSTM/GRU	Improved via direct attention connections	Fixed receptive field (grows with depth)	Excellent (direct attention between all position pairs)
Positional information	Implicit in sequential processing	Implicit in sequential processing	Implicit in convolutional structure	Explicit positional encodings required
Training speed	Slow (sequential computation)	Slow (sequential computation)	Fast (parallelizable)	Fast (parallelizable on GPUs/TPUs)
Interpretability	Low	Moderate (attention weights provide alignment)	Moderate	Moderate (multi-head attention weights)
Representative BLEU (En-Fr WMT'14)	34.81 (Sutskever et al., ensemble)	36.15 (Bahdanau et al., no UNK)	Competitive with LSTM	41.8 (Vaswani et al., single model)
Representative models	Original seq2seq (2014), GNMT (2016)	Bahdanau (2015), Luong (2015)	ConvS2S (Gehring et al., 2017)	Transformer (2017), T5, BART, mBART

Timeline of seq2seq milestones

Year	Milestone
2013	Kalchbrenner and Blunsom propose an encoder-decoder model using a CNN encoder and RNN decoder for machine translation
2014	Sutskever, Vinyals, and Le demonstrate end-to-end seq2seq with deep LSTM networks
2014	Cho et al. introduce the RNN Encoder-Decoder and the GRU
2014	Bahdanau, Cho, and Bengio propose the attention mechanism for seq2seq
2015	Luong, Pham, and Manning explore global and local attention variants
2016	Google deploys GNMT for production machine translation
2016	Sennrich et al. introduce BPE for subword tokenization in NMT
2016	Chan et al. propose Listen, Attend and Spell for end-to-end speech recognition
2017	Gehring et al. introduce convolutional seq2seq
2017	Vaswani et al. publish "Attention Is All You Need," introducing the Transformer
2017	See, Liu, and Manning propose pointer-generator networks for summarization
2019	Facebook AI demonstrates seq2seq for symbolic mathematics
2020	Google publishes T5, unifying NLP tasks as text-to-text seq2seq
2020	Facebook AI publishes BART and mBART for pre-trained seq2seq
2022	OpenAI releases Whisper for multilingual speech recognition
2022	Amazon introduces AlexaTM 20B, a 20 billion parameter seq2seq model

Legacy and influence

The seq2seq framework has had a profound and lasting influence on the field of artificial intelligence:

Established the encoder-decoder paradigm. The idea of encoding an input into a representation and decoding it into an output has become a fundamental design pattern in deep learning, applied well beyond NLP to areas like computer vision and speech.
Motivated the attention mechanism. The limitations of the fixed-length context vector in early seq2seq models directly motivated the invention of attention, which in turn became the foundation of the Transformer and all modern large language models.
Enabled end-to-end learning. Before seq2seq, complex tasks like translation required multi-stage pipelines with separately trained components. Seq2seq demonstrated that a single neural network trained end-to-end could match or outperform these pipelines.
Inspired the text-to-text paradigm. The T5 model's insight that all NLP tasks can be cast as seq2seq problems has influenced how researchers and practitioners think about task formulation and model design.
Laid the groundwork for modern generative AI. Today's large language models, including GPT-4 and Claude, trace their architectural lineage directly to the encoder-decoder and attention innovations of the 2014-2017 seq2seq era. While many modern models use decoder-only architectures, the core concepts of autoregressive generation, attention, and sequence transduction originated in the seq2seq literature.

References

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. *Advances in Neural Information Processing Systems 27 (NeurIPS 2014)*, 3104-3112. https://arxiv.org/abs/1409.3215
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 1724-1734. https://arxiv.org/abs/1406.1078
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. *Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015)*. https://arxiv.org/abs/1409.0473
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. *Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 1412-1421. https://arxiv.org/abs/1508.04025
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. *Proceedings of the 34th International Conference on Machine Learning (ICML)*, 1243-1252. https://arxiv.org/abs/1705.03122
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*. https://arxiv.org/abs/1706.03762
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., et al. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. *arXiv preprint*. https://arxiv.org/abs/1609.08144
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. *Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)*, 1715-1725. https://arxiv.org/abs/1508.07909
Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2016). Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. *IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. https://arxiv.org/abs/1508.01211
See, A., Liu, P. J., & Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks. *Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL)*, 1073-1083. https://arxiv.org/abs/1704.04368
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. *Journal of Machine Learning Research*, 21(140), 1-67. https://arxiv.org/abs/1910.10683
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. *Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)*, 7871-7880. https://arxiv.org/abs/1910.13461
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. *arXiv preprint*. https://arxiv.org/abs/2212.04356
Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. *Advances in Neural Information Processing Systems 28 (NeurIPS 2015)*. https://arxiv.org/abs/1506.03099
Ranzato, M., Chopra, S., Auli, M., & Zaremba, W. (2016). Sequence Level Training with Recurrent Neural Networks. *Proceedings of the 4th International Conference on Learning Representations (ICLR 2016)*. https://arxiv.org/abs/1511.06732
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. *Neural Computation*, 9(8), 1735-1780.

ELI5: Explain like I'm 5

Historical background

Early sequence modeling

The two foundational papers of 2014

Core architecture

Encoder

Decoder

Context vector and the information bottleneck

Historical development

Foundational papers (2014)

Sutskever, Vinyals, and Le (2014)

Cho et al. (2014): RNN Encoder-Decoder and GRUs

Attention mechanisms (2014-2015)

Convolutional seq2seq (2017)

The Transformer: replacing RNNs (2017)

Pre-trained seq2seq models (2019-present)

T5: Text-to-Text Transfer Transformer

BART: Denoising Seq2Seq Pre-training

The broader text-to-text family

Applications

Machine translation

Text summarization

Speech recognition

Dialogue systems

Code generation

Other applications

Training techniques

Teacher forcing

Scheduled sampling

Sequence-level training objectives

Subword tokenization

Decoding strategies

Greedy decoding

Beam search

Sampling-based methods

Evaluation metrics

Limitations and challenges

Exposure bias

Hallucination

Handling long sequences

Computational cost

Evaluation limitations

Comparison of seq2seq architectures

Timeline of seq2seq milestones

Legacy and influence

See also

References

Improve this article

Related Articles

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Supervised fine-tuning

ELI5: Explain like I'm 5

Historical background

Early sequence modeling

The two foundational papers of 2014

Core architecture

Encoder

Decoder

Context vector and the information bottleneck

Historical development

Foundational papers (2014)

Sutskever, Vinyals, and Le (2014)

Cho et al. (2014): RNN Encoder-Decoder and GRUs

Attention mechanisms (2014-2015)

Convolutional seq2seq (2017)

The Transformer: replacing RNNs (2017)

Pre-trained seq2seq models (2019-present)

T5: Text-to-Text Transfer Transformer

BART: Denoising Seq2Seq Pre-training

The broader text-to-text family

Applications

Machine translation

Text summarization

Speech recognition

Dialogue systems

Code generation