Text2Text Generation Models
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,497 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,497 words
Add missing citations, update stale details, or suggest a clearer explanation.
Text-to-text (text2text) generation models are a family of neural network systems that frame many natural language processing tasks as a single problem: given an input text string, produce an output text string. The category is dominated by encoder-decoder Transformer architectures pretrained on large unlabeled corpora with denoising objectives, and then either fine-tuned on individual tasks or trained jointly on many tasks using task-specific prefixes. The paradigm was popularized by Google's T5 (Text-to-Text Transfer Transformer), which treats translation, summarization, classification, question answering, and other tasks as variations of the same text-to-text format. Other widely used text2text models include BART, mT5, FLAN-T5, UL2, Pegasus, LongT5, and CodeT5.
Text2text models are distinguished from decoder-only text generation models (such as the GPT family) by their architecture. A text2text model has a bidirectional encoder that builds contextual representations of the entire input, and a separate autoregressive decoder that attends to those representations through cross-attention while generating the output. Decoder-only models, in contrast, use a single causal stack that conditions only on past tokens.
See also: Natural Language Processing Models and Tasks
The encoder-decoder formulation of text-to-text generation grew out of neural machine translation research in the mid-2010s. Sutskever, Vinyals, and Le (2014) introduced sequence-to-sequence learning with stacked LSTMs, mapping an English sentence to a fixed-dimensional vector and decoding it into French, reaching 34.8 BLEU on WMT'14 English-French. The same year, Bahdanau, Cho, and Bengio added a soft alignment mechanism that lets the decoder attend to different positions in the source at each generation step. That work introduced what is now called Bahdanau or additive attention.
The encoder-decoder Transformer of Vaswani et al. (2017) replaced recurrence with self-attention, parallelized training, and set new BLEU scores on WMT 2014 English-German (28.4) and English-French (41.8). The original Transformer was itself a text-to-text architecture for translation, though the label "text-to-text" became associated with later models that applied the same architecture to many tasks at once.
In October 2019, two papers published within two weeks of each other crystallized the modern text2text paradigm. BART, from Facebook AI Research (now Meta AI), proposed a denoising autoencoder that corrupts text with an arbitrary noising function and learns to reconstruct the original. T5, from Google Research, reframed every supervised task as taking a text string in and producing a text string out, with task-specific prefixes like "translate English to German:" or "summarize:". T5 was pretrained on the Colossal Clean Crawled Corpus (C4), a roughly 750 GB filtered subset of Common Crawl.
Work after T5 extended the recipe along several dimensions. mT5 (Xue et al., 2020) trained the architecture on a corpus covering 101 languages. ByT5 (Xue et al., 2021) replaced the SentencePiece tokenizer with raw UTF-8 bytes, producing a token-free model that handles noisy text and works on any script. PEGASUS (Zhang et al., 2019) introduced gap-sentences generation, a summarization-specific pretraining objective in which whole sentences are masked and reconstructed. LongT5 (Guo et al., 2021) combined T5 with a transient global attention pattern to support inputs up to roughly 16,000 tokens, and PEGASUS-X (Phang et al., 2022) extended PEGASUS to the same regime.
The Switch Transformer (Fedus et al., 2021) applied sparse mixture-of-experts routing to a T5 backbone, reaching trillion-parameter scale at constant per-token compute. UL2 (Tay et al., 2022) unified denoising and causal pretraining into a single Mixture-of-Denoisers objective and was released as a 20-billion-parameter encoder-decoder. FLAN-T5 (Chung et al., 2022) showed that scaling up the number of instruction-tuned tasks plus chain-of-thought training dramatically improves zero-shot and few-shot performance. Flan-UL2, released in early 2023, applied the same recipe to the 20B UL2 model. CodeT5 and CodeT5+ (Wang et al., 2021; 2023) adapted the encoder-decoder design to programming languages.
A text2text model is a Transformer with two stacks. The encoder reads the input token sequence and applies bidirectional self-attention, so each input position can attend to every other. The output is one contextual vector per input token. The decoder generates the output one token at a time using two attention mechanisms per layer: causal self-attention over previously generated tokens, and cross-attention over the encoder output. The two stacks are typically the same depth and width.
T5 uses relative position biases inside attention rather than absolute positional embeddings. BART uses learned absolute positions and a GeLU activation, following the BERT and GPT conventions of the time. UL2 and LongT5 introduce additional positional and attention modifications to handle their pretraining objectives and longer inputs.
The text-to-text framing introduced by Raffel et al. is the central design idea of the category. Every input begins with a short prefix that names the task. Examples from the original T5 paper include "translate English to German: That is good.", "summarize: state authorities dispatched emergency crews ...", "cola sentence: The course is jumping well.", and "stsb sentence1: ... sentence2: ...". Classification targets are written as words ("acceptable", "entailment") rather than class indices, and regression targets like semantic similarity are rounded to one decimal place and emitted as text. A single set of weights handles every task; at inference time the prefix selects which one.
This framing makes the loss function uniform (cross-entropy over output tokens), simplifies multi-task fine-tuning, and turns evaluation into string matching. It is also the conceptual predecessor of prompt-based usage of decoder-only large language models, although decoder-only models typically learn the prompt format implicitly from web text rather than from explicit task prefixes.
Denoising is the dominant pretraining objective for text2text models. BART pretrains by corrupting text with several noising functions, including text infilling (replacing spans of tokens with a single mask), sentence permutation, token deletion, document rotation, and token masking, then training the model to reconstruct the original document. Lewis et al. found that the combination of text infilling and sentence permutation gave the strongest results.
T5 uses span corruption. Roughly 15 percent of input tokens are dropped in contiguous spans of average length three, each span is replaced by a single sentinel token, and the target sequence is the dropped spans separated by the same sentinels. This produces short targets, which is cheap relative to BART's full-document reconstruction.
UL2 unifies several pretraining schemes under a Mixture-of-Denoisers. R-denoising is regular T5-style span corruption with short spans. S-denoising splits the document at a random position and treats the prefix as the input and the suffix as the target, which is essentially causal language modeling cast as a text-to-text problem. X-denoising is extreme span corruption, with longer spans or higher corruption ratios. Each objective is prefixed by a mode token so the model can be steered toward one or another at inference. PEGASUS uses gap-sentences generation: principal sentences (chosen by a ROUGE-based importance heuristic) are masked out of the document and the model is asked to regenerate them as a pseudo-summary.
| Model | Release | Organization | Sizes | Notes |
|---|---|---|---|---|
| BART | Oct 2019 | Facebook AI Research | 140M (base), 400M (large) | Denoising autoencoder; strong on summarization and dialogue |
| T5 | Oct 2019 | Google Research | 60M, 220M, 770M, 3B, 11B | Span corruption pretraining on C4; unified text-to-text framing |
| mT5 | Oct 2020 | Google Research | 300M to 13B | Pretrained on mC4 across 101 languages |
| PEGASUS | Dec 2019 | Google Research | 568M | Gap-sentences generation objective for abstractive summarization |
| CodeT5 | Sep 2021 | Salesforce Research | 60M, 220M, 770M | Identifier-aware pretraining on 8.35M functions in 8 languages |
| LongT5 | Dec 2021 | Google Research | up to 3B | Transient global attention for inputs up to 16K tokens |
| Switch Transformer | Jan 2021 | Google Research | 1.6T (sparse) | Mixture-of-experts on a T5 backbone |
| UL2 | May 2022 | Google Research | 20B | Mixture-of-Denoisers; SOTA on 50 supervised tasks at release |
| PEGASUS-X | Aug 2022 | Google Research | 272M, 568M | Staggered block-local attention for 16K-token inputs |
| FLAN-T5 | Oct 2022 | Google Research | 80M, 250M, 780M, 3B, 11B | Instruction-tuned T5; 1.8K tasks plus chain-of-thought data |
| Flan-UL2 | Mar 2023 | Google Research | 20B | UL2 with FLAN instruction tuning |
| CodeT5+ | May 2023 | Salesforce Research | 220M to 16B | Multi-objective code LLM; instruction-tuned variant matches open code LLMs on HumanEval |
ByT5, released in May 2021, comes in the same five-size range as mT5 but trades the SentencePiece tokenizer for raw UTF-8 byte input. The T5X framework, written in JAX, reimplemented T5, mT5, UL2, and related models for TPU training and is the reference codebase for most public Google encoder-decoder checkpoints.
Text2text models have set or matched state-of-the-art results across a broad range of NLP tasks. Raffel et al. report that T5-11B reached an average score of 89.3 on SuperGLUE (compared with a human baseline of 89.8 reported with the benchmark), 90.06 F1 on SQuAD v1.1, a ROUGE-L of 39.65 on CNN/DailyMail summarization, and BLEU scores of 32.1 on WMT English-German and 43.4 on WMT English-French. BART achieved 6 ROUGE point gains over the previous state of the art on XSum summarization and matched RoBERTa on GLUE and SQuAD despite being a generative model.
The following table summarizes representative results for several models.
| Model | Task | Benchmark | Score |
|---|---|---|---|
| T5-11B | Multi-task | SuperGLUE | 89.3 average |
| T5-11B | Reading comprehension | SQuAD v1.1 | 90.06 F1 |
| T5-11B | Translation | WMT En-De | 32.1 BLEU |
| T5-11B | Summarization | CNN/DailyMail | 39.65 ROUGE-L |
| BART-large | Summarization | XSum | 45.14 ROUGE-1 |
| PEGASUS-large | Summarization | XSum | 47.21 ROUGE-1 |
| mT5-XXL | Cross-lingual QA | TyDi QA GoldP | 82.5 F1 |
| FLAN-T5-XXL | Zero-shot reasoning | MMLU | 55.1 percent |
| UL2 20B | Zero-shot generation | SuperGLUE | exceeds 175B GPT-3 |
| CodeT5+ 16B | Code generation | HumanEval pass@1 | 35.0 percent |
Encoder-decoder text2text models and decoder-only language models sit at different points in the design space. The bidirectional encoder lets every input token attend to every other input token, which is well suited to tasks where the model must understand a fixed input before producing a short structured output, such as classification, extractive question answering, and summarization. Cross-attention also separates the cost of reading the input from the cost of generating the output, so a long document with a short summary is cheaper to process than with a decoder-only architecture, which rolls the entire input through its causal stack alongside the output.
Decoder-only models, by contrast, have benefited disproportionately from scale and from prompt-based learning since 2020. A single autoregressive stack, the absence of an architectural prior about which positions are "input" and which are "output", and the ability to interleave instructions, examples, and continuations in one stream made decoder-only the dominant choice for frontier-scale large language models. Open-weight encoder-decoder models above roughly 20B parameters remain rare; UL2 20B, Flan-UL2 20B, and the sparse Switch family are the principal exceptions. Tay et al. nonetheless reported that UL2 outperformed a 175B GPT-3 on zero-shot SuperGLUE while using a small fraction of the compute.
Text2text models cover most generation-flavored NLP tasks. Machine translation was the original target for sequence-to-sequence research and remains a strong fit for the encoder-decoder design. Abstractive text summarization is the area where BART, PEGASUS, and LongT5 are most widely adopted in production, including for news, scientific papers, and long-form documents. Question answering is supported in both extractive and generative forms; T5 introduced the "closed-book" formulation, in which the model answers factual questions without retrieving external passages.
Other common applications include paraphrasing, text simplification, headline generation, grammatical error correction, data-to-text (turning structured tables into prose), dialogue response generation, and natural language to SQL. CodeT5 and CodeT5+ target programming tasks such as code summarization, completion, defect detection, clone detection, and text-to-code retrieval. Multilingual variants (mT5, ByT5) extend each of these applications across languages, often outperforming language-specific baselines on low-resource pairs.
Text2text models share several limitations. Denoising pretraining does not match downstream tasks exactly, so fine-tuning or instruction tuning is generally still needed for strong zero-shot performance. The fixed split between encoder and decoder makes the architecture less flexible than a single causal stack for tasks where input and output blur (long multi-turn dialogues, code execution traces, agentic loops). The number of open-weight encoder-decoder checkpoints above 20B parameters is small, which limits direct comparisons against frontier decoder-only models. Generation diversity can also suffer because the encoder-decoder design tends to be more confident and less varied than a temperature-sampled decoder-only model.
Like all generative language models, text2text models can hallucinate facts, copy biases from their pretraining data, and produce unsafe content when prompted adversarially. The closed-book question-answering experiments in the original T5 paper made this explicit: even an 11B model misses many factual questions that are easily handled by retrieval-augmented systems.