Text2Text Generation Models

AI Models Natural Language Processing

25 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v4 · 5,087 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Text-to-text (text2text) generation models are a family of neural network systems that frame many natural language processing tasks as a single problem: given an input text string, produce an output text string. The category is dominated by encoder-decoder Transformer architectures pretrained on large unlabeled corpora with denoising objectives, and then either fine-tuned on individual tasks or trained jointly on many tasks using task-specific prefixes. The paradigm was popularized by Google's T5 (Text-to-Text Transfer Transformer), which treats translation, summarization, classification, question answering, and other tasks as variations of the same text-to-text format.^[5] Other widely used text2text models include BART, mT5, FLAN-T5, UL2, Pegasus, LongT5, and CodeT5.

Text2text models are distinguished from decoder-only text generation models (such as the GPT family) by their architecture. A text2text model has a bidirectional encoder that builds contextual representations of the entire input, and a separate autoregressive decoder that attends to those representations through cross-attention while generating the output. Decoder-only models, in contrast, use a single causal stack that conditions only on past tokens.

See also: Natural Language Processing Models and Tasks

Infobox

Text-to-text generation models
Type	Encoder-decoder Transformer family
Core idea	Every NLP task reframed as text in, text out
Defining architecture	Bidirectional encoder + autoregressive decoder with cross-attention
Dominant pretraining objective	Span corruption and denoising
Unifying paper	"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Raffel et al., 2019)^[5]
Key models	T5, BART, mT5, FLAN-T5, UL2, PEGASUS, LongT5, CodeT5
Key tasks	Machine translation, summarization, question answering, classification-as-generation, paraphrase, NLI
Representative sizes	60M to 20B dense; 1.6T sparse (Switch Transformer)^[8]
Predecessor paradigm	Sequence-to-sequence learning with LSTMs (Sutskever et al., 2014)^[1]

History

The encoder-decoder formulation of text-to-text generation grew out of neural machine translation research in the mid-2010s. Sutskever, Vinyals, and Le (2014) introduced sequence-to-sequence learning with stacked LSTMs, mapping an English sentence to a fixed-dimensional vector and decoding it into French, reaching 34.8 BLEU on WMT'14 English-French.^[1] The same year, Bahdanau, Cho, and Bengio added a soft alignment mechanism that lets the decoder attend to different positions in the source at each generation step.^[2] That work introduced what is now called Bahdanau or additive attention.^[2]

The encoder-decoder Transformer of Vaswani et al. (2017) replaced recurrence with self-attention, parallelized training, and set new BLEU scores on WMT 2014 English-German (28.4) and English-French (41.8).^[3] The original Transformer was itself a text-to-text architecture for translation, though the label "text-to-text" became associated with later models that applied the same architecture to many tasks at once.

In October 2019, two papers published within two weeks of each other crystallized the modern text2text paradigm. BART, from Facebook AI Research (now Meta AI), proposed a denoising autoencoder that corrupts text with an arbitrary noising function and learns to reconstruct the original.^[4] T5, from Google Research, reframed every supervised task as taking a text string in and producing a text string out, with task-specific prefixes like "translate English to German:" or "summarize:".^[5] T5 was pretrained on the Colossal Clean Crawled Corpus (C4), a roughly 750 GB filtered subset of Common Crawl.^[5]

Work after T5 extended the recipe along several dimensions. mT5 (Xue et al., 2020) trained the architecture on a corpus covering 101 languages.^[7] ByT5 (Xue et al., 2021) replaced the SentencePiece tokenizer with raw UTF-8 bytes, producing a token-free model that handles noisy text and works on any script.^[9] PEGASUS (Zhang et al., 2019) introduced gap-sentences generation, a summarization-specific pretraining objective in which whole sentences are masked and reconstructed.^[6] LongT5 (Guo et al., 2021) combined T5 with a transient global attention pattern to support inputs up to roughly 16,000 tokens,^[11] and PEGASUS-X (Phang et al., 2022) extended PEGASUS to the same regime.^[13]

The Switch Transformer (Fedus et al., 2021) applied sparse mixture-of-experts routing to a T5 backbone, reaching trillion-parameter scale at constant per-token compute.^[8] UL2 (Tay et al., 2022) unified denoising and causal pretraining into a single Mixture-of-Denoisers objective and was released as a 20-billion-parameter encoder-decoder.^[12] FLAN-T5 (Chung et al., 2022) showed that scaling up the number of instruction-tuned tasks plus chain-of-thought training dramatically improves zero-shot and few-shot performance.^[14] Flan-UL2, released in early 2023, applied the same recipe to the 20B UL2 model. CodeT5 and CodeT5+ (Wang et al., 2021; 2023) adapted the encoder-decoder design to programming languages.^[10]^[15]

Architecture

A text2text model is a Transformer with two stacks. The encoder reads the input token sequence and applies bidirectional self-attention, so each input position can attend to every other. The output is one contextual vector per input token. The decoder generates the output one token at a time using two attention mechanisms per layer: causal self-attention over previously generated tokens, and cross-attention over the encoder output.^[3] The two stacks are typically the same depth and width.

T5 uses relative position biases inside attention rather than absolute positional embeddings.^[5] BART uses learned absolute positions and a GeLU activation, following the BERT and GPT conventions of the time.^[4] UL2 and LongT5 introduce additional positional and attention modifications to handle their pretraining objectives and longer inputs.^[11]^[12]

Encoder stack

The encoder applies a stack of identical layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward sublayer.^[3] Because the attention is bidirectional, each token representation integrates information from the full input at every layer. This is a key advantage for tasks where global context matters: a model summarizing a document can, for example, resolve coreference across the full input before generating a single output token. The encoder produces one hidden-state vector per input token, and the full set of these vectors is passed to every decoder layer via cross-attention.

T5's encoder uses relative attention biases rather than absolute positional embeddings.^[5] Relative biases add a learned scalar to each attention logit depending on the distance between the attending token and the attended token, capped at a maximum bucket distance.^[5] This makes T5 more robust to inputs longer than those seen during pretraining, within limits.

Decoder stack and cross-attention

The decoder processes the output sequence token by token. Each decoder layer applies three sublayers in order: causal self-attention (attending only to previously generated output tokens), encoder-decoder cross-attention (attending to the full encoder output), and a position-wise feed-forward network.^[3] The cross-attention keys and values come from the encoder's final hidden states, while the queries come from the decoder's current hidden state. This separates the reading phase (encoder) from the generation phase (decoder), which is the fundamental structural difference from decoder-only large language models.

At each generation step, the decoder softmax over the full vocabulary selects the next token, which is appended to the output sequence and fed back in as the next input. This autoregressive loop continues until the model emits a special end-of-sequence token or reaches a maximum length.

Positional encoding variants

Different text2text models use different strategies for encoding token position. The original encoder-decoder Transformer used sinusoidal fixed embeddings.^[3] T5 adopted relative position biases shared across all layers.^[5] BART learned absolute position embeddings separately for the encoder and decoder.^[4] PEGASUS followed BART's convention.^[6] UL2 inherited T5's relative biases and modified them slightly to handle the Mixture-of-Denoisers objective.^[12] LongT5 combined local attention within fixed windows with transient global tokens that aggregate information across the full sequence, making it practical to run the model on inputs with tens of thousands of tokens.^[11]

T5 paradigm

The text-to-text framing introduced by Raffel et al. is the central design idea of the category.^[5] Every input begins with a short prefix that names the task. Examples from the original T5 paper include "translate English to German: That is good.", "summarize: state authorities dispatched emergency crews ...", "cola sentence: The course is jumping well.", and "stsb sentence1: ... sentence2: ...".^[5] Classification targets are written as words ("acceptable", "entailment") rather than class indices, and regression targets like semantic similarity are rounded to one decimal place and emitted as text.^[5] A single set of weights handles every task; at inference time the prefix selects which one.

This framing makes the loss function uniform (cross-entropy over output tokens), simplifies multi-task fine-tuning, and turns evaluation into string matching. It is also the conceptual predecessor of prompt-based usage of decoder-only large language models, although decoder-only models typically learn the prompt format implicitly from web text rather than from explicit task prefixes.

The C4 corpus

Raffel et al. built the Colossal Clean Crawled Corpus (C4) specifically to pretrain T5.^[5] Starting from a Common Crawl snapshot, they applied a pipeline of heuristic filters: remove lines not ending in terminal punctuation, discard pages with fewer than five sentences, remove pages containing offensive words from a fixed list, deduplicate at the three-sentence-consecutive-overlap level, and keep only English content as scored by a language-identification model.^[5] The result was roughly 750 GB of clean English web text, substantially larger than and qualitatively different from earlier English pretraining corpora.^[5] C4 became the standard pretraining corpus for mT5 (where it was extended to 101 languages as mC4), T5 variants, and several third-party reimplementations.^[7]

Instruction tuning and FLAN-T5

The FLAN (Fine-tuned LAnguage Net) line of work showed that the T5 architecture, already capable of multi-task generation, benefits dramatically from explicit instruction tuning: fine-tuning the model on a large collection of tasks expressed as natural-language instructions rather than fixed prefixes.^[17] FLAN-T5 (Chung et al., 2022) instruction-tuned T5 on 1,836 tasks drawn from 473 datasets, including chain-of-thought prompts for 9 of those datasets.^[14] The authors reported gains over both the base T5 and over GPT-3 on several zero-shot benchmarks, despite FLAN-T5-XXL having roughly 11B parameters (comparable to T5-11B) versus GPT-3's 175B.^[14] FLAN-T5 became the most widely used open-weight instruction-following encoder-decoder model as of 2023.

Pretraining objectives

Denoising is the dominant pretraining objective for text2text models. BART pretrains by corrupting text with several noising functions, including text infilling (replacing spans of tokens with a single mask), sentence permutation, token deletion, document rotation, and token masking, then training the model to reconstruct the original document.^[4] Lewis et al. found that the combination of text infilling and sentence permutation gave the strongest results.^[4]

T5 uses span corruption. Roughly 15 percent of input tokens are dropped in contiguous spans of average length three, each span is replaced by a single sentinel token, and the target sequence is the dropped spans separated by the same sentinels.^[5] This produces short targets, which is cheap relative to BART's full-document reconstruction.

UL2 unifies several pretraining schemes under a Mixture-of-Denoisers.^[12] R-denoising is regular T5-style span corruption with short spans. S-denoising splits the document at a random position and treats the prefix as the input and the suffix as the target, which is essentially causal language modeling cast as a text-to-text problem. X-denoising is extreme span corruption, with longer spans or higher corruption ratios. Each objective is prefixed by a mode token so the model can be steered toward one or another at inference.^[12] PEGASUS uses gap-sentences generation: principal sentences (chosen by a ROUGE-based importance heuristic) are masked out of the document and the model is asked to regenerate them as a pseudo-summary.^[6]

Comparison of pretraining objectives

Model	Pretraining objective	Corruption type	Target length	Special design
T5	Span corruption	Contiguous token spans (~15%, avg len 3)	Short (sentinel-separated spans)	Sentinel tokens for each masked span^[5]
BART	Denoising autoencoder	Text infilling + sentence permutation (best combo)	Full document reconstruction	Multiple noising functions explored^[4]
PEGASUS	Gap-sentences generation	Whole sentences (selected by ROUGE importance)	Selected sentences as pseudo-summary	Summarization-specific^[6]
UL2	Mixture-of-Denoisers	R (short spans), S (suffix), X (long spans)	Varies by mode	Mode token steers objective at inference^[12]
ByT5	Span corruption (bytes)	Byte-level spans	Short	Token-free; operates on raw UTF-8^[9]
FLAN-T5	Span corruption + instruction tuning	Same as T5 pretraining, then task instructions	Task-dependent	Chain-of-thought data included^[14]

Tasks unified by the text-to-text paradigm

One of the most consequential claims of the T5 paper was that a single architecture and training procedure could cover nearly every standard NLP benchmark.^[5] The following tasks were shown to fit the text-to-text format directly.

Machine translation

Machine translation is the original motivation for encoder-decoder sequence-to-sequence research.^[1] T5 handled translation by prepending a prefix like "translate English to German:" to the source sentence and training the decoder to produce the target sentence.^[5] On WMT 2014, T5-11B reached 32.1 BLEU on English-German and 43.4 on English-French, competitive with dedicated translation systems of the time.^[5] mT5 extended the same approach to 101 languages and demonstrated that shared multilingual pretraining transfers well to low-resource language pairs.^[7]

Text summarization

Abstractive text summarization is the area where text2text models have had the clearest commercial impact. PEGASUS introduced a pretraining objective specifically designed to make the model learn to extract and rephrase salient information, and its 568M-parameter variant set new benchmarks on XSum (47.21 ROUGE-1) and CNN/DailyMail (44.17 ROUGE-1) at publication.^[6] BART achieved 45.14 ROUGE-1 on XSum.^[4] T5-11B reached 39.65 ROUGE-L on CNN/DailyMail.^[5] LongT5 and PEGASUS-X extended summarization to long documents up to 16,000 tokens, relevant for scientific papers, legal documents, and financial reports.^[11]^[13]

Question answering

Question answering in text2text models takes two forms. Extractive QA maps naturally to the format: the model receives a passage and a question and outputs a span or a generated answer. T5 was fine-tuned on SQuAD v1.1 and reached 90.06 F1.^[5] Closed-book QA, introduced by the T5 paper, is more unusual: the model answers factual questions purely from its pretrained weights, with no retrieved context.^[16] T5-11B answered correctly on TriviaQA and Natural Questions open-book benchmarks at rates well above previous closed-book systems.^[16] This was an early indication that large pretrained models memorize substantial factual knowledge, though the closed-book approach has since been largely superseded by retrieval-augmented generation.

Natural language inference and classification-as-generation

Natural language inference (NLI) requires predicting whether a hypothesis entails, contradicts, or is neutral with respect to a premise. In the T5 framing, the input is "mnli hypothesis: [H] premise: [P]" and the output is one of the words "entailment", "contradiction", or "neutral".^[5] Classification results such as sentiment labels ("positive", "negative"), grammatical acceptability ("acceptable", "unacceptable"), and textual similarity scores (rounded floats like "3.8") are all emitted as text.^[5] This classification-as-generation approach avoids task-specific output heads entirely and makes multi-task training trivial: the same softmax is used for everything.

Semantic similarity, paraphrase, and other SuperGLUE tasks

T5 was evaluated on all eight SuperGLUE tasks, including BoolQ (yes/no questions), CB and MNLI (NLI), WiC (word-in-context disambiguation), WSC and Winogrande (coreference resolution), MultiRC (multi-sentence reading comprehension), ReCoRD (commonsense reasoning), and RTE (textual entailment). The 11B model averaged 89.3 on SuperGLUE, matching the human baseline reported at benchmark publication.^[5] The same model also handled Winogrande, ANLI, and several GLUE tasks, all through the same text-to-text interface with task prefixes.^[5]

Data-to-text and structured tasks

Data-to-text tasks, in which the model turns a structured table or database record into fluent prose, also fit naturally into the text-to-text format. The input is a linearized representation of the structured data (field names and values concatenated as text), and the output is a sentence or paragraph. This was demonstrated on the WebNLG and Dart benchmarks.^[20] Related structured tasks include text-to-SQL (mapping a natural-language question to an SQL query) and linearized table-to-text transformations used in commercial data analytics contexts.

Dialogue and code

BART and its variants have been applied to dialogue response generation, with the encoder reading the conversation history and the decoder generating a response.^[4] CodeT5 adapted the encoder-decoder design to programming tasks. Pretrained with an identifier-aware objective on 8.35 million functions across 8 programming languages (from CodeSearchNet), CodeT5 supported code summarization, code generation from comments, code refinement, and defect detection.^[10] CodeT5+ extended this to 16B parameters and added instruction tuning, matching or exceeding open code LLMs of the same era on HumanEval.^[15]

Notable models

Model	Release	Organization	Sizes	Notes
BART	Oct 2019	Facebook AI Research	140M (base), 400M (large)	Denoising autoencoder; strong on summarization and dialogue^[4]
T5	Oct 2019	Google Research	60M, 220M, 770M, 3B, 11B	Span corruption pretraining on C4; unified text-to-text framing^[5]
mT5	Oct 2020	Google Research	300M to 13B	Pretrained on mC4 across 101 languages^[7]
PEGASUS	Dec 2019	Google Research	568M	Gap-sentences generation objective for abstractive summarization^[6]
CodeT5	Sep 2021	Salesforce Research	60M, 220M, 770M	Identifier-aware pretraining on 8.35M functions in 8 languages^[10]
LongT5	Dec 2021	Google Research	up to 3B	Transient global attention for inputs up to 16K tokens^[11]
Switch Transformer	Jan 2021	Google Research	1.6T (sparse)	Mixture-of-experts on a T5 backbone^[8]
UL2	May 2022	Google Research	20B	Mixture-of-Denoisers; SOTA on 50 supervised tasks at release^[12]
PEGASUS-X	Aug 2022	Google Research	272M, 568M	Staggered block-local attention for 16K-token inputs^[13]
FLAN-T5	Oct 2022	Google Research	80M, 250M, 780M, 3B, 11B	Instruction-tuned T5; 1.8K tasks plus chain-of-thought data^[14]
Flan-UL2	Mar 2023	Google Research	20B	UL2 with FLAN instruction tuning
CodeT5+	May 2023	Salesforce Research	220M to 16B	Multi-objective code LLM; instruction-tuned variant matches open code LLMs on HumanEval^[15]

ByT5, released in May 2021, comes in the same five-size range as mT5 but trades the SentencePiece tokenizer for raw UTF-8 byte input.^[9] The T5X framework, written in JAX, reimplemented T5, mT5, UL2, and related models for TPU training and is the reference codebase for most public Google encoder-decoder checkpoints.

Benchmarks

Text2text models have set or matched state-of-the-art results across a broad range of NLP tasks. Raffel et al. report that T5-11B reached an average score of 89.3 on SuperGLUE (compared with a human baseline of 89.8 reported with the benchmark), 90.06 F1 on SQuAD v1.1, a ROUGE-L of 39.65 on CNN/DailyMail summarization, and BLEU scores of 32.1 on WMT English-German and 43.4 on WMT English-French.^[5] BART achieved 6 ROUGE point gains over the previous state of the art on XSum summarization and matched RoBERTa on GLUE and SQuAD despite being a generative model.^[4]

The following table summarizes representative results for several models.

Model	Task	Benchmark	Score
T5-11B	Multi-task	SuperGLUE	89.3 average^[5]
T5-11B	Reading comprehension	SQuAD v1.1	90.06 F1^[5]
T5-11B	Translation	WMT En-De	32.1 BLEU^[5]
T5-11B	Summarization	CNN/DailyMail	39.65 ROUGE-L^[5]
BART-large	Summarization	XSum	45.14 ROUGE-1^[4]
PEGASUS-large	Summarization	XSum	47.21 ROUGE-1^[6]
mT5-XXL	Cross-lingual QA	TyDi QA GoldP	82.5 F1^[7]
FLAN-T5-XXL	Zero-shot reasoning	MMLU	55.1 percent^[14]
UL2 20B	Zero-shot generation	SuperGLUE	exceeds 175B GPT-3^[12]
CodeT5+ 16B	Code generation	HumanEval pass@1	35.0 percent^[15]

Transfer learning and fine-tuning

Text2text models are canonical examples of pretrain-then-fine-tune transfer learning. The core insight is that pretraining on large unlabeled corpora with a self-supervised objective instills general language understanding that can be adapted to many downstream tasks with relatively small labeled datasets. Raffel et al. studied this extensively in the T5 paper, ablating pretraining objectives, model sizes, pretraining dataset sizes, and fine-tuning strategies.^[5]

Key findings from the T5 transfer learning ablations include: (1) multi-task pretraining (sharing all parameters across tasks via prefixes during pretraining, not just fine-tuning) helped some tasks but hurt others unless task mixing ratios were carefully tuned; (2) pretraining for longer on more data consistently improved downstream performance without saturation up to the largest configurations tested; (3) bigger models were consistently better, with the 11B model outperforming the 3B model across nearly all tasks; (4) span corruption outperformed other pretraining objectives including BERT-style masked language modeling and prefix language modeling.^[5]

Multi-task learning and task mixing

T5 demonstrated that a single model with a task prefix can be fine-tuned jointly on many tasks, but the optimal mixing ratio between tasks was non-trivial. Equal mixing underweighted high-resource tasks; proportional mixing overweighted them. The paper explored several mixing strategies, including temperature-scaled sampling.^[5] FLAN-T5 and subsequent instruction-tuned variants took a more aggressive approach: train on as many diverse tasks as possible, expressed as natural-language instructions, which generalizes better to unseen tasks than proportional mixing on a fixed task set.^[14]^[18]

Comparison with decoder-only models

Encoder-decoder text2text models and decoder-only language models sit at different points in the design space. The bidirectional encoder lets every input token attend to every other input token, which is well suited to tasks where the model must understand a fixed input before producing a short structured output, such as classification, extractive question answering, and summarization. Cross-attention also separates the cost of reading the input from the cost of generating the output, so a long document with a short summary is cheaper to process than with a decoder-only architecture, which rolls the entire input through its causal stack alongside the output.

Decoder-only models, by contrast, have benefited disproportionately from scale and from prompt-based learning since 2020. A single autoregressive stack, the absence of an architectural prior about which positions are "input" and which are "output", and the ability to interleave instructions, examples, and continuations in one stream made decoder-only the dominant choice for frontier-scale large language models. Open-weight encoder-decoder models above roughly 20B parameters remain rare; UL2 20B, Flan-UL2 20B, and the sparse Switch family are the principal exceptions. Tay et al. nonetheless reported that UL2 outperformed a 175B GPT-3 on zero-shot SuperGLUE while using a small fraction of the compute.^[12]

Structural comparison

The table below summarizes the key architectural and practical differences between the two families.

Property	Text2text encoder-decoder	Decoder-only LLM
Encoder attention	Bidirectional (full self-attention over input)	Causal (attends only to past tokens)
Input-output separation	Explicit: encoder reads input, decoder writes output	Implicit: input and output share the same token stream
Cross-attention	Yes: decoder attends to all encoder hidden states at each layer	No
Pretraining objective	Denoising / span corruption / gap-sentence generation	Causal language modeling or a mix with denoising (UL2-style)
Relative strength	Classification, extraction, short structured output, translation, summarization	Open-ended generation, long-form reasoning, instruction following, in-context learning
Largest open-weight dense models	UL2 20B, Flan-UL2 20B	Llama 3 70B, Mistral, Qwen, Falcon and many others
Largest open-weight sparse models	Switch Transformer 1.6T	Mixtral, DeepSeek-V3 671B
Frontier closed models	(largely superseded at frontier scale)	GPT-4o, Gemini, Claude Sonnet, Grok
Per-token inference cost for summarization	Lower (input processed once by encoder; decoder generates short output)	Higher (full input+output causal pass per token)
Common deployment contexts	Summarization APIs, translation, structured extraction, code tasks	Chat, agents, reasoning, open-ended generation

Why decoder-only models came to dominate at scale

The convergence to decoder-only architectures at frontier scale reflects several practical factors. First, scaling laws studies found that a decoder-only model with N total parameters performs comparably to an encoder-decoder model with N total parameters on many generation tasks, since in the encoder-decoder case the N parameters are split between two stacks each half the size of the decoder-only model. Second, in-context learning works naturally in a decoder-only model, where examples and a query can be concatenated in a single stream; the encoder-decoder format requires a fixed input-output split that does not accommodate fluid few-shot prompting as gracefully. Third, RL-based alignment (RLHF and its successors) was first scaled successfully on decoder-only models, creating a compounding advantage as the industry invested in those pipelines. The result is that as of 2025 the frontier reasoning models and large language models are almost exclusively decoder-only.

Applications

Text2text models cover most generation-flavored NLP tasks. Machine translation was the original target for sequence-to-sequence research and remains a strong fit for the encoder-decoder design.^[1] Abstractive text summarization is the area where BART, PEGASUS, and LongT5 are most widely adopted in production, including for news, scientific papers, and long-form documents.^[4]^[6]^[11] Question answering is supported in both extractive and generative forms; T5 introduced the "closed-book" formulation, in which the model answers factual questions without retrieving external passages.^[16]

Other common applications include paraphrasing, text simplification, headline generation, grammatical error correction, data-to-text (turning structured tables into prose), dialogue response generation, and natural language to SQL. CodeT5 and CodeT5+ target programming tasks such as code summarization, completion, defect detection, clone detection, and text-to-code retrieval.^[10]^[15] Multilingual variants (mT5, ByT5) extend each of these applications across languages, often outperforming language-specific baselines on low-resource pairs.^[7]^[9]

Text2text models have also been deployed in production settings for specialized document intelligence: contract review (summarizing clauses, extracting obligations), biomedical literature mining (abstracting and indexing clinical trial results), customer support (summarizing tickets, drafting responses), and search (generating extractive snippets or abstractive answers). Their deterministic prefix-to-output interface makes them easier to control than decoder-only models for high-precision production pipelines.

Limitations

Text2text models share several limitations. Denoising pretraining does not match downstream tasks exactly, so fine-tuning or instruction tuning is generally still needed for strong zero-shot performance. The fixed split between encoder and decoder makes the architecture less flexible than a single causal stack for tasks where input and output blur (long multi-turn dialogues, code execution traces, agentic loops). The number of open-weight encoder-decoder checkpoints above 20B parameters is small, which limits direct comparisons against frontier decoder-only models. Generation diversity can also suffer because the encoder-decoder design tends to be more confident and less varied than a temperature-sampled decoder-only model.

Like all generative language models, text2text models can hallucinate facts, copy biases from their pretraining data, and produce unsafe content when prompted adversarially. The closed-book question-answering experiments in the original T5 paper made this explicit: even an 11B model misses many factual questions that are easily handled by retrieval-augmented systems.^[16]

Additional limitations specific to the architecture include:

Fixed input-output split. The encoder-decoder design requires a clear boundary between input and output at every forward pass. This is natural for translation and summarization but awkward for conversational tasks, multi-step reasoning, or any scenario where the model should treat previous output as additional input context without re-encoding the entire conversation.
Context window constraints on the encoder. Although LongT5 and PEGASUS-X extend the effective context to 16K tokens,^[11]^[13] standard T5 and BART encoders are limited to 512 or 1,024 tokens, shorter than the context windows of modern decoder-only models. Processing book-length documents requires chunking, which can lose cross-chunk coherence.
Training instability at large scale. The Switch Transformer (a sparse T5 variant) required additional engineering to stabilize training, including dtype-specific attention computation and a separate expert capacity factor to prevent overflow.^[8] These engineering costs partly explain why sparse encoder-decoder research has not scaled as far as sparse decoder-only models.
Ecosystem weight. The open-source and research community has invested heavily in decoder-only infrastructure (fine-tuning frameworks, evaluation harnesses, RLHF pipelines) since 2022. Encoder-decoder models benefit less from this infrastructure, creating a compounding disadvantage even when the underlying model quality is competitive.

References

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." arXiv:1409.3215. https://arxiv.org/abs/1409.3215 ↩
Bahdanau, D., Cho, K., and Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473. https://arxiv.org/abs/1409.0473 ↩
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." arXiv:1706.03762. https://arxiv.org/abs/1706.03762 ↩
Lewis, M., Liu, Y., Goyal, N., et al. (2019). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." arXiv:1910.13461. https://arxiv.org/abs/1910.13461 ↩
Raffel, C., Shazeer, N., Roberts, A., et al. (2019). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." arXiv:1910.10683. https://arxiv.org/abs/1910.10683 ↩
Zhang, J., Zhao, Y., Saleh, M., and Liu, P. J. (2019). "PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization." arXiv:1912.08777. https://arxiv.org/abs/1912.08777 ↩
Xue, L., Constant, N., Roberts, A., et al. (2020). "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer." arXiv:2010.11934. https://arxiv.org/abs/2010.11934 ↩
Fedus, W., Zoph, B., and Shazeer, N. (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961. https://arxiv.org/abs/2101.03961 ↩
Xue, L., Barua, A., Constant, N., et al. (2021). "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models." arXiv:2105.13626. https://arxiv.org/abs/2105.13626 ↩
Wang, Y., Wang, W., Joty, S., and Hoi, S. C. H. (2021). "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation." arXiv:2109.00859. https://arxiv.org/abs/2109.00859 ↩
Guo, M., Ainslie, J., Uthus, D., et al. (2021). "LongT5: Efficient Text-To-Text Transformer for Long Sequences." arXiv:2112.07916. https://arxiv.org/abs/2112.07916 ↩
Tay, Y., Dehghani, M., Tran, V. Q., et al. (2022). "UL2: Unifying Language Learning Paradigms." arXiv:2205.05131. https://arxiv.org/abs/2205.05131 ↩
Phang, J., Zhao, Y., and Liu, P. J. (2022). "Investigating Efficiently Extending Transformers for Long Input Summarization." arXiv:2208.04347. https://arxiv.org/abs/2208.04347 ↩
Chung, H. W., Hou, L., Longpre, S., et al. (2022). "Scaling Instruction-Finetuned Language Models." arXiv:2210.11416. https://arxiv.org/abs/2210.11416 ↩
Wang, Y., Le, H., Gotmare, A. D., et al. (2023). "CodeT5+: Open Code Large Language Models for Code Understanding and Generation." arXiv:2305.07922. https://arxiv.org/abs/2305.07922 ↩
Roberts, A., Raffel, C., and Shazeer, N. (2020). "How Much Knowledge Can You Pack Into the Parameters of a Language Model?" arXiv:2002.08910. https://arxiv.org/abs/2002.08910 ↩
Wei, J., Bosma, M., Zhao, V. Y., et al. (2021). "Finetuned Language Models Are Zero-Shot Learners." arXiv:2109.01652. https://arxiv.org/abs/2109.01652 ↩
Longpre, S., Hou, L., Vu, T., et al. (2023). "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning." arXiv:2301.13688. https://arxiv.org/abs/2301.13688 ↩
Wang, S., Guo, Y., Shao, Y., et al. (2021). "GLGE: A New General Language Generation Evaluation Benchmark." arXiv:2011.11928. https://arxiv.org/abs/2011.11928
Kale, M. and Rastogi, A. (2020). "Text-to-Text Pre-Training for Data-to-Text Tasks." arXiv:2005.10433. https://arxiv.org/abs/2005.10433 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Text Generation Models