Text2Text Generation Models
Last reviewed
May 31, 2026
Sources
20 citations
Review status
Source-backed
Revision
v3 ยท 5,087 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
20 citations
Review status
Source-backed
Revision
v3 ยท 5,087 words
Add missing citations, update stale details, or suggest a clearer explanation.
Text-to-text (text2text) generation models are a family of neural network systems that frame many natural language processing tasks as a single problem: given an input text string, produce an output text string. The category is dominated by encoder-decoder Transformer architectures pretrained on large unlabeled corpora with denoising objectives, and then either fine-tuned on individual tasks or trained jointly on many tasks using task-specific prefixes. The paradigm was popularized by Google's T5 (Text-to-Text Transfer Transformer), which treats translation, summarization, classification, question answering, and other tasks as variations of the same text-to-text format. Other widely used text2text models include BART, mT5, FLAN-T5, UL2, Pegasus, LongT5, and CodeT5.
Text2text models are distinguished from decoder-only text generation models (such as the GPT family) by their architecture. A text2text model has a bidirectional encoder that builds contextual representations of the entire input, and a separate autoregressive decoder that attends to those representations through cross-attention while generating the output. Decoder-only models, in contrast, use a single causal stack that conditions only on past tokens.
See also: Natural Language Processing Models and Tasks
| Text-to-text generation models | |
|---|---|
| Type | Encoder-decoder Transformer family |
| Core idea | Every NLP task reframed as text in, text out |
| Defining architecture | Bidirectional encoder + autoregressive decoder with cross-attention |
| Dominant pretraining objective | Span corruption and denoising |
| Unifying paper | "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Raffel et al., 2019) |
| Key models | T5, BART, mT5, FLAN-T5, UL2, PEGASUS, LongT5, CodeT5 |
| Key tasks | Machine translation, summarization, question answering, classification-as-generation, paraphrase, NLI |
| Representative sizes | 60M to 20B dense; 1.6T sparse (Switch Transformer) |
| Predecessor paradigm | Sequence-to-sequence learning with LSTMs (Sutskever et al., 2014) |
The encoder-decoder formulation of text-to-text generation grew out of neural machine translation research in the mid-2010s. Sutskever, Vinyals, and Le (2014) introduced sequence-to-sequence learning with stacked LSTMs, mapping an English sentence to a fixed-dimensional vector and decoding it into French, reaching 34.8 BLEU on WMT'14 English-French. The same year, Bahdanau, Cho, and Bengio added a soft alignment mechanism that lets the decoder attend to different positions in the source at each generation step. That work introduced what is now called Bahdanau or additive attention.
The encoder-decoder Transformer of Vaswani et al. (2017) replaced recurrence with self-attention, parallelized training, and set new BLEU scores on WMT 2014 English-German (28.4) and English-French (41.8). The original Transformer was itself a text-to-text architecture for translation, though the label "text-to-text" became associated with later models that applied the same architecture to many tasks at once.
In October 2019, two papers published within two weeks of each other crystallized the modern text2text paradigm. BART, from Facebook AI Research (now Meta AI), proposed a denoising autoencoder that corrupts text with an arbitrary noising function and learns to reconstruct the original. T5, from Google Research, reframed every supervised task as taking a text string in and producing a text string out, with task-specific prefixes like "translate English to German:" or "summarize:". T5 was pretrained on the Colossal Clean Crawled Corpus (C4), a roughly 750 GB filtered subset of Common Crawl.
Work after T5 extended the recipe along several dimensions. mT5 (Xue et al., 2020) trained the architecture on a corpus covering 101 languages. ByT5 (Xue et al., 2021) replaced the SentencePiece tokenizer with raw UTF-8 bytes, producing a token-free model that handles noisy text and works on any script. PEGASUS (Zhang et al., 2019) introduced gap-sentences generation, a summarization-specific pretraining objective in which whole sentences are masked and reconstructed. LongT5 (Guo et al., 2021) combined T5 with a transient global attention pattern to support inputs up to roughly 16,000 tokens, and PEGASUS-X (Phang et al., 2022) extended PEGASUS to the same regime.
The Switch Transformer (Fedus et al., 2021) applied sparse mixture-of-experts routing to a T5 backbone, reaching trillion-parameter scale at constant per-token compute. UL2 (Tay et al., 2022) unified denoising and causal pretraining into a single Mixture-of-Denoisers objective and was released as a 20-billion-parameter encoder-decoder. FLAN-T5 (Chung et al., 2022) showed that scaling up the number of instruction-tuned tasks plus chain-of-thought training dramatically improves zero-shot and few-shot performance. Flan-UL2, released in early 2023, applied the same recipe to the 20B UL2 model. CodeT5 and CodeT5+ (Wang et al., 2021; 2023) adapted the encoder-decoder design to programming languages.
A text2text model is a Transformer with two stacks. The encoder reads the input token sequence and applies bidirectional self-attention, so each input position can attend to every other. The output is one contextual vector per input token. The decoder generates the output one token at a time using two attention mechanisms per layer: causal self-attention over previously generated tokens, and cross-attention over the encoder output. The two stacks are typically the same depth and width.
T5 uses relative position biases inside attention rather than absolute positional embeddings. BART uses learned absolute positions and a GeLU activation, following the BERT and GPT conventions of the time. UL2 and LongT5 introduce additional positional and attention modifications to handle their pretraining objectives and longer inputs.
The encoder applies a stack of identical layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward sublayer. Because the attention is bidirectional, each token representation integrates information from the full input at every layer. This is a key advantage for tasks where global context matters: a model summarizing a document can, for example, resolve coreference across the full input before generating a single output token. The encoder produces one hidden-state vector per input token, and the full set of these vectors is passed to every decoder layer via cross-attention.
T5's encoder uses relative attention biases rather than absolute positional embeddings. Relative biases add a learned scalar to each attention logit depending on the distance between the attending token and the attended token, capped at a maximum bucket distance. This makes T5 more robust to inputs longer than those seen during pretraining, within limits.
The decoder processes the output sequence token by token. Each decoder layer applies three sublayers in order: causal self-attention (attending only to previously generated output tokens), encoder-decoder cross-attention (attending to the full encoder output), and a position-wise feed-forward network. The cross-attention keys and values come from the encoder's final hidden states, while the queries come from the decoder's current hidden state. This separates the reading phase (encoder) from the generation phase (decoder), which is the fundamental structural difference from decoder-only large language models.
At each generation step, the decoder softmax over the full vocabulary selects the next token, which is appended to the output sequence and fed back in as the next input. This autoregressive loop continues until the model emits a special end-of-sequence token or reaches a maximum length.
Different text2text models use different strategies for encoding token position. The original encoder-decoder Transformer used sinusoidal fixed embeddings. T5 adopted relative position biases shared across all layers. BART learned absolute position embeddings separately for the encoder and decoder. PEGASUS followed BART's convention. UL2 inherited T5's relative biases and modified them slightly to handle the Mixture-of-Denoisers objective. LongT5 combined local attention within fixed windows with transient global tokens that aggregate information across the full sequence, making it practical to run the model on inputs with tens of thousands of tokens.
The text-to-text framing introduced by Raffel et al. is the central design idea of the category. Every input begins with a short prefix that names the task. Examples from the original T5 paper include "translate English to German: That is good.", "summarize: state authorities dispatched emergency crews ...", "cola sentence: The course is jumping well.", and "stsb sentence1: ... sentence2: ...". Classification targets are written as words ("acceptable", "entailment") rather than class indices, and regression targets like semantic similarity are rounded to one decimal place and emitted as text. A single set of weights handles every task; at inference time the prefix selects which one.
This framing makes the loss function uniform (cross-entropy over output tokens), simplifies multi-task fine-tuning, and turns evaluation into string matching. It is also the conceptual predecessor of prompt-based usage of decoder-only large language models, although decoder-only models typically learn the prompt format implicitly from web text rather than from explicit task prefixes.
Raffel et al. built the Colossal Clean Crawled Corpus (C4) specifically to pretrain T5. Starting from a Common Crawl snapshot, they applied a pipeline of heuristic filters: remove lines not ending in terminal punctuation, discard pages with fewer than five sentences, remove pages containing offensive words from a fixed list, deduplicate at the three-sentence-consecutive-overlap level, and keep only English content as scored by a language-identification model. The result was roughly 750 GB of clean English web text, substantially larger than and qualitatively different from earlier English pretraining corpora. C4 became the standard pretraining corpus for mT5 (where it was extended to 101 languages as mC4), T5 variants, and several third-party reimplementations.
The FLAN (Fine-tuned LAnguage Net) line of work showed that the T5 architecture, already capable of multi-task generation, benefits dramatically from explicit instruction tuning: fine-tuning the model on a large collection of tasks expressed as natural-language instructions rather than fixed prefixes. FLAN-T5 (Chung et al., 2022) instruction-tuned T5 on 1,836 tasks drawn from 473 datasets, including chain-of-thought prompts for 9 of those datasets. The authors reported gains over both the base T5 and over GPT-3 on several zero-shot benchmarks, despite FLAN-T5-XXL having roughly 11B parameters (comparable to T5-11B) versus GPT-3's 175B. FLAN-T5 became the most widely used open-weight instruction-following encoder-decoder model as of 2023.
Denoising is the dominant pretraining objective for text2text models. BART pretrains by corrupting text with several noising functions, including text infilling (replacing spans of tokens with a single mask), sentence permutation, token deletion, document rotation, and token masking, then training the model to reconstruct the original document. Lewis et al. found that the combination of text infilling and sentence permutation gave the strongest results.
T5 uses span corruption. Roughly 15 percent of input tokens are dropped in contiguous spans of average length three, each span is replaced by a single sentinel token, and the target sequence is the dropped spans separated by the same sentinels. This produces short targets, which is cheap relative to BART's full-document reconstruction.
UL2 unifies several pretraining schemes under a Mixture-of-Denoisers. R-denoising is regular T5-style span corruption with short spans. S-denoising splits the document at a random position and treats the prefix as the input and the suffix as the target, which is essentially causal language modeling cast as a text-to-text problem. X-denoising is extreme span corruption, with longer spans or higher corruption ratios. Each objective is prefixed by a mode token so the model can be steered toward one or another at inference. PEGASUS uses gap-sentences generation: principal sentences (chosen by a ROUGE-based importance heuristic) are masked out of the document and the model is asked to regenerate them as a pseudo-summary.
| Model | Pretraining objective | Corruption type | Target length | Special design |
|---|---|---|---|---|
| T5 | Span corruption | Contiguous token spans (~15%, avg len 3) | Short (sentinel-separated spans) | Sentinel tokens for each masked span |
| BART | Denoising autoencoder | Text infilling + sentence permutation (best combo) | Full document reconstruction | Multiple noising functions explored |
| PEGASUS | Gap-sentences generation | Whole sentences (selected by ROUGE importance) | Selected sentences as pseudo-summary | Summarization-specific |
| UL2 | Mixture-of-Denoisers | R (short spans), S (suffix), X (long spans) | Varies by mode | Mode token steers objective at inference |
| ByT5 | Span corruption (bytes) | Byte-level spans | Short | Token-free; operates on raw UTF-8 |
| FLAN-T5 | Span corruption + instruction tuning | Same as T5 pretraining, then task instructions | Task-dependent | Chain-of-thought data included |
One of the most consequential claims of the T5 paper was that a single architecture and training procedure could cover nearly every standard NLP benchmark. The following tasks were shown to fit the text-to-text format directly.
Machine translation is the original motivation for encoder-decoder sequence-to-sequence research. T5 handled translation by prepending a prefix like "translate English to German:" to the source sentence and training the decoder to produce the target sentence. On WMT 2014, T5-11B reached 32.1 BLEU on English-German and 43.4 on English-French, competitive with dedicated translation systems of the time. mT5 extended the same approach to 101 languages and demonstrated that shared multilingual pretraining transfers well to low-resource language pairs.
Abstractive text summarization is the area where text2text models have had the clearest commercial impact. PEGASUS introduced a pretraining objective specifically designed to make the model learn to extract and rephrase salient information, and its 568M-parameter variant set new benchmarks on XSum (47.21 ROUGE-1) and CNN/DailyMail (44.17 ROUGE-1) at publication. BART achieved 45.14 ROUGE-1 on XSum. T5-11B reached 39.65 ROUGE-L on CNN/DailyMail. LongT5 and PEGASUS-X extended summarization to long documents up to 16,000 tokens, relevant for scientific papers, legal documents, and financial reports.
Question answering in text2text models takes two forms. Extractive QA maps naturally to the format: the model receives a passage and a question and outputs a span or a generated answer. T5 was fine-tuned on SQuAD v1.1 and reached 90.06 F1. Closed-book QA, introduced by the T5 paper, is more unusual: the model answers factual questions purely from its pretrained weights, with no retrieved context. T5-11B answered correctly on TriviaQA and Natural Questions open-book benchmarks at rates well above previous closed-book systems. This was an early indication that large pretrained models memorize substantial factual knowledge, though the closed-book approach has since been largely superseded by retrieval-augmented generation.
Natural language inference (NLI) requires predicting whether a hypothesis entails, contradicts, or is neutral with respect to a premise. In the T5 framing, the input is "mnli hypothesis: [H] premise: [P]" and the output is one of the words "entailment", "contradiction", or "neutral". Classification results such as sentiment labels ("positive", "negative"), grammatical acceptability ("acceptable", "unacceptable"), and textual similarity scores (rounded floats like "3.8") are all emitted as text. This classification-as-generation approach avoids task-specific output heads entirely and makes multi-task training trivial: the same softmax is used for everything.
T5 was evaluated on all eight SuperGLUE tasks, including BoolQ (yes/no questions), CB and MNLI (NLI), WiC (word-in-context disambiguation), WSC and Winogrande (coreference resolution), MultiRC (multi-sentence reading comprehension), ReCoRD (commonsense reasoning), and RTE (textual entailment). The 11B model averaged 89.3 on SuperGLUE, matching the human baseline reported at benchmark publication. The same model also handled Winogrande, ANLI, and several GLUE tasks, all through the same text-to-text interface with task prefixes.
Data-to-text tasks, in which the model turns a structured table or database record into fluent prose, also fit naturally into the text-to-text format. The input is a linearized representation of the structured data (field names and values concatenated as text), and the output is a sentence or paragraph. This was demonstrated on the WebNLG and Dart benchmarks. Related structured tasks include text-to-SQL (mapping a natural-language question to an SQL query) and linearized table-to-text transformations used in commercial data analytics contexts.
BART and its variants have been applied to dialogue response generation, with the encoder reading the conversation history and the decoder generating a response. CodeT5 adapted the encoder-decoder design to programming tasks. Pretrained with an identifier-aware objective on 8.35 million functions across 8 programming languages (from CodeSearchNet), CodeT5 supported code summarization, code generation from comments, code refinement, and defect detection. CodeT5+ extended this to 16B parameters and added instruction tuning, matching or exceeding open code LLMs of the same era on HumanEval.
| Model | Release | Organization | Sizes | Notes |
|---|---|---|---|---|
| BART | Oct 2019 | Facebook AI Research | 140M (base), 400M (large) | Denoising autoencoder; strong on summarization and dialogue |
| T5 | Oct 2019 | Google Research | 60M, 220M, 770M, 3B, 11B | Span corruption pretraining on C4; unified text-to-text framing |
| mT5 | Oct 2020 | Google Research | 300M to 13B | Pretrained on mC4 across 101 languages |
| PEGASUS | Dec 2019 | Google Research | 568M | Gap-sentences generation objective for abstractive summarization |
| CodeT5 | Sep 2021 | Salesforce Research | 60M, 220M, 770M | Identifier-aware pretraining on 8.35M functions in 8 languages |
| LongT5 | Dec 2021 | Google Research | up to 3B | Transient global attention for inputs up to 16K tokens |
| Switch Transformer | Jan 2021 | Google Research | 1.6T (sparse) | Mixture-of-experts on a T5 backbone |
| UL2 | May 2022 | Google Research | 20B | Mixture-of-Denoisers; SOTA on 50 supervised tasks at release |
| PEGASUS-X | Aug 2022 | Google Research | 272M, 568M | Staggered block-local attention for 16K-token inputs |
| FLAN-T5 | Oct 2022 | Google Research | 80M, 250M, 780M, 3B, 11B | Instruction-tuned T5; 1.8K tasks plus chain-of-thought data |
| Flan-UL2 | Mar 2023 | Google Research | 20B | UL2 with FLAN instruction tuning |
| CodeT5+ | May 2023 | Salesforce Research | 220M to 16B | Multi-objective code LLM; instruction-tuned variant matches open code LLMs on HumanEval |
ByT5, released in May 2021, comes in the same five-size range as mT5 but trades the SentencePiece tokenizer for raw UTF-8 byte input. The T5X framework, written in JAX, reimplemented T5, mT5, UL2, and related models for TPU training and is the reference codebase for most public Google encoder-decoder checkpoints.
Text2text models have set or matched state-of-the-art results across a broad range of NLP tasks. Raffel et al. report that T5-11B reached an average score of 89.3 on SuperGLUE (compared with a human baseline of 89.8 reported with the benchmark), 90.06 F1 on SQuAD v1.1, a ROUGE-L of 39.65 on CNN/DailyMail summarization, and BLEU scores of 32.1 on WMT English-German and 43.4 on WMT English-French. BART achieved 6 ROUGE point gains over the previous state of the art on XSum summarization and matched RoBERTa on GLUE and SQuAD despite being a generative model.
The following table summarizes representative results for several models.
| Model | Task | Benchmark | Score |
|---|---|---|---|
| T5-11B | Multi-task | SuperGLUE | 89.3 average |
| T5-11B | Reading comprehension | SQuAD v1.1 | 90.06 F1 |
| T5-11B | Translation | WMT En-De | 32.1 BLEU |
| T5-11B | Summarization | CNN/DailyMail | 39.65 ROUGE-L |
| BART-large | Summarization | XSum | 45.14 ROUGE-1 |
| PEGASUS-large | Summarization | XSum | 47.21 ROUGE-1 |
| mT5-XXL | Cross-lingual QA | TyDi QA GoldP | 82.5 F1 |
| FLAN-T5-XXL | Zero-shot reasoning | MMLU | 55.1 percent |
| UL2 20B | Zero-shot generation | SuperGLUE | exceeds 175B GPT-3 |
| CodeT5+ 16B | Code generation | HumanEval pass@1 | 35.0 percent |
Text2text models are canonical examples of pretrain-then-fine-tune transfer learning. The core insight is that pretraining on large unlabeled corpora with a self-supervised objective instills general language understanding that can be adapted to many downstream tasks with relatively small labeled datasets. Raffel et al. studied this extensively in the T5 paper, ablating pretraining objectives, model sizes, pretraining dataset sizes, and fine-tuning strategies.
Key findings from the T5 transfer learning ablations include: (1) multi-task pretraining (sharing all parameters across tasks via prefixes during pretraining, not just fine-tuning) helped some tasks but hurt others unless task mixing ratios were carefully tuned; (2) pretraining for longer on more data consistently improved downstream performance without saturation up to the largest configurations tested; (3) bigger models were consistently better, with the 11B model outperforming the 3B model across nearly all tasks; (4) span corruption outperformed other pretraining objectives including BERT-style masked language modeling and prefix language modeling.
T5 demonstrated that a single model with a task prefix can be fine-tuned jointly on many tasks, but the optimal mixing ratio between tasks was non-trivial. Equal mixing underweighted high-resource tasks; proportional mixing overweighted them. The paper explored several mixing strategies, including temperature-scaled sampling. FLAN-T5 and subsequent instruction-tuned variants took a more aggressive approach: train on as many diverse tasks as possible, expressed as natural-language instructions, which generalizes better to unseen tasks than proportional mixing on a fixed task set.
Encoder-decoder text2text models and decoder-only language models sit at different points in the design space. The bidirectional encoder lets every input token attend to every other input token, which is well suited to tasks where the model must understand a fixed input before producing a short structured output, such as classification, extractive question answering, and summarization. Cross-attention also separates the cost of reading the input from the cost of generating the output, so a long document with a short summary is cheaper to process than with a decoder-only architecture, which rolls the entire input through its causal stack alongside the output.
Decoder-only models, by contrast, have benefited disproportionately from scale and from prompt-based learning since 2020. A single autoregressive stack, the absence of an architectural prior about which positions are "input" and which are "output", and the ability to interleave instructions, examples, and continuations in one stream made decoder-only the dominant choice for frontier-scale large language models. Open-weight encoder-decoder models above roughly 20B parameters remain rare; UL2 20B, Flan-UL2 20B, and the sparse Switch family are the principal exceptions. Tay et al. nonetheless reported that UL2 outperformed a 175B GPT-3 on zero-shot SuperGLUE while using a small fraction of the compute.
The table below summarizes the key architectural and practical differences between the two families.
| Property | Text2text encoder-decoder | Decoder-only LLM |
|---|---|---|
| Encoder attention | Bidirectional (full self-attention over input) | Causal (attends only to past tokens) |
| Input-output separation | Explicit: encoder reads input, decoder writes output | Implicit: input and output share the same token stream |
| Cross-attention | Yes: decoder attends to all encoder hidden states at each layer | No |
| Pretraining objective | Denoising / span corruption / gap-sentence generation | Causal language modeling or a mix with denoising (UL2-style) |
| Relative strength | Classification, extraction, short structured output, translation, summarization | Open-ended generation, long-form reasoning, instruction following, in-context learning |
| Largest open-weight dense models | UL2 20B, Flan-UL2 20B | Llama 3 70B, Mistral, Qwen, Falcon and many others |
| Largest open-weight sparse models | Switch Transformer 1.6T | Mixtral, DeepSeek-V3 671B |
| Frontier closed models | (largely superseded at frontier scale) | GPT-4o, Gemini, Claude Sonnet, Grok |
| Per-token inference cost for summarization | Lower (input processed once by encoder; decoder generates short output) | Higher (full input+output causal pass per token) |
| Common deployment contexts | Summarization APIs, translation, structured extraction, code tasks | Chat, agents, reasoning, open-ended generation |
The convergence to decoder-only architectures at frontier scale reflects several practical factors. First, scaling laws studies found that a decoder-only model with N total parameters performs comparably to an encoder-decoder model with N total parameters on many generation tasks, since in the encoder-decoder case the N parameters are split between two stacks each half the size of the decoder-only model. Second, in-context learning works naturally in a decoder-only model, where examples and a query can be concatenated in a single stream; the encoder-decoder format requires a fixed input-output split that does not accommodate fluid few-shot prompting as gracefully. Third, RL-based alignment (RLHF and its successors) was first scaled successfully on decoder-only models, creating a compounding advantage as the industry invested in those pipelines. The result is that as of 2025 the frontier reasoning models and large language models are almost exclusively decoder-only.
Text2text models cover most generation-flavored NLP tasks. Machine translation was the original target for sequence-to-sequence research and remains a strong fit for the encoder-decoder design. Abstractive text summarization is the area where BART, PEGASUS, and LongT5 are most widely adopted in production, including for news, scientific papers, and long-form documents. Question answering is supported in both extractive and generative forms; T5 introduced the "closed-book" formulation, in which the model answers factual questions without retrieving external passages.
Other common applications include paraphrasing, text simplification, headline generation, grammatical error correction, data-to-text (turning structured tables into prose), dialogue response generation, and natural language to SQL. CodeT5 and CodeT5+ target programming tasks such as code summarization, completion, defect detection, clone detection, and text-to-code retrieval. Multilingual variants (mT5, ByT5) extend each of these applications across languages, often outperforming language-specific baselines on low-resource pairs.
Text2text models have also been deployed in production settings for specialized document intelligence: contract review (summarizing clauses, extracting obligations), biomedical literature mining (abstracting and indexing clinical trial results), customer support (summarizing tickets, drafting responses), and search (generating extractive snippets or abstractive answers). Their deterministic prefix-to-output interface makes them easier to control than decoder-only models for high-precision production pipelines.
Text2text models share several limitations. Denoising pretraining does not match downstream tasks exactly, so fine-tuning or instruction tuning is generally still needed for strong zero-shot performance. The fixed split between encoder and decoder makes the architecture less flexible than a single causal stack for tasks where input and output blur (long multi-turn dialogues, code execution traces, agentic loops). The number of open-weight encoder-decoder checkpoints above 20B parameters is small, which limits direct comparisons against frontier decoder-only models. Generation diversity can also suffer because the encoder-decoder design tends to be more confident and less varied than a temperature-sampled decoder-only model.
Like all generative language models, text2text models can hallucinate facts, copy biases from their pretraining data, and produce unsafe content when prompted adversarially. The closed-book question-answering experiments in the original T5 paper made this explicit: even an 11B model misses many factual questions that are easily handled by retrieval-augmented systems.
Additional limitations specific to the architecture include: