T5 (Text-to-Text Transfer Transformer) is a transformer-based language model developed by researchers at Google AI. Introduced in a paper first posted to arXiv in October 2019 and published in the Journal of Machine Learning Research (JMLR) in 2020, T5 proposed a unified framework in which every natural language processing (NLP) task is cast as a text-to-text problem. Classification, translation, summarization, question answering, and even regression tasks are all reformulated so that both the input and the output are text strings. This simple but powerful idea allowed the same model architecture, training procedure, loss function, and hyperparameters to be applied across dozens of different tasks without modification.
The paper, titled "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer," was authored by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Beyond introducing the T5 model itself, the paper presented an extensive empirical study comparing pre-training objectives, architectures, unlabeled datasets, transfer learning approaches, and scaling strategies. This systematic investigation made the paper one of the most cited works in the field, and its conclusions shaped subsequent research on pre-training and fine-tuning language models.
Google open-sourced all T5 model checkpoints and code, making the models freely available for research and commercial use.
By 2019, transfer learning had become the dominant paradigm in NLP. Models like ELMo, GPT, and BERT demonstrated that pre-training on large amounts of unlabeled text, then fine-tuning on downstream tasks, consistently outperformed training from scratch. However, these models used different architectures and different task-specific output heads. BERT used an encoder-only architecture and required adding a classification layer for each task. GPT used a decoder-only architecture and generated text autoregressively. Both approaches worked well for certain categories of tasks but required custom modifications depending on the downstream application.
The T5 authors recognized that this diversity of approaches made it difficult to compare transfer learning methods fairly. Different papers used different architectures, objectives, datasets, and evaluation protocols, making it hard to isolate which factor was responsible for observed improvements. T5 was designed as a common framework that could serve as both a practical model and a controlled experimental testbed.
The key insight was that virtually any NLP task can be framed as taking text as input and producing text as output. For a sentiment classification task, the model receives the text "classify: This movie is great" and outputs the text "positive." For English-to-German translation, it receives "translate English to German: That is good" and outputs "Das ist gut." For summarization, it receives a document prefixed with "summarize:" and outputs a shorter version. Even regression tasks can be handled by having the model output a string representation of a number. This unified text-to-text format eliminates the need for task-specific architectures or output heads.
T5 follows the original encoder-decoder transformer architecture proposed by Vaswani et al. in the 2017 paper "Attention Is All You Need," with several modifications. The encoder processes the input text bidirectionally, and the decoder generates the output text autoregressively, attending to both the encoder's output and the previously generated tokens.
T5 differs from the original transformer in three main ways:
Relative positional embeddings: Instead of the sinusoidal absolute positional encodings used in the original transformer, T5 uses a learned relative position bias. Each possible offset between two token positions is assigned a scalar bias that is added to the attention logits before the softmax. The model uses 32 learned embedding buckets with ranges that increase logarithmically up to an offset of 128 positions, beyond which all positions map to the same bucket. This scheme allows the model to generalize to sequence lengths not seen during training. Position embedding parameters are shared across all layers, though each attention head within a layer learns its own set of biases.
Pre-layer normalization: T5 places layer normalization before each sub-layer (attention or feed-forward) rather than after, a configuration known as "pre-norm." This improves training stability. T5 also uses a simplified "RMSNorm" variant of layer normalization that only rescales activations without recentering them (no bias term and no subtraction of the mean).
No bias terms: T5 removes bias terms from all dense layers and layer normalization throughout the model. This is a minor simplification that reduces parameter count slightly without harming performance.
The activation function in the feed-forward sub-layers is ReLU in the original T5 models. Later variants like T5 v1.1 switched to the GeLU or GeGLU activation function for improved performance.
T5 uses SentencePiece tokenization with a Unigram subword model. The vocabulary contains 32,128 tokens, including 100 sentinel tokens used during the span corruption pre-training objective. The vocabulary was constructed from a representative sample of English text.
The original T5 paper released five model sizes, from 60 million to 11 billion parameters. All variants use the same vocabulary and the same encoder-decoder structure; they differ only in depth, width, and the number of attention heads. The encoder and decoder share the same depth and hidden dimension in all variants.
| Variant | Parameters | Encoder layers | Decoder layers | d_model | d_ff | Attention heads | d_kv |
|---|---|---|---|---|---|---|---|
| T5-Small | 60M | 6 | 6 | 512 | 2,048 | 8 | 64 |
| T5-Base | 220M | 12 | 12 | 768 | 3,072 | 12 | 64 |
| T5-Large | 770M | 24 | 24 | 1,024 | 4,096 | 16 | 64 |
| T5-3B | 3B | 24 | 24 | 1,024 | 16,384 | 32 | 128 |
| T5-11B | 11B | 24 | 24 | 1,024 | 65,536 | 128 | 128 |
In this table, d_model is the hidden dimension, d_ff is the inner dimension of the feed-forward layers, and d_kv is the dimension of each attention head's key and value projections. The T5-3B and T5-11B models achieve their larger sizes primarily by widening the feed-forward layers and increasing the number of attention heads, rather than by adding more layers.
To pre-train T5, the authors created a new dataset called the Colossal Clean Crawled Corpus (C4). C4 was derived from the April 2019 snapshot of Common Crawl, a publicly available archive of web pages. The raw Common Crawl data contained roughly 1.4 trillion tokens, but the vast majority of this text was low-quality, duplicated, or not natural English.
The cleaning pipeline applied the following filters:
After this cleaning process, C4 contained approximately 750 gigabytes of "reasonably clean and natural English text," roughly two orders of magnitude larger than Wikipedia. The dataset was made publicly available to the research community, and it has since been widely used for pre-training other language models. A later audit by Dodge et al. (2021) documented that C4 spans content from over 365 million internet domains.
T5 uses a denoising pre-training objective called span corruption (sometimes called "span masking" or "fill-in-the-blank"). During pre-training, 15% of the tokens in each input sequence are selected for corruption. Rather than masking individual tokens independently (as in BERT's masked language modeling), T5 groups consecutive corrupted tokens into spans. Each span is replaced with a single unique sentinel token (e.g., <extra_id_0>, <extra_id_1>, etc.). The target output consists of the original tokens from each corrupted span, delimited by the corresponding sentinel tokens.
For example, if the input is "The quick brown fox jumps over the lazy dog" and the tokens "quick brown" and "lazy" are selected for corruption, the model's input becomes:
The
<extra_id_0>fox jumps over the<extra_id_1>dog
And the target output is:
<extra_id_0>quick brown<extra_id_1>lazy
The average span length is 3 tokens. The authors found that this span corruption approach outperformed other pre-training objectives they tested, including standard language modeling (predicting the next token), BERT-style masked language modeling (predicting individual masked tokens), and deshuffling (reconstructing the original order of a shuffled input). The 15% corruption rate was chosen based on historical precedent from BERT, and the authors confirmed that the model's performance was not highly sensitive to this parameter; corruption rates of 10%, 15%, and 25% all yielded similar results.
An important advantage of span corruption is that the target sequences are much shorter than the full input, since only the corrupted tokens need to be predicted. This makes pre-training more computationally efficient than objectives that require the model to reconstruct the entire input.
All T5 models were pre-trained with a maximum sequence length of 512 tokens for both the encoder input and decoder target. Sequences were packed so that multiple shorter examples could be combined into a single 512-token sequence, improving training efficiency. The batch size was 128 sequences, meaning each batch contained approximately 65,536 tokens (2^16).
The models were trained for 2^19 steps (524,288 steps), exposing them to a total of approximately 2^35 tokens, or roughly 34 billion tokens. Since C4 contains far more than 34 billion tokens, the model never repeated any training data during pre-training.
The learning rate followed an inverse square root schedule: lr = 1/sqrt(max(n, k)), where n is the current training step and k = 10,000 is a warmup constant. Training used the AdaFactor optimizer rather than Adam, as AdaFactor uses less memory by maintaining moving averages of row and column sums of the squared gradients rather than the full matrix.
The largest T5-11B model was trained on TPU v3 pods. All models were implemented using the Mesh TensorFlow library, which supports efficient model parallelism across multiple TPU chips.
A defining feature of the T5 paper is its comprehensive empirical study. Rather than simply proposing a new model, the authors systematically evaluated how different design choices affect transfer learning performance. They used the T5-Base model (220M parameters) for most experiments, training each variant on C4 and evaluating on a suite of benchmarks.
The study compared the following factors:
| Factor | Variations tested |
|---|---|
| Architecture | Encoder-decoder, decoder-only, prefix language model |
| Pre-training objective | Language modeling, span corruption, BERT-style MLM, deshuffling |
| Corruption rate | 10%, 15%, 25%, 50% |
| Corruption span length | Individual tokens (i.i.d.), mean span length 2, 3, 5, 10 |
| Dataset | C4, unfiltered Common Crawl, Wikipedia + BooksCorpus, WebText-like |
| Dataset size | Full C4, 2^29 tokens, 2^27 tokens, 2^25 tokens, 2^23 tokens |
| Fine-tuning strategy | Full model fine-tuning, adapter layers, gradual unfreezing |
| Multi-task learning | Pre-train then fine-tune, multi-task pre-training, leave-one-out multi-task |
| Scaling | More parameters vs. more training steps vs. ensembling |
The study produced several influential conclusions:
Encoder-decoder models outperform decoder-only models when both have the same total number of parameters. The encoder-decoder architecture provides a natural separation between understanding the input and generating the output, which benefits seq2seq tasks like translation and summarization.
Span corruption outperforms other pre-training objectives. Denoising objectives (predicting corrupted tokens) consistently beat autoregressive language modeling for downstream task performance, while also being more computationally efficient because the decoder targets are shorter.
Pre-training on in-domain data helps, but dataset size matters more. Models pre-trained on a small domain-specific dataset (like Wikipedia) could outperform models pre-trained on a larger but noisier dataset, but the best results came from a large, cleaned dataset (C4).
Pre-training then fine-tuning outperforms multi-task learning for most tasks, although multi-task pre-training followed by fine-tuning can match or exceed pure pre-training in some settings.
Scaling up model size, training data, and training time all improve performance, with model size having the largest impact per additional FLOP. However, ensembling multiple models also provides significant gains.
Fine-tuning all model parameters works best. While adapter layers (which freeze the pre-trained weights and only train small additional layers) are more parameter-efficient, full fine-tuning consistently produced better results.
These findings guided the design of the final T5 models and influenced subsequent work on scaling language models.
The largest T5-11B model achieved state-of-the-art results on a wide range of NLP benchmarks when it was released.
T5-11B set new state-of-the-art scores on the GLUE benchmark (a collection of nine sentence-level classification tasks) and the SuperGLUE benchmark (a harder successor to GLUE with eight tasks). On SuperGLUE, T5-11B achieved an average score of 88.9, approaching the human baseline of 89.8. This performance on SuperGLUE was a significant milestone, as the benchmark had been designed specifically to be challenging for contemporary models.
On the Stanford Question Answering Dataset (SQuAD), T5-11B achieved strong results by framing the reading comprehension task as text generation. Rather than predicting start and end positions in the passage (as BERT-style models do), T5 simply generates the answer text directly.
On the CNN/Daily Mail summarization benchmark, T5-11B set a new state-of-the-art ROUGE score, demonstrating the effectiveness of the text-to-text approach for abstractive summarization.
In a follow-up study by Roberts et al. (2020), T5 was evaluated on closed-book question answering, where the model must answer factual questions without access to any external documents, relying entirely on knowledge stored in its parameters. T5-11B achieved 50.1% exact match accuracy on TriviaQA and 34.5% on Natural Questions in the closed-book setting, demonstrating that large language models can store and recall substantial amounts of world knowledge.
The text-to-text framework is the conceptual core of T5. Each task is converted to a text-to-text format by prepending a task-specific text prefix to the input. The prefixes used during fine-tuning include:
| Task | Input prefix | Example input | Example output |
|---|---|---|---|
| English-German translation | "translate English to German:" | "translate English to German: That is good." | "Das ist gut." |
| English-French translation | "translate English to French:" | "translate English to French: That is good." | "C'est bien." |
| Sentiment classification | "sst2 sentence:" | "sst2 sentence: This movie is great." | "positive" |
| Sentence similarity | "stsb sentence1: ... sentence2:" | "stsb sentence1: The cat sat. sentence2: A cat is sitting." | "4.2" |
| Summarization | "summarize:" | "summarize: [long article text]" | "[short summary]" |
| Question answering | "question: ... context:" | "question: Who wrote Hamlet? context: ..." | "William Shakespeare" |
This design has several practical advantages. A single model checkpoint can be fine-tuned (or even used zero-shot) on any task by simply changing the prefix. There is no need to design task-specific output heads or loss functions. The cross-entropy loss over generated tokens serves as the universal training objective.
A potential disadvantage is that for classification tasks, the model must generate an entire text label (e.g., "positive" or "entailment") token by token, which is less efficient than producing a single logit. In practice, the overhead is small because most label strings are just one or two tokens long.
T5 occupies a distinct position in the landscape of pre-trained language models. Whereas BERT uses only the encoder half of the transformer and GPT uses only the decoder half, T5 uses the full encoder-decoder architecture. This has implications for what types of tasks each model handles well.
| Feature | BERT | GPT | T5 |
|---|---|---|---|
| Architecture | Encoder-only | Decoder-only | Encoder-decoder |
| Attention pattern | Bidirectional (full) | Causal (left-to-right) | Bidirectional encoder, causal decoder |
| Pre-training objective | Masked language modeling | Autoregressive language modeling | Span corruption (denoising) |
| Task adaptation | Add task-specific output head | Prompt + generation | Text prefix + generation |
| Natural fit | Classification, token labeling, extraction | Text generation, dialogue | Seq2seq tasks: translation, summarization, QA |
| Output format | Class labels or span pointers | Free-form text | Free-form text |
| Example sizes | 110M (Base), 340M (Large) | 117M (GPT-1), 1.5B (GPT-2) | 60M (Small) to 11B (11B) |
BERT's encoder-only architecture makes it strong for understanding tasks (classification, named entity recognition, extractive QA) but unable to generate free-form text. GPT's decoder-only architecture excels at text generation but processes input only from left to right, limiting its ability to fully attend to bidirectional context. T5's encoder-decoder architecture provides bidirectional encoding of the input and autoregressive generation of the output, making it naturally suited for tasks that require both understanding an input and producing a structured output.
The T5 paper's systematic comparison confirmed that encoder-decoder models tend to outperform decoder-only models of the same total parameter count on the benchmarks studied. However, as decoder-only models have continued to scale (with GPT-3 reaching 175 billion parameters and later models growing even larger), the decoder-only paradigm has become dominant for general-purpose language models, partly because it simplifies the training pipeline and scales more efficiently for pure generation tasks.
T5's design has been extended, refined, and adapted in numerous follow-up models.
Google released T5 v1.1 as an improved version of the original T5 with several changes: (1) the feed-forward activation function was changed from ReLU to GeGLU (Gated Linear Unit with GeLU activation), which improved quality; (2) dropout was disabled during pre-training; (3) the model was pre-trained on C4 only, without mixing in any downstream task data; and (4) the encoder and decoder embedding matrices were no longer tied. These changes produced better downstream performance without changing the model sizes.
Flan-T5 is an instruction-tuned version of T5 released by Google in 2022 as part of the Flan 2022 collection. The model was initialized from a pre-trained T5 checkpoint and then fine-tuned on a curated mixture of over 1,800 tasks drawn from multiple sources, including Flan 2021, P3++, Super-Natural Instructions, and custom additions spanning question answering, natural language inference, code, dialogue, and chain-of-thought reasoning.
The instruction-tuning process used a mix of three prompting formats: zero-shot (instructions only), few-shot (instructions with examples), and chain-of-thought (instructions that ask for step-by-step reasoning). Training with all three formats yielded roughly 2% higher accuracy across all evaluation settings compared to training with any single format.
Flan-T5 comes in five sizes corresponding to the original T5 variants, plus two larger sizes:
| Variant | Parameters |
|---|---|
| Flan-T5-Small | 60M |
| Flan-T5-Base | 250M |
| Flan-T5-Large | 780M |
| Flan-T5-XL | 3B |
| Flan-T5-XXL | 11B |
Flan-T5 outperformed the original T5 by 3% to 17% or more across various evaluation settings. It also demonstrated that instruction-tuned models converge faster and reach higher accuracy when further fine-tuned on individual downstream tasks, making them more computationally efficient starting checkpoints. Flan-T5 became one of the most widely used open-source language models for research and production applications, and it played an important role in popularizing instruction tuning as a training methodology.
mT5 (Multilingual T5) is a multilingual variant of T5 introduced by Xue et al. in 2020. It follows the same architecture and pre-training objective as T5 but was pre-trained on mC4, a multilingual version of C4 covering 101 languages. The mC4 corpus was created by applying the same cleaning heuristics used for C4 to Common Crawl data in each language, using the cld3 library for language identification. The resulting corpus contains approximately 6.3 trillion tokens across all languages.
mT5 was released in the same five sizes as T5 (Small through 11B) and achieved state-of-the-art results on several multilingual benchmarks, including XTREME and XNLI. The authors also described a technique to prevent "accidental translation" in zero-shot cross-lingual transfer, where a generative model might produce output in the wrong language.
ByT5 (Byte-level T5), introduced by Xue et al. in 2022, operates directly on raw UTF-8 byte sequences rather than using a subword tokenizer. This eliminates the need for a tokenizer entirely and makes the model robust to misspellings, character-level noise, and morphologically rich languages where subword tokenization can be suboptimal. ByT5 was particularly effective on tasks involving word-internal phenomena such as spelling correction, pronunciation prediction, and morphological analysis. The trade-off is that byte sequences are typically longer than subword token sequences, increasing computational cost.
LongT5, introduced by Guo et al. in 2022, extends T5 to handle long input sequences efficiently. Standard transformer attention has quadratic time and memory complexity with respect to sequence length, which makes processing long documents prohibitively expensive. LongT5 addresses this by using local attention (a sliding window approach) in the encoder, which reduces the complexity to linear. LongT5 was pre-trained using the PEGASUS principle sentence generation objective (designed for summarization) and demonstrated strong performance on long-document summarization tasks.
CodeT5, developed by Salesforce Research, adapted the T5 architecture for programming tasks. It was pre-trained on code from multiple programming languages and could perform code summarization, code generation, code translation, and defect detection. CodeT5+ (2023) extended this further with models ranging from 220 million to 16 billion parameters, employing a flexible architecture that could operate in encoder-only, decoder-only, or encoder-decoder mode depending on the task.
UL2 (Unifying Language Learning Paradigms), introduced by Yi Tay, Mostafa Dehghani, and colleagues at Google in 2022, built directly on the T5 architecture but proposed a fundamentally different pre-training objective. Rather than using a single denoising task, UL2 uses a Mixture-of-Denoisers (MoD) approach that combines three types of denoising:
UL2 also introduced the concept of "mode switching," where special tokens in the input signal which denoising mode was used, allowing the model to adapt its behavior at inference time. The UL2-20B model outperformed both T5 and GPT-style models of comparable size across 50 supervised NLP benchmarks, demonstrating that the mixture-of-denoisers approach provides a better pre-training signal than any single objective.
T5X is a modular, composable, and research-friendly framework developed by Google for training, evaluating, and serving sequence models. Built on top of JAX and Flax, T5X provides the reference implementation for T5 and its variants and was designed to replace the original Mesh TensorFlow-based implementation.
T5 and its variants have been applied across a broad range of NLP tasks in both research and industry.
T5's encoder-decoder architecture makes it a natural fit for abstractive summarization, where the model must read a long document and generate a concise summary. T5-11B set state-of-the-art results on the CNN/Daily Mail benchmark, and Flan-T5 models have been widely adopted for summarization in production systems.
Although T5 was primarily designed for English, the text-to-text format handles translation naturally. The mT5 variant extended this capability to 101 languages, and fine-tuned mT5 models have been used in multilingual translation systems.
T5 can perform both extractive and generative question answering. In the extractive setting, the model generates the answer span from the context. In the closed-book setting, the model generates answers purely from its parametric knowledge. The closed-book capability demonstrated by Roberts et al. (2020) showed that T5-11B can store and retrieve factual knowledge effectively.
By framing classification as text generation (outputting a label string), T5 can handle any classification task without a task-specific head. This approach has been used for sentiment analysis, topic categorization, natural language inference, and content moderation.
The CodeT5 family demonstrated that the T5 architecture transfers well to programming tasks. These models can summarize code, generate code from natural language descriptions, translate between programming languages, and identify bugs.
T5 has been applied to structured data-to-text tasks, such as generating natural language descriptions from tables, knowledge graphs, or database records. The encoder processes the structured input, and the decoder generates fluent text describing it.
Fine-tuned T5 models have been used for named entity recognition, relation extraction, and event extraction by casting these structured prediction tasks as sequence-to-sequence problems where the model generates a structured text output.
T5's impact extends beyond its own model family. Several ideas introduced or popularized by the T5 paper have become standard practices in NLP research.
Text-to-text framing: T5 demonstrated that framing all tasks as text generation is both practical and effective. This idea influenced later models, including GPT-3 (which uses text-to-text prompting for few-shot learning) and instruction-tuned models like InstructGPT and ChatGPT.
Systematic empirical studies: The T5 paper set a standard for rigorous empirical comparison of design choices. Its methodology of isolating individual factors while holding everything else constant has been adopted by subsequent large-scale studies, including the Chinchilla scaling laws paper and the LLaMA technical report.
Instruction tuning: The Flan-T5 line of work was instrumental in demonstrating that instruction-tuning on a diverse mixture of tasks dramatically improves a model's ability to follow natural language instructions. This approach directly influenced the development of InstructGPT, ChatGPT, and other instruction-following models.
Encoder-decoder vs. decoder-only debate: T5's finding that encoder-decoder models outperform decoder-only models at the same parameter count sparked ongoing discussion about optimal architectures. While decoder-only models have become dominant at the largest scales (partly for simplicity and scaling efficiency), encoder-decoder models like T5 remain competitive for tasks where input comprehension and output generation are distinct steps.
Open-source model ecosystem: Google's decision to release T5 checkpoints, code, and training data freely helped build the open-source language model ecosystem. T5 models are among the most downloaded on Hugging Face, and they serve as the foundation for hundreds of fine-tuned models across domains.
T5 has several known limitations.
English-centric: The original T5 models were trained exclusively on English text. While mT5 addresses multilingual coverage, the English-only C4 dataset limits the original model's applicability to other languages.
Fixed sequence length: T5's maximum sequence length of 512 tokens limits its ability to process long documents without truncation or chunking. LongT5 was developed specifically to address this limitation.
Computational cost: The T5-11B model requires substantial compute for both training and inference. Fine-tuning the 11B model requires multiple GPUs or TPUs, and serving it at scale is expensive. Smaller variants trade off quality for efficiency.
Autoregressive decoding for classification: For simple classification tasks, generating a text label token by token is less efficient than computing a single logit. While the overhead is small in practice, it adds latency compared to encoder-only models like BERT.
Pre-training data concerns: The C4 dataset, while cleaned, still contains biases, inaccuracies, and potentially harmful content from the internet. An audit by Dodge et al. (2021) found that C4 underrepresents content from minority groups and certain geographic regions.
All original T5 model checkpoints are available through the Hugging Face Transformers library under identifiers such as google-t5/t5-small, google-t5/t5-base, google-t5/t5-large, google-t5/t5-3b, and google-t5/t5-11b. Flan-T5 checkpoints are available as google/flan-t5-small through google/flan-t5-xxl. mT5 checkpoints cover the same size range under google/mt5-small through google/mt5-xxl.
The original implementation was in Mesh TensorFlow, available at the google-research/text-to-text-transfer-transformer GitHub repository. The newer T5X framework provides a JAX/Flax implementation. The C4 dataset is available through TensorFlow Datasets and on Hugging Face Datasets.
All models are released under the Apache 2.0 license, allowing free use for research and commercial purposes.